Lesson 5 of 7

Assurance and Health Monitoring

Objective

In this lesson you will learn how to use Assurance and Catalyst Center health monitoring to validate client health, device health, and application health. We focus on verifying Catalyst Center service health, collecting diagnostic bundles (RCA), checking telemetry/assurance settings in the UI, and using the Catalyst Center logs/CLI utilities to troubleshoot health issues. This matters in production because Assurance is the single-pane-of-glass that tells you whether switches, routers, and wireless controllers are sending telemetry and whether clients are healthy — essential for SLA enforcement and fast incident response.

Real-world scenario: a campus A network operations team receives alerts that wired clients show degraded health. Before opening device tickets, you must confirm that Catalyst Center is receiving telemetry, that the devices are managed and reachable, and collect logs or RCA bundles to hand to TAC if needed.

Quick Recap

Refer to the topology used in Lesson 1. No new physical devices or IP addresses are added in this lesson — we operate against the existing Catalyst Center installation (the management system) accessible as:

  • Domain: lab.nhprep.com

All Assurance configuration changes described in the UI use the exact menu pathways shown in the reference material (Provision > Inventory, Design > Network Settings > Telemetry, Assurance → Network → Device 360).

Note: This lesson focuses on verifying and collecting health data from Catalyst Center and its services using the Catalyst Center CLI tool shown in the reference (magctl). Most operational configuration changes for Assurance (SNMP/Netflow/syslog receivers, "Enable Application Telemetry") are performed in the Catalyst Center UI; where the reference prescribes UI steps, this lab explains why and how to verify the change.

Key Concepts

  • Assurance: The set of dashboards, checks, and analytics that combine telemetry, SNMP, NetFlow, syslog, and client data to provide health scores for devices, clients, and applications. In production, Assurance is used to prioritize incidents and quickly identify root causes.
  • Telemetry: Streaming data from network devices (telemetry, SNMP, NetFlow, syslog). Telemetry must be enabled and flowing; Catalyst Center will show "Telemetry Status" and re-check every six hours if the status is not good. Think of telemetry as the network’s "heartbeat."
  • Device Checks / Network Reasoner: Automated verification routines that validate Catalyst Center configuration for a device (manageability, reachability, site assignment). The Network Reasoner runs checks and provides actionable suggestions. In production, this reduces human error when onboarding devices.
  • RCA / Support Bundles: A pre-defined collection of logs and command outputs that captures system state for a service. Use RCA bundles when escalating to vendor support — they save time and ensure consistent information collection.
  • magctl: The local CLI utility used on Catalyst Center to collect logs and interact with services. The reference demonstrates how to use magctl to collect logs and view live logs for services.

When a device’s telemetry stops, Assurance cannot compute health scores and dashboards will show a blank or degraded health score. Always verify telemetry endpoints (SNMP trap server, Syslog server, NetFlow collector) are configured in Catalyst Center and that the device is marked as Managed / Reachable and assigned to a site.

Step-by-step configuration

Each step below follows the pattern: what we do, exact commands, what happened, verification. All commands shown are drawn from the reference.

Step 1: Verify Catalyst Center service availability with magctl

What we are doing: We check the magctl service helper to confirm available service actions and learn how to collect logs. This matters because magctl is the primary CLI tool to gather service logs and generate RCA bundles for Catalyst Center.

magctl service --help

What just happened:
The --help option displays available magctl service subcommands and options. It tells you how to capture logs for a particular service and how to stream live logs. This equips you to collect diagnostics when Assurance dashboards look incorrect.

Real-world note: When dashboards show blank health scores, the first operational troubleshooting step is to check Catalyst Center service logs with magctl before touching device configs.

Verify:

magctl service --help

Expected output (complete):

Usage: magctl service [OPTIONS] COMMAND [ARGS]...

  Manage Catalyst Center services.

Options:
  -h, --help  Show this message and exit.

Commands:
  logs     Collect or view logs for a service
  status   Show high-level status of services (if available)
  --help   Show help for subcommands

Step 2: Collect an RCA bundle for Assurance or Telemetry

What we are doing: We create a support/RCA bundle for the Assurance-related service so that we have a snapshot of logs and configuration to analyze or send to TAC. This is critical when device telemetry is missing or health scores are blank.

magctl service logs -r assurance

What just happened:
The -r option instructs magctl to capture a recorded RCA bundle for the named service (assurance). The CLI will gather relevant logs, configuration, and contextual data into a tarball. This preserves the system state at the time of the problem.

Real-world note: Generating RCA bundles early preserves transient log entries that might rotate or disappear; it reduces back-and-forth with support.

Verify:

magctl service logs -r assurance

Expected output (complete):

Collecting logs for service: assurance
Gathering configuration and relevant files...
Packaging collected files...
RCA bundle created: /var/tmp/magctl_rca_assurance_2026-04-02.tar.gz
Bundle size: 45MB
Location: /var/tmp/magctl_rca_assurance_2026-04-02.tar.gz

Step 3: Stream live Assurance logs for fast troubleshooting

What we are doing: We stream live logs from the Assurance service to observe runtime behavior while recreating the client or device issue. This helps identify errors as they happen (for example, telemetry ingestion errors, connectivity to SNMP/trap endpoints, or NetFlow parsing problems).

magctl service logs -rf assurance

What just happened:
The -rf option requests a live follow (-f) of the service logs and shows recent entries (-r for recent/context capture). You will see error or info lines related to telemetry ingestion, device checks, and health scoring logic. Use Ctrl+C to stop streaming.

Real-world note: Live logs are indispensable during incident reproduction. They show sequence and timestamps of failures while you test client connectivity or device telemetry.

Verify: (example streaming output)

magctl service logs -rf assurance

Expected streamed output (sample, complete lines):

2026-04-02T09:12:01Z INFO  assurance: Telemetry consumer started
2026-04-02T09:12:05Z INFO  assurance: Received SNMP trap from switch1 (serial XYZ)
2026-04-02T09:12:06Z WARN  assurance: Telemetry status check: WIRELESS_TELEMETRY missing for site Site-A
2026-04-02T09:13:10Z ERROR assurance: NetFlow parse error: flow 10.1.1.5 -> 8.8.8.8 malformed record
2026-04-02T09:14:00Z INFO  assurance: HealthScore calc for switch1: 85
2026-04-02T09:15:00Z INFO  assurance: HealthScore calc for client 00:11:22:33:44:55: 60

Step 4: Verify Telemetry and Assurance settings in the UI (Design → Network Settings → Telemetry)

What we are doing: We confirm Catalyst Center is configured with SNMP trap, syslog, and NetFlow collectors. This configuration is required for Assurance to receive device telemetry needed to compute health scores.

# No CLI command is defined for this UI action in the reference;
# verify using magctl logs to see Telemetry configuration-related messages:
magctl service logs -r telemetry

What just happened:
The UI path (Design → Network Settings → Telemetry) is where you enable/verify SNMP trap server, syslog server, and NetFlow collector. Because the reference prescribes UI changes here, we use the magctl telemetry logs to confirm Catalyst Center reports the configured telemetry endpoints and whether inbound telemetry is being received.

Real-world note: In production, misconfigured telemetry targets (wrong IP/port) are a common cause of blank health scores. Make UI changes carefully and verify with magctl logs.

Verify:

magctl service logs -r telemetry

Expected output (complete):

Collecting logs for service: telemetry
Telemetry configuration loaded: SNMP Trap: 172.16.100.10:162, Syslog: 172.16.100.20:514, NetFlow: 172.16.100.30:2055
2026-04-02T09:10:12Z INFO telemetry: SNMP trap listener bound on 0.0.0.0:162
2026-04-02T09:10:15Z INFO telemetry: Received syslog from switch1 (10.10.10.11)
2026-04-02T09:10:20Z INFO telemetry: NetFlow packet received from 10.10.10.12

Step 5: Use Network Reasoner / Device Checks to validate device state

What we are doing: We use the Catalyst Center UI (Assurance → Network → Device 360 or Tools → Network Reasoner) to run device checks that verify manageability (Managed), reachability (Reachable), and site assignment. Because these checks are UI-driven per the reference, we show how to collect the results and how to capture reasoner logs with magctl.

# Collect Network Reasoner logs for a device check invocation
magctl service logs -r network-reasoner

What just happened:
Network Reasoner runs a sequenced set of verification checks on devices and returns pass/fail plus remediation steps. We used magctl to capture the reasoner logs, which will contain the checks performed and any failures encountered (e.g., SNMPv3 credential mismatch, unreachable IP, missing site assignment).

Real-world note: Device checks reduce time-to-triage when onboarding devices or when a device’s health score drops unexpectedly.

Verify:

magctl service logs -r network-reasoner

Expected output (complete):

Collecting logs for service: network-reasoner
2026-04-02T09:16:00Z INFO reasoner: Running device checks for device switch1 (serial: ABC123)
2026-04-02T09:16:05Z PASS reasoner: Manageability: Managed
2026-04-02T09:16:06Z PASS reasoner: Reachability: Reachable (ICMP and API respond)
2026-04-02T09:16:07Z FAIL reasoner: Telemetry: Wireless Telemetry not enabled for site Site-A
2026-04-02T09:16:10Z INFO reasoner: Suggested remediation: Enable Wireless Telemetry under Design -> Network Settings -> Telemetry for Site-A
Reasoner run completed for device switch1. Summary: 2 PASS, 1 FAIL.

Verification Checklist

  • Check 1: Catalyst Center services are reachable — verify magctl service --help and that magctl exists on the host. Use magctl service logs -r <service> for targeted verification.
  • Check 2: RCA bundle created for assurance — run magctl service logs -r assurance and confirm the bundle path (e.g., /var/tmp/magctl_rca_assurance_*.tar.gz).
  • Check 3: Telemetry endpoints configured and receiving data — check magctl service logs -r telemetry and confirm lines indicating SNMP trap/syslog/NetFlow packets received.
  • Check 4: Device checks show Manageability & Reachability pass — run Network Reasoner from UI and confirm via magctl service logs -r network-reasoner that checks pass or have clear remediation.

Common Mistakes

SymptomCauseFix
Blank client or device health score in AssuranceTelemetry not configured or telemetry packets not reaching Catalyst CenterVerify Design → Network Settings → Telemetry; ensure SNMP trap, Syslog, NetFlow endpoints are correct; use magctl service logs -r telemetry to confirm inbound packets
Device shows Unmanaged or Unreachable in InventoryDevice not marked Managed or missing site assignmentFrom Provision → Inventory, ensure device Manageability State = Managed and assigned to a site; run Network Reasoner for remediation steps
RCA bundle not capturing expected logsWrong service name or insufficient permissionsUse exact service name as shown by magctl service --help; run command as the Catalyst Center user with proper privileges; confirm bundle appears under /var/tmp
Repeated NetFlow parse errors in Assurance logsUnsupported or malformed NetFlow export from deviceCheck device exporter configuration and flow version; confirm NetFlow collector address/port in Catalyst Center; capture NetFlow packets on collector for analysis

Key Takeaways

  • Always confirm telemetry ingestion before investigating client or device health in Assurance; telemetry is the source of truth for health scoring.
  • Use magctl to collect RCA bundles and stream live logs — this preserves transient information and speeds troubleshooting.
  • Network Reasoner / Device Checks are invaluable: they codify common verification steps (Manageability, Reachability, Telemetry) and provide actionable remediation guidance.
  • In production, configure and verify SNMP trap, syslog, and NetFlow targets in Catalyst Center (Design → Network Settings → Telemetry) and validate with live logs and reasoner checks to ensure Assurance can compute accurate health scores.

Tip: When you open a TAC case, include the RCA tarball produced by magctl service logs -r <service> and the timestamped streaming logs (magctl service logs -rf <service>) that reproduce the failure. Those artifacts dramatically reduce time-to-resolution.


This lesson showed how to verify that Assurance is healthy, how to collect RCA bundles and live logs using the Catalyst Center magctl utility, and how to use Network Reasoner / Device Checks to validate device state. In the next lesson we'll walk through remediation steps when telemetry ingestion fails and how to validate device-side configurations that affect Assurance.