Lesson 4 of 5

Troubleshooting Assurance

Objective

In this lesson you will troubleshoot Assurance data collection problems in a Catalyst Center environment: specifically broken syslog ingestion, missing SNMP traps, and absent NetFlow records. You will learn how to use the appliance CLI (maglev) and the container orchestration utility (magctl) to locate the failing microservice, inspect runtime logs, and recover service availability. This matters in production because Assurance telemetry (syslog, SNMP traps, NetFlow) powers troubleshooting, alerting, and analytics — when telemetry stops, Operations lose visibility and SLAs can be impacted.

Real-world scenario: an NOC reports that devices show up in the inventory but no syslog, traps, or flow records are visible in the Assurance dashboards. You must determine whether ingestion services are down, misconfigured, or receiving malformed traffic.

Quick Recap

Refer to the topology deployed in Lesson 1 (Catalyst Center cluster with namespaces for platform services and assurance microservices). This lesson does not add new network devices or IP addresses beyond the Catalyst Center access point shown previously. You will connect to the Catalyst Center management host using the maglev account and use magctl to inspect cluster resources.

Device Table

DeviceAccess UserAddress / Port
Catalyst Center management hostmaglevCatalyst Center_IP :2222

Tip: UI credentials are separate from Linux CLI access. When you need the web UI for cross-checking, use admin@lab.nhprep.com and password Lab@123.

Key Concepts

  • Microservices and Pods: Assurance functions run as Kubernetes-style microservices (pods). If telemetry stops, first check pod status — a crashed or evicted pod cannot receive or forward telemetry. Think of each pod like a lightweight appliance that must be "UP" to do its job.
  • Ingress of telemetry: Syslog and NetFlow are UDP-heavy protocols; packet loss, port blocks, or binding failures prevent ingestion. UDP is connectionless — if a pod crashes while UDP flows arrive, the packets are dropped silently.
  • Logging and diagnostics: Application logs inside the pod are the primary source to diagnose parsing or binding errors (for example, port already in use, malformed messages causing parser exceptions).
  • Control plane vs. data plane: magctl lets you inspect the control plane objects (pods, services). Restarts or redeploys at the control plane are often the fastest remediation. In production, coordinate restarts with stakeholders to avoid data gaps.
  • Analogy: Think of the cluster as a postal sorting center. Pods are sorting teams: if one team is absent or misconfigured, the mail (syslog/SNMP/NetFlow) piles up or gets discarded.

Step-by-step configuration

Step 1: SSH to the Catalyst Center management host

What we are doing: Connect to the Catalyst Center host as the maglev user on the management SSH port. This gives you access to maglev and magctl utilities used to inspect and control microservices.

ssh maglev@Catalyst Center_IP -p 2222

What just happened: You established an SSH session to the Catalyst Center management host. This host is the entry point for cluster-level commands. Using the maglev account is required because the UI username/password are separate from CLI credentials.

Real-world note: In production, SSH access is typically restricted to a jump host or bastion; always use MFA and audit SSH sessions.

Verify:

maglev@catalyst-center:~$ hostname
catalyst-center
maglev@catalyst-center:~$ whoami
maglev

Expected output:

maglev@catalyst-center:~$ hostname
catalyst-center
maglev@catalyst-center:~$ whoami
maglev

Step 2: List Assurance-related pods with magctl

What we are doing: Query the cluster for all pods across namespaces to find Assurance-related services (syslog ingestion, SNMP trap forwarder, NetFlow collector). If a pod is CrashLoopBackOff, Pending, or NotReady, it explains telemetry loss.

magctl get pods --all-namespaces

What just happened: This magctl command lists all pod objects and their status across namespaces. It shows the pod name, namespace, readiness, status (Running, Pending, CrashLoopBackOff), restarts, and age. Finding a non-Running status points you to the failing component.

Real-world note: In large clusters, filter by namespace (for example, -n assurance) to reduce noise; magctl behaves like kubectl for listing resources.

Verify:

magctl get pods --all-namespaces

Expected output:

NAMESPACE           NAME                                      READY   STATUS             RESTARTS   AGE
assurance           ndp-assurance-0                           1/1     Running            0          12d
assurance           fusion-assurance-0                        1/1     CrashLoopBackOff   4          3h
assurance           syslog-ingest-0                           0/1     CrashLoopBackOff   10         2h
assurance           snmp-trap-receiver-0                      1/1     Running            0          12d
platform            postgres-0                                1/1     Running            0          12d
platform            inventory-0                               1/1     Running            12d

Interpretation: The syslog-ingest-0 and fusion-assurance-0 pods are unstable (CrashLoopBackOff) — these are the primary suspects for missing syslog and some assurance functions.

Step 3: Describe the failing pod to see recent events

What we are doing: Use magctl to describe the failing pod (syslog-ingest-0) to get recent events (scheduling problems, image pull errors, OOMKilled, or port binding failures).

magctl describe pod syslog-ingest-0 -n assurance

What just happened: The describe command prints detailed pod information including container state, last state, event messages from the kubelet (or equivalent), and reasons for restarts or failures. This reveals if the failure is due to resource limits, missing config, or port conflicts.

Real-world note: A CrashLoopBackOff often has a short-lived container that exits immediately due to misconfiguration (bad secret, missing config map, or runtime exception). Check events for "Back-off restarting failed container" and read the container logs.

Verify:

magctl describe pod syslog-ingest-0 -n assurance

Expected output:

Name:         syslog-ingest-0
Namespace:    assurance
Node:         node-02/10.10.2.5
Start Time:   Tue, 31 Mar 2025 09:15:22 -0700
Labels:       app=syslog-ingest
Status:       Running
IP:           10.244.2.18
Containers:
  syslog-ingest:
    Container ID:   docker://abcdef123456
    Image:          nhprep/syslog-ingest:2.3.7
    Image ID:       docker-pullable://nhprep/syslog-ingest@sha256:abcd
    Port:           514/udp
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 31 Mar 2025 09:17:01 -0700
      Finished:     Tue, 31 Mar 2025 09:17:02 -0700
    Ready:          False
    Restart Count:  10
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  2h                 default-scheduler  Successfully assigned assurance/syslog-ingest-0 to node-02
  Warning  BackOff    1h59m              kubelet            Back-off restarting failed container
  Warning  Failed     1h59m              kubelet            Error: container has runAsNonRoot and image will run as root

Interpretation: The Last State shows Exit Code 1 and an event indicates a security context mismatch (runAsNonRoot conflict) — this points to a pod security or container image issue preventing the container from starting.

Step 4: Inspect container logs for parsing or startup errors

What we are doing: Pull the latest logs from the failing container to identify the application-level error (for example, a config parsing error or a port bind failure). Logs are the primary evidence for parser exceptions or missing dependencies.

magctl logs syslog-ingest-0 -n assurance

What just happened: magctl streams the container stdout/stderr logs. Application logs typically include startup checks (config validation) and runtime exceptions. When a log shows "bind: permission denied" or "cannot open port 514", it indicates permission or port binding issues.

Real-world note: Syslog traditionally listens on UDP 514. In Linux containers, non-root containers cannot bind ports below 1024 unless configured. This is a common cause for syslog ingestion failures.

Verify:

magctl logs syslog-ingest-0 -n assurance

Expected output:

2025-03-31T09:17:01Z INFO  Starting syslog-ingest v2.3.7
2025-03-31T09:17:01Z INFO  Loading configuration from /etc/syslog/config.yaml
2025-03-31T09:17:01Z ERROR Failed to bind UDP socket on 0.0.0.0:514: permission denied
2025-03-31T09:17:01Z ERROR Syslog ingestion startup failed: Failed to bind UDP socket
2025-03-31T09:17:01Z INFO  Exiting with status 1

Interpretation: The container cannot bind port 514 because it requires root privileges (or an ambient capability) to bind ports <1024. The earlier describe hinted at runAsNonRoot; the logs confirm a port bind permission issue.

Step 5: Remediate by patching deployment to use a high port or add permission, then restart

What we are doing: Apply an immediate remediation by patching the deployment so the syslog container listens on a high port (>1024) or adjust the container security context. Here we change the container argument to use 5514 (example) and restart the pod. Using a non-privileged port avoids requiring root in the container.

magctl set env deployment/syslog-ingest -n assurance SYSLOG_PORT=5514
magctl rollout restart deployment/syslog-ingest -n assurance

What just happened: The first command sets an environment variable that the application reads to choose the bind port (SYSLOG_PORT=5514). The rollout restart triggers a redeploy so new pods use the updated environment. Because 5514 > 1024, a non-root container can bind to it without special capabilities.

Real-world note: In production, changing ports may require updates to network device syslog targets or firewall rules. Coordinate with NOC and update device configurations to send to the new port, or alternatively apply Linux capabilities to allow binding to privileged ports.

Verify:

magctl get pods -n assurance
magctl logs syslog-ingest-0 -n assurance

Expected output:

magctl get pods -n assurance
NAME                                 READY   STATUS    RESTARTS   AGE
syslog-ingest-0                      1/1     Running   0          2m
snmp-trap-receiver-0                 1/1     Running   0          12d
ndp-assurance-0                      1/1     Running   0          12d

magctl logs syslog-ingest-0 -n assurance
2025-03-31T09:22:05Z INFO  Starting syslog-ingest v2.3.7
2025-03-31T09:22:05Z INFO  Using SYSLOG_PORT=5514
2025-03-31T09:22:05Z INFO  Bound UDP socket on 0.0.0.0:5514
2025-03-31T09:22:05Z INFO  Syslog ingestion ready

Interpretation: The pod is now Running and the logs show a successful bind to UDP 5514. To complete the fix, you must update device syslog targets to point to Catalyst Center_IP:5514 or use a cluster-level port translation.

Step 6: Verify NetFlow and SNMP trap collectors

What we are doing: Confirm NetFlow and SNMP trap collector pods are Running and show recent "received flow" or "trap received" log entries to ensure data is arriving.

magctl get pods -n assurance
magctl logs netflow-collector-0 -n assurance
magctl logs snmp-trap-receiver-0 -n assurance

What just happened: You checked pod health and read collector logs for evidence of ingestion. For NetFlow, look for UDP bind success and flow records parsed; for SNMP traps, look for received trap processing messages.

Real-world note: NetFlow uses UDP high ports; some collectors support multiple flow versions. If collectors run behind a load balancer, ensure session-affinity is correct to avoid packet distribution issues.

Verify:

magctl get pods -n assurance
magctl logs netflow-collector-0 -n assurance
magctl logs snmp-trap-receiver-0 -n assurance

Expected output:

magctl get pods -n assurance
NAME                     READY   STATUS    RESTARTS   AGE
netflow-collector-0      1/1     Running   0          12d
snmp-trap-receiver-0     1/1     Running   0          12d

magctl logs netflow-collector-0 -n assurance
2025-03-31T09:00:01Z INFO  NetFlow collector started, listening on UDP 2055
2025-03-31T09:01:03Z INFO  Received NetFlow record from 10.1.1.10:53656, records=45
2025-03-31T09:02:10Z INFO  Parsed flow: src=10.1.2.5 dst=192.168.10.20 bytes=1040 proto=6

magctl logs snmp-trap-receiver-0 -n assurance
2025-03-31T08:59:58Z INFO  SNMP Trap receiver started on 162/udp
2025-03-31T09:05:13Z INFO  Received SNMPv2-Trap from 10.1.3.7:162, oid=1.3.6.1.6.3.1.1.5.3

Interpretation: NetFlow and SNMP collectors are healthy and receiving traffic; they do not require remediation.

Verification Checklist

  • Check 1: Pod health — run magctl get pods --all-namespaces and ensure Assurance pods show READY state and STATUS Running.
  • Check 2: Syslog binding — magctl logs syslog-ingest-0 -n assurance should show "Bound UDP socket" and no binding errors.
  • Check 3: NetFlow & SNMP reception — magctl logs netflow-collector-0 -n assurance and magctl logs snmp-trap-receiver-0 -n assurance should show recent "Received" entries.

Common Mistakes

SymptomCauseFix
Syslog pod CrashLoopBackOff with permission denied binding to 514Container runs as non-root but tries to bind to privileged port (<1024)Change application to use high port (>1024) or adjust security context/capabilities; then restart pod
No NetFlow records in dashboard though collector is RunningDevices sending flows to wrong IP/port, or firewall blocking UDPVerify device NetFlow settings and network ACLs; ensure collector listening port matches device config
SNMP traps not appearing but trap receiver pod is RunningSNMP version mismatch or community/auth configuration errorConfirm trap version and community/credentials on device and in receiver config; check logs for parsing errors
magctl shows pod PendingNode resources exhausted (CPU/memory) or unschedulableInspect node capacity and pod resource requests; free resources or scale nodes; check events in magctl describe pod
Logs show "Image pull backoff"Image not present or registry authentication failureVerify image repository, tags, and registry secrets; correct imagePullSecrets or network access

Key Takeaways

  • Always start troubleshooting at the microservice (pod) level: pod status, describe events, and container logs reveal most ingestion failures.
  • UDP-based telemetry (syslog, NetFlow, SNMP traps) fails silently when the collector is down — packet drops don't generate retransmits; logs and pod readiness are the only immediate signals.
  • Common real-world causes are permission/port binding issues, misrouted telemetry targets, and resource constraints on nodes.
  • Use maglev/magctl as your cluster control plane tools: list pods, describe failing pods, inspect logs, then restart or patch deployments to remediate. In production, coordinate port or capability changes with device owners and update network ACLs where needed.

Final tip: Document every remediation step and maintain change control when you alter ports or restart services. Assurance telemetry loss can mask other failures — re-validate dashboards and alerts after recovery.