Root Cause Analysis with AI
Objective
In this lesson you will learn how to use AI-driven root cause analysis data (telemetry and AIOps insights) to find the cause of network problems. You will inspect telemetry connection state and telemetry subscription health, interpret status messages produced by the Machine Reasoning Engine, and practice the verification steps that an engineer uses in production to confirm a collector connection and subscription integrity. This matters because AI correlation and dependency mapping require reliable, complete telemetry — without it, root cause analysis will be incomplete or misleading. Real-world scenario: an enterprise wireless deployment shows intermittent client drops; before AIOps can correlate events and recommend fixes (for example, a bad AP driver or rogue interference), you must confirm the telemetry pipeline (device → collector → analysis) is healthy.
Topology
ASCII diagram (reference topology from Lesson 1, showing the wireless controller and telemetry collector with exact IPs):
+-----------------------------+ +---------------------------+ | POD4 - C9800 - CL1 | | Telemetry Collector | | (Wireless Controller) | | (Telemetry / AIOps) | | | | | | Mgmt0: 192.168.4.7 |------------------------>| eth0: 172.100.1.53:25103 | | | TCP/TLS telemetry | | +-----------------------------+ +---------------------------+
Tip: The controller source IP for telemetry in this lab is 192.168.4.7; the collector peer IP is 172.100.1.53 on port 25103. These exact addresses are used in all verification commands below.
Device Table
| Device | Role | Management IP / Listener |
|---|---|---|
| POD4 - C9800 - CL1 | Wireless Controller / Telemetry source | 192.168.4.7 |
| Telemetry Collector | Telemetry sink / AIOps backend | 172.100.1.53:25103 |
Quick Recap
- This lesson continues from Lesson 1 and uses the same core devices: the POD4 C9800 controller as the telemetry source and the telemetry collector on 172.100.1.53.
- We will not add new devices or IPs; we will inspect the telemetry pipeline and use the controller's telemetry show commands to confirm health and diagnose issues.
- Where the GUI action is required (for example, a Force Push of telemetry settings), the CLI verification commands here confirm the result.
Key Concepts (theory + practical implications)
-
Telemetry Connection Lifecycle
- Theory: Devices form a secure transport (typically TLS over TCP) to a telemetry collector. The device maintains a persistent connection so the collector receives streaming sensor/state data.
- Practical: When the transport is up, AIOps receives live events; when it's down, correlation and dependency mapping cannot include that device.
-
Subscriptions vs. Connection
- Theory: A telemetry connection is the transport; subscriptions are the content negotiated (what data streams are enabled). A healthy deployment needs both an active connection and valid subscriptions.
- Practical: A controller can have a transport up but invalid subscriptions (e.g., config mismatch), producing partial visibility.
-
Status Codes and Meanings
- Theory: Common status indicators include "Active" (connection and subscription OK), "Connecting" (transport establishment in progress; could be TLS certificate or firewall issues), and "N/A" (telemetry configuration missing).
- Practical: Knowing what each status means directs the troubleshooting action: firewall/cert checks for Connecting; configuration push for N/A.
-
AI Root Cause Analysis Dependency
- Theory: Machine Reasoning Engine correlates events across devices using telemetry and topology metadata. Missing sensor data or partial subscriptions breaks correlation graphs.
- Practical: Before trusting AIOps recommendations, validate telemetry completeness (connection + all expected subscriptions valid).
-
Analogy: Think of telemetry like a set of live cameras (connections) and each camera has a set of lenses (subscriptions). If the camera is connected but the lens is missing or invalid, you’ll see incomplete images — AIOps cannot infer the full sequence of events.
Step-by-step configuration / verification
Each step below follows the same pattern: what we are doing, the exact command(s) to run, what just happened, a real-world note, and a verification command with expected output.
Step 1: Verify telemetry transport (connection)
What we are doing: We confirm whether the controller has an active TCP/TLS session to the telemetry collector. This checks the transport layer for streaming telemetry — if it’s down, the collector receives nothing.
show telemetry connection all
What just happened: The command lists telemetry transport sessions the device maintains to collectors. It shows the collector peer address, the destination port (25103 in this lab), the VRF used, the controller source address, and a human-readable state such as "Active" or "Connecting". On a healthy transport you should see "Active" and "Connection up."
Real-world note: If the transport shows "Connecting" the next troubleshooting steps are network reachability tests, firewall inspection, and certificate validation between device and collector.
Verify:
Telemetry connections
Index Peer Address Port VRF Source Address State State Description
----- -------------------- ----- --- --------------- -------------- ---------------------
109 172.100.1.53 25103 0 192.168.4.7 Active Connection up
- Interpretation: An entry like the one above indicates the device at 192.168.4.7 has established a telemetry session to 172.100.1.53:25103 and the connection is up.
Step 2: Verify telemetry subscriptions (what data is being sent)
What we are doing: We check the subscription summary to confirm the expected telemetry streams exist and are valid. Subscriptions are the logical collections of sensor telemetry; if they are invalid, the collector cannot ingest certain data types.
show telemetry ietfsubscription summary
What just happened: This command reports how many subscriptions the device supports and their status: total, valid, invalid, and whether they are dynamic, configured, or permanent. A fully healthy device will typically show zero invalid subscriptions.
Real-world note: In production, a large number of subscriptions (e.g., dozens or hundreds) is common; ensure the device's "Maximum supported" limit (here 128) is not being exceeded by your configuration or auto-generated subscriptions.
Verify:
Subscription Summary
====================
Maximum supported: 128
Subscription Total Valid Invalid
---------------------------------------
All 112 112 0
Dynamic 0 0 0
Configured 112 112 0
Permanent 0 0 0
- Interpretation: This output shows 112 subscriptions configured and all are valid — the device is successfully sending the expected telemetry streams.
Step 3: Interpret status messages and map to likely root causes
What we are doing: We translate status descriptions into actionable root cause hypotheses (for example: "Connecting – Cert/FW issue" suggests TLS or firewall issues). This step is purely investigative: it maps observed telemetry states to next troubleshooting steps.
show telemetry connection all
show telemetry ietfsubscription summary
What just happened: Running both commands together gives a full view: the transport level (connection) and the application level (subscriptions). If the transport is "Active" and subscriptions are valid, the telemetry pipeline is healthy; any deviation indicates a focused next step (network path, firewall, certificates, or configuration push).
Real-world note: The Machine Reasoning Engine will often annotate a device with one of these summarized statuses so that an operator can quickly see whether missing telemetry is due to a transport (network/port), a security (cert), or a config issue.
Verify:
Telemetry connections
Index Peer Address Port VRF Source Address State State Description
----- -------------------- ----- --- --------------- -------------- ---------------------
109 172.100.1.53 25103 0 192.168.4.7 Active Connection up
Subscription Summary
====================
Maximum supported: 128
Subscription Total Valid Invalid
---------------------------------------
All 112 112 0
Dynamic 0 0 0
Configured 112 112 0
Permanent 0 0 0
- Interpretation: No action required — both transport and subscriptions are healthy. If you saw "Connecting" instead of "Active," your next command set would be standard reachability checks (ping/traceroute) and certificate/trustchain validation on both ends (note: those commands are platform-specific and must be performed where supported).
Step 4: Force-push telemetry settings (GUI action) and confirm via CLI
What we are doing: When telemetry configuration is missing or stale (N/A status) the usual remedy is to re-push telemetry settings from inventory or management UI. In many production environments you will perform a Force Push from the management plane and then confirm the controller reports the expected state on the CLI.
# GUI action (do this in the management UI):
# Inventory > Actions > Telemetry > Update Telemetry Settings
# Then verify on the controller using CLI:
show telemetry ietfsubscription summary
show telemetry connection all
What just happened: The GUI Force Push updates the telemetry subscription configuration on the device. The CLI verification commands confirm that the device accepted the new subscriptions (valid count increases from 0 to the expected number) and that a transport to the collector is established.
Real-world note: Management plane force-push is often needed after software upgrades or when AP join profiles are changed; always verify via CLI that the device now reports valid subscriptions.
Verify:
Subscription Summary
====================
Maximum supported: 128
Subscription Total Valid Invalid
---------------------------------------
All 112 112 0
Dynamic 0 0 0
Configured 112 112 0
Permanent 0 0 0
Telemetry connections
Index Peer Address Port VRF Source Address State State Description
----- -------------------- ----- --- --------------- -------------- ---------------------
109 172.100.1.53 25103 0 192.168.4.7 Active Connection up
- Interpretation: After the push, both the subscription summary and transport show healthy values.
Step 5: Intelligent Capture and AIOps analysis readiness
What we are doing: Confirm the controller is prepared for Intelligent Capture (iCAP) and anomaly capture features which provide packet-level and time-series evidence for root cause analysis. If AP stats and iCAP are not pushed correctly, AIOps cannot triangulate client/AP events.
show telemetry connection all
show telemetry ietfsubscription summary
What just happened: These verification commands confirm that the device has the telemetry transport and subscriptions active — prerequisites for iCAP and other advanced capture workflows. The management UI setting for iCAP is enabled per-site or per-AP; when changed for newly added AP join profiles you may need to re-push settings (disable/enable in Assurance > Settings > Intelligent Capture Settings) and then confirm subscriptions via CLI.
Real-world note: AP stats collection is limited on the analysis platform (e.g., an internal limit of 1000 APs for AP stats). In large campuses, plan data retention and sampling to avoid oversubscription.
Verify:
Telemetry connections
Index Peer Address Port VRF Source Address State State Description
----- -------------------- ----- --- --------------- -------------- ---------------------
109 172.100.1.53 25103 0 192.168.4.7 Active Connection up
Subscription Summary
====================
Maximum supported: 128
Subscription Total Valid Invalid
---------------------------------------
All 112 112 0
Dynamic 0 0 0
Configured 112 112 0
Permanent 0 0 0
- Interpretation: With both transport and subscriptions valid, iCAP and anomaly detection features can stream data to the AIOps backend for correlation and root cause analysis.
Verification Checklist
- Check 1: Transport verification — run
show telemetry connection alland confirm a line with Peer Address 172.100.1.53, Port 25103, Source Address 192.168.4.7 and State "Active" with "Connection up". - Check 2: Subscription verification — run
show telemetry ietfsubscription summaryand confirm "All 112 112 0" (Total/Valid/Invalid). - Check 3: Post‑config push validation — after performing a Force Push in the management UI, re-run the CLI commands above and confirm both transport and subscription states show Active and Valid respectively.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| Telemetry connection shows "Connecting" | Firewall blocking port 25103 or TLS handshake failing (certificate/trust issue) | Verify network ACLs and confirm collector is reachable on 172.100.1.53:25103; validate certificates and trust chain on both sides |
| Subscriptions show zero or "Invalid" | Telemetry configuration not pushed or mismatched schema between device and collector | Force push telemetry settings from Inventory > Actions > Telemetry > Update Telemetry Settings and re-check subscription validity |
| iCAP captures not appearing | Intelligent Capture settings not applied to new AP join profiles or APs exceeding platform limits | Re-apply iCAP settings via Assurance > Settings > Intelligent Capture Settings (disable & enable for newly added profiles) and confirm AP counts under platform limits |
| Partial visibility in AIOps (some events missing) | Device has transport up but only a subset of subscriptions valid | Inspect subscription summary to find which subscriptions are invalid; re-push or correct config for the invalid streams |
Key Takeaways
- Confirm both the transport (connection) and subscriptions (what is being sent) — AIOps requires both to perform accurate root cause analysis.
- Common telemetry faults map to network (firewall/port), security (certificates), and configuration (missing telemetry settings); interpreting the status text (Active / Connecting / N/A) points you to the right domain to troubleshoot.
- After any management-plane change (push, upgrade, or AP join-profile change), always verify via
show telemetry connection allandshow telemetry ietfsubscription summaryto ensure the AIOps pipeline is healthy. - In production environments, plan telemetry scale (maximum supported subscriptions) and iCAP scope (AP limits) to avoid overloading the analytics backend and to ensure dependable root cause reasoning.
Warning: Always validate telemetry health before acting on AI-suggested remedies. Poor or missing telemetry can cause incorrect or incomplete recommendations from the Machine Reasoning Engine.
This completes Lesson 3: "Root Cause Analysis with AI". In the next lesson we'll walk through interpreting AIOps event correlation graphs and using time-series capture evidence to validate the root cause hypotheses you derive from telemetry.