LLMs for Network Troubleshooting
Objective
In this lesson you will learn how to use Large Language Models (LLMs) embedded in an SD‑WAN management plane to perform natural‑language troubleshooting queries and accept AI‑driven recommendations. We will enable an AI assistant, create a network trace using NWPI metadata, ask the assistant to analyze path quality and recommend a path change, then verify and (optionally) apply that recommendation. This matters in production because operators can shorten mean‑time‑to‑repair by using natural language to surface root causes and precise policy actions across hundreds of sites.
Topology (Quick Recap from Lesson 1)
The SD‑WAN fabric used in this lab contains the management and control plane appliances plus two Edge locations (Branch and Data Center). This lesson does not add new devices; we reuse the same devices from Lesson 1.
ASCII topology (interfaces and IPs shown):
Internet
|
ISP Router (203.0.113.1)
|
Branch Edge (BR-EDGE) Gi0/0 203.0.113.2
BR-EDGE Gi0/1 10.1.1.1/24 ------- LAN 10.1.1.0/24
|
SD-WAN Fabric
vManage (SDMGR) 10.0.0.10
vSmart (Control) 10.0.0.11
vBond (Orchestration) 10.0.0.12
|
DC Edge (DC-EDGE) Gi0/0 198.51.100.2
DC-EDGE Gi0/1 10.2.2.1/24 ------- LAN 10.2.2.0/24
|
Internet
ISP Router (198.51.100.1)
Device table:
| Device | Role | Management IP |
|---|---|---|
| vManage | SD-WAN Manager / Analytics / AI Assistant | 10.0.0.10 |
| vSmart | Control Plane Controller | 10.0.0.11 |
| vBond | Orchestrator | 10.0.0.12 |
| BR-EDGE | Branch Edge Router | 203.0.113.2 |
| DC-EDGE | Data Center Edge Router | 198.51.100.2 |
Tip: For lab examples, the management UI runs at https://lab.nhprep.com (mapped to vManage 10.0.0.10) and shared lab credentials are Lab@123 for hands-on exercises.
Key Concepts (theory + behavior)
- LLMs and AI Assistant: An LLM provides a natural‑language front end to operational data. In SD‑WAN Manager, the assistant accepts human queries (for example, "why is Sales app slow between Branch and DC?") and generates actionable readouts by combining documentation, telemetry, and live network state.
- Why this matters: Operators can reduce time spent translating CLI output into insights.
- NWPI (Network‑Wide Path Insights): NWPI is metadata embedded in SD‑WAN headers that enables routers to collect per‑flow path quality and telemetry. The first router writes NWPI metadata; subsequent routers use it to correlate flow telemetry back to the manager.
- Packet behavior: when NWPI tracing is active, the manager instructs an edge to tag flows; edge and transit devices emit correlated flow records to SD‑WAN Manager for path analysis.
- RAG (Retrieval Augmented Generation): LLMs often retrieve relevant documents/telemetry (the “retrieval” step) and then generate a natural‑language summary. The assistant queries telemetry indices (time series, path metrics) and synthesizes an explanation.
- Practical: This avoids hallucination because the assistant cites live telemetry and recommendation confidence.
- Closed‑loop Automation: When the assistant recommends a policy change (for example, steering an App Group to a different path), the system can create a local policy change and apply it automatically.
- Production caution: closed loop must be controlled — audits and revert behavior are required if central policy changes afterwards.
Step-by-step configuration
Step 1: Enable the AI Assistant in SD‑WAN Manager
What we are doing: Enable the built‑in AI assistant feature so operators can query the fabric in natural language. This step creates the assistant service and authorizes access from the vManage UI. In production, this is the first step before any AI queries can be issued.
configure
system
assistant enable
exit
exit
What just happened: The system assistant enable command turns on the AI assistant service inside vManage. The management plane now listens for assistant queries from authorized users and populates correlation indices used by the assistant (telemetry, NWPI metadata endpoints). Enabling the assistant also prepares the manager to accept natural language requests and to fetch telemetry for RAG.
Real-world note: In production, enabling the assistant is often controlled by policy and privacy review — ensure telemetry sharing complies with organization rules.
Verify:
show assistant status
Assistant: Enabled
Version: 1.0.0
Status: Ready
Active Queries: 0
RAG Indexes: TelemetryIndex, NWPIIndex
Step 2: Create an NWPI trace on the Branch Edge
What we are doing: Instruct the SD‑WAN Manager to create an NWPI trace starting at the Branch Edge (BR-EDGE). This writes NWPI metadata into the packet headers for sampled flows so that subsequent devices can report correlated telemetry back to the manager. This trace is required for the assistant to produce precise path‑level insights.
request nwpi trace create device BR-EDGE flow-match src-ip 10.1.1.10 dst-ip 10.2.2.20 duration 300
What just happened: The manager sent a control message to BR-EDGE to enable NWPI tagging for flows matching the 10.1.1.10 → 10.2.2.20 tuple for 300 seconds. BR-EDGE will now insert NWPI metadata into outgoing packets for that flow, and intermediate SD‑WAN routers will attach path telemetry reports to the manager, allowing precise path quality reconstruction.
Real-world note: Use focused traces (single app or 5‑tuple) to avoid overwhelming telemetry ingestion in production.
Verify:
show nwpi trace status
TraceID: 101
Device: BR-EDGE
FlowMatch: src-ip 10.1.1.10 dst-ip 10.2.2.20
StartTime: 2025-04-02T10:00:00Z
DurationSeconds: 300
Status: Active
SamplesCollected: 124
Step 3: Ask the AI Assistant a troubleshooting question
What we are doing: Query the assistant in plain English to analyze recent path quality between the Branch and the DC for the traced flow and ask for recommendations. This demonstrates how operators can obtain correlated analysis without parsing multiple logs.
Assistant> "Show me path quality and recent loss/latency/jitter for flow src 10.1.1.10 dst 10.2.2.20 over the last 5 minutes. If there are better paths, recommend the action and the expected gain."
What just happened: The assistant uses the NWPIIndex and TelemetryIndex to retrieve time‑series metrics (latency, jitter, packet loss) for the sampled flow. It runs its internal recommendation model (predictive path recommendations logic) and synthesizes a natural‑language response including a recommended policy action (for example, steer to private2). The assistant performs RAG: it retrieves telemetry, then generates a clear recommendation.
Real-world note: The assistant includes confidence estimates — always inspect those before applying automatic policy changes.
Verify:
Assistant Response:
Summary: Between BR-EDGE (10.1.1.1) and DC-EDGE (10.2.2.1), the measured path 'private1' shows average latency 45 ms, 2.3% loss, jitter 8 ms. The alternative path 'private2' shows average latency 30 ms, 0.4% loss, jitter 3 ms.
Recommendation: Change AAR policy for AppGroup 'SalesApp' at BR-EDGE to prefer 'private2' (expected improvement: 35% overall quality).
Confidence: 92%
Actionable steps:
1) Create a copy of the AAR policy with new sequence matching AppGroup 'SalesApp'.
2) Apply to site BR-EDGE only.
3) Monitor for 15 minutes and verify application quality.
Step 4: Generate the recommended policy change (create a policy draft)
What we are doing: Instruct vManage to create a policy draft based on the assistant's recommendation. This is a non‑destructive change that allows review before application — a best practice for closed‑loop automation.
request policy aar draft create name "SalesApp_pref_private2" source-site BR-EDGE app-group SalesApp preference private2
What just happened: The manager created a copy of the AAR (Application Aware Routing) policy, scoped to BR-EDGE for the SalesApp application group, preferring the 'private2' transport. This draft is saved for review; no live traffic steering happens until the policy is applied.
Real-world note: Draft creation enables peer review and prevents accidental network-wide policy changes.
Verify:
show policy aar drafts
Name: SalesApp_pref_private2
SourceSite: BR-EDGE
AppGroup: SalesApp
Preference: private2
Status: Draft
CreatedBy: Assistant (automated)
CreatedAt: 2025-04-02T10:06:12Z
Step 5: Apply the policy (controlled closed‑loop) and monitor
What we are doing: Apply the approved draft policy to BR-EDGE and monitor path metrics to confirm improvement. This performs the actual traffic steering the assistant recommended.
request policy aar apply name "SalesApp_pref_private2" target-site BR-EDGE
What just happened: The manager pushed the new AAR policy to the BR-EDGE device. BR-EDGE updated its local routing/forwarding decision to prefer the private2 transport for SalesApp flows. Because NWPI tracing was active, the manager will observe the post‑change path metrics and compare them to the predicted improvement.
Real-world note: Apply changes during a maintenance window or gradually (per site) to limit blast radius.
Verify:
show policy aar active
Name: SalesApp_pref_private2
AppliedTo: BR-EDGE
Status: Active
AppliedAt: 2025-04-02T10:09:30Z
show nwpi path analytics flow 10.1.1.10 10.2.2.20
Flow: src 10.1.1.10 dst 10.2.2.20
BeforeChange: private1 AvgLatency=45ms Loss=2.3% Jitter=8ms
AfterChange: private2 AvgLatency=28ms Loss=0.2% Jitter=2ms
MeasuredImprovement: 38% overall quality gain
Verification Checklist
- Check 1: AI Assistant is enabled — verify with
show assistant statusand expect "Assistant: Enabled". - Check 2: NWPI trace is active and collecting samples — verify with
show nwpi trace statusand expect Status: Active and SamplesCollected > 0. - Check 3: Assistant produced a recommendation — verify by sending the natural language query and receiving a response with a Recommendation and Confidence field.
- Check 4: Policy is applied and metrics improved — verify with
show nwpi path analyticsand expect lower latency/loss on the recommended path.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| Assistant returns "No telemetry available" | NWPI trace was not created or telemetry indexes are empty | Create NWPI trace with correct flow-match; verify show nwpi trace status |
| Recommendation confidence is low (< 50%) | Insufficient samples or short trace duration | Extend trace duration or broaden flow sampling; rerun query |
| Draft policy created but not applied | Operator did not apply draft or lacks permissions | Apply draft with request policy aar apply and ensure user has policy apply privileges |
| After applying policy, no traffic shifts | Policy scope mismatch (wrong AppGroup or site) | Inspect draft (show policy aar drafts) and confirm AppGroup and SourceSite match intended targets |
Key Takeaways
- An LLM‑based AI Assistant is a powerful natural‑language interface to SD‑WAN telemetry and recommendations; it relies on live telemetry (NWPI) and RAG to produce accurate, actionable results.
- NWPI traces are essential for correlating per‑flow metrics across the fabric; create focused traces to avoid unnecessary telemetry volume.
- Always create and review a policy draft before applying — closed‑loop automation must be controlled and auditable in production.
- Monitor after changes: the assistant predicts gains, but you must validate with measured telemetry and be prepared to revert if behavior differs.
Important: In production environments, coordinate AI-driven policy changes with change control and ensure telemetry privacy and compliance are respected when exposing data to LLMs.