Lesson 3 of 6

LLMs for Network Troubleshooting

Objective

In this lesson you will learn how to use Large Language Models (LLMs) embedded in an SD‑WAN management plane to perform natural‑language troubleshooting queries and accept AI‑driven recommendations. We will enable an AI assistant, create a network trace using NWPI metadata, ask the assistant to analyze path quality and recommend a path change, then verify and (optionally) apply that recommendation. This matters in production because operators can shorten mean‑time‑to‑repair by using natural language to surface root causes and precise policy actions across hundreds of sites.

Topology (Quick Recap from Lesson 1)

The SD‑WAN fabric used in this lab contains the management and control plane appliances plus two Edge locations (Branch and Data Center). This lesson does not add new devices; we reuse the same devices from Lesson 1.

ASCII topology (interfaces and IPs shown):

                             Internet
                                |
                          ISP Router (203.0.113.1)
                                |
  Branch Edge (BR-EDGE) Gi0/0 203.0.113.2
  BR-EDGE Gi0/1 10.1.1.1/24 ------- LAN 10.1.1.0/24
                                |
                           SD-WAN Fabric
  vManage (SDMGR)    10.0.0.10
  vSmart (Control)   10.0.0.11
  vBond (Orchestration) 10.0.0.12
                                |
  DC Edge   (DC-EDGE) Gi0/0 198.51.100.2
  DC-EDGE Gi0/1 10.2.2.1/24 ------- LAN 10.2.2.0/24
                                |
                          Internet
                         ISP Router (198.51.100.1)

Device table:

DeviceRoleManagement IP
vManageSD-WAN Manager / Analytics / AI Assistant10.0.0.10
vSmartControl Plane Controller10.0.0.11
vBondOrchestrator10.0.0.12
BR-EDGEBranch Edge Router203.0.113.2
DC-EDGEData Center Edge Router198.51.100.2

Tip: For lab examples, the management UI runs at https://lab.nhprep.com (mapped to vManage 10.0.0.10) and shared lab credentials are Lab@123 for hands-on exercises.

Key Concepts (theory + behavior)

  • LLMs and AI Assistant: An LLM provides a natural‑language front end to operational data. In SD‑WAN Manager, the assistant accepts human queries (for example, "why is Sales app slow between Branch and DC?") and generates actionable readouts by combining documentation, telemetry, and live network state.
    • Why this matters: Operators can reduce time spent translating CLI output into insights.
  • NWPI (Network‑Wide Path Insights): NWPI is metadata embedded in SD‑WAN headers that enables routers to collect per‑flow path quality and telemetry. The first router writes NWPI metadata; subsequent routers use it to correlate flow telemetry back to the manager.
    • Packet behavior: when NWPI tracing is active, the manager instructs an edge to tag flows; edge and transit devices emit correlated flow records to SD‑WAN Manager for path analysis.
  • RAG (Retrieval Augmented Generation): LLMs often retrieve relevant documents/telemetry (the “retrieval” step) and then generate a natural‑language summary. The assistant queries telemetry indices (time series, path metrics) and synthesizes an explanation.
    • Practical: This avoids hallucination because the assistant cites live telemetry and recommendation confidence.
  • Closed‑loop Automation: When the assistant recommends a policy change (for example, steering an App Group to a different path), the system can create a local policy change and apply it automatically.
    • Production caution: closed loop must be controlled — audits and revert behavior are required if central policy changes afterwards.

Step-by-step configuration

Step 1: Enable the AI Assistant in SD‑WAN Manager

What we are doing: Enable the built‑in AI assistant feature so operators can query the fabric in natural language. This step creates the assistant service and authorizes access from the vManage UI. In production, this is the first step before any AI queries can be issued.

configure
system
 assistant enable
 exit
 exit

What just happened: The system assistant enable command turns on the AI assistant service inside vManage. The management plane now listens for assistant queries from authorized users and populates correlation indices used by the assistant (telemetry, NWPI metadata endpoints). Enabling the assistant also prepares the manager to accept natural language requests and to fetch telemetry for RAG.

Real-world note: In production, enabling the assistant is often controlled by policy and privacy review — ensure telemetry sharing complies with organization rules.

Verify:

show assistant status
Assistant: Enabled
Version: 1.0.0
Status: Ready
Active Queries: 0
RAG Indexes: TelemetryIndex, NWPIIndex

Step 2: Create an NWPI trace on the Branch Edge

What we are doing: Instruct the SD‑WAN Manager to create an NWPI trace starting at the Branch Edge (BR-EDGE). This writes NWPI metadata into the packet headers for sampled flows so that subsequent devices can report correlated telemetry back to the manager. This trace is required for the assistant to produce precise path‑level insights.

request nwpi trace create device BR-EDGE flow-match src-ip 10.1.1.10 dst-ip 10.2.2.20 duration 300

What just happened: The manager sent a control message to BR-EDGE to enable NWPI tagging for flows matching the 10.1.1.10 → 10.2.2.20 tuple for 300 seconds. BR-EDGE will now insert NWPI metadata into outgoing packets for that flow, and intermediate SD‑WAN routers will attach path telemetry reports to the manager, allowing precise path quality reconstruction.

Real-world note: Use focused traces (single app or 5‑tuple) to avoid overwhelming telemetry ingestion in production.

Verify:

show nwpi trace status
TraceID: 101
Device: BR-EDGE
FlowMatch: src-ip 10.1.1.10 dst-ip 10.2.2.20
StartTime: 2025-04-02T10:00:00Z
DurationSeconds: 300
Status: Active
SamplesCollected: 124

Step 3: Ask the AI Assistant a troubleshooting question

What we are doing: Query the assistant in plain English to analyze recent path quality between the Branch and the DC for the traced flow and ask for recommendations. This demonstrates how operators can obtain correlated analysis without parsing multiple logs.

Assistant> "Show me path quality and recent loss/latency/jitter for flow src 10.1.1.10 dst 10.2.2.20 over the last 5 minutes. If there are better paths, recommend the action and the expected gain."

What just happened: The assistant uses the NWPIIndex and TelemetryIndex to retrieve time‑series metrics (latency, jitter, packet loss) for the sampled flow. It runs its internal recommendation model (predictive path recommendations logic) and synthesizes a natural‑language response including a recommended policy action (for example, steer to private2). The assistant performs RAG: it retrieves telemetry, then generates a clear recommendation.

Real-world note: The assistant includes confidence estimates — always inspect those before applying automatic policy changes.

Verify:

Assistant Response:
Summary: Between BR-EDGE (10.1.1.1) and DC-EDGE (10.2.2.1), the measured path 'private1' shows average latency 45 ms, 2.3% loss, jitter 8 ms. The alternative path 'private2' shows average latency 30 ms, 0.4% loss, jitter 3 ms.
Recommendation: Change AAR policy for AppGroup 'SalesApp' at BR-EDGE to prefer 'private2' (expected improvement: 35% overall quality).
Confidence: 92%
Actionable steps:
1) Create a copy of the AAR policy with new sequence matching AppGroup 'SalesApp'.
2) Apply to site BR-EDGE only.
3) Monitor for 15 minutes and verify application quality.

Step 4: Generate the recommended policy change (create a policy draft)

What we are doing: Instruct vManage to create a policy draft based on the assistant's recommendation. This is a non‑destructive change that allows review before application — a best practice for closed‑loop automation.

request policy aar draft create name "SalesApp_pref_private2" source-site BR-EDGE app-group SalesApp preference private2

What just happened: The manager created a copy of the AAR (Application Aware Routing) policy, scoped to BR-EDGE for the SalesApp application group, preferring the 'private2' transport. This draft is saved for review; no live traffic steering happens until the policy is applied.

Real-world note: Draft creation enables peer review and prevents accidental network-wide policy changes.

Verify:

show policy aar drafts
Name: SalesApp_pref_private2
SourceSite: BR-EDGE
AppGroup: SalesApp
Preference: private2
Status: Draft
CreatedBy: Assistant (automated)
CreatedAt: 2025-04-02T10:06:12Z

Step 5: Apply the policy (controlled closed‑loop) and monitor

What we are doing: Apply the approved draft policy to BR-EDGE and monitor path metrics to confirm improvement. This performs the actual traffic steering the assistant recommended.

request policy aar apply name "SalesApp_pref_private2" target-site BR-EDGE

What just happened: The manager pushed the new AAR policy to the BR-EDGE device. BR-EDGE updated its local routing/forwarding decision to prefer the private2 transport for SalesApp flows. Because NWPI tracing was active, the manager will observe the post‑change path metrics and compare them to the predicted improvement.

Real-world note: Apply changes during a maintenance window or gradually (per site) to limit blast radius.

Verify:

show policy aar active
Name: SalesApp_pref_private2
AppliedTo: BR-EDGE
Status: Active
AppliedAt: 2025-04-02T10:09:30Z

show nwpi path analytics flow 10.1.1.10 10.2.2.20
Flow: src 10.1.1.10 dst 10.2.2.20
BeforeChange: private1 AvgLatency=45ms Loss=2.3% Jitter=8ms
AfterChange:  private2 AvgLatency=28ms Loss=0.2% Jitter=2ms
MeasuredImprovement: 38% overall quality gain

Verification Checklist

  • Check 1: AI Assistant is enabled — verify with show assistant status and expect "Assistant: Enabled".
  • Check 2: NWPI trace is active and collecting samples — verify with show nwpi trace status and expect Status: Active and SamplesCollected > 0.
  • Check 3: Assistant produced a recommendation — verify by sending the natural language query and receiving a response with a Recommendation and Confidence field.
  • Check 4: Policy is applied and metrics improved — verify with show nwpi path analytics and expect lower latency/loss on the recommended path.

Common Mistakes

SymptomCauseFix
Assistant returns "No telemetry available"NWPI trace was not created or telemetry indexes are emptyCreate NWPI trace with correct flow-match; verify show nwpi trace status
Recommendation confidence is low (< 50%)Insufficient samples or short trace durationExtend trace duration or broaden flow sampling; rerun query
Draft policy created but not appliedOperator did not apply draft or lacks permissionsApply draft with request policy aar apply and ensure user has policy apply privileges
After applying policy, no traffic shiftsPolicy scope mismatch (wrong AppGroup or site)Inspect draft (show policy aar drafts) and confirm AppGroup and SourceSite match intended targets

Key Takeaways

  • An LLM‑based AI Assistant is a powerful natural‑language interface to SD‑WAN telemetry and recommendations; it relies on live telemetry (NWPI) and RAG to produce accurate, actionable results.
  • NWPI traces are essential for correlating per‑flow metrics across the fabric; create focused traces to avoid unnecessary telemetry volume.
  • Always create and review a policy draft before applying — closed‑loop automation must be controlled and auditable in production.
  • Monitor after changes: the assistant predicts gains, but you must validate with measured telemetry and be prepared to revert if behavior differs.

Important: In production environments, coordinate AI-driven policy changes with change control and ensure telemetry privacy and compliance are respected when exposing data to LLMs.