Lesson 2 of 5

SD-WAN Control and Data Plane Troubleshooting

SD-WAN Control and Data Plane Troubleshooting

Introduction

When an SD-WAN fabric is running smoothly, traffic flows across encrypted tunnels, policies steer applications to the best path, and the entire overlay operates transparently. The moment something breaks, however, you need a structured approach to isolate whether the problem sits in the control plane (OMP peering, policy distribution, TLOC advertisement) or the data plane (IPsec tunnels, BFD sessions, packet forwarding). Without that distinction, troubleshooting can spiral into hours of guesswork.

This lesson walks you through the control and data plane architecture of Catalyst SD-WAN, then focuses on the commands and methodology you need to diagnose real failures. By the end, you will be able to:

  • Verify OMP peering between WAN Edge routers and controllers
  • Inspect centralized control policy assignment and matching
  • Understand data plane security mechanisms including pairwise encryption, anti-replay, and integrity verification
  • Apply the correct order of operations when troubleshooting centralized data and application-aware routing (AAR) policies

Key Concepts

Control Plane vs. Data Plane

The SD-WAN architecture separates concerns into distinct planes. Understanding what belongs where is the first step in any troubleshooting workflow.

PlaneProtocol / MechanismRuns OverPurpose
ControlOMP (Overlay Management Protocol)Authenticated TLS/DTLS connectionsAdvertises routes, TLOCs, encryption keys, and policies
DataIPsec tunnelsPublic or private transport (Internet, MPLS)Carries user traffic between WAN Edge routers
ManagementNETCONF / HTTPSVPN 512Configuration push, monitoring, telemetry

Overlay Management Protocol (OMP) is the control plane protocol of the SD-WAN fabric. It runs inside authenticated TLS or DTLS connections between WAN Edge routers and controllers. OMP advertises three categories of information:

  • Reachability -- IP subnets and TLOCs
  • Security -- Encryption keys
  • Policy -- Control policies, data policies, and application-aware routing policies

Important: WAN Edge routers need not connect to all controllers. As long as at least one controller peering is active, the edge can receive routing and policy updates.

Transport Locators (TLOCs)

A Transport Locator (TLOC) uniquely identifies a WAN transport connection on a WAN Edge router. Each TLOC is defined by three attributes:

AttributeDescription
System IPThe router's system IP address (loopback identity)
ColorThe transport type label (e.g., mpls, biz-internet, lte)
EncapsulationThe tunnel encapsulation type (IPsec or GRE)

WAN Edge routers advertise their local TLOCs to controllers via OMP. The controllers then re-advertise those TLOCs to all other WAN Edge routers (by default). This TLOC exchange is what builds the full-mesh IPsec fabric -- every edge learns every other edge's transport endpoints and can form direct tunnels. Control policies on the controller can influence which TLOCs are advertised to which edges, allowing you to build hub-and-spoke or partial-mesh topologies instead.

VPN Segmentation

SD-WAN uses VPNs (VRFs) to isolate traffic domains. Each VPN maintains its own forwarding table, and reachability within a VPN is advertised by OMP.

VPNPurpose
VPN 0Reserved for WAN uplinks (transport side)
VPN 512Reserved for management interfaces
VPN nUser-defined LAN segments (service side)

VPNs are isolated from each other. This segmentation means that a routing or forwarding problem in one VPN does not automatically affect another -- always verify which VPN the affected traffic belongs to before diving deeper.

How It Works

Control Plane: OMP and Policy Distribution

Consider a topology with two WAN Edge routers and a controller:

  • WAN Edge 1: System IP 10.255.255.11, Site ID 10, transports on VPN1-A and VPN2-B
  • WAN Edge 2: System IP 10.255.255.21, Site ID 20, transports on VPN1-C and VPN2-D

Each edge establishes a DTLS/TLS tunnel to the controller and begins exchanging OMP updates. The OMP update carries subnets, TLOCs, and policies. The controller reflects these updates to other edges (subject to control policies), and IPsec tunnels are built between the edges using learned TLOC information. BFD runs inside these IPsec tunnels to detect path failures.

When a centralized control policy is applied on the controller, it filters or modifies OMP updates before they are sent to the WAN Edge routers. Troubleshooting a policy problem therefore starts at the controller.

Data Plane: Security Mechanisms

Once the IPsec data plane tunnels are established, three mechanisms protect traffic in transit:

Pairwise Key Encryption -- Each WAN Edge creates a separate session key for each transport and for each peer. These session keys are advertised through controllers using OMP. When Edge-A sends traffic to Edge-B, it uses session key "AB." Edge-B uses key "BA" to send traffic back. This pairwise model ensures that even if one key is compromised, only a single peer-to-peer relationship is affected.

Data Plane Integrity with NAT Traversal -- The controller (acting as validator) discovers each WAN Edge's public IP address, even when the edge sits behind NAT. The validator communicates the public IP back to the WAN Edge. The WAN Edge then computes the authentication header (AH) value based on the post-NAT public IP, ensuring that packet integrity -- including IP headers -- is preserved across NAT boundaries. The encryption algorithm used is AES256-GCM.

IPsec Anti-Replay Protection -- Encrypted packets are assigned sequence numbers. WAN Edge routers drop packets with duplicate sequence numbers (replayed packets) and packets with sequence numbers lower than the minimum of the sliding window (maliciously injected packets). When a packet with a higher sequence number arrives, the sliding window advances. The sliding window is CoS-aware, preventing low-priority traffic from advancing the window and causing high-priority packets to be dropped.

Configuration Example

Verifying OMP Peering

Start by confirming that the WAN Edge has an active OMP session with the controller:

show omp peers 10.255.255.11 details

This command displays the OMP peer state, uptime, and route counts for the specified system IP. If the peer state is not "up," check the underlying DTLS/TLS connection and certificate validity.

Checking Control Policy Assignment

On the controller, verify which policies are applied to a specific peer and in which direction:

vsmart1# show support omp peer peer-ip 10.0.0.11 | include -pol
site-pol:      BRANCHES
route-pol-in:  None
route-pol-out: MY-CONTROL-POLICY-v1
data-pol-in:   None
data-pol-out:  None
pfr-pol:       None
mem-pol:       None
cflowd:        None

This output tells you that peer 10.0.0.11 belongs to site-list BRANCHES and has the control policy MY-CONTROL-POLICY-v1 applied in the outbound direction. No data policy or AAR policy is assigned.

Verifying the Control Policy Definition

Confirm that the policy defined in the SD-WAN Manager was correctly translated into CLI on the controller:

vsmart1# show running-config policy control-policy MY-CONTROL-POLICY-V1
policy control-policy REMOTE-TOPOLOGY-POLICY-PPC-rev1
  sequence 1
    match tloc
      site-list SITE-30
    !
    action accept
    !
  !
  sequence 11
    match tloc
      site-list SITE-40
    !
    action accept
    !
  !
  sequence 21
    match route
      prefix-list DEFAULT
      site-list SITE-30
    !
    action accept
      set preference 100
      service netsvc3 vpn 3
    !
  !
  sequence 31
    match route
      prefix-list DEFAULT
      site-list SITE-40
    !
    action accept
      set preference 50
      service IDP vpn 3
    !
  !
  default-action reject
  !
!

Verifying Policy Application to a Site List

Check the apply-policy section to confirm which site list receives the policy:

vsmart1# show running-config apply-policy site-list BRANCHES
apply-policy
  site-list BRANCHES
    control-policy MY-CONTROL-POLICY-v1 out
  !
!

And verify the site list membership:

vsmart1# show running-config policy lists site-list BRANCHES
policy lists
  site-list BRANCHES
    site-id 11-12
  !
!

Testing Policy Match

Use the test policy match command on the controller to simulate which sequence a given route or TLOC would hit:

vsmart1# test policy match control-policy MY-CONTROL-POLICY-V1 site-id 40 ipv4-prefix DEFAULT
Found: "site-id 40 ipv4-prefix-list DEFAULT" matches policy MY-CONTROL-POLICY-v1 sequence 31
  sequence:  31
  match route [SITE-LIST PFX-LIST]
    site-list:        SITE-40
    IPv4 prefix-list: DEFAULT
  action: accept
    set: [PREF SERVICE]
    preference: 50
    service: 3
    vpn: 3

This confirms that traffic from site 40 matching the DEFAULT prefix list hits sequence 31, where the preference is set to 50 and the traffic is steered to service IDP in VPN 3.

Order of Operations for Data and AAR Policies

When troubleshooting data plane policy behavior, understanding the processing order is essential. The order of operations from service side to transport side is:

StepProcessing StageKey Functions
1Local Ingress PolicyPolicing, admission control, classification and marking
2Centralized App-Route PolicySLA-based path selection
3Centralized Data PolicyPolicing, admission control, classification, re-marking, path selection, services
4Routing and ForwardingTopology-driven forwarding
5Queueing and SchedulingShaping, weighted scheduling
6Local Egress PolicyAccess lists, policing, re-marking

Best Practice: When a data policy does not seem to take effect, first verify whether a local ingress policy at step 1 is dropping or re-marking the traffic before it reaches the centralized policy at steps 2 and 3.

Real-World Application

Application-Aware Routing in Production

In production deployments, Application-Aware Routing (AAR) policies define SLA classes with thresholds for latency, loss, and jitter. For example, an SLA class for a critical application might require latency below 150 ms, loss below 2%, and jitter below 10 ms. Each path through the fabric is continuously measured against these thresholds:

  • Path 1: 10 ms latency, 1% loss, 5 ms jitter -- meets SLA
  • Path 2: 200 ms latency, 3% loss, 10 ms jitter -- violates SLA
  • Path 3: 140 ms latency, 1% loss, 10 ms jitter -- meets SLA

If multiple paths meet the SLA, traffic is load-balanced (hashed) across them. If a path is marked as preferred and it meets the SLA, that path is chosen exclusively.

Cloud OnRamp for SaaS

Cloud OnRamp for SaaS continuously monitors performance from the edge to SaaS applications on both direct internet access (DIA) and backhaul paths. It selects the best-performing path based on loss and delay metrics, with automatic failover when performance degrades. This operates fully automated and supports multiple application profiles.

AppQoE Building Blocks

For environments that need to optimize application experience beyond path selection, the SD-WAN fabric provides several capabilities managed from the SD-WAN Manager: TCP optimization (window scaling, large initial windows, selective acknowledgement), byte-level caching and compression (DRE, LZW), protocol-agnostic forward error correction, packet duplication, and the BBR2 congestion algorithm.

Summary

  • OMP is the backbone of the control plane -- it distributes routes, TLOCs, encryption keys, and policies over authenticated TLS/DTLS sessions between WAN Edges and controllers.
  • TLOCs (System IP + Color + Encapsulation) build the IPsec fabric -- controllers reflect TLOCs to all edges by default, creating a full-mesh topology that can be shaped by control policies.
  • Data plane security is layered -- pairwise encryption keys, AES256-GCM integrity verification across NAT, and CoS-aware anti-replay sliding windows protect traffic end to end.
  • Troubleshoot control policies systematically -- verify OMP peering, check policy assignment direction, inspect the CLI translation, and use test policy match to confirm which sequence a route hits.
  • Understand the order of operations -- local ingress policies are processed before centralized data and AAR policies; a mismatch in processing order is a common source of unexpected behavior.

Next, continue building your skills by practicing centralized data policy creation and AAR SLA class configuration in a lab environment, applying the verification commands covered in this lesson to validate each step.