Advanced SD-WAN Policies and Troubleshooting

Introduction

You have just activated a centralized data policy on your SD-WAN fabric, and suddenly half of your branch sites lose connectivity to the data center. Where do you start? SD-WAN policies are one of the most powerful capabilities of the Cisco Catalyst SD-WAN architecture, but they are also one of the most common sources of production outages when misconfigured or poorly understood. Whether you are dealing with control policies that shape overlay routing, data policies that enforce per-VPN forwarding decisions, or application-aware routing (AAR) policies that steer traffic based on SLA thresholds, the ability to systematically troubleshoot SD-WAN policies is a skill every network engineer must develop.

This article provides a comprehensive, CLI-driven guide to advanced SD-WAN policy troubleshooting. We will cover the policy architecture, walk through detailed troubleshooting workflows for every policy type, explore the internal IOS-XE components responsible for policy programming, and examine real-world failure scenarios drawn from field escalations. By the end, you will have a practical toolkit for diagnosing and resolving even the most challenging SD-WAN policy issues.

What Are SD-WAN Policies and How Are They Organized?

Before diving into troubleshooting, it is essential to understand the SD-WAN policy architecture and how different policy types interact within the overlay fabric.

Component Naming

The Cisco Catalyst SD-WAN platform has undergone a naming refresh. The legacy component names map to newer designations as follows:

Legacy Name	New Name
vManage (NMS)	Catalyst SD-WAN Manager
vBond (Orchestrator)	Catalyst SD-WAN Validator
vSmart	Catalyst SD-WAN Controller

In practice, the CLI outputs continue to use the legacy names (vmanage, vbond, vsmart), and there are no plans to change this. For consistency with the actual commands and outputs you will encounter during troubleshooting, this article uses the legacy naming convention throughout.

Policy Categories

SD-WAN policies fall into two broad categories: centralized policies and localized policies.

Centralized Policies are defined on vManage and distributed across the network:

Topology and VPN Membership: Control Policy, VPN Membership Policy (UX 2.0: Topology)
Traffic Rules: App-Aware Routing Policy, Data Policy (Traffic Data), cFlowd template (UX 2.0: Application Priority and SLA)

Localized Policies are pushed via device templates or configuration groups:

Local Control Policy: Routing policies for IGP/BGP
Local Data Policy: QoS, ACLs, mirroring
Security Policy: Zone-Based Firewall (ZBFW), Unified Threat Defense (UTD), URL Filtering (URLF) (UX 2.0: NGFW, SIG/SSE, DNS Security)

The key architectural distinction is how each policy type reaches the network. Centralized control policies are enforced on vSmart, which manipulates OMP routing updates. Centralized data and AAR policies are sent from vSmart to WAN Edge routers via OMP and stored in volatile memory (Policy RIB). Localized policies are pushed via NETCONF as part of the device configuration.

Building Blocks of Centralized Policies

Every centralized policy is built from three components:

Groups of Interest (Lists): Prefixes, Sites, TLOCs, VPNs, Colors, SLAs
Policy Definition: The match-action sequences that define the policy logic
Policy Application: The apply directive that binds a policy to specific site lists and directions

Pro Tip: Control policies affect overlay routing decisions on vSmart. Data policies provide VPN-level, policy-based routing on the WAN Edge. AAR policies steer traffic according to configured SLA thresholds. Understanding which component enforces each policy type is the foundation of effective troubleshooting.

How Does the Generalized SD-WAN Policy Troubleshooting Workflow Operate?

When a policy is not behaving as expected, the first step is to determine which type of policy is involved and then follow the appropriate troubleshooting path. The generalized workflow proceeds as follows:

Is it a centralized policy?
- If yes, determine whether it is a control policy or a data/AAR/cFlowd policy.
- If no, check whether a local policy is assigned to the device template.
For centralized policies, start from vManage:
- Was the policy activated in the UI?
- Is a template assigned to vSmart?
- Was the NETCONF session from vManage to vSmart successful?
Move to the vSmart/WAN Edge side:
- Is OMP peering between the WAN Edge and vSmart established?
- Check policy/template assignment on the WAN Edge.
- Check policy match and actions.
For data/AAR policies, additional steps:
- Ensure ECMP availability.
- Check AAR statistics and logs.
- Verify whether data policy overrides are in effect.

Pro Tip: vManage automatically verifies the NETCONF session to vSmart and warns you in the UI if there are problems. Always check for these warnings before diving into CLI troubleshooting.

How to Troubleshoot SD-WAN Policies from the vManage Perspective

The vManage UI provides several checkpoints for verifying policy status before you move to the CLI.

Centralized Policy Verification

Check if a template is assigned to vSmart: Navigate to Configuration > Devices > Controllers and verify that vSmart has a device template attached.
Check if the policy was activated: Navigate to Configuration > Policies > Centralized Policy and confirm the policy status shows as active.
UX 2.0 catch: In the newer policy-groups/configuration-groups interface (accessed via Configuration > Topology), policy changes must be explicitly deployed after modification. This is a common oversight that differs from the classic UI behavior.

Localized Policy Verification

Check if the policy is assigned to a device template: Navigate to Configuration > Templates > Device Template > Additional Templates section.
UX 2.0 policy-groups: Navigate to Configuration > Policy-Groups and expand the policy-group details to verify assignment.

Common vManage-Side Issues

Policy activation failures: The centralized policy may fail to activate if there are validation errors in the policy definition.
Template assignment issues: When attaching devices to templates, configuration errors such as typos in AS path regular expressions (e.g., ^ ^*$ instead of a valid regex) will cause the push to fail.
Policy preview: Always use the policy preview feature in vManage to see the CLI representation of the policy before activation. This preview should match what you see on the vSmart controller.

Centralized Control Policies Troubleshooting: A Step-by-Step Workflow

Centralized control policies operate on the vSmart controller and manipulate OMP routing updates. The most important concept to remember is that control policies can be applied in either the inbound or outbound direction relative to vSmart, and this direction determines when the policy acts relative to OMP best-path selection.

Inbound policy: Applied before OMP best-path selection
Outbound policy: Applied after OMP best-path selection

This distinction is critical because applying a policy inbound can inadvertently remove backup paths before the best-path algorithm has a chance to evaluate them.

Step-by-Step Troubleshooting Commands

Step 1: Check policy commit changes on vSmart.

vsmart1# show configuration commit changes <number>

This command reveals exactly what was committed, including all policy lists, control-policy sequences, match conditions, and actions. For example:

vsmart1# show configuration commit changes 0
!
! Created by: vmanage-admin
! Date: 2023-04-24 19:22:02
! Client: netconf
!
policy
 lists
  site-list BRANCHES
   site-id 11-12
  !
  site-list SITE-30
   site-id 30
  !
  prefix-list DEFAULT
   ip-prefix 0.0.0.0/0
  !
 !
 control-policy MY-CONTROL-POLICY-v1
  sequence 1
   match tloc
    site-list SITE-30
   !
   action accept
  !
  sequence 21
   match route
    prefix-list DEFAULT
    site-list SITE-30
   !
   action accept
    set
     preference 100
     service netsvc3 vpn 3
    !
   !
  !
  default-action reject
 !
!
apply-policy
 site-list BRANCHES
  control-policy MY-CONTROL-POLICY-v1 out
 !
!

Step 2: Check OMP peering between WAN Edge and vSmart.

vsmart1# show omp peers <system-ip>

Verify that the peer state is "up" and note the routes received/installed/sent (R/I/S) counters.

Step 3: Check which control policy is assigned and its direction.

vsmart1# show support omp peer peer-ip <system-ip> | include -pol

Example output:

site-pol: BRANCHES
route-pol-in: None
route-pol-out: MY-CONTROL-POLICY-v1
data-pol-in: None
data-pol-out: None
pfr-pol: None
mem-pol: None
cflowd: None

This tells you which policy is applied to this peer, which site-list it belongs to, and the direction of application.

Step 4: Verify the CLI representation on vSmart matches the vManage policy preview.

vsmart1# show running-config policy control-policy <name>
vsmart1# show running-config apply-policy site-list <name>
vsmart1# show running-config policy lists site-list <name>

Step 5: Test control policy match and actions.

vsmart1# test policy match control-policy <name> <conditions>

For example:

vsmart1# test policy match control-policy MY-CONTROL-POLICY-V1 site-id 40 ipv4-prefix DEFAULT

This command reveals which sequence in the policy matches the given conditions, including the specific match criteria and action statements (such as preference values and service settings).

Step 6: If all else fails, enable debugging on vSmart.

debug omp policy [level <high|low> peer-address <system-ip> prefix <IP prefix/length> direction <both|received|sent> vpn <number>]

Before version 20.12, debug logs are stored in /var/log/tmplog/vdebug. From version 20.12 onward, they are in /var/log/vdebug. Enable disk logging for debug messages:

vSmart1(config)# system logging disk enable priority debug

To view logs, enter vshell and use tail -f <filename>, or use show log <filename> tail -f, or monitor start <filename>.

Overlay Routing Troubleshooting for Missing Routes

When a control policy issue manifests as missing routes, follow this algorithm on the WAN Edge:

Check RIB/FIB: show ip route / show sdwan ip fib
Check OMP table: show sdwan omp route
Check TLOC information: show sdwan omp tloc
Check BFD sessions: show sdwan bfd sessions
Check local policy filtering: show sdwan run "sdwan omp", show sdwan run "policy", show run route-map

On vSmart, verify OMP route and TLOC tables: show omp route, show omp tloc.

How to Troubleshoot Centralized Data and AAR Policies on SD-WAN

Data and AAR policies are enforced on WAN Edge routers rather than on vSmart. They follow a different processing pipeline, and understanding the order of operations is essential.

Order of Operations (Service Side to Transport Side)

Order	Function	Description
1	Local Ingress Policy	Policing, Admission Control, Classification and Marking
2	Centralized App-Route Policy	SLA-Based Path Selection
3	Centralized Data Policy	Policing, Classification, Re-Marking, Path Selection, Services
4	Routing and Forwarding	Topology-Driven Forwarding
5	Queueing and Scheduling	WRR with LLQ, Congestion Avoidance
6	Local Egress Policy	Access Lists, Policing, Re-marking

The guiding principle is: Data Policy makes the final decision, but it considers the AAR SLA match. When a packet is subject to both an AAR policy and a data policy, the data policy path decision takes precedence with consideration for the AAR SLA class match.

Data and AAR Troubleshooting Workflow from vSmart

From the vSmart perspective, the workflow is similar to control policies, with an additional step for XML translation verification:

vsmart1# show support omp peer peer-ip <system-ip> | begin "Policy received" | until "Statistics"

This displays the complete XML-encoded policy that vSmart crafts and sends to the WAN Edge, including all data-policy sequences, VPN lists, prefix lists, and action statements.

Data and AAR Troubleshooting Workflow from WAN Edge

Step 1: Check policy assignment on WAN Edge.

For AAR and data policies received via OMP (volatile Policy RIB):

show sdwan policy from-vsmart

For localized policies that are part of a template:

show sdwan running-config "policy"

Step 2: Verify correct next-hop and egress interface selection.

show sdwan policy service-path vpn <name> interface <name> source-ip <ip-addr> dest-ip <ip-addr> protocol <id> src-port <number> dst-port <number> app <name> [all]

If the output shows "Next Hop: Blackhole," you have a policy or routing problem causing traffic to be dropped.

Step 3: Verify correct policy match.

There are three methods:

3a. Use policy counters:

Add a count action to your policy sequence: action accept count <counter-name>

Then display counters:

show sdwan policy data-policy-filter
show sdwan policy app-route-policy-filter
show sdwan policy access-list-counters

3b. Use logging action:

Add log to a policy sequence action. This logs the first packet in each flow:

show logging | include dst: <ip-address>

The log entry includes the VPN, source/destination IP and port, protocol, direction, policy name, sequence number, and result (accept/drop).

3c. Use packet-trace:

debug platform condition ipv4 <address>/<mask> both
debug platform packet-trace packet <number-of-packets> [fia-trace]
debug platform condition start
show platform packet-trace summary
show platform packet-trace packet <number>

To stop and clear:

debug platform condition stop
clear platform condition all

To trace only dropped packets, first check QFP drop statistics:

show platform hardware qfp active statistics drop detail

Then enable trace for a specific drop code:

debug platform packet-trace drop code <id>

Step 4: Other useful commands.

Verify AAR SLA class statistics:

show sdwan app-route stats

Check traffic flows and path taken using NetFlow data (requires policy flow-visibility or cFlowd template):

show sdwan app-fwd cflowd flows

Verify DPI application classification (requires policy app-visibility):

show sdwan app-fwd dpi flows

Generating Synthetic Traffic for Testing

Starting with IOS-XE 17.12, you can generate synthetic traffic probes from the CLI when no user is available at a remote site for testing:

request platform software sdwan synthetic-traffic probe vpn-id 1 url www.cisco.com

The probe result is reported in the system log and includes the application name, URL, source interface, latency, loss percentage, and score.

Understanding the IOS-XE Policy Programming Architecture

When troubleshooting stubborn policy issues, it helps to understand how policies are programmed internally within IOS-XE on the WAN Edge.

Software Architecture Overview

All IOS-XE-based platforms share a similar architecture:

Control Plane (RP): Runs the Linux kernel and multiple processes including SD-WAN daemons (vDaemon, OMPd, FTMd, ConFD, TTMd), IOS, DMI/NESD/NETCONF, and Forwarding Manager (FMAN-RP).
Data Plane (QFP): The Quantum Flow Processor uses the Feature Invocation Array (FIA) for traffic processing, including data and AAR policies, ACLs, QoS marking, security policies, and more.

Key differences across platforms are in the data plane implementation (dedicated CPU/linecard versus Linux software process) and crypto implementation (inline versus external hardware accelerator).

How Centralized Policies Are Installed

The policy installation flow follows this path:

vSmart sends the policy update over OMP (TCP session encapsulated in DTLS)
OMPd on the WAN Edge receives the XML policy, prunes unsupported configurations, and commits to ConFD
FPMd (Forwarding Policy Manager Daemon) parses the configuration and populates internal data structures
FMAN-RP passes through to FMAN-FP, which sets up object dependencies
FMAN-FP pushes the policy into the QFP data plane where PPE microcode implements packet processing features via the FIA

Low-Level Policy Verification Commands

AOM (Asynchronous Object Manager) status is the key indicator: "Done" means the object was programmed successfully; "Pending" means there is a problem.

Check FMAN-FP policy binding:

show platform software sdwan fp active policy bind summary

This shows the target VRF, direction, address family, class-group (policy) name, AOM ID, and AOM status.

Pro Tip: The VRF ID in the platform output does not always match its name. Use show ip vrf detail <name> | include Id to find the actual VRF ID.

Check policy sequences:

show platform software sdwan fp active policy class summary

If a sequence is missing from the output (for example, because it matches on a custom application and the SD-AVC definition download from vManage failed), the policy will not work as intended.

Check detailed policy programming in FMAN-FP:

show platform software sdwan f0 policy cg <id> detail

Verify policy in the QFP data plane:

show platform hardware qfp active interface if-name <interface> | include SDWAN
show platform hardware qfp active classification class-group-manager class-group client sdwan all
show platform hardware qfp active feature sdwan client policy class-group <id> detail

If you encounter log messages like %FMFP-3-OBJ_DWNLD_TO_DP_STUCK, these indicate that AOM download to the data plane is stuck, and low-level verification is warranted.

Real-World SD-WAN Policy Failures: Disjoined Underlays and Redundancy

Field escalations reveal several patterns of SD-WAN policy failures that are not immediately obvious. Understanding these scenarios will save you significant troubleshooting time.

Disjoined Underlay Without Control Policy

In overlays with disjoined underlays (where different transport types such as MPLS and Internet are not directly connected to each other), branch-to-branch traffic must traverse a gateway router. Without a control policy, OMP routes from a branch connected only to MPLS may include TLOCs reachable only via Internet, which the branch cannot resolve. This leads to ECMP distributing traffic across both gateways, and if one gateway loses its Internet link, approximately 50% of traffic will be blackholed.

Solution: Configure a control policy to rewrite TLOCs (hub-and-spoke topology):

policy
 lists
  site-list BR1
   site-id 1
  !
  site-list BR2
   site-id 2
  !
  site-list ALL_BRANCHES
   site-id 1
   site-id 2
  !
  tloc-list INET_TLOCS
   tloc 10.0.0.1 color biz-internet encap ipsec
   tloc 10.0.0.2 color biz-internet encap ipsec
  !
  tloc-list MPLS_TLOCS
   tloc 10.0.0.1 color mpls encap ipsec
   tloc 10.0.0.2 color mpls encap ipsec
  !
 !
 control-policy CHANGE_TLOC_NH
  sequence 10
   match route
    site-list BR1
    vpn 3
   !
   action accept
    set
     tloc-list INET_TLOCS
    !
   !
  !
  sequence 20
   match route
    site-list BR2
    vpn 3
   !
   action accept
    set
     tloc-list MPLS_TLOCS
    !
   !
  !
  default-action accept
 !
!
apply-policy
 site-list ALL_BRANCHES
  control-policy CHANGE_TLOC_NH out
 !
!

Other possible solutions include configuring TLOC extension, setting up IGP/BGP peering between gateways with bidirectional OMP redistribution, or enabling Multi-Regional Fabric "Light" with the transport-gateway enable command (available in IOS-XE 17.9 and later).

Active-Standby Redundancy Failure with Control Policy Direction

A common mistake is using an inbound control policy on vSmart to set OMP route preference for active-standby gateway redundancy. When the active gateway loses its MPLS link, an inbound policy may have already filtered out backup paths before best-path selection, leaving no valid alternatives for branches that can only reach the gateway via MPLS.

The fix: Use an outbound control policy instead. Outbound policies apply after best-path selection on vSmart, so backup paths are preserved:

apply-policy
 site-list ALL_BRANCHES
  control-policy PREFER_GW1 out
 !
!

An alternative is to set TLOC preference directly on the gateway interface rather than using a control policy:

sdwan
 interface GigabitEthernet2
  tunnel-interface
   encapsulation ipsec
    preference 200
   exit
  exit
 !
!

The set tloc-list Trap

When using set tloc-list with preferences in a control policy to implement multi-level backup, a dangerous behavior can occur. If the preferred gateway (GW1) stops advertising a route because it lost LAN connectivity, vSmart still executes the policy as instructed and replaces the route's TLOC with the TLOC list -- which still includes GW1's TLOCs. This causes traffic blackholing because GW1's TLOCs are still reachable at the overlay level even though the destination behind GW1 is not.

Solution: Match on TLOC conditionally so that TLOC rewrite only happens when the corresponding gateway is actually advertising the route:

control-policy DC_PREFERENCES_FIX
 sequence 10
  match route
   site-list DCs
   tloc-list GW1_TLOCS
  !
  action accept
   set
    tloc-list GW1_TLOCS_W_PREF
   !
  !
 !
 sequence 20
  match route
   site-list DCs
   tloc-list GW2_TLOCS
  !
  action accept
   set
    tloc-list GW2_TLOCS_W_PREF
   !
  !
 !
 default-action accept
!

How Does Traffic Engineering with set tloc-action Work in SD-WAN Policies?

The set tloc-action feature allows you to steer traffic through an intermediate router (such as a DC gateway) on its way to the final destination. However, it requires specific configuration on the intermediate router and has important limitations.

Mandatory Prerequisite

When using set tloc-action primary in a control policy, you must configure service TE in the global VRF on the intermediate router:

sdwan
 service TE vrf global
!

Without this configuration, the traffic-engineered path will show as "Inv,U" (Invalid, Unresolved) in the OMP route table, and traffic will take the direct path instead of the intended intermediate hop. Note that this same configuration is also a prerequisite for dynamic on-demand tunnels (ODT) to function correctly.

Disjoined Underlay Limitation

The tloc-action feature is only supported end-to-end if the transport color is the same from the source site to the intermediate hop and from the intermediate hop to the final destination. If different colors are involved (for example, biz-internet to the intermediate hop but MPLS from the intermediate hop to the destination), the path will remain unresolved and invalid.

Common SD-WAN Policy Pitfalls and Field-Proven Solutions

Device Cannot Install Policy After Reload

In large-scale deployments (thousands of routers), a hub router may fail to install any policy after a software upgrade and reload. The root cause is the punt policer -- a rate limiter for control plane packets sent to the CPU. When thousands of spokes simultaneously try to establish tunnels after the hub comes back online, the punt policer drops excessive control plane traffic, including OMP sessions carrying policy updates.

Diagnosis:

show platform software punt-policer drop-only

Look for high drop counts on punt cause 11 ("For-us data").

Solution:

platform punt-policer 11 10000 high

Traffic Blackholing with DIA Policy

Direct Internet Access (DIA) data policies can cause traffic blackholing when NBAR (Deep Packet Inspection) misclassifies an internal application as a SaaS application (such as Office 365). Internal traffic matching the DIA policy gets NAT'd out the Internet-facing interface instead of being forwarded through the overlay.

Solution: Insert a data policy sequence above the NAT sequence that explicitly accepts traffic destined for RFC1918 addresses:

policy
 data-policy VPN_1_NAT
  vpn-list VPN_1
   sequence 1
    match
     destination-data-prefix-list RFC1918
    !
    action accept
    !
   !
  !
 !
!

Starting with version 20.13/17.13, RFC1918 addresses are excluded from DIA/CoR evaluation by default.

DPI Classification Failures

NBAR application recognition can fail for several reasons:

DNS is encrypted (DNS over HTTPS, DNS over TLS)
Asymmetric routing in dual-homed sites causes DNS replies to arrive on a different path
DNS traffic does not pass through the router performing classification
DNS traffic is forwarded in a VRF different from where the application data flows
DNS pipelining: multiple DNS requests sent over the same UDP stream. IOS-XE 17.12 and later can handle up to 32 consecutive requests; older versions recognize only the first request.

Incorrect DSCP Marking in cFlowd

When a data policy sets DSCP marking but show sdwan app-fwd cflowd flows shows DSCP as "0," the marking is actually being applied correctly (confirmed by packet-trace and tcpdump). The issue is that the cFlowd template does not collect the DSCP output field by default. The solution is to enable it on the vSmart cFlowd template:

vsmart1(config)# policy cflowd-template test-cflowd-template
vsmart1(config)# customized-ipv4-record-fields collect-tos collect-dscp-output
vsmart1(config)# commit

Typical SD-WAN Policy Issues Quick Reference

Generic Policy Issues

Wrong default-action (reject vs. accept)
Wrong direction of policy application (in vs. out, from-tunnel vs. from-service)
Policy scope too narrow or too wide (site-id not specified in match, action applied to entire site-list)
Typos and misconfigurations (missing prefix from a prefix-list, wrong mask, wrong site-id)
Forgetting that once a packet matches a policy sequence, it is final and further sequences are not processed

Control Policy Issues

Control policy applied inbound before OMP best-path selection, causing backup paths to be lost
Unconditional TLOC rewrites with set tloc-list where vSmart ignores TLOC state
Using set tloc-action without enabling service TE on the intermediate WAN Edge
Attempting to join different transport colors with set tloc-action

AAR and Data Policy Issues

Return traffic is asymmetric; AAR is unidirectional so traffic may return on a different color if the remote device has no symmetric policy
ECMP paths not available across multiple colors, limiting AAR choices
AAR convergence misunderstanding: by default, it may take up to 1 hour for AAR to change a path (app-route poll-interval 600 seconds multiplied by multiplier 6)
BFD poll-interval affects frequency of app-route statistics updates (accuracy) but not AAR reaction time (convergence)
Data policy overrides AAR but considers AAR SLA class match (20.6 and later behavior)
DIA NAT fallback or SIG fallback-to-routing not configured by default
First packet match failures causing policy bypass; may need policy flow-stickiness-disable (17.6 and later)
Fragmented UDP packets matching wrong policy sequences because fragment headers lack port information

AAR Timing Details

AAR statistics collection works with a sliding window approach:

For each polling interval period, AAR collects BFD Hello packets to measure loss and latency
Statistics are retained for the duration defined by polling interval multiplied by multiplier (default: 10 minutes multiplied by 6 = 60 minutes)
Statistics are stored in six buckets indexed 0 through 5, with bucket 0 containing the latest data
Every polling interval, the newest statistics go into bucket 0, bucket 5 is discarded, and remaining buckets shift
AAR calculates the mean of loss and latency across all buckets and compares to configured SLA thresholds

Frequently Asked Questions

What is the difference between centralized and localized SD-WAN policies?

Centralized policies are defined on vManage and enforced either on vSmart (control policies) or on WAN Edge routers via OMP distribution (data and AAR policies). They are stored in volatile memory on the WAN Edge. Localized policies are pushed via NETCONF as part of the device configuration template and include local routing policies (OSPF/BGP), local data policies (QoS, ACLs), and security policies (ZBFW, UTD). Centralized policies affect overlay-level routing and forwarding, while localized policies handle device-specific behaviors.

Why does my SD-WAN control policy cause traffic blackholing after a gateway failure?

This typically happens when using an inbound control policy to set route preferences or when using set tloc-list unconditionally. Inbound policies filter paths before best-path selection, potentially removing backup paths. Unconditional TLOC list rewrites cause vSmart to continue advertising TLOCs for a gateway that has lost service-side connectivity. The solution is to use outbound control policies and match on TLOC conditionally so that TLOC rewrite only applies when the gateway is actively advertising the route.

How long does it take for AAR to react to an SLA violation?

By default, AAR may take up to one hour to change a forwarding path. This is because the app-route poll-interval defaults to 600 seconds and the multiplier defaults to 6, meaning AAR averages statistics over a 60-minute window. To achieve faster convergence, reduce the poll-interval and/or multiplier values. Note that the BFD poll-interval affects the accuracy of statistics but does not directly control AAR convergence time.

How can I test SD-WAN policy behavior without a user at the remote site?

Starting with IOS-XE 17.12, you can use the synthetic traffic probe feature to generate test traffic from the CLI: request platform software sdwan synthetic-traffic probe vpn-id 1 url www.cisco.com. The probe result is reported in the system log and includes latency, loss percentage, and score. For more detailed analysis, use the NWPI (Network-Wide Path Insight) tool available in vManage 20.12.1 and later, which traces flows end-to-end and provides advanced insight views.

What should I check if a WAN Edge router does not install any policy after a reload?

Check the QFP drop counters using show platform hardware qfp active statistics drop detail. If you see high counts for "PuntPerCausePolicerDrops" (drop code 206), the punt policer is rate-limiting control plane traffic, preventing OMP sessions from delivering the policy. Use show platform software punt-policer drop-only to confirm, and increase the policer rate with platform punt-policer 11 <rate> high. This is common in large-scale deployments where thousands of spokes try to establish tunnels simultaneously after a hub reload.

How do I prevent DIA policies from blackholing internal traffic?

Add a data policy sequence above your DIA/NAT sequence that matches RFC1918 destination addresses and accepts them without NAT. This ensures internal traffic is forwarded through the overlay rather than being misclassified by NBAR and sent out the DIA circuit. Starting with version 20.13/17.13, RFC1918 exclusion from DIA evaluation is handled by default.

Conclusion

Mastering SD-WAN policies troubleshooting requires a structured approach that begins at the vManage UI, progresses through vSmart controller verification, and extends deep into the IOS-XE forwarding architecture on the WAN Edge. The key takeaways from this guide are:

Always start with the generalized workflow: determine the policy type, verify activation and assignment in vManage, then move to CLI verification on vSmart and WAN Edge.
Understand the direction of policy application: inbound vs. outbound control policies behave fundamentally differently, and choosing the wrong direction is a common cause of failover failures.
Use the full troubleshooting toolkit: from show configuration commit changes and test policy match on vSmart, to show sdwan policy service-path, packet-trace, and NWPI on the WAN Edge.
Know the common pitfalls: unconditional TLOC rewrites, missing service TE for traffic engineering, DPI misclassification with DIA, and punt policer exhaustion in large-scale environments.
Verify policy programming at every level: from OMP to FPMD to FMAN-FP to QFP, using AOM status as your primary health indicator.

SD-WAN policy design and troubleshooting is an advanced skill that directly impacts production network reliability. To deepen your understanding of SD-WAN architecture, routing, and policy design, explore the SD-WAN courses available on NHPREP for hands-on lab practice and guided instruction.