7 Ways to Fail with Catalyst SD-WAN and How to Fix It
Introduction
You have spent months planning, designing, and deploying your Catalyst SD-WAN fabric. The control connections are green, overlays are forming, and traffic is flowing exactly the way you expected. Then, without any warning, BFD sessions start flapping across hundreds of sites. Or worse — corporate application traffic silently takes the wrong path and nobody notices until users start filing tickets. These are not hypothetical horror stories. They are real-world SD-WAN mistakes that catch even experienced network engineers off guard.
The old engineering proverb rings painfully true: "Anything that can go wrong will go wrong, and at the worst possible time." And in the context of Catalyst SD-WAN, an equally dangerous cousin of that law applies: "Making assumptions is a perfect recipe for a disaster."
This article walks through seven of the most consequential ways that Catalyst SD-WAN deployments fail in production. For each failure mode, you will learn exactly what happens, why it happens, how to confirm the root cause with real CLI commands, and — most importantly — how to prevent it from ever happening in your network. Whether you are preparing for a certification exam or managing a live SD-WAN fabric, these Catalyst SD-WAN tips will save you hours of troubleshooting and protect your deployment from avoidable outages.
Before diving in, let us make sure we are aligned on the key terminology that appears throughout this article.
Essential Catalyst SD-WAN Terminology
Understanding the building blocks of Catalyst SD-WAN is critical before you can understand how they break. The table below summarizes the acronyms and architectural constructs referenced throughout this article.
| Term | Description |
|---|---|
| CP | Control-plane connections (DTLS/TLS tunnels between edges and controllers) |
| DP | Data-plane connections (IPSec tunnels with BFD running on top) |
| IPSec Re-key | The process of refreshing and redistributing IPSec key material along with SPI, maintained by the FTM process |
| TLOC | A data structure that uniquely identifies a control-plane or data-plane tunnel endpoint (tunnel interface) |
| DIA | Direct Internet Access — local Internet breakout from the edge device |
| FIF | First packet in the flow — the initial packet used for classification decisions |
| OMP | The control-plane protocol in Catalyst SD-WAN that distributes routes, TLOCs, and policies |
| vDaemon | A process running on each edge device and controller to establish and maintain control-plane tunnels |
| FTMd | A process running on each edge device to dynamically calculate the overlay FIB and maintain IPSec/BFD sessions |
With this foundation in place, let us examine how the SD-WAN data plane is constructed automatically — because understanding that process reveals exactly where the pitfalls hide.
How Does Catalyst SD-WAN Build Its Data Plane Automatically?
One of the defining characteristics of Catalyst SD-WAN is that the data plane is built dynamically. Understanding this process is essential to diagnosing every failure mode discussed in this article.
Here is the sequence at a high level:
- Underlay transport connectivity — Each WAN edge device connects to one or more transport networks (for example, business Internet and MPLS).
- Control-plane tunnel establishment — Each edge builds DTLS or TLS tunnels to the controllers (vSmart). The vDaemon process on each device is responsible for establishing and maintaining these tunnels.
- OMP route and TLOC exchange — Once control connections are up, the edge sends OMP updates containing its local TLOCs (tunnel endpoints) and learned routes (from BGP, OSPF, connected, or static) to vSmart. The controller reflects these updates to other edges.
- IPSec tunnel formation — When a WAN edge receives remote TLOC information via OMP, the FTMd process dynamically calculates the overlay FIB and establishes IPSec tunnels to the remote TLOCs.
- BFD session activation — BFD runs on top of each IPSec tunnel to detect path failures quickly.
The critical dependency chain here is: BFD relies on IPSec. IPSec relies on OMP TLOC updates and local TLOC state and control connections. If there is no IPSec tunnel, there is no BFD session. If OMP is not exchanging TLOCs properly, IPSec tunnels cannot form. Every failure mode in this article traces back to a break somewhere in this chain.
By default, Catalyst SD-WAN builds a full mesh of IPSec tunnels. Every edge attempts to establish a direct data-plane tunnel to every other edge for which it receives a TLOC via OMP. There are no warning signs and no guardrails when this behavior begins to push a platform beyond its limits.
Now let us walk through the seven most common SD-WAN mistakes and how to fix each one.
Failure 1: Hitting Platform Scale Limits Without Knowing It
This is arguably the most insidious failure because it does not announce itself. It creeps in gradually as your network grows.
What You Have
A network with a few hundred sites and devices. The number of devices evolves naturally over time as branches are added. Lower-end hardware platforms — such as the ISR 1000 series — are used for smaller branch offices.
What You Do
You deploy a basic device template to ensure control connections and data-plane connectivity. There are no specific traffic policies, no topology restrictions, and no additional configurations beyond the basics. Everything works perfectly for weeks or even months.
What Goes Wrong
Then, without any warning signs, you start seeing BFD flaps on most — if not all — devices in the network. Some tunnels remain up while others keep flapping with no discernible pattern. You confirm there are no ISP-related issues. Bandwidth consumption is well within the envelope. No software failures are detected. Yet the fabric is unstable.
Why It Happens
The root cause is platform scale limits. For example, the safe limit for the ISR 1000 platform is 200 IPSec tunnels. Since Catalyst SD-WAN builds full-mesh IPSec tunnels by default, the number of tunnels grows quadratically as you add sites. There is no call-admission control — the platform will keep trying to establish new tunnels even after it has reached its limit. This destabilizes already-established tunnels, causing the flapping behavior you observe.
How to Confirm the Issue
The diagnostic process follows the dependency chain: BFD depends on IPSec, which depends on OMP TLOCs and control connections. Start at the top and work down.
Step 1 — Check BFD summary:
# show sdwan bfd summary
Look at the key counters. In a problem scenario, you might see output like this:
sessions-total 400
sessions-up 174
sessions-max 400
sessions-flap 81077
poll-interval 600000
When sessions-up is significantly lower than sessions-total and the sessions-flap counter is extremely high (81,077 in this example), you have a clear indication that tunnels are churning.
Step 2 — Verify control connections are stable:
# show sdwan control connections
Check the uptime column. If control connections have been up for days (for example, 4:23:12:30), they are not the problem. If they are also flapping, investigate that first.
Step 3 — Check OMP TLOC counts:
# show sdwan omp summary
Look at the tlocs-received counter. This value increases each time a TLOC on a remote edge flaps. If tlocs-received is climbing rapidly (for example, 802 and growing), remote edges are flapping and sending constant OMP updates.
Step 4 — Verify local TLOCs are stable:
# show sdwan control local-properties
Check the uptime for each local TLOC. If the uptime is very short, investigate the WAN interfaces first.
Step 5 — Check the platform scale limit for SD-WAN tunnels. This is a platform-specific value that you should always know for every hardware model in your network.
Step 6 — Enable debug traces for FTMd:
# debug platform software sdwan ftmbfd
# debug platform software sdwan ftmipsec
# debug platform software sdwan ftmttm-events
# show logging profile sdwan start last 5 minutes
Step 7 — Look for the smoking gun. In the log output, watch for messages like this:
*Oct 1 13:49:01.702: %Cisco-SDWAN-CEDGE1-FTMD-6-INFO-1000015: R0/0: FTMD: Tunnel Add to TLOC 10.2.1.7.biz-internet Failed. Reason Out of resources
The message Tunnel Add ... Failed. Reason Out of resources confirms that the IPSec tunnel scale limit has been reached.
How to Prevent It
Several strategies exist to avoid this SD-WAN mistake:
- Implement centralized control policy for hub-and-spoke topologies. If your design is hub-and-spoke, configure a control policy on vSmart to prevent direct tunnels between spokes. This dramatically reduces the tunnel count on branch devices.
- Use on-demand tunnels. If you still need occasional direct spoke-to-spoke communication, enable on-demand tunnels (similar in concept to DMVPN). Tunnels are built only when traffic demands it and torn down after a timeout.
- Deploy Multi-Region Fabric (MRF). Also known as Hierarchical SD-WAN (H-SDWAN), MRF segments the overlay into regions, limiting the scope of full-mesh tunnel formation within each region.
- Upgrade hardware. If you genuinely need full-mesh connectivity (which is very unlikely in practice), the only option is to move to a platform with a higher tunnel scale.
Pro Tip: Always document the IPSec tunnel scale limit for every hardware platform in your network. Build a spreadsheet that projects tunnel growth as new sites are added, and set internal thresholds well below the platform maximum.
Failure 2: Making False Assumptions About VPN0 and Data-Plane Tunnels
This failure involves a creative but ultimately flawed network design that works partially — control plane comes up perfectly, but the data plane refuses to follow.
The Scenario
You have a topology where an SD-WAN edge device does not have direct IP connectivity to controllers or other edges. Between the edge and the rest of the SD-WAN fabric sits an intermediate non-SD-WAN device. The edge can build a manual IPSec or GRE tunnel to this intermediate device and gain IP reachability to the controllers.
What You Do
You configure:
- A manual IPSec tunnel in a VRF on the SD-WAN edge, pointing to the intermediate device
- A default route in VPN 0 pointing to the tunnel
- Route leaking from the VRF into VPN 0 and back
- A loopback interface in VPN 0 configured as the TLOC
What You Get
Control connections come up successfully. You can ping in VPN 0 from the loopback TLOC to other TLOC interfaces using the manual IPSec tunnel. But no BFD session is established. No BFD packets are sent or received. The data plane is completely dead.
Why It Happens
The assumption was that having IP connectivity and a default route in VPN 0 would be enough to establish both control-plane and data-plane sessions. This assumption is partially correct: it is enough for control-plane DTLS/TLS connections. However, it is not enough for data-plane IPSec/BFD tunnels when the associated WAN interface (the manual IPSec tunnel) is assigned to a non-VPN 0 VRF.
The FTMd process — which is responsible for creating and maintaining data-plane IPSec/BFD sessions — requires the WAN interface to be in VPN 0. When the tunnel interface belongs to a different VRF, FTMd simply cannot use it, even though VPN 0 has a default route pointing to it.
How to Confirm the Issue
Step 1 — Check the default route in VPN 0:
# show ip route
Gateway of last resort is 192.168.4.1 to network 0.0.0.0
S* 0.0.0.0/0 [1/0] via 192.168.4.1
Note which interface the default route points to.
Step 2 — Examine the interface configuration:
interface Tunnel1
vrf forwarding LTE
ip address 192.168.4.2 255.255.255.252
tunnel source Cellular0/2/0
tunnel mode ipsec ipv4
tunnel destination 10.10.10.10
tunnel path-mtu-discovery
tunnel protection ipsec profile ipsec-profile-1
The key line is vrf forwarding LTE. The tunnel interface is in the LTE VRF, not in VPN 0.
Step 3 — Enable FTMd debug traces:
# debug platform software sdwan ftmall
# show logging process ftminternal
Step 4 — Look for the error message:
2024/09/19 07:34:58.877468369 {ftmd_R0-0}{255}: [ftmd] [30679]: UUID: 0, ra: 0 (ERR): No interface Tunnel1 in default VPN
The message No interface Tunnel1 in default VPN confirms that FTMd cannot use this interface for data-plane tunnel establishment because it is not in VPN 0.
How to Prevent It
This design represents a corner case that has not been validated, so there is no officially supported answer. However, you can try:
- Move the manual IPSec tunnel into VPN 0. This places the tunnel interface in the correct VRF for FTMd to use it.
- Bind the loopback TLOC to the tunnel interface. This ensures the TLOC association maps to an interface that FTMd can find in VPN 0.
Pro Tip: Before implementing any non-standard design, always verify whether the topology is a validated design. Assumptions about what "should" work based on IP reachability alone do not account for process-level VRF requirements within the SD-WAN architecture. The control plane and data plane have different interface requirements — never assume that what works for one will work for the other.
Failure 3: Misforwarding Traffic Due to Application Misclassification
This failure is subtle and dangerous because traffic appears to be flowing — it is just flowing to the wrong place.
What You Have
A Catalyst SD-WAN network with DIA (Direct Internet Access) configured for local Internet breakout.
What You Want
Specific application traffic should be steered to the Internet via DIA. All other traffic should follow normal overlay routing.
What You Do
You enable NBAR (application visibility) and configure a data policy to match an application by name and steer it to DIA:
policy data-policy DIA_NAT
vpn-list VPN1
sequence 10
match
source-ip 10.0.0.1/32
!
action accept
nat use-vpn0
!
!
default-action accept
!
What You Get
The targeted application traffic is successfully forwarded to the Internet via DIA. However, internal corporate applications start reporting connectivity issues. Additionally, some applications that should be taking DIA show their own connectivity problems. Traffic is being misforwarded.
Why It Happens
The root cause lies in how application classification and flow handling interact. Understanding three key concepts is essential:
First-Packet Match (FPM): This is the ideal scenario where classification is final on the very first packet of a flow. When FPM is achieved, the forwarding decision is correct from the start.
Non-FPM classification: For many applications, final classification is not possible on the first packet. The engine needs to inspect several packets before it can confidently identify the application. During this period, the classification state is "Not final."
Flow stickiness: This is enabled by default to ensure forwarding consistency. Once a forwarding decision is made for a flow, subsequent packets in that flow follow the same path — even if the classification changes.
Here is where the SD-WAN mistakes compound:
| Scenario | FPM | Flow Stickiness | Result |
|---|---|---|---|
| Best case | Yes | Enabled | Traffic correctly forwarded from the first packet |
| Overlay trap | No | Enabled (default) | Some Internet-destined flows may initially be forwarded to the overlay based on non-final classification, and flow stickiness keeps them there |
| Path change | No | Disabled | Flows initially forwarded to overlay may switch to DIA mid-flow after final classification, breaking connections |
With flow stickiness enabled (the default), some flows destined for the Internet with non-FPM matching might be forwarded to the overlay and stay there — leading to misforwarding. With flow stickiness disabled, some flows initially forwarded to the overlay might change their forwarding path to DIA after final classification. Not every application flow can survive such a mid-stream path change.
How to Confirm the Issue
Use the packet trace (fia-trace) feature to inspect classification for the traffic of interest. With NBAR enabled, the fia-trace output includes classification details:
Feature: NBAR
Packet number in flow: 1
FIF packet
Classification state: Not final
Classification name: ms-services
App classification
Classification ID: 1737 [CANA-L7:777]
Candidate classification sources: SDAVC-L3-L4: ms-services [1737]
The key fields to examine are:
- Packet number in flow: 1 — This is the first packet (FIF).
- Classification state: Not final — The classification has not reached a final determination yet. This means the forwarding decision made on this packet could be incorrect, and subsequent behavior depends on whether flow stickiness is enabled.
If you see Classification state: Not final on the FIF packet for traffic that is being misforwarded, you have found your root cause.
Pro Tip: When designing DIA policies with application-based matching, test each application's classification behavior with fia-trace before deploying to production. Know which applications achieve FPM and which do not. For applications that do not achieve FPM, consider using IP prefix or FQDN matching instead of application-name matching, as these provide deterministic first-packet classification.
Failure 4: Overlooking IPSec Re-Key Impact on Tunnel Stability
The IPSec re-key process involves refreshing and redistributing IPSec key material along with the Security Parameter Index (SPI). This process is managed by the FTMd process on each edge device. While re-keying is essential for security, it can introduce transient instability if the device is already operating near its resource limits.
When a device is handling hundreds of IPSec tunnels close to its platform maximum, the simultaneous re-key of many tunnels can compete for CPU and memory resources. Combined with the scale-limit issues discussed in Failure 1, re-key events can trigger cascading BFD flaps that appear random but are actually resource-driven.
The diagnostic approach follows the same methodology as Failure 1: check BFD summary, verify control connections, inspect OMP TLOC stability, and look for Out of resources messages in the FTMd logs. The prevention strategy is the same — ensure your tunnel count stays well below platform maximums so that re-key events do not push the device over the edge.
Failure 5: Ignoring the Relationship Between Control-Plane and Data-Plane Health
Many engineers troubleshoot BFD and IPSec problems in isolation without first verifying that the control plane is healthy. This leads to wasted hours chasing data-plane symptoms when the root cause is a control-plane issue.
Remember the dependency chain: No control connection means no OMP exchange. No OMP exchange means no TLOC distribution. No TLOC distribution means no IPSec tunnel formation. No IPSec tunnel means no BFD session.
A Systematic Troubleshooting Approach for SD-WAN Mistakes
Always start your investigation from the bottom of the stack and work upward:
- Transport layer — Is the underlay reachable? Can the edge reach the controller IPs?
- Control connections — Are DTLS/TLS sessions to vSmart established and stable?
# show sdwan control connections
Check the uptime column. If uptime is measured in seconds or minutes rather than days, the control plane is flapping and that is your primary problem.
- OMP state — Are TLOCs being exchanged? Are route counts stable?
# show sdwan omp summary
Watch the tlocs-received and tlocs-installed counters. If tlocs-received is climbing rapidly while tlocs-installed remains static, remote edges are flapping.
- Local TLOC state — Are local tunnel endpoints stable?
# show sdwan control local-properties
- Data plane — Only after confirming the above layers are healthy should you focus on BFD and IPSec.
# show sdwan bfd summary
Pro Tip: Build this troubleshooting ladder into your team's standard operating procedures. Print it out and pin it to the wall of your NOC. When BFD issues are reported, resist the urge to jump straight to data-plane debugging — always validate the control plane first.
Failure 6: Deploying DIA Without Understanding Classification Behavior
This failure is closely related to Failure 3 but focuses specifically on the design decision to deploy DIA with application-based policies without fully understanding how the classification engine works.
How Does Application Classification Work in Catalyst SD-WAN?
When NBAR or SD-AVC is enabled, the data-policy can match traffic based on application name. A DIA policy typically matches specific applications and applies a nat use-vpn0 action to send them to the local Internet breakout.
The classification engine uses several sources to identify applications:
- Layer 3/Layer 4 headers — IP addresses and port numbers (fastest, but least specific)
- Layer 7 deep packet inspection — Payload analysis (most accurate, but requires multiple packets)
- SD-AVC — Centralized application visibility and control
The first packet in the flow (FIF) is the critical decision point. When the data policy evaluates the FIF packet, the classification may or may not be final. The Classification state field in the fia-trace output tells you exactly where you stand.
The Danger of Non-Final Classification
When classification is Not final on the FIF packet, the edge device makes a forwarding decision based on its best guess. Depending on the flow-stickiness setting:
- Flow stickiness ON (default): The initial forwarding decision sticks for the entire flow, even if the classification changes later. This provides consistency but can result in permanently misforwarded flows.
- Flow stickiness OFF: The forwarding path can change mid-flow when classification becomes final. This provides accuracy but can break stateful connections.
Neither option is perfect. The best approach is to design your policies so that the applications you are steering to DIA achieve FPM whenever possible. For applications where FPM is not achievable, consider alternative matching criteria such as destination IP prefixes or FQDNs, which provide deterministic matching on the first packet.
Building a Reliable DIA Policy
| Matching Criteria | FPM Possible? | Recommended For |
|---|---|---|
| Source/Destination IP | Yes (always) | Known server ranges |
| FQDN | Yes (with DNS inspection) | SaaS applications with known domains |
| Application name (NBAR) | Depends on app | Applications with confirmed FPM behavior |
Pro Tip: Before deploying any application-based DIA policy, run fia-trace for each target application in a lab or pilot site. Document which applications achieve FPM and which do not. This information should drive your policy design, not the other way around.
Failure 7: Stretching Design Creativity Beyond Validated Boundaries
As demonstrated in Failure 2, network engineers are creative problem solvers. When faced with a constraint — such as the lack of direct IP reachability between an SD-WAN edge and the rest of the fabric — they find workarounds. Manual IPSec tunnels through intermediate devices, route leaking across VRFs, loopback-based TLOCs — these are clever solutions born from deep networking expertise.
The problem is that Catalyst SD-WAN is a tightly integrated system where multiple processes (vDaemon, FTMd, OMP) interact in specific ways. An approach that provides IP reachability at the routing table level may not satisfy the process-level requirements of FTMd, which expects certain interfaces to exist in specific VRFs.
Common Patterns That Lead to This SD-WAN Mistake
- Using non-VPN 0 interfaces for SD-WAN transport — FTMd requires the WAN-facing interface to be in VPN 0. Route leaking and default routes do not change this requirement.
- Assuming control-plane success implies data-plane capability — Control-plane DTLS/TLS connections and data-plane IPSec tunnels have different interface and VRF requirements. A successful control connection does not guarantee data-plane tunnel formation.
- Deploying unvalidated topologies without testing — Corner-case designs that have not been tested may work partially or exhibit unexpected behavior under load.
How to Stay Within Safe Boundaries
- Always verify that your proposed design aligns with validated reference architectures
- Test non-standard designs thoroughly in a lab environment before production deployment
- When control plane works but data plane does not, immediately suspect VRF or interface assignment issues
- Use the FTMd debug traces (
debug platform software sdwan ftmall) to identify exactly why tunnels are not forming
Pro Tip: If you find yourself building elaborate workarounds involving manual tunnels, VRF route leaking, and creative TLOC placement, take a step back and ask whether the underlying connectivity problem should be solved at a different layer. Sometimes the right answer is to fix the underlay rather than work around it in the overlay.
Key SD-WAN Best Practices: Summary Table
| Failure Mode | Root Cause | Key Diagnostic Command | Prevention Strategy |
|---|---|---|---|
| Scale limit hit | Full-mesh tunnels exceed platform capacity | show sdwan bfd summary | Control policy, on-demand tunnels, MRF |
| False assumptions | Manual tunnel in non-VPN 0 VRF | show logging process ftminternal | Keep WAN interfaces in VPN 0 |
| Misclassification | Non-FPM app classification with flow stickiness | fia-trace (packet trace) | Use IP/FQDN matching for non-FPM apps |
| Re-key instability | Resource contention during bulk re-key | show sdwan bfd summary | Stay well below tunnel scale limits |
| CP/DP confusion | Troubleshooting data plane before validating control plane | show sdwan control connections | Follow the systematic troubleshooting ladder |
| DIA design gaps | App-based DIA without FPM validation | fia-trace (packet trace) | Test classification behavior before deployment |
| Creative overreach | Unvalidated designs with VRF workarounds | debug platform software sdwan ftmall | Use validated reference architectures |
Frequently Asked Questions
What is the default tunnel behavior in Catalyst SD-WAN?
By default, Catalyst SD-WAN builds a full mesh of IPSec data-plane tunnels between all edge devices. When a WAN edge receives TLOC information from remote edges via OMP, the FTMd process automatically establishes IPSec tunnels to every remote TLOC. BFD sessions then run on top of each IPSec tunnel. There are no built-in guardrails or call-admission controls to prevent the platform from exceeding its tunnel scale limits. This is why networks that grow organically can silently hit platform limits and experience sudden BFD flapping.
How can I check if my SD-WAN device has hit its tunnel scale limit?
Start with show sdwan bfd summary and look at the sessions-total, sessions-up, and sessions-flap counters. If sessions-up is significantly lower than sessions-total and the flap counter is very high, you are likely hitting a scale issue. Confirm by enabling FTMd debug traces with debug platform software sdwan ftmbfd, debug platform software sdwan ftmipsec, and debug platform software sdwan ftmttm-events, then check the logs with show logging profile sdwan start last 5 minutes. The definitive confirmation is a log message containing Tunnel Add to TLOC ... Failed. Reason Out of resources.
What is the difference between FPM and non-FPM classification in SD-WAN?
First-Packet Match (FPM) means the application classification is final on the very first packet of a flow. This is the ideal scenario because the forwarding decision is correct immediately. Non-FPM classification means the engine cannot determine the application identity from the first packet alone and needs to inspect additional packets. During this interim period, the classification state is "Not final," and the forwarding decision may be based on an incorrect classification. You can verify the classification state for any flow using the packet trace (fia-trace) feature — look for the Classification state field in the output.
Why does my SD-WAN control plane work but the data plane does not?
The most common cause is an interface or VRF mismatch. Control-plane DTLS/TLS connections and data-plane IPSec/BFD tunnels have different requirements. The FTMd process, which is responsible for data-plane tunnel creation, requires the WAN-facing interface to be in VPN 0 (the transport VRF). If the interface is assigned to a different VRF, FTMd will log an error such as No interface Tunnel1 in default VPN and refuse to create data-plane tunnels — even though the control plane established its connections through the same path successfully. Always verify interface VRF assignments when control plane works but data plane does not.
What is Multi-Region Fabric (MRF) and how does it help with SD-WAN scale?
Multi-Region Fabric, also known as Hierarchical SD-WAN (H-SDWAN), segments the SD-WAN overlay into multiple regions. Instead of building a full mesh of IPSec tunnels across the entire network, each region maintains its own mesh. Inter-region traffic passes through border nodes. This dramatically reduces the tunnel count on individual edge devices, keeping them well within their platform scale limits even in large deployments with hundreds of sites.
Should I disable flow stickiness when using application-based DIA policies?
There is no universally correct answer. With flow stickiness enabled (the default), misclassified flows stay on their initial path for the entire flow duration — providing consistency but potentially misforwarding traffic. With flow stickiness disabled, flows can change paths mid-stream when classification becomes final — providing accuracy but potentially breaking stateful connections. The best approach is to avoid the dilemma entirely by using matching criteria that achieve FPM (such as IP prefixes or FQDNs) for DIA policies, so that the first forwarding decision is always correct regardless of the flow-stickiness setting.
Conclusion
Catalyst SD-WAN is a powerful platform, but its automation and abstraction can mask critical issues until they manifest as production outages. The seven failure modes covered in this article — from hitting platform scale limits to misforwarding traffic due to classification behavior — share a common thread: they all stem from gaps between how engineers assume the system works and how it actually works at the process level.
The most important SD-WAN best practices to take away are:
- Know your platform limits. Every hardware model has a defined IPSec tunnel scale limit. Know it, track it, and stay well below it.
- Understand the dependency chain. BFD depends on IPSec, which depends on OMP, which depends on control connections. Always troubleshoot from the bottom up.
- Never assume control-plane success means data-plane capability. They have different interface and VRF requirements.
- Test application classification before deploying DIA policies. Use fia-trace to understand FPM behavior for every target application.
- Use validated designs. Creative workarounds may solve the immediate IP reachability problem but break process-level assumptions within the SD-WAN architecture.
- Implement topology controls. Centralized control policies, on-demand tunnels, and Multi-Region Fabric are your tools for managing scale.
- Build systematic troubleshooting procedures. Follow the diagnostic ladder — transport, control connections, OMP, local TLOCs, then data plane — every time.
These are not theoretical recommendations. They are drawn from real production cases where networks failed and engineers had to diagnose, fix, and prevent recurrence. Invest the time to understand these failure modes now, and you will save far more time when your production fabric is on the line.
For hands-on practice with Catalyst SD-WAN deployment, troubleshooting, and policy design, explore the SD-WAN courses available at nhprep.com. Building deep familiarity with these concepts in a lab environment is the best way to ensure you never encounter these failures unprepared in production.