SD-WAN Scalability and Design
Objective
In this lesson you will learn SD‑WAN scalability and design patterns: why full‑mesh, hub‑and‑spoke, and regional‑hub topologies are selected, how they affect control and data plane scale, and how to implement and verify a simple hub‑spoke + regional‑hub design on IOS routers to demonstrate the routing and policy consequences. This matters in production because tunnel counts, control plane state, and security inspection all impact CPU, memory, and failure domains — making the right topology essential for predictable performance in large enterprises.
Real-world scenario: A global enterprise has 200 branch sites and three regional data centers. Direct full‑mesh tunnels between every branch would overwhelm device resources and complicate inspection policies. Instead, the network team must choose regional hubs to reduce tunnel counts and centralize security inspection while preserving optimal SaaS performance where needed.
Quick Recap
We use the same topology introduced in Lesson 1. No new devices are added in this lesson — we focus on topology choices and routing/design changes. The devices in this lesson are:
- Controller (management/controller)
- WAN‑HUB (global/wide area hub)
- REGIONAL‑HUB (regional concentrator)
- BRANCH1 (edge branch)
ASCII topology with exact IP addresses on every interface:
Controller
(CTRL)
+---------------------+
| Gi0/0 192.168.100.1 |
+----------+----------+
|
| 192.168.100.0/24
|
+----------+----------+
| WAN-HUB |
| Gi0/2 192.168.100.2 |
| Gi0/0 10.0.0.1/30 |
| Gi0/1 10.0.1.1/30 |
+----+-------------+--+
| |
10.0.0.0/30 | | 10.0.1.0/30
| |
+-------+--+ +---+-------+
| REGIONAL | | BRANCH1 |
| HUB | | Gi0/1 10.0.1.2 |
| Gi0/0 10.0.0.2| +---------------+
+---------------+
Device IP addressing summary:
| Device | Interface | IP Address | Role |
|---|---|---|---|
| Controller | Gi0/0 | 192.168.100.1/24 | Controller/management |
| WAN-HUB | Gi0/2 | 192.168.100.2/24 | Controller link |
| WAN-HUB | Gi0/0 | 10.0.0.1/30 | Link to REGIONAL-HUB |
| WAN-HUB | Gi0/1 | 10.0.1.1/30 | Link to BRANCH1 |
| REGIONAL-HUB | Gi0/0 | 10.0.0.2/30 | Regional concentrator |
| BRANCH1 | Gi0/1 | 10.0.1.2/30 | Branch edge |
Tip: For examples and credentials use lab.nhprep.com and password Lab@123 where a domain or password is required.
Key Concepts
Before touching the CLI, understand these principles — theory + practical implications:
-
Mesh complexity (O(N^2) tunnels): In a full mesh, every site maintains a tunnel to every other site. The number of tunnels grows as N*(N-1)/2. Protocol-level impact: each tunnel consumes CPU, RAM (state), DTLS/TLS sessions for control/crypto, and increases state to be maintained during churn. In production this matters because an increase in tunnels can saturate CPU and cause control instability.
-
Hub‑and‑spoke concentration: Branches point traffic to a hub. This reduces tunnel count dramatically (O(N)) and centralizes stateful services (NGFW, IPS, URL filtering). Packet flow: branch → hub (inspection/peering) → destination. In production this is used when a central security posture or regional peering is required.
-
Regional hubs (hierarchical design): Use multiple hubs by geography. This partitions the network into clusters. Benefit: reduces intercontinental hairpinning while keeping per-device scale within limits. Think of it like multiple “mini‑meshes” with central points of concentration.
-
Controller vs data plane separation: Controllers provide policy, orchestration and may push route policies, but data plane tunnels carry user traffic. Scalability decisions affect both: more devices => more control sessions to manage, and more tunnels => more data plane resource usage.
-
Security inspection cost: NGFW/IPS/URL‑filtering require memory and CPU. When you centralize inspection at a hub, the hub must be sized to handle peak concurrent sessions and throughput. In practice, design for sustained traffic plus bursts and use regional hubs to distribute load.
Analogy: Think of full mesh like every colleague calling every colleague individually — the phone system overloads quickly. Hub‑and‑spoke is like everyone calling a receptionist (hub) who connects or handles the call.
Step-by-step configuration
Each step below shows commands, explains why, and gives verification output you should see. All configuration examples use IOS-style commands.
Step 1: Configure basic interface addressing and loopbacks
What we are doing: Assign IP addresses to physical interfaces and create a loopback for stable router ID and management. This matters because stable IDs and correct IP addressing are prerequisites for routing adjacencies and for controller discovery.
WAN-HUB# configure terminal
WAN-HUB(config)# hostname WAN-HUB
WAN-HUB(config)# interface GigabitEthernet0/2
WAN-HUB(config-if)# ip address 192.168.100.2 255.255.255.0
WAN-HUB(config-if)# no shutdown
WAN-HUB(config-if)# exit
WAN-HUB(config)# interface GigabitEthernet0/0
WAN-HUB(config-if)# ip address 10.0.0.1 255.255.255.252
WAN-HUB(config-if)# no shutdown
WAN-HUB(config-if)# exit
WAN-HUB(config)# interface GigabitEthernet0/1
WAN-HUB(config-if)# ip address 10.0.1.1 255.255.255.252
WAN-HUB(config-if)# no shutdown
WAN-HUB(config-if)# exit
WAN-HUB(config)# interface Loopback0
WAN-HUB(config-if)# ip address 1.1.1.1 255.255.255.255
WAN-HUB(config-if)# exit
WAN-HUB(config)# end
What just happened: Each interface now has an IP and is administratively up; the loopback provides a stable identifier (useful for routing protocols and controller registration). At the protocol level, OSPF or BGP (configured later) will advertise the loopback as the router‑ID / reachability anchor. In SD‑WAN architectures, this loopback often represents the device ID used by the control plane.
Real-world note: Use loopbacks for stable router IDs; physical interfaces can flap and you will want your control plane identity to remain stable.
Verify:
WAN-HUB# show ip interface brief
Interface IP-Address OK? Method Status Protocol
GigabitEthernet0/0 10.0.0.1 YES manual up up
GigabitEthernet0/1 10.0.1.1 YES manual up up
GigabitEthernet0/2 192.168.100.2 YES manual up up
Loopback0 1.1.1.1 YES manual up up
Step 2: Establish OSPF area 0 to demonstrate reachability and full‑mesh routing behavior
What we are doing: Enable OSPF on the hub, regional hub, and branch link IPs. OSPF simulates the overlay routing distribution and allows you to observe adjacency formation and route propagation across hubs and branches.
WAN-HUB# configure terminal
WAN-HUB(config)# router ospf 1
WAN-HUB(config-router)# router-id 1.1.1.1
WAN-HUB(config-router)# network 1.1.1.1 0.0.0.0 area 0
WAN-HUB(config-router)# network 10.0.0.0 0.0.0.3 area 0
WAN-HUB(config-router)# network 10.0.1.0 0.0.0.3 area 0
WAN-HUB(config-router)# network 192.168.100.0 0.0.0.255 area 0
WAN-HUB(config-router)# end
Repeat equivalent OSPF commands on REGIONAL‑HUB and BRANCH1 with router-ids 2.2.2.2 and 3.3.3.3 and their respective networks.
What just happened: OSPF starts sending Hello packets on all interfaces in the specified networks (default Hello interval 10s on broadcast networks). Adjacent routers that share a common area and interface parameters will form neighbor relationships and exchange LSAs, populating the link‑state database and building routes. This is a stand‑in for the SD‑WAN overlay where control plane learning distributes routes to endpoints.
Real-world note: In SD‑WAN, control plane overlays often handle routing abstractions; here OSPF lets you visualize route propagation and scaling effects (LSA count, SPF frequency).
Verify:
WAN-HUB# show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface
2.2.2.2 1 FULL/DR 00:00:34 10.0.0.2 GigabitEthernet0/0
3.3.3.3 1 FULL/ - 00:00:36 10.0.1.2 GigabitEthernet0/1
WAN-HUB# show ip route
Codes: C - connected, O - OSPF, S - static
C 10.0.0.0/30 is directly connected, GigabitEthernet0/0
C 10.0.1.0/30 is directly connected, GigabitEthernet0/1
C 192.168.100.0/24 is directly connected, GigabitEthernet0/2
O 10.0.0.2/32 [110/20] via 10.0.0.2, 00:00:36, GigabitEthernet0/0
O 10.0.1.2/32 [110/20] via 10.0.1.2, 00:00:34, GigabitEthernet0/1
Step 3: Implement hub‑and‑spoke routing (static default at branch pointing to hub)
What we are doing: Configure a default route on BRANCH1 that points to its hub (WAN‑HUB). This concentrates internet/centralized services traffic at the hub and illustrates the traffic path that enforces security/inspection.
BRANCH1# configure terminal
BRANCH1(config)# ip route 0.0.0.0 0.0.0.0 10.0.1.1
BRANCH1(config)# end
What just happened: Branch1 now forwards any traffic for unknown destinations to the hub (next hop 10.0.1.1). In production, this is how branches send user traffic to centralized security stacks or regional hubs. It reduces the need to build direct tunnels to every destination and simplifies policy enforcement.
Real-world note: If you centralize too much traffic you risk creating a hub bottleneck. Use regional hubs to distribute load geographically.
Verify:
BRANCH1# show ip route
Codes: C - connected, O - OSPF, S - static
C 10.0.1.0/30 is directly connected, GigabitEthernet0/1
O 10.0.1.1/32 [110/20] via 10.0.1.1, 00:02:12, GigabitEthernet0/1
S 0.0.0.0/0 [1/0] via 10.0.1.1
BRANCH1# show ip route 0.0.0.0
S 0.0.0.0/0 [1/0] via 10.0.1.1
* Route source is a static route
Step 4: Simulate regional hub aggregation (OSPF summary or redistribution to reduce LSA churn)
What we are doing: On the regional hub, summarize or redistribute internal prefixes to limit growth of LSAs propagated to the central hub. This reduces link‑state database size seen by WAN‑HUB and cheaper SPF runs.
REGIONAL-HUB# configure terminal
REGIONAL-HUB(config)# router ospf 1
REGIONAL-HUB(config-router)# area 0 range 10.0.0.0 255.255.255.252
REGIONAL-HUB(config-router)# end
What just happened: The area range command summarizes the specified prefix so that the parent area (area 0) sees a single summary LSA instead of many host routes. In large deployments this reduces the number and size of LSAs that each hub must process, lowering CPU and memory footprint during SPF runs.
Real-world note: Summarization is a key tool when scaling routing domains. Use it to keep control plane state within device capacity.
Verify:
REGIONAL-HUB# show ip ospf database router
OSPF Router with ID (2.2.2.2) (Process ID 1)
Rtr Count: 3
Link ID ADV Router Age Seq# CkSum Link count
2.2.2.2 2.2.2.2 300 0x80000002 0x00ab 3
REGIONAL-HUB# show ip route
Codes: C - connected, O - OSPF, S - static
C 10.0.0.0/30 is directly connected, GigabitEthernet0/0
O 10.0.0.1/32 [110/20] via 10.0.0.1, 00:01:02, GigabitEthernet0/0
Step 5: Apply a simple access‑list on the hub to illustrate centralized security policy
What we are doing: Configure an IP access‑list on WAN‑HUB to drop obvious malicious traffic (e.g., block Telnet) to simulate a minimal centralized security posture and show how hub policing may be used.
WAN-HUB# configure terminal
WAN-HUB(config)# ip access-list extended HUB-SECURITY
WAN-HUB(config-ext-nacl)# deny tcp any any eq 23
WAN-HUB(config-ext-nacl)# permit ip any any
WAN-HUB(config-ext-nacl)# exit
WAN-HUB(config)# interface GigabitEthernet0/1
WAN-HUB(config-if)# ip access-group HUB-SECURITY in
WAN-HUB(config-if)# end
What just happened: The ACL denies Telnet (TCP 23) on traffic entering Gi0/1 (from branch). Applied at the hub, this provides a simple example of central policy enforcement. In production, stateful NGFWs/IPS provide far richer inspection (application awareness, user identity, URL filtering), but ACLs remain useful for quick filters or as a first‑drop mechanism.
Real-world note: Stateful inspection and URL filtering require more resources than a simple ACL. When centralizing inspection, ensure the hub platform supports NGFW/IPS capacity consistent with expected throughput.
Verify:
WAN-HUB# show access-lists HUB-SECURITY
Extended IP access list HUB-SECURITY
deny tcp any any eq 23
permit ip any any
Verification Checklist
- Check 1: OSPF adjacencies are formed — verify with
show ip ospf neighboron WAN‑HUB and expect neighbor entries for 2.2.2.2 and 3.3.3.3. - Check 2: Branch default route points to the hub — verify with
show ip routeon BRANCH1 and expectS 0.0.0.0/0 via 10.0.1.1. - Check 3: Regional summarization is in effect — verify with
show ip ospf databaseon REGIONAL‑HUB and expect summarized LSAs or reduced LSA counts. - Check 4: Hub ACL is present and applied — verify with
show access-lists HUB-SECURITYandshow ip interface GigabitEthernet0/1to confirm the ACL is inbound.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| OSPF neighbors never reach FULL | Mismatched area or network statement, or interface down | Verify show ip interface brief; ensure router ospf network statements include the interface and area is consistent |
| Branch traffic bypasses hub (direct internet) | Missing default route on branch or incorrect next‑hop | Configure ip route 0.0.0.0 0.0.0.0 <hub-ip> and verify with show ip route |
| Hub CPU spikes under load | Centralized inspection without proper sizing | Distribute inspection to regional hubs or resize hub platform; summarize routes to reduce SPF churn |
| ACL blocks unintended traffic | Overly broad ACL sequence or wrong direction | Check show access-lists and show ip interface to ensure ACL entries and direction are correct; test with targeted pings |
Key Takeaways
- Full mesh creates O(N^2) tunnel/state growth; it is simple but does not scale for hundreds of sites.
- Hub‑and‑spoke reduces tunnel counts and centralizes security, but hubs must be sized to handle aggregated traffic and inspection.
- Regional hubs provide a practical compromise — they localize traffic and inspection, balancing performance and manageability.
- Always plan controller/data‑plane separation and control‑plane state (LSAs, sessions) when designing for scale: summarize and limit what each device must process.
Final real-world reminder: In production SD‑WAN deployments, choose topology and hub sizing based on measured concurrent sessions and throughput requirements. Design for growth and failure — regional distribution of services reduces single points of failure and keeps control plane churn manageable.