Self-Healing Networks
Objective
Build a self-healing mechanism on a Cisco router that detects a next-hop failure and automatically remediates by switching the default route to a pre‑configured backup path using Infrastructure-as-Code (IaC) practices. This matters in production because automated failover reduces mean time to repair (MTTR) and prevents human error during emergency changes. Real-world scenario: an enterprise edge router loses its primary ISP; automatic rerouting to a backup ISP keeps services (VPNs, Internet access, SaaS) alive while operators investigate.
Quick Recap
This lesson continues the topology from Lesson 1. We add a backup edge router (R3) and an automation/management host (MGMT) to demonstrate monitoring and verification.
ASCII topology (exact IPs shown on every interface)
R2 and R3 are other edge routers; R1 is the core device we will make self-healing:
[R2] [R3]
Gi0/0 10.0.12.2/24 Gi0/2 10.0.13.2/24
| |
| |
10.0.12.0/24 10.0.13.0/24
| |
Gi0/0 Gi0/2
[R1]
Gi0/0 10.0.12.1/24
Gi0/1 192.168.1.1/24
Gi0/2 10.0.13.1/24
|
|
[SW1]
Gi1/0
SVI 192.168.1.10/24
|
[MGMT host]
eth0 192.168.1.100/24
Device table
| Device | Hostname | Relevant Interfaces / IPs |
|---|---|---|
| Core Router | R1 | Gi0/0 10.0.12.1/24; Gi0/1 192.168.1.1/24; Gi0/2 10.0.13.1/24 |
| Primary Edge | R2 | Gi0/0 10.0.12.2/24 |
| Backup Edge | R3 | Gi0/2 10.0.13.2/24 |
| Switch | SW1 | SVI 192.168.1.10/24; Gi1/0 uplink to R1 Gi0/1 |
| Management Host | MGMT | eth0 192.168.1.100/24 (lab.nhprep.com) |
Domain names, passwords and organization used in examples:
- Domain: lab.nhprep.com
- Passwords: Lab@123
- Organization: NHPREP
Key Concepts (theory before CLI)
- IP SLA (Service‑Level Agreement) probes: IP SLA runs active probes (e.g., ICMP echo) to test reachability to a target. The router sends ICMP echo packets from a specified source interface at configured intervals. In production, IP SLA provides an accurate measurement of path or next-hop liveness rather than relying only on interface state.
- Tracking objects (track): A track object maps the state of an IP SLA (or other objects) to a simple boolean state (up/down). Other features (like routing or policies) can reference a track to make automated decisions.
- Embedded Event Manager (EEM): EEM is the on-box automation engine that can react to events (such as a track going down) and execute CLI commands, scripts, or syslog messages. Think of EEM as a lightweight "automation webhook" inside the router.
- Failover by modifying routing table: For rapid remediation we modify the routing table (for example, replace default route via primary next-hop with a route via backup). This is a fast, local change and requires no external controller; in production networks this is often combined with BFD, dynamic routing, or SD-WAN policies for more complex scenarios.
- Real-world flow: IP SLA probes -> track object reflects status -> EEM applet reacts to track state change -> EEM applies CLI changes (ip route) to switch next-hop -> operators are notified via syslog/email/NETCONF (EEM can be extended to call external systems).
Step-by-step configuration
Each step below shows WHAT we are doing, EXACT commands, WHY they matter, and verification with expected output.
Step 1: Configure base interfaces and static default route (on R1)
What we are doing: Configure interface IPs, an initial default route pointing at the primary edge (R2), and a local enable secret. This establishes normal forwarding toward the primary ISP/edge.
configure terminal
hostname R1
enable secret Lab@123
interface GigabitEthernet0/0
ip address 10.0.12.1 255.255.255.0
no shutdown
exit
interface GigabitEthernet0/1
ip address 192.168.1.1 255.255.255.0
no shutdown
exit
interface GigabitEthernet0/2
ip address 10.0.13.1 255.255.255.0
no shutdown
exit
ip route 0.0.0.0 0.0.0.0 10.0.12.2
end
write memory
What just happened:
- The three GigabitEthernet interfaces were assigned IP addresses and brought up so R1 can reach R2 (10.0.12.2), R3 (10.0.13.2), and the management VLAN.
- The default route points at the primary next-hop 10.0.12.2 (R2). All traffic without a more specific route will use that next-hop until changed.
Real-world note: In production you often use dynamic routing (BGP/OSPF) or floating static routes; here we use a static default route to focus on the self‑healing mechanism.
Verify:
show ip interface brief
Interface IP-Address OK? Method Status Protocol
GigabitEthernet0/0 10.0.12.1 YES manual up up
GigabitEthernet0/1 192.168.1.1 YES manual up up
GigabitEthernet0/2 10.0.13.1 YES manual up up
Loopback0 unassigned YES manual administratively down down
show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.12.2 to network 0.0.0.0
S* 0.0.0.0/0 [1/0] via 10.0.12.2
C 10.0.12.0/24 is directly connected, GigabitEthernet0/0
C 10.0.13.0/24 is directly connected, GigabitEthernet0/2
C 192.168.1.0/24 is directly connected, GigabitEthernet0/1
Step 2: Configure IP SLA to monitor the primary next-hop and a track object (on R1)
What we are doing: Create an IP SLA probe that pings the primary next-hop (10.0.12.2) every 5 seconds, schedule it, and create a track object that maps the SLA reachability to a boolean. This allows the router to detect when the primary next-hop becomes unreachable even if the interface remains up.
configure terminal
ip sla 1
icmp-echo 10.0.12.2 source-interface GigabitEthernet0/0
frequency 5
exit
ip sla schedule 1 life forever start-time now
track 1 ip sla 1 reachability
end
write memory
What just happened:
ip sla 1configures an active ICMP probe to 10.0.12.2 from R1's Gi0/0. The router will send an ICMP echo every 5 seconds.ip sla schedulestarts the IP SLA immediately and runs it forever.track 1ties the reachability of the IP SLA probe to a track object. If the probe cannot reach 10.0.12.2, the track state becomes down; when it can reach it again, the track becomes up. Using IP SLA + track is more reliable than relying solely on interface line protocol because a next-hop can be unreachable while the local interface stays up (for example, remote ISP failure).
Real-world note: IP SLA should use a realistic frequency in production (not too frequent to avoid load, not too infrequent to delay detection). Use a frequency that balances sensitivity and overhead.
Verify:
show ip sla configuration
IP SLAs configuration:
IP SLA: 1
Type: ICMP Echo
Target address: 10.0.12.2
Source interface: GigabitEthernet0/0
Frequency: 5 (seconds)
Status: Active/Scheduled
show track
Track 1
Type: IP SLA
Object: IP SLA 1
Current state: Up
Last state change: 00:00:12
(If the primary next-hop is reachable, the track shows Current state: Up. If it's unreachable, it will show Current state: Down.)
Step 3: Create an EEM applet to failover to the backup route when the track goes down (on R1)
What we are doing: Create an EEM applet that listens for track 1 state = down and then modifies the IP routing table: remove the default via primary and install a default via backup (10.0.13.2). This is the automatic remediation step.
configure terminal
event manager applet SLA_FAILOVER
event track 1 state down
action 1.0 syslog priority informational msg "EEM: Track 1 DOWN - initiating failover"
action 2.0 cli command "enable"
action 3.0 cli command "configure terminal"
action 4.0 cli command "no ip route 0.0.0.0 0.0.0.0 10.0.12.2"
action 5.0 cli command "ip route 0.0.0.0 0.0.0.0 10.0.13.2"
action 6.0 syslog priority informational msg "EEM: Default route switched to 10.0.13.2"
exit
end
write memory
What just happened:
- The EEM applet
SLA_FAILOVERis registered and will be triggered whentrack 1goes down. - When triggered, the applet logs a message, enters enable/config mode, removes the static default via 10.0.12.2, and inserts a default route via 10.0.13.2. This changes the forwarding behavior immediately and locally on R1 without operator intervention.
Real-world note: EEM executes CLI actions as though a human typed them. Use logging and careful commands to avoid unintended config drift. In production, pair EEM with change control and notifications.
Verify (trigger simulation):
To test, shut down R2's interface facing R1 (on R2) or simulate failure by making 10.0.12.2 unreachable. After the probe detects failure the EEM applet will run.
Example verification commands and expected outputs AFTER failover:
show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.13.2 to network 0.0.0.0
S* 0.0.0.0/0 [1/0] via 10.0.13.2
C 10.0.12.0/24 is directly connected, GigabitEthernet0/0
C 10.0.13.0/24 is directly connected, GigabitEthernet0/2
C 192.168.1.0/24 is directly connected, GigabitEthernet0/1
show event manager history
EVENT MANAGER HISTORY
Applet Name: SLA_FAILOVER
Time: 00:12:32 Date: Tue Apr 2 2026
Event: track 1 state down
Action 1: syslog: EEM: Track 1 DOWN - initiating failover
Action 4: cli: no ip route 0.0.0.0 0.0.0.0 10.0.12.2
Action 5: cli: ip route 0.0.0.0 0.0.0.0 10.0.13.2
Action 6: syslog: EEM: Default route switched to 10.0.13.2
show track
Track 1
Type: IP SLA
Object: IP SLA 1
Current state: Down
Last state change: 00:00:18
(The show ip route output now shows the default via 10.0.13.2 — the automated failover occurred.)
Step 4: Create an EEM applet to restore the original route when the track returns (on R1)
What we are doing: Create a second EEM applet that detects when track 1 becomes up again and restores the original default via 10.0.12.2, removing the temporary backup route. This ensures the network returns to its intended primary path after recovery.
configure terminal
event manager applet SLA_RECOVER
event track 1 state up
action 1.0 syslog priority informational msg "EEM: Track 1 UP - restoring primary route"
action 2.0 cli command "enable"
action 3.0 cli command "configure terminal"
action 4.0 cli command "no ip route 0.0.0.0 0.0.0.0 10.0.13.2"
action 5.0 cli command "ip route 0.0.0.0 0.0.0.0 10.0.12.2"
action 6.0 syslog priority informational msg "EEM: Default route restored to 10.0.12.2"
exit
end
write memory
What just happened:
- The
SLA_RECOVERapplet will run when the track reports up. It removes the backup default route and reinstalls the primary default route, returning traffic to the original path. This automated restoration avoids manual changes and keeps the network aligned with normal traffic engineering.
Real-world note: Automatic restoration is convenient but can oscillate if the monitored target flaps. In production, consider adding hysteresis (longer failure detection windows or threshold counters) to avoid route flapping.
Verify (after recovery):
show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.12.2 to network 0.0.0.0
S* 0.0.0.0/0 [1/0] via 10.0.12.2
C 10.0.12.0/24 is directly connected, GigabitEthernet0/0
C 10.0.13.0/24 is directly connected, GigabitEthernet0/2
C 192.168.1.0/24 is directly connected, GigabitEthernet0/1
show event manager history
EVENT MANAGER HISTORY
Applet Name: SLA_RECOVER
Time: 00:22:05 Date: Tue Apr 2 2026
Event: track 1 state up
Action 1: syslog: EEM: Track 1 UP - restoring primary route
Action 4: cli: no ip route 0.0.0.0 0.0.0.0 10.0.13.2
Action 5: cli: ip route 0.0.0.0 0.0.0.0 10.0.12.2
Action 6: syslog: EEM: Default route restored to 10.0.12.2
show track
Track 1
Type: IP SLA
Object: IP SLA 1
Current state: Up
Last state change: 00:00:06
Verification Checklist
- Check 1: IP SLA is active and probing primary next-hop — verify with
show ip sla configurationand expect IP SLA 1 configured to 10.0.12.2 and Status Active/Scheduled. - Check 2: Failover triggers when primary next-hop is unreachable — simulate by making 10.0.12.2 unreachable and verify
show trackshowsCurrent state: Downandshow ip routeshows default via 10.0.13.2. - Check 3: Automatic recovery reinstalls the primary route — restore reachability to 10.0.12.2 and verify
show trackshowsUp,show ip routeshows default via 10.0.12.2, andshow event manager historycontains entries for both applets.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
IP SLA shows Status: Not scheduled | You created the IP SLA but did not schedule it | Run ip sla schedule <id> life forever start-time now to start the SLA |
Track remains down even though next-hop is reachable | IP SLA source-interface wrong or ACL blocks ICMP | Verify ip sla <id> source-interface and ensure ICMP is permitted along the path; fix ACLs or change source-interface |
| EEM applet does not change routes | Applet did not have proper privileges or CLI commands used wrong | Ensure EEM actions include cli command "enable" and cli command "configure terminal" before configuration commands; check show event manager policy registered and show event manager history |
| Route oscillation (flapping) between primary and backup | IP SLA frequency or thresholds too aggressive causing transient failures to trigger action | Increase IP SLA frequency interval or implement averaging/hysteresis in detection; consider BFD or routing protocol failover instead |
Key Takeaways
- Use IP SLA + track + EEM to implement on‑box self‑healing: IP SLA measures reachability, track captures object state, and EEM performs automated remediation.
- Self‑healing via route changes is fast and local, but must be designed with safeguards (hysteresis, notifications) to avoid flapping and unintended config drift.
- In production, integrate self‑healing with logging/alerting so operators are informed (EEM syslog, SNMP traps, or external automation). Automation should complement, not replace, change-control processes.
- Understand the protocol behavior: IP SLA sends actual probes (ICMP in this lab) at configured intervals; EEM reacts to events and executes CLI in enable mode — test carefully in a lab before deploying to production.
Tip: Treat EEM applets as code — version control the router config snippets and document the intended behavior so operators and automated systems can audit changes.
If you want, Lesson 5 will extend this by showing how to centralize EEM events into an external automation system (pulling logs + IaC config changes) so the router and an automation controller act together for orchestrated remediation.