Lesson 4 of 6

Self-Healing Networks

Objective

Build a self-healing mechanism on a Cisco router that detects a next-hop failure and automatically remediates by switching the default route to a pre‑configured backup path using Infrastructure-as-Code (IaC) practices. This matters in production because automated failover reduces mean time to repair (MTTR) and prevents human error during emergency changes. Real-world scenario: an enterprise edge router loses its primary ISP; automatic rerouting to a backup ISP keeps services (VPNs, Internet access, SaaS) alive while operators investigate.

Quick Recap

This lesson continues the topology from Lesson 1. We add a backup edge router (R3) and an automation/management host (MGMT) to demonstrate monitoring and verification.

ASCII topology (exact IPs shown on every interface)

R2 and R3 are other edge routers; R1 is the core device we will make self-healing:

    [R2]                        [R3]
  Gi0/0 10.0.12.2/24        Gi0/2 10.0.13.2/24
       |                          |
       |                          |
   10.0.12.0/24              10.0.13.0/24
       |                          |
    Gi0/0                       Gi0/2
    [R1]
  Gi0/0 10.0.12.1/24
  Gi0/1 192.168.1.1/24
  Gi0/2 10.0.13.1/24
       |
       |
    [SW1]
   Gi1/0
    SVI 192.168.1.10/24
       |
    [MGMT host]
    eth0 192.168.1.100/24

Device table

DeviceHostnameRelevant Interfaces / IPs
Core RouterR1Gi0/0 10.0.12.1/24; Gi0/1 192.168.1.1/24; Gi0/2 10.0.13.1/24
Primary EdgeR2Gi0/0 10.0.12.2/24
Backup EdgeR3Gi0/2 10.0.13.2/24
SwitchSW1SVI 192.168.1.10/24; Gi1/0 uplink to R1 Gi0/1
Management HostMGMTeth0 192.168.1.100/24 (lab.nhprep.com)

Domain names, passwords and organization used in examples:

  • Domain: lab.nhprep.com
  • Passwords: Lab@123
  • Organization: NHPREP

Key Concepts (theory before CLI)

  • IP SLA (Service‑Level Agreement) probes: IP SLA runs active probes (e.g., ICMP echo) to test reachability to a target. The router sends ICMP echo packets from a specified source interface at configured intervals. In production, IP SLA provides an accurate measurement of path or next-hop liveness rather than relying only on interface state.
  • Tracking objects (track): A track object maps the state of an IP SLA (or other objects) to a simple boolean state (up/down). Other features (like routing or policies) can reference a track to make automated decisions.
  • Embedded Event Manager (EEM): EEM is the on-box automation engine that can react to events (such as a track going down) and execute CLI commands, scripts, or syslog messages. Think of EEM as a lightweight "automation webhook" inside the router.
  • Failover by modifying routing table: For rapid remediation we modify the routing table (for example, replace default route via primary next-hop with a route via backup). This is a fast, local change and requires no external controller; in production networks this is often combined with BFD, dynamic routing, or SD-WAN policies for more complex scenarios.
  • Real-world flow: IP SLA probes -> track object reflects status -> EEM applet reacts to track state change -> EEM applies CLI changes (ip route) to switch next-hop -> operators are notified via syslog/email/NETCONF (EEM can be extended to call external systems).

Step-by-step configuration

Each step below shows WHAT we are doing, EXACT commands, WHY they matter, and verification with expected output.

Step 1: Configure base interfaces and static default route (on R1)

What we are doing: Configure interface IPs, an initial default route pointing at the primary edge (R2), and a local enable secret. This establishes normal forwarding toward the primary ISP/edge.

configure terminal
hostname R1
enable secret Lab@123
interface GigabitEthernet0/0
 ip address 10.0.12.1 255.255.255.0
 no shutdown
exit
interface GigabitEthernet0/1
 ip address 192.168.1.1 255.255.255.0
 no shutdown
exit
interface GigabitEthernet0/2
 ip address 10.0.13.1 255.255.255.0
 no shutdown
exit
ip route 0.0.0.0 0.0.0.0 10.0.12.2
end
write memory

What just happened:

  • The three GigabitEthernet interfaces were assigned IP addresses and brought up so R1 can reach R2 (10.0.12.2), R3 (10.0.13.2), and the management VLAN.
  • The default route points at the primary next-hop 10.0.12.2 (R2). All traffic without a more specific route will use that next-hop until changed.

Real-world note: In production you often use dynamic routing (BGP/OSPF) or floating static routes; here we use a static default route to focus on the self‑healing mechanism.

Verify:

show ip interface brief
Interface              IP-Address      OK? Method Status                Protocol
GigabitEthernet0/0     10.0.12.1       YES manual up                    up
GigabitEthernet0/1     192.168.1.1     YES manual up                    up
GigabitEthernet0/2     10.0.13.1       YES manual up                    up
Loopback0              unassigned      YES manual administratively down down

show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.12.2 to network 0.0.0.0

S*    0.0.0.0/0 [1/0] via 10.0.12.2
C     10.0.12.0/24 is directly connected, GigabitEthernet0/0
C     10.0.13.0/24 is directly connected, GigabitEthernet0/2
C     192.168.1.0/24 is directly connected, GigabitEthernet0/1

Step 2: Configure IP SLA to monitor the primary next-hop and a track object (on R1)

What we are doing: Create an IP SLA probe that pings the primary next-hop (10.0.12.2) every 5 seconds, schedule it, and create a track object that maps the SLA reachability to a boolean. This allows the router to detect when the primary next-hop becomes unreachable even if the interface remains up.

configure terminal
ip sla 1
 icmp-echo 10.0.12.2 source-interface GigabitEthernet0/0
 frequency 5
exit
ip sla schedule 1 life forever start-time now
track 1 ip sla 1 reachability
end
write memory

What just happened:

  • ip sla 1 configures an active ICMP probe to 10.0.12.2 from R1's Gi0/0. The router will send an ICMP echo every 5 seconds.
  • ip sla schedule starts the IP SLA immediately and runs it forever.
  • track 1 ties the reachability of the IP SLA probe to a track object. If the probe cannot reach 10.0.12.2, the track state becomes down; when it can reach it again, the track becomes up. Using IP SLA + track is more reliable than relying solely on interface line protocol because a next-hop can be unreachable while the local interface stays up (for example, remote ISP failure).

Real-world note: IP SLA should use a realistic frequency in production (not too frequent to avoid load, not too infrequent to delay detection). Use a frequency that balances sensitivity and overhead.

Verify:

show ip sla configuration
IP SLAs configuration:
IP SLA: 1
 Type: ICMP Echo
 Target address: 10.0.12.2
 Source interface: GigabitEthernet0/0
 Frequency: 5 (seconds)
 Status: Active/Scheduled

show track
Track 1
  Type: IP SLA
  Object: IP SLA 1
  Current state: Up
  Last state change: 00:00:12

(If the primary next-hop is reachable, the track shows Current state: Up. If it's unreachable, it will show Current state: Down.)

Step 3: Create an EEM applet to failover to the backup route when the track goes down (on R1)

What we are doing: Create an EEM applet that listens for track 1 state = down and then modifies the IP routing table: remove the default via primary and install a default via backup (10.0.13.2). This is the automatic remediation step.

configure terminal
event manager applet SLA_FAILOVER
 event track 1 state down
 action 1.0 syslog priority informational msg "EEM: Track 1 DOWN - initiating failover"
 action 2.0 cli command "enable"
 action 3.0 cli command "configure terminal"
 action 4.0 cli command "no ip route 0.0.0.0 0.0.0.0 10.0.12.2"
 action 5.0 cli command "ip route 0.0.0.0 0.0.0.0 10.0.13.2"
 action 6.0 syslog priority informational msg "EEM: Default route switched to 10.0.13.2"
exit
end
write memory

What just happened:

  • The EEM applet SLA_FAILOVER is registered and will be triggered when track 1 goes down.
  • When triggered, the applet logs a message, enters enable/config mode, removes the static default via 10.0.12.2, and inserts a default route via 10.0.13.2. This changes the forwarding behavior immediately and locally on R1 without operator intervention.

Real-world note: EEM executes CLI actions as though a human typed them. Use logging and careful commands to avoid unintended config drift. In production, pair EEM with change control and notifications.

Verify (trigger simulation):
To test, shut down R2's interface facing R1 (on R2) or simulate failure by making 10.0.12.2 unreachable. After the probe detects failure the EEM applet will run.

Example verification commands and expected outputs AFTER failover:

show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.13.2 to network 0.0.0.0

S*    0.0.0.0/0 [1/0] via 10.0.13.2
C     10.0.12.0/24 is directly connected, GigabitEthernet0/0
C     10.0.13.0/24 is directly connected, GigabitEthernet0/2
C     192.168.1.0/24 is directly connected, GigabitEthernet0/1

show event manager history
EVENT MANAGER HISTORY
Applet Name: SLA_FAILOVER
 Time: 00:12:32  Date: Tue Apr  2 2026
 Event: track 1 state down
 Action 1: syslog: EEM: Track 1 DOWN - initiating failover
 Action 4: cli: no ip route 0.0.0.0 0.0.0.0 10.0.12.2
 Action 5: cli: ip route 0.0.0.0 0.0.0.0 10.0.13.2
 Action 6: syslog: EEM: Default route switched to 10.0.13.2

show track
Track 1
  Type: IP SLA
  Object: IP SLA 1
  Current state: Down
  Last state change: 00:00:18

(The show ip route output now shows the default via 10.0.13.2 — the automated failover occurred.)

Step 4: Create an EEM applet to restore the original route when the track returns (on R1)

What we are doing: Create a second EEM applet that detects when track 1 becomes up again and restores the original default via 10.0.12.2, removing the temporary backup route. This ensures the network returns to its intended primary path after recovery.

configure terminal
event manager applet SLA_RECOVER
 event track 1 state up
 action 1.0 syslog priority informational msg "EEM: Track 1 UP - restoring primary route"
 action 2.0 cli command "enable"
 action 3.0 cli command "configure terminal"
 action 4.0 cli command "no ip route 0.0.0.0 0.0.0.0 10.0.13.2"
 action 5.0 cli command "ip route 0.0.0.0 0.0.0.0 10.0.12.2"
 action 6.0 syslog priority informational msg "EEM: Default route restored to 10.0.12.2"
exit
end
write memory

What just happened:

  • The SLA_RECOVER applet will run when the track reports up. It removes the backup default route and reinstalls the primary default route, returning traffic to the original path. This automated restoration avoids manual changes and keeps the network aligned with normal traffic engineering.

Real-world note: Automatic restoration is convenient but can oscillate if the monitored target flaps. In production, consider adding hysteresis (longer failure detection windows or threshold counters) to avoid route flapping.

Verify (after recovery):

show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP, L - local
Gateway of last resort is 10.0.12.2 to network 0.0.0.0

S*    0.0.0.0/0 [1/0] via 10.0.12.2
C     10.0.12.0/24 is directly connected, GigabitEthernet0/0
C     10.0.13.0/24 is directly connected, GigabitEthernet0/2
C     192.168.1.0/24 is directly connected, GigabitEthernet0/1

show event manager history
EVENT MANAGER HISTORY
Applet Name: SLA_RECOVER
 Time: 00:22:05  Date: Tue Apr  2 2026
 Event: track 1 state up
 Action 1: syslog: EEM: Track 1 UP - restoring primary route
 Action 4: cli: no ip route 0.0.0.0 0.0.0.0 10.0.13.2
 Action 5: cli: ip route 0.0.0.0 0.0.0.0 10.0.12.2
 Action 6: syslog: EEM: Default route restored to 10.0.12.2

show track
Track 1
  Type: IP SLA
  Object: IP SLA 1
  Current state: Up
  Last state change: 00:00:06

Verification Checklist

  • Check 1: IP SLA is active and probing primary next-hop — verify with show ip sla configuration and expect IP SLA 1 configured to 10.0.12.2 and Status Active/Scheduled.
  • Check 2: Failover triggers when primary next-hop is unreachable — simulate by making 10.0.12.2 unreachable and verify show track shows Current state: Down and show ip route shows default via 10.0.13.2.
  • Check 3: Automatic recovery reinstalls the primary route — restore reachability to 10.0.12.2 and verify show track shows Up, show ip route shows default via 10.0.12.2, and show event manager history contains entries for both applets.

Common Mistakes

SymptomCauseFix
IP SLA shows Status: Not scheduledYou created the IP SLA but did not schedule itRun ip sla schedule <id> life forever start-time now to start the SLA
Track remains down even though next-hop is reachableIP SLA source-interface wrong or ACL blocks ICMPVerify ip sla <id> source-interface and ensure ICMP is permitted along the path; fix ACLs or change source-interface
EEM applet does not change routesApplet did not have proper privileges or CLI commands used wrongEnsure EEM actions include cli command "enable" and cli command "configure terminal" before configuration commands; check show event manager policy registered and show event manager history
Route oscillation (flapping) between primary and backupIP SLA frequency or thresholds too aggressive causing transient failures to trigger actionIncrease IP SLA frequency interval or implement averaging/hysteresis in detection; consider BFD or routing protocol failover instead

Key Takeaways

  • Use IP SLA + track + EEM to implement on‑box self‑healing: IP SLA measures reachability, track captures object state, and EEM performs automated remediation.
  • Self‑healing via route changes is fast and local, but must be designed with safeguards (hysteresis, notifications) to avoid flapping and unintended config drift.
  • In production, integrate self‑healing with logging/alerting so operators are informed (EEM syslog, SNMP traps, or external automation). Automation should complement, not replace, change-control processes.
  • Understand the protocol behavior: IP SLA sends actual probes (ICMP in this lab) at configured intervals; EEM reacts to events and executes CLI in enable mode — test carefully in a lab before deploying to production.

Tip: Treat EEM applets as code — version control the router config snippets and document the intended behavior so operators and automated systems can audit changes.


If you want, Lesson 5 will extend this by showing how to centralize EEM events into an external automation system (pulling logs + IaC config changes) so the router and an automation controller act together for orchestrated remediation.