Automated Remediation
Objective
In this lesson you will configure AI-style automated remediation on edge routers using Embedded Event Manager (EEM) applets. You will create EEM policies that detect common failures (BGP adjacency loss and interface flaps) from syslog messages and execute predefined remediation actions (clear BGP neighbor or reset the interface). Automated remediation reduces mean time to repair (MTTR) in production networks by performing predictable corrective tasks immediately when a problem is detected.
Real-world scenario: in an SD‑WAN edge deployment a remote branch may experience transient interface flaps or BGP session drops. Automatically issuing a controlled reset (clear or shutdown/no shutdown) can restore service without waiting for manual intervention or opening a ticket, while appropriate logging preserves an audit trail.
Quick Recap
This lesson continues from the topology used earlier in the lab (Hub and two Branch routers). No new devices are added in this lesson.
Topology (ASCII):
+-----------+
| HUB |
| hostname: HUB
| Gi0/0: 10.0.0.1/30 ----+
| Gi0/1: 10.0.1.1/30 ----+-- to BR2
+-----------+ |
| |
10.0.0.0/30| |
| |
+-----------+ |
| BR1 | |
| hostname: BR1 |
| Gi0/0: 10.0.0.2/30 <---+
+-----------+
Device table:
| Device | Hostname | Primary Interfaces (exact names) |
|---|---|---|
| Hub router | HUB | GigabitEthernet0/0, GigabitEthernet0/1 |
| Branch 1 | BR1 | GigabitEthernet0/0 |
| Branch 2 | BR2 | GigabitEthernet0/0 |
IP addressing:
| Link | Device/Interface | IP Address |
|---|---|---|
| Hub–BR1 | HUB Gi0/0 | 10.0.0.1/30 |
| Hub–BR1 | BR1 Gi0/0 | 10.0.0.2/30 |
| Hub–BR2 | HUB Gi0/1 | 10.0.1.1/30 |
| Hub–BR2 | BR2 Gi0/0 | 10.0.1.2/30 |
Domain, password, organization used in examples:
- domain: lab.nhprep.com
- local administrative password examples: Lab@123
- organization: NHPREP
Key Concepts
- Embedded Event Manager (EEM): a local IOS scripting facility that can react to events (syslog messages, timers, SNMP traps) and execute CLI commands. Think of EEM like a local automation “watchdog” that performs a sequence of commands when a trigger happens.
- Syslog-driven remediation: network protocols (BGP, OSPF, interface link events) emit standardized syslog messages when their state changes. EEM can match those syslog messages and run remediation. In production, syslog is a reliable, low-latency signal for state changes.
- Remediation actions and protocol behavior: for BGP, running "clear ip bgp
" tears down the TCP session and forces a graceful reestablishment; BGP withdraws prefixes then re-advertises them after re-establishment, causing route churn. For an interface reset (shutdown/no shutdown), the device toggles interface administrative state causing the link to renegotiate — often clearing transient hardware or driver issues. - Safety and auditing: automated actions should be logged and rate-limited. EEM supports syslog messages as part of the action sequence so remediation events are recorded. In production, remediation policies must be carefully scoped to avoid loops (for example, don’t continuously clear BGP if the underlying layer is down).
- Analogy: Think of EEM as a local "auto-mechanic" in the router that listens for warning lights (syslog messages). When a specific light comes on, it performs a pre-approved repair step and writes a note in the maintenance log.
Step-by-step configuration
Step 1: Verify existing BGP adjacency and baseline
What we are doing: Confirm the current BGP neighbor and baseline operational state so we know what “normal” looks like before automation. This matters because remediation actions will reference neighbor IPs and rely on BGP state transitions visible in syslog.
enable
show ip bgp summary
What just happened: The show ip bgp summary command displays BGP peer status, prefixes received, and uptime. We use the neighbor IP (10.0.0.2) in remediation policies, so verifying it exists avoids accidental commands against the wrong neighbor.
Real-world note: Always confirm peer IPs and ASNs before writing automation; a typo could clear the wrong session in production.
Verify:
show ip bgp summary
BGP router identifier 10.0.0.1, local AS number 65000
BGP table version is 1, main routing table version 1
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.2 4 65001 102 101 1 0 0 00:23:10 5
10.0.1.2 4 65002 50 49 1 0 0 02:12:45 3
Step 2: Create EEM applet to remediate BGP adjacency downs
What we are doing: Configure an EEM applet on the HUB to detect syslog messages that indicate the BGP neighbor (10.0.0.2) went down and then clear that neighbor to force a reconnection. This matters because clearing the BGP session can resolve transient TCP/BGP state issues automatically.
configure terminal
event manager applet BGP_REM_NBR_10_0_0_2
event syslog pattern "BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down"
action 1.0 cli command "enable"
action 2.0 cli command "clear ip bgp 10.0.0.2"
action 3.0 syslog msg "Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2"
end
What just happened:
event manager applet BGP_REM_NBR_10_0_0_2creates an EEM policy.event syslog patterntells EEM to trigger when the router logs the BGP neighbor-down message for 10.0.0.2.- The
cli commandactions run interactive CLI commands on match; here we enable privileged mode and issueclear ip bgp 10.0.0.2. - The
syslog msgaction writes an audit message to the logging buffer so operators can see the remediation event.
Real-world note: Matching syslog message text must match the exact message produced by your IOS version. Test your pattern in lab first and log the full message to ensure the pattern triggers correctly.
Verify:
show running-config | section event manager
event manager applet BGP_REM_NBR_10_0_0_2
event syslog pattern BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down
action 1.0 cli command enable
action 2.0 cli command clear ip bgp 10.0.0.2
action 3.0 syslog msg Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2
Step 3: Create EEM applet to reset an interface on link down
What we are doing: Configure an EEM applet to detect interface Gi0/1 going down and execute a shutdown/no shutdown sequence to attempt to recover transient link issues. This matters because many physical-layer anomalies can be resolved by toggling the administrative state, and doing so automatically reduces manual intervention.
configure terminal
event manager applet IF_RESET_GI0_1
event syslog pattern "%LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to down"
action 1.0 cli command "enable"
action 2.0 cli command "configure terminal"
action 3.0 cli command "interface GigabitEthernet0/1"
action 4.0 cli command "shutdown"
action 5.0 cli command "no shutdown"
action 6.0 cli command "end"
action 7.0 syslog msg "Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown"
end
What just happened:
- The applet listens for the interface-down syslog message for GigabitEthernet0/1.
- On match, it enters global configuration, navigates to the interface, applies
shutdownfollowed immediately byno shutdown, then exits config mode. - The final
syslog msgprovides an auditable record that remediation was performed.
Real-world note: Never put aggressive reset logic on production customer-facing trunks or uplinks without safeguards — add rate-limiting or counters in production to prevent repeated toggles. Use EEM ACLs and careful scoping.
Verify:
show running-config | section event manager
event manager applet IF_RESET_GI0_1
event syslog pattern %LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to down
action 1.0 cli command enable
action 2.0 cli command configure terminal
action 3.0 cli command interface GigabitEthernet0/1
action 4.0 cli command shutdown
action 5.0 cli command no shutdown
action 6.0 cli command end
action 7.0 syslog msg Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown
Step 4: Configure local logging so remediation events are stored locally
What we are doing: Ensure the router keeps remediation audit messages in the logging buffer so operators can review automation actions after the fact. This matters because automation must be auditable — logs show what automated steps ran and when.
configure terminal
logging buffered 4096
service timestamps log datetime msec
end
What just happened:
logging buffered 4096increases the local memory buffer for syslog messages so EEM-generated messages are retained.service timestamps log datetime msecadds timestamps to logs for accurate event correlation.
Real-world note: In production you should forward logs to a centralized syslog or SIEM (e.g., via syslog server) so automated remediation events are preserved even if a device fails.
Verify:
show logging | include Automated remediation executed
Apr 1 12:34:05.123: Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2
Apr 1 13:01:22.456: Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown
Step 5: Test remediation by forcing a BGP neighbor down (lab test)
What we are doing: Simulate a BGP adjacency loss by administratively shutting the BR1 interface, then observe automated remediation on the HUB and confirm the applet reported the action. This matters because practical testing ensures the policy triggers and that the remediation performs as expected.
On BR1:
enable
configure terminal
interface GigabitEthernet0/0
shutdown
end
On HUB (observe logs and BGP):
show logging | include BGP-5-ADJCHANGE
show logging | include Automated remediation executed
show ip bgp summary
What just happened:
- Shutting BR1 Gi0/0 causes the link to go down; HUB records a BGP adjacency-down syslog message.
- The EEM applet on HUB matches that message and issues
clear ip bgp 10.0.0.2. - The remediation syslog message appears in the logging buffer, and
show ip bgp summarywill reflect the cleared or reestablished session.
Real-world note: Always run tests like this during maintenance windows in production. In SD‑WAN networks, coordinate with controllers/orchestrators before forcing changes.
Verify:
show logging | include BGP-5-ADJCHANGE
Apr 1 14:22:10.789: %BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down BGP neighbor is down
show logging | include Automated remediation executed
Apr 1 14:22:10.892: Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2
show ip bgp summary
BGP router identifier 10.0.0.1, local AS number 65000
BGP table version is 2, main routing table version 2
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.2 4 65001 103 104 2 0 0 00:00:03 5
(Above, the adjacency shows 00:00:03 meaning a recent reconnection after the clear.)
Verification Checklist
- Check 1: EEM applet for BGP remediation exists —
show running-config | section event managershould list BGP_REM_NBR_10_0_0_2. - Check 2: EEM applet for interface reset exists —
show running-config | section event managershould list IF_RESET_GI0_1. - Check 3: Remediation events are logged —
show logging | include Automated remediation executedshould show the EEM syslog messages after testing.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| EEM applet never triggers | Syslog pattern does not exactly match the router’s message format | Capture an actual syslog message (show logging) and update the event syslog pattern to match exact text |
| Remediation runs repeatedly in a loop | Underlying physical issue persists (e.g., cable) and EEM has no rate limiting | Add logic to the applet (counters/time checks) or move to a policy that requires multiple occurrences before action |
clear ip bgp affected wrong neighbor | Wrong neighbor IP used in the applet | Verify neighbor IP with show ip bgp summary and update the applet config |
| No audit trail of automated actions | No local or remote logging configured | Configure logging buffered and forward logs to a centralized syslog server |
Key Takeaways
- EEM provides a powerful, local automation mechanism to perform immediate remediation based on events such as syslog messages — valuable for reducing MTTR in SD‑WAN edge networks.
- Always validate syslog patterns and test applets in a lab before applying to production; exact message text and IOS versions matter.
- Automated remediation changes protocol state (e.g., clearing BGP restarts TCP sessions and triggers route withdrawal/redistribution) so be mindful of the operational impact.
- Logging and auditing of remediation actions are essential for post-mortem analysis; forward logs to a centralized system in production.
Tip: In production SD‑WAN deployments, pair local EEM remediation with centralized analytics so you can correlate automated actions against higher-level AI insights and avoid conflicting automated responses.