Lesson 5 of 6

Automated Remediation

Objective

In this lesson you will configure AI-style automated remediation on edge routers using Embedded Event Manager (EEM) applets. You will create EEM policies that detect common failures (BGP adjacency loss and interface flaps) from syslog messages and execute predefined remediation actions (clear BGP neighbor or reset the interface). Automated remediation reduces mean time to repair (MTTR) in production networks by performing predictable corrective tasks immediately when a problem is detected.

Real-world scenario: in an SD‑WAN edge deployment a remote branch may experience transient interface flaps or BGP session drops. Automatically issuing a controlled reset (clear or shutdown/no shutdown) can restore service without waiting for manual intervention or opening a ticket, while appropriate logging preserves an audit trail.

Quick Recap

This lesson continues from the topology used earlier in the lab (Hub and two Branch routers). No new devices are added in this lesson.

Topology (ASCII):

                    +-----------+
                    |   HUB     |
                    | hostname: HUB
                    | Gi0/0: 10.0.0.1/30 ----+  
                    | Gi0/1: 10.0.1.1/30 ----+-- to BR2
                    +-----------+            |
                         |                   |
               10.0.0.0/30|                   |
                         |                   |
                    +-----------+            |
                    |   BR1     |            |
                    | hostname: BR1           |
                    | Gi0/0: 10.0.0.2/30 <---+
                    +-----------+

Device table:

DeviceHostnamePrimary Interfaces (exact names)
Hub routerHUBGigabitEthernet0/0, GigabitEthernet0/1
Branch 1BR1GigabitEthernet0/0
Branch 2BR2GigabitEthernet0/0

IP addressing:

LinkDevice/InterfaceIP Address
Hub–BR1HUB Gi0/010.0.0.1/30
Hub–BR1BR1 Gi0/010.0.0.2/30
Hub–BR2HUB Gi0/110.0.1.1/30
Hub–BR2BR2 Gi0/010.0.1.2/30

Domain, password, organization used in examples:

  • domain: lab.nhprep.com
  • local administrative password examples: Lab@123
  • organization: NHPREP

Key Concepts

  • Embedded Event Manager (EEM): a local IOS scripting facility that can react to events (syslog messages, timers, SNMP traps) and execute CLI commands. Think of EEM like a local automation “watchdog” that performs a sequence of commands when a trigger happens.
  • Syslog-driven remediation: network protocols (BGP, OSPF, interface link events) emit standardized syslog messages when their state changes. EEM can match those syslog messages and run remediation. In production, syslog is a reliable, low-latency signal for state changes.
  • Remediation actions and protocol behavior: for BGP, running "clear ip bgp " tears down the TCP session and forces a graceful reestablishment; BGP withdraws prefixes then re-advertises them after re-establishment, causing route churn. For an interface reset (shutdown/no shutdown), the device toggles interface administrative state causing the link to renegotiate — often clearing transient hardware or driver issues.
  • Safety and auditing: automated actions should be logged and rate-limited. EEM supports syslog messages as part of the action sequence so remediation events are recorded. In production, remediation policies must be carefully scoped to avoid loops (for example, don’t continuously clear BGP if the underlying layer is down).
  • Analogy: Think of EEM as a local "auto-mechanic" in the router that listens for warning lights (syslog messages). When a specific light comes on, it performs a pre-approved repair step and writes a note in the maintenance log.

Step-by-step configuration

Step 1: Verify existing BGP adjacency and baseline

What we are doing: Confirm the current BGP neighbor and baseline operational state so we know what “normal” looks like before automation. This matters because remediation actions will reference neighbor IPs and rely on BGP state transitions visible in syslog.

enable
show ip bgp summary

What just happened: The show ip bgp summary command displays BGP peer status, prefixes received, and uptime. We use the neighbor IP (10.0.0.2) in remediation policies, so verifying it exists avoids accidental commands against the wrong neighbor.

Real-world note: Always confirm peer IPs and ASNs before writing automation; a typo could clear the wrong session in production.

Verify:

show ip bgp summary
BGP router identifier 10.0.0.1, local AS number 65000
BGP table version is 1, main routing table version 1
Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.0.0.2        4 65001     102     101        1    0    0 00:23:10        5
10.0.1.2        4 65002      50      49        1    0    0 02:12:45        3

Step 2: Create EEM applet to remediate BGP adjacency downs

What we are doing: Configure an EEM applet on the HUB to detect syslog messages that indicate the BGP neighbor (10.0.0.2) went down and then clear that neighbor to force a reconnection. This matters because clearing the BGP session can resolve transient TCP/BGP state issues automatically.

configure terminal
event manager applet BGP_REM_NBR_10_0_0_2
 event syslog pattern "BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down"
 action 1.0 cli command "enable"
 action 2.0 cli command "clear ip bgp 10.0.0.2"
 action 3.0 syslog msg "Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2"
end

What just happened:

  • event manager applet BGP_REM_NBR_10_0_0_2 creates an EEM policy.
  • event syslog pattern tells EEM to trigger when the router logs the BGP neighbor-down message for 10.0.0.2.
  • The cli command actions run interactive CLI commands on match; here we enable privileged mode and issue clear ip bgp 10.0.0.2.
  • The syslog msg action writes an audit message to the logging buffer so operators can see the remediation event.

Real-world note: Matching syslog message text must match the exact message produced by your IOS version. Test your pattern in lab first and log the full message to ensure the pattern triggers correctly.

Verify:

show running-config | section event manager
event manager applet BGP_REM_NBR_10_0_0_2
 event syslog pattern BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down
 action 1.0 cli command enable
 action 2.0 cli command clear ip bgp 10.0.0.2
 action 3.0 syslog msg Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2

Step 3: Create EEM applet to reset an interface on link down

What we are doing: Configure an EEM applet to detect interface Gi0/1 going down and execute a shutdown/no shutdown sequence to attempt to recover transient link issues. This matters because many physical-layer anomalies can be resolved by toggling the administrative state, and doing so automatically reduces manual intervention.

configure terminal
event manager applet IF_RESET_GI0_1
 event syslog pattern "%LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to down"
 action 1.0 cli command "enable"
 action 2.0 cli command "configure terminal"
 action 3.0 cli command "interface GigabitEthernet0/1"
 action 4.0 cli command "shutdown"
 action 5.0 cli command "no shutdown"
 action 6.0 cli command "end"
 action 7.0 syslog msg "Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown"
end

What just happened:

  • The applet listens for the interface-down syslog message for GigabitEthernet0/1.
  • On match, it enters global configuration, navigates to the interface, applies shutdown followed immediately by no shutdown, then exits config mode.
  • The final syslog msg provides an auditable record that remediation was performed.

Real-world note: Never put aggressive reset logic on production customer-facing trunks or uplinks without safeguards — add rate-limiting or counters in production to prevent repeated toggles. Use EEM ACLs and careful scoping.

Verify:

show running-config | section event manager
event manager applet IF_RESET_GI0_1
 event syslog pattern %LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to down
 action 1.0 cli command enable
 action 2.0 cli command configure terminal
 action 3.0 cli command interface GigabitEthernet0/1
 action 4.0 cli command shutdown
 action 5.0 cli command no shutdown
 action 6.0 cli command end
 action 7.0 syslog msg Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown

Step 4: Configure local logging so remediation events are stored locally

What we are doing: Ensure the router keeps remediation audit messages in the logging buffer so operators can review automation actions after the fact. This matters because automation must be auditable — logs show what automated steps ran and when.

configure terminal
logging buffered 4096
service timestamps log datetime msec
end

What just happened:

  • logging buffered 4096 increases the local memory buffer for syslog messages so EEM-generated messages are retained.
  • service timestamps log datetime msec adds timestamps to logs for accurate event correlation.

Real-world note: In production you should forward logs to a centralized syslog or SIEM (e.g., via syslog server) so automated remediation events are preserved even if a device fails.

Verify:

show logging | include Automated remediation executed
Apr  1 12:34:05.123: Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2
Apr  1 13:01:22.456: Automated remediation executed on HUB: GigabitEthernet0/1 shutdown/no shutdown

Step 5: Test remediation by forcing a BGP neighbor down (lab test)

What we are doing: Simulate a BGP adjacency loss by administratively shutting the BR1 interface, then observe automated remediation on the HUB and confirm the applet reported the action. This matters because practical testing ensures the policy triggers and that the remediation performs as expected.

On BR1:

enable
configure terminal
interface GigabitEthernet0/0
shutdown
end

On HUB (observe logs and BGP):

show logging | include BGP-5-ADJCHANGE
show logging | include Automated remediation executed
show ip bgp summary

What just happened:

  • Shutting BR1 Gi0/0 causes the link to go down; HUB records a BGP adjacency-down syslog message.
  • The EEM applet on HUB matches that message and issues clear ip bgp 10.0.0.2.
  • The remediation syslog message appears in the logging buffer, and show ip bgp summary will reflect the cleared or reestablished session.

Real-world note: Always run tests like this during maintenance windows in production. In SD‑WAN networks, coordinate with controllers/orchestrators before forcing changes.

Verify:

show logging | include BGP-5-ADJCHANGE
Apr  1 14:22:10.789: %BGP-5-ADJCHANGE: neighbor 10.0.0.2 Down BGP neighbor is down

show logging | include Automated remediation executed
Apr  1 14:22:10.892: Automated remediation executed on HUB: cleared BGP neighbor 10.0.0.2

show ip bgp summary
BGP router identifier 10.0.0.1, local AS number 65000
BGP table version is 2, main routing table version 2
Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.0.0.2        4 65001     103     104        2    0    0 00:00:03        5

(Above, the adjacency shows 00:00:03 meaning a recent reconnection after the clear.)

Verification Checklist

  • Check 1: EEM applet for BGP remediation exists — show running-config | section event manager should list BGP_REM_NBR_10_0_0_2.
  • Check 2: EEM applet for interface reset exists — show running-config | section event manager should list IF_RESET_GI0_1.
  • Check 3: Remediation events are logged — show logging | include Automated remediation executed should show the EEM syslog messages after testing.

Common Mistakes

SymptomCauseFix
EEM applet never triggersSyslog pattern does not exactly match the router’s message formatCapture an actual syslog message (show logging) and update the event syslog pattern to match exact text
Remediation runs repeatedly in a loopUnderlying physical issue persists (e.g., cable) and EEM has no rate limitingAdd logic to the applet (counters/time checks) or move to a policy that requires multiple occurrences before action
clear ip bgp affected wrong neighborWrong neighbor IP used in the appletVerify neighbor IP with show ip bgp summary and update the applet config
No audit trail of automated actionsNo local or remote logging configuredConfigure logging buffered and forward logs to a centralized syslog server

Key Takeaways

  • EEM provides a powerful, local automation mechanism to perform immediate remediation based on events such as syslog messages — valuable for reducing MTTR in SD‑WAN edge networks.
  • Always validate syslog patterns and test applets in a lab before applying to production; exact message text and IOS versions matter.
  • Automated remediation changes protocol state (e.g., clearing BGP restarts TCP sessions and triggers route withdrawal/redistribution) so be mindful of the operational impact.
  • Logging and auditing of remediation actions are essential for post-mortem analysis; forward logs to a centralized system in production.

Tip: In production SD‑WAN deployments, pair local EEM remediation with centralized analytics so you can correlate automated actions against higher-level AI insights and avoid conflicting automated responses.