SDA Day-2 Assurance and Troubleshooting
SDA Day-2 Assurance and Troubleshooting
Introduction
Once your SD-Access fabric is deployed and clients are onboarded, the real work begins: keeping everything running smoothly. Day-2 operations cover the assurance, monitoring, and troubleshooting tasks that network engineers perform on an ongoing basis to ensure the fabric delivers the performance and reliability that users expect.
This lesson focuses on the two most critical day-2 disciplines in an SDA environment. First, we look at assurance monitoring, which gives you visibility into application performance, wireless client health, and RF conditions across the fabric. Second, we walk through system upgrade monitoring and troubleshooting, which is essential for keeping your controller platform current without introducing downtime or failures.
By the end of this lesson, you will be able to:
- Configure Application Assurance on a wireless controller to gain visibility into application performance metrics
- Use Intelligent Capture features to troubleshoot wireless client and AP issues
- Monitor a system upgrade through every phase and identify where failures occur
- Read system-updater logs to pinpoint the root cause of upgrade failures
- Troubleshoot application upgrade errors using the correct CLI commands
Key Concepts
Application Assurance
Application Assurance is a feature on the Catalyst 9800 Wireless Controller that provides deep visibility into how applications perform across your wireless network. It leverages flow monitors to track metrics such as loss, jitter, and delay for traffic flowing through your access points. Application Assurance also enables DNS Service monitoring, giving you insight into DNS resolution behavior on your network.
Intelligent Capture
Intelligent Capture is a set of on-demand diagnostic capabilities available through the controller's Device 360 View. It provides both AP-focused and client-focused troubleshooting tools.
| Focus Area | Capabilities |
|---|---|
| Client-focused | Anomaly Capture, Client statistics (collected over 30-second intervals when AP Statistics is enabled) |
| AP-focused | AP statistics, RF Spectrum Analysis, Over-the-Air (OTA) capture |
Important: Client statistics are collected directly from the APs over 30-second intervals. This collection only occurs when AP Statistics is enabled on the controller.
System Upgrade Phases
When upgrading the controller platform, the process is divided into well-defined phases. Understanding these phases is essential for monitoring progress and knowing where to look when something goes wrong.
| Phase | Progress Range | Description |
|---|---|---|
| Preparation | 0% to 31% | Downloads packages to nodes, disables applications |
| System Upgrade | 32% to 94% | Upgrades Linux kernel, Docker, Kubernetes, Maglev services, refreshes certificates |
| Applications Upgrade | 32% to 94% | Upgrades individual application packages after system upgrade completes |
| Post Upgrade | 95% to 100% | Final verification and cleanup |
Note: Most upgrade-related field issues are seen during the Preparation phase, specifically up to the 31% mark where applications are being shut down.
How It Works
Application Assurance Flow Monitoring
Application Assurance works by attaching IPv4 flow monitors to wireless policy profiles on the Catalyst 9800 controller. When you enable this feature, the controller temporarily shuts down the associated policy profile, applies the flow monitor configuration, and then brings the profile back up.
Three types of flow monitors provide different layers of visibility:
- avc_ipv4_assurance -- General application visibility and control flow data
- avc_ipv4_assurance_dns -- DNS service monitoring to track resolution performance
- avc_ipv4_assurance_rtp -- RTP flow monitoring for voice and video quality metrics including loss, jitter, and delay
For FlexConnect deployments using local switching, Application Experience support (loss, jitter, delay metrics) is available starting from IOS-XE version 17.10. In this mode, the flow monitors use a v9 export format, so the monitor names change to avc_ipv4_assurance_v9 and avc_ipv4_assurance_rtp_v9.
Intelligent Capture Workflow
Intelligent Capture is accessed through the AP's Device 360 View in the controller management interface. The workflow for the two most commonly used features is straightforward:
Spectrum Analysis:
- Navigate to the Access Point Device 360 View
- Select Intelligent Capture
- Start Spectrum Analysis
This gives you a real-time view of the RF spectrum environment around that AP, helping you identify interference sources and channel utilization issues.
Over-the-Air (OTA) Capture:
- Navigate to the AP Device 360 View
- Select Run OTA Capture
- Choose the band, radio, channel width, and channel
The OTA Sniffer mode turns the selected radio into a dedicated sniffer, which means that radio can no longer serve clients while capturing. It captures all packets on the specified channel. This feature requires IOS-XE version 17.11 and AP firmware version 2.3.7 or higher.
System Upgrade Process
The upgrade process follows a structured sequence. During the Preparation phase (0% to 31%), packages are downloaded and copied to all nodes in the cluster. The system then disables user applications before beginning the actual upgrade work.
The System Upgrade phase (32% to 94%) is broken down into multiple sub-phases:
- Quick checks of system memory requirements in
/anddatapartitions, NTP service status, old file clean-ups, and system setting changes (the upgrade can fail at this stage if requirements are not met) - Upgrade of the Linux Kernel, Docker, and Kubernetes
- Upgrade of Maglev Server and its services (Kong, RabbitMQ, GlusterFS, MongoDB, Cassandra, and others)
- Certificate refresh
- Cluster health check
- Nodes are upgraded one at a time in a cluster with multiple checks and balances in place
- Restarts typically occur after the Linux Kernel upgrade and after the Kubernetes upgrade if required
Once the system upgrade finishes, the Applications Upgrade begins automatically. Each application package is upgraded individually, and you can track their status using the package status command.
Configuration Example
Configuring Application Assurance on a Policy Profile
To enable Application Assurance on a standard (centrally switched) wireless policy profile:
C9800(config)#wireless profile policy xxx
shutdown
ipv4 flow monitor avc_ipv4_assurance input
ipv4 flow monitor avc_ipv4_assurance_dns input
ipv4 flow monitor avc_ipv4_assurance_rtp input
no shutdown
Important: The policy profile must be shut down before applying the flow monitors. After configuration is complete, issue
no shutdownto bring the profile back online. This will temporarily disconnect clients on that policy.
For a FlexConnect local switching policy profile (supported from version 17.10):
C9800(config)#wireless profile policy PP4IoT
no central association
no central switching
no central dhcp
ipv4 flow monitor avc_ipv4_assurance_v9 input
ipv4 flow monitor avc_ipv4_assurance_rtp_v9 input
no shutdown
Notice the FlexConnect profile uses the _v9 variant of the flow monitors and includes the no central association, no central switching, and no central dhcp commands that define local switching behavior.
Monitoring a System Upgrade
Check upgrade status at any time with:
$ maglev system_updater update_info
During the download phase you will see output similar to:
System update status:
Version successfully installed : 1.7.1013
Version currently processed : 1.8.222
Update phase : Downloading the host update packages
Update details : Copying the host packages to all the nodes
Progress : 7%
State : DOWNLOADING_UPDATES
Sub-State : INSTALLED_SYSTEMUPDATER
When the system upgrade completes successfully:
System update status:
Version successfully installed : 1.8.222
State : INSTALLING_UPDATES
Sub-State : COMPLETED
Details : The system has been successfully updated
Troubleshooting Upgrade Failures
If the upgrade fails, the status output shows the failure clearly:
Update phase : failed
Update details : Updating node 10.10.10.10 failed
Progress : 34%
State : FAILED
Sub-State : INSTALLED_HOST_COMPONENTS
Use these log commands to investigate system upgrade failures:
$ magctl service logs -r system-updater
$ cat log/maglev-node-updater-<IP Addr>.log
$ cat log/maglev-hook-installer.log
The key systemd services involved in upgrading Linux and Kubernetes are maglev-node-updater and maglev-hook-installer.
For application upgrade failures, check the application status and logs:
$ maglev package status
$ magctl appstack status
$ magctl service logs -r <affected application>
The magctl appstack status command shows the status of all services including pod readiness, restart counts, and node assignments. The workflow-worker and maglev-server services are the maglev services responsible for upgrading applications.
Real-World Application
Assurance in Production
In production SDA deployments, Application Assurance is essential for validating that the fabric is meeting service-level expectations. When users report poor voice or video quality, the RTP flow monitors provide concrete data on loss, jitter, and delay rather than relying on subjective reports.
DNS monitoring is particularly valuable in large campus networks where DNS resolution delays can cause application timeouts that are difficult to diagnose without proper instrumentation.
For sites using FlexConnect with local switching -- common in branch office deployments -- the Application Experience support starting in version 17.10 closes a major visibility gap that previously existed for locally switched traffic.
Intelligent Capture in Troubleshooting
Spectrum Analysis is the go-to tool when you suspect RF interference is degrading wireless performance. Rather than deploying a dedicated spectrum analyzer and walking the floor, you can remotely enable spectrum analysis on any AP and get immediate visibility.
OTA capture is invaluable for diagnosing association failures, roaming issues, or authentication problems. Keep in mind that enabling OTA sniffer mode takes a radio out of service, so plan captures during maintenance windows or on APs with redundant coverage.
Upgrade Best Practices
Best Practice: Before starting an upgrade, verify that the system meets memory requirements for the
/anddatapartitions and that NTP is synchronized. These are the most common causes of early upgrade failure.
Since nodes are upgraded one at a time in a cluster, the upgrade process is designed to maintain service availability. However, the Preparation phase (0% to 31%) is where most field issues occur, particularly around the application shutdown step. If an upgrade fails at this point, check the system-updater logs first.
Summary
- Application Assurance on the Catalyst 9800 uses flow monitors (assurance, DNS, and RTP) attached to policy profiles to provide visibility into application performance, with FlexConnect local switching support from version 17.10
- Intelligent Capture offers client-focused (anomaly capture, client statistics) and AP-focused (spectrum analysis, OTA capture) diagnostics accessible through the Device 360 View; OTA capture requires version 17.11 and AP firmware 2.3.7
- System upgrades follow a structured process: Preparation (0-31%), System Upgrade (32-94%), and Post Upgrade (95-100%), with most field issues occurring in the Preparation phase
- Use
maglev system_updater update_infoto track upgrade progress, andmagctl service logs -r system-updateralong with node-updater logs to troubleshoot failures - Application upgrade failures can be diagnosed with
magctl appstack statusandmaglev package statusto identify which service is stuck or in an error state
Next steps: Continue building your day-2 operations skills by practicing upgrade procedures in a lab environment and familiarizing yourself with the full range of Intelligent Capture capabilities available on different AP models and controller versions.