SDA Day-2 Assurance and Troubleshooting

Introduction

Once your SD-Access fabric is deployed and clients are onboarded, the real work begins: keeping everything running smoothly. Day-2 operations cover the assurance, monitoring, and troubleshooting tasks that network engineers perform on an ongoing basis to ensure the fabric delivers the performance and reliability that users expect.

This lesson focuses on the two most critical day-2 disciplines in an SDA environment. First, we look at assurance monitoring, which gives you visibility into application performance, wireless client health, and RF conditions across the fabric. Second, we walk through system upgrade monitoring and troubleshooting, which is essential for keeping your controller platform current without introducing downtime or failures.

By the end of this lesson, you will be able to:

Configure Application Assurance on a wireless controller to gain visibility into application performance metrics
Use Intelligent Capture features to troubleshoot wireless client and AP issues
Monitor a system upgrade through every phase and identify where failures occur
Read system-updater logs to pinpoint the root cause of upgrade failures
Troubleshoot application upgrade errors using the correct CLI commands

Key Concepts

Application Assurance

Application Assurance is a feature on the Catalyst 9800 Wireless Controller that provides deep visibility into how applications perform across your wireless network. It leverages flow monitors to track metrics such as loss, jitter, and delay for traffic flowing through your access points. Application Assurance also enables DNS Service monitoring, giving you insight into DNS resolution behavior on your network.

Intelligent Capture

Intelligent Capture is a set of on-demand diagnostic capabilities available through the controller's Device 360 View. It provides both AP-focused and client-focused troubleshooting tools.

Focus Area	Capabilities
Client-focused	Anomaly Capture, Client statistics (collected over 30-second intervals when AP Statistics is enabled)
AP-focused	AP statistics, RF Spectrum Analysis, Over-the-Air (OTA) capture

Important: Client statistics are collected directly from the APs over 30-second intervals. This collection only occurs when AP Statistics is enabled on the controller.

System Upgrade Phases

When upgrading the controller platform, the process is divided into well-defined phases. Understanding these phases is essential for monitoring progress and knowing where to look when something goes wrong.

Phase	Progress Range	Description
Preparation	0% to 31%	Downloads packages to nodes, disables applications
System Upgrade	32% to 94%	Upgrades Linux kernel, Docker, Kubernetes, Maglev services, refreshes certificates
Applications Upgrade	32% to 94%	Upgrades individual application packages after system upgrade completes
Post Upgrade	95% to 100%	Final verification and cleanup

Note: Most upgrade-related field issues are seen during the Preparation phase, specifically up to the 31% mark where applications are being shut down.

How It Works

Application Assurance Flow Monitoring

Application Assurance works by attaching IPv4 flow monitors to wireless policy profiles on the Catalyst 9800 controller. When you enable this feature, the controller temporarily shuts down the associated policy profile, applies the flow monitor configuration, and then brings the profile back up.

Three types of flow monitors provide different layers of visibility:

avc_ipv4_assurance -- General application visibility and control flow data
avc_ipv4_assurance_dns -- DNS service monitoring to track resolution performance
avc_ipv4_assurance_rtp -- RTP flow monitoring for voice and video quality metrics including loss, jitter, and delay

For FlexConnect deployments using local switching, Application Experience support (loss, jitter, delay metrics) is available starting from IOS-XE version 17.10. In this mode, the flow monitors use a v9 export format, so the monitor names change to avc_ipv4_assurance_v9 and avc_ipv4_assurance_rtp_v9.

Intelligent Capture Workflow

Intelligent Capture is accessed through the AP's Device 360 View in the controller management interface. The workflow for the two most commonly used features is straightforward:

Spectrum Analysis:

Navigate to the Access Point Device 360 View
Select Intelligent Capture
Start Spectrum Analysis

This gives you a real-time view of the RF spectrum environment around that AP, helping you identify interference sources and channel utilization issues.

Over-the-Air (OTA) Capture:

Navigate to the AP Device 360 View
Select Run OTA Capture
Choose the band, radio, channel width, and channel

The OTA Sniffer mode turns the selected radio into a dedicated sniffer, which means that radio can no longer serve clients while capturing. It captures all packets on the specified channel. This feature requires IOS-XE version 17.11 and AP firmware version 2.3.7 or higher.

System Upgrade Process

The upgrade process follows a structured sequence. During the Preparation phase (0% to 31%), packages are downloaded and copied to all nodes in the cluster. The system then disables user applications before beginning the actual upgrade work.

The System Upgrade phase (32% to 94%) is broken down into multiple sub-phases:

Quick checks of system memory requirements in / and data partitions, NTP service status, old file clean-ups, and system setting changes (the upgrade can fail at this stage if requirements are not met)
Upgrade of the Linux Kernel, Docker, and Kubernetes
Upgrade of Maglev Server and its services (Kong, RabbitMQ, GlusterFS, MongoDB, Cassandra, and others)
Certificate refresh
Cluster health check
Nodes are upgraded one at a time in a cluster with multiple checks and balances in place
Restarts typically occur after the Linux Kernel upgrade and after the Kubernetes upgrade if required

Once the system upgrade finishes, the Applications Upgrade begins automatically. Each application package is upgraded individually, and you can track their status using the package status command.

Configuration Example

Configuring Application Assurance on a Policy Profile

To enable Application Assurance on a standard (centrally switched) wireless policy profile:

C9800(config)#wireless profile policy xxx
  shutdown
  ipv4 flow monitor avc_ipv4_assurance input
  ipv4 flow monitor avc_ipv4_assurance_dns input
  ipv4 flow monitor avc_ipv4_assurance_rtp input
  no shutdown

Important: The policy profile must be shut down before applying the flow monitors. After configuration is complete, issue no shutdown to bring the profile back online. This will temporarily disconnect clients on that policy.

For a FlexConnect local switching policy profile (supported from version 17.10):

C9800(config)#wireless profile policy PP4IoT
  no central association
  no central switching
  no central dhcp
  ipv4 flow monitor avc_ipv4_assurance_v9 input
  ipv4 flow monitor avc_ipv4_assurance_rtp_v9 input
  no shutdown

Notice the FlexConnect profile uses the _v9 variant of the flow monitors and includes the no central association, no central switching, and no central dhcp commands that define local switching behavior.

Monitoring a System Upgrade

Check upgrade status at any time with:

$ maglev system_updater update_info

During the download phase you will see output similar to:

System update status:
  Version successfully installed : 1.7.1013
  Version currently processed    : 1.8.222
  Update phase     : Downloading the host update packages
  Update details   : Copying the host packages to all the nodes
  Progress         : 7%
  State            : DOWNLOADING_UPDATES
  Sub-State        : INSTALLED_SYSTEMUPDATER

When the system upgrade completes successfully:

System update status:
  Version successfully installed : 1.8.222
  State            : INSTALLING_UPDATES
  Sub-State        : COMPLETED
  Details          : The system has been successfully updated

Troubleshooting Upgrade Failures

If the upgrade fails, the status output shows the failure clearly:

  Update phase     : failed
  Update details   : Updating node 10.10.10.10 failed
  Progress         : 34%
  State            : FAILED
  Sub-State        : INSTALLED_HOST_COMPONENTS

Use these log commands to investigate system upgrade failures:

$ magctl service logs -r system-updater
$ cat log/maglev-node-updater-<IP Addr>.log
$ cat log/maglev-hook-installer.log

The key systemd services involved in upgrading Linux and Kubernetes are maglev-node-updater and maglev-hook-installer.

For application upgrade failures, check the application status and logs:

$ maglev package status
$ magctl appstack status
$ magctl service logs -r <affected application>

The magctl appstack status command shows the status of all services including pod readiness, restart counts, and node assignments. The workflow-worker and maglev-server services are the maglev services responsible for upgrading applications.

Real-World Application

Assurance in Production

In production SDA deployments, Application Assurance is essential for validating that the fabric is meeting service-level expectations. When users report poor voice or video quality, the RTP flow monitors provide concrete data on loss, jitter, and delay rather than relying on subjective reports.

DNS monitoring is particularly valuable in large campus networks where DNS resolution delays can cause application timeouts that are difficult to diagnose without proper instrumentation.

For sites using FlexConnect with local switching -- common in branch office deployments -- the Application Experience support starting in version 17.10 closes a major visibility gap that previously existed for locally switched traffic.

Intelligent Capture in Troubleshooting

Spectrum Analysis is the go-to tool when you suspect RF interference is degrading wireless performance. Rather than deploying a dedicated spectrum analyzer and walking the floor, you can remotely enable spectrum analysis on any AP and get immediate visibility.

OTA capture is invaluable for diagnosing association failures, roaming issues, or authentication problems. Keep in mind that enabling OTA sniffer mode takes a radio out of service, so plan captures during maintenance windows or on APs with redundant coverage.

Upgrade Best Practices

Best Practice: Before starting an upgrade, verify that the system meets memory requirements for the / and data partitions and that NTP is synchronized. These are the most common causes of early upgrade failure.

Since nodes are upgraded one at a time in a cluster, the upgrade process is designed to maintain service availability. However, the Preparation phase (0% to 31%) is where most field issues occur, particularly around the application shutdown step. If an upgrade fails at this point, check the system-updater logs first.

Summary

Application Assurance on the Catalyst 9800 uses flow monitors (assurance, DNS, and RTP) attached to policy profiles to provide visibility into application performance, with FlexConnect local switching support from version 17.10
Intelligent Capture offers client-focused (anomaly capture, client statistics) and AP-focused (spectrum analysis, OTA capture) diagnostics accessible through the Device 360 View; OTA capture requires version 17.11 and AP firmware 2.3.7
System upgrades follow a structured process: Preparation (0-31%), System Upgrade (32-94%), and Post Upgrade (95-100%), with most field issues occurring in the Preparation phase
Use maglev system_updater update_info to track upgrade progress, and magctl service logs -r system-updater along with node-updater logs to troubleshoot failures
Application upgrade failures can be diagnosed with magctl appstack status and maglev package status to identify which service is stuck or in an error state

Next steps: Continue building your day-2 operations skills by practicing upgrade procedures in a lab environment and familiarizing yourself with the full range of Intelligent Capture capabilities available on different AP models and controller versions.