Catalyst Center Troubleshooting
Objective
In this lesson you will learn how to troubleshoot common Catalyst Center problems using the built‑in CLI tooling: checking service and appstack status, reading service logs (raw, truncated, and live), examining system-upgrade artifacts, and performing controlled service restarts. These troubleshooting skills matter in production because upgrades and application failures are the most frequent causes of degraded management-plane visibility; rapid diagnosis prevents long outages for device inventory, provisioning, and telemetry. Real-world scenario: during an upgrade some application pods fail to become ready — you must inspect appstack status, correlate with system-updater and maglev logs, and decide whether to restart a container or escalate to re-deploy.
Quick Recap
Refer to the topology used in earlier lessons. This lesson does not add devices; we focus on Catalyst Center nodes that appear in diagnostic output:
ASCII topology (management/control plane view)
[Operator Console]
|
| (management)
|
[Catalyst Center Node A]
- eth0: 10.128.249.10
- pod IP: 169.254.41.122
[Catalyst Center Node B]
- pod IP: 169.254.1.21
Important: all IPs shown above are taken directly from the diagnostic output that Catalyst Center produces during upgrades and failure conditions. In production, these are the addresses you will see when correlating pod readiness to node reachability.
Device Table
| Device | Role | Management IPs (from output) |
|---|---|---|
| Catalyst Center Node A | maglev / application host | 10.128.249.10, pod 169.254.41.122 |
| Catalyst Center Node B | node that received update | pod 169.254.1.21 |
Key Concepts
- magctl / maglev: magctl is the CLI used to inspect and control Catalyst Center services (appstack, service logs, restarts). maglev is the underlying system updater and package manager. Think of magctl as the remote control and maglev as the engine — you query magctl to see the engine state and read maglev logs to understand what the engine did.
- Appstack vs Service: An appstack view (magctl appstack status) shows Kubernetes pods and readiness for an application. A service-level view (magctl service logs/status/restart) targets the logical service managed by Catalyst Center. When a pod fails to become ready, start with appstack to see which pod/node is affected, then inspect the service logs.
- Upgrade phases: System Update progression has phases (Preparation 0–31%, Upgrade 32–94%, Post 95–100%). During the 32–94% phase, node-updater logs and system-updater logs are the most useful artifacts — they show per-node timing and errors such as "Node update took longer to complete in node 169.254.1.21".
- Log types and tailing: Use
magctl service logs -r <service>to dump raw logs,| tail -n Nto see last N lines, andmagctl service logs -rf <service>to follow live output (liketail -f). Usecaton maglev logs for node-specific history. - Restart semantics:
magctl service restart <service>performs a soft restart (container restart).magctl service restart -d <service>performs a hard restart (pod deletion and recreation) — hard restart can cause loss of non-persistent in-container data. Always understand persistence before choosing -d.
Step-by-step configuration / troubleshooting
Step 1: Inspect appstack status
What we are doing: Query the appstack to find unhealthy or not-ready pods. This quickly shows which application or pod is failing and on which node — essential first step to localize the problem.
magctl appstack status
What just happened: The command enumerates all appstacks (namespaces and pod names) and prints the READY count, STATUS, RESTARTS, AGE, IP, and NODE. The NODE column ties a pod back to the worker node (e.g., 10.128.249.10) while the IP column shows the pod IP (e.g., 169.254.41.122). Seeing READY 1/2 indicates the pod containers are not fully ready.
Real-world note: In large deployments, appstack status lets you spot a partially-ready application quickly; start here before tailing logs.
Verify:
magctl appstack status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fusion apic-em-pki-broker-service-595ddd545b-txqnt 1/2 Running 93 6d4h 169.254.41.122 10.128.249.10 <none> <none>
Step 2: Check package and system-updater state
What we are doing: Examine the package deployment status and system updater info to determine whether a system or application upgrade is in progress, paused, or failed. This helps correlate app failures to recent upgrade steps.
maglev package status
maglev system_updater update_info
What just happened: maglev package status reports application versions (deployed and available) and status (e.g., UPGRADE_ERROR). maglev system_updater update_info provides high-level updater state including installed version, current version being processed, and the sub-state (e.g., INSTALLING_UPDATES or COMPLETED). If package status shows UPGRADE_ERROR it means an application upgrade encountered an exception or timeout.
Real-world note: During rolling upgrades, package mismatches between deployed and available versions can be normal; UPGRADE_ERROR needs immediate attention.
Verify:
maglev package status
NAME DISPLAY_NAME DEPLOYED AVAILABLE STATUS
network-visibility Network Controller Platform 2.1.718.60779 2.1.720.60128 UPGRADE_ERROR -Exception in task -Maximum wait time 5400 seconds exceeded for the following services to be ready: apic-em-pki-broker-service
maglev system_updater update_info
System update status:
Version successfully installed : 1.8.222
Updater State:
Currently processed version : 1.8.222
State : INSTALLING_UPDATES
Sub-State : COMPLETED
Details :The system has been successfully updated
Source : system-updater
Abort pending : False
Step 3: Read system-updater and node-updater logs
What we are doing: Retrieve the system-updater service logs and node-updater log for the affected node (169.254.1.21). These logs contain stack traces and timing messages that explain why a node update took longer or failed.
magctl service logs -r system-updater
magctl service logs -r system-updater | tail -n 50
cat log/maglev-node-updater-169.254.1.21.log
cat log/maglev-hook-installer.log
What just happened: The first command prints the full system-updater log. Piping to tail shows the recent lines; this is useful to find final errors. The cat of maglev-node-updater-169.254.1.21.log pulls the node-specific log — many node-related timing and exception messages are recorded there. maglev-hook-installer.log contains hook-related installation messages. You will often find messages like "Node update took longer to complete in node 169.254.1.21" and a system-updater error noting the node issue.
Real-world note: Node-specific logs help determine if the issue is node-local (resource exhaustion, reboot) versus application-specific.
Verify:
magctl service logs -r system-updater
| 2004 | 2025-01-30T00:56:41.565Z | ERROR| 57 | ThreadPoolExecutor-4_2 | 140303126214400 | node-updater | node_updater.py:709 | Node update took longer to complete in node 169.254.1.21 |
| 2005 | 2025-01-30T00:56:41.589Z | ERROR| 57 | MainThread | 140304732464960 | system-updater | system_update_orchestrator.py:452 | Status: 1/Node update took longer to complete in 169.254.1.21
cat log/maglev-node-updater-169.254.1.21.log
2025-01-30T00:56:40.000Z INFO Node update started for node 169.254.1.21
2025-01-30T00:56:41.565Z ERROR Node update took longer to complete in node 169.254.1.21
cat log/maglev-hook-installer.log
2025-01-30 00:55:00 INFO Hook installer started
2025-01-30 00:56:30 ERROR Hook install timeout for node 169.254.1.21
Step 4: Inspect failing application service logs
What we are doing: Pull the logs for the failing application (apic-em-pki-broker-service) to identify application-side errors like connection failures, missing files, or configuration errors.
magctl service logs -r apic-em-pki-broker-service
magctl service status apic-em-pki-broker-service
magctl service display apic-em-pki-broker-service
What just happened: magctl service logs -r dumps the raw application logs; these often contain stack traces (e.g., "There was an exception while connecting to the URL http://localhost:16029/pki-is-ejbca-ready") or configuration errors (e.g., "DiskConfig file:/media/floppy/config.json not found."). magctl service status shows concise status for the service, and magctl service display shows stateful info. Together they help determine whether the problem is internal to the service or due to an external dependency.
Real-world note: Application logs are the most direct evidence of why a particular service is not ready. For example, a missing local file or downstream service can cause readiness probes to fail repeatedly.
Verify:
magctl service logs -r apic-em-pki-broker-service
2025-01-24 07:06:29,791 | ERROR| pool-4-thread-1 | | c.c.e.pki.impl.utils.MakeRestCalls | There was an exception while connecting to the URL http://localhost:16029/pki-is-ejbca-ready
2025-01-24 07:06:30,828 | ERROR| main | | c.c.grapevine.api.SecurityManager | DiskConfig file:/media/floppy/config.json not found.
magctl service status apic-em-pki-broker-service
Name: apic-em-pki-broker-service
Status: Degraded
Ready: 1/2
Restarts: 93
Node: 10.128.249.10
Pod IP: 169.254.41.122
Step 5: Controlled service restart (soft and hard)
What we are doing: Attempt a soft restart first to let the container gracefully restart; if that fails, perform a hard restart (pod deletion) as an escalation. Understand the difference: hard restart recreates the pod and can result in loss of non-persistent in-container data.
magctl service restart apic-em-pki-broker-service
# If soft restart does not resolve, escalate:
magctl service restart -d apic-em-pki-broker-service
What just happened: The first command asks the orchestration layer to restart the service container(s) in-place — useful to recover transient faults. The -d flag performs a pod deletion and recreation (hard restart) — this forces a fresh pod lifecycle and can clear corrupted in-memory state but will drop ephemeral container data. Use magctl appstack status afterward to confirm readiness.
Warning: Use hard restart only when you are certain ephemeral state is not required (or you have backups). Hard restart is equivalent to deleting the pod and letting the controller re-create it.
Verify:
magctl service restart apic-em-pki-broker-service
# Expected result: command returns success and you then check appstack
magctl appstack status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fusion apic-em-pki-broker-service-595ddd545b-txqnt 2/2 Running 94 6d4h 169.254.41.122 10.128.249.10 <none> <none>
If soft restart did not resolve:
magctl service restart -d apic-em-pki-broker-service
magctl appstack status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fusion apic-em-pki-broker-service-595ddd545b-xxxxx 1/1 Running 0 0m 169.254.41.130 10.128.249.10 <none> <none>
(Expect the pod name and IP to change after hard restart; above shows the logical outcome of recreation.)
Step 6: Live monitor upgrade progress
What we are doing: Follow the system-updater logs in real time during an upgrade or while a restart is happening to confirm progression or to catch recurrent errors as they occur.
magctl service logs -rf system-updater
What just happened: The -rf flag streams the system-updater log live to your console, equivalent to tail -f. This is vital during an upgrade window when you need real-time feedback on operations (node updates, hook executions, timeouts).
Real-world note: Use live tailing during maintenance windows to immediately respond to timeouts (e.g., maximum wait time exceeded) and to capture transient errors that do not persist in static logs.
Verify: Run the command and observe live entries similar to the earlier error lines. You should see continuous output; when the process completes you will see state transitions similar to the update_info output.
Verification Checklist
- Check 1: Appstack shows pods READY for the affected application. Verify with
magctl appstack status. - Check 2: System updater completed or indicates actionable error. Verify with
maglev system_updater update_info. - Check 3: Node-specific updater logs do not show repeated timeouts for node 169.254.1.21. Verify by
cat log/maglev-node-updater-169.254.1.21.log. - Check 4: Application logs no longer show connection or missing-file errors. Verify with
magctl service logs -r apic-em-pki-broker-service.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| App shows READY 1/2 repeatedly | Readiness probe failing due to missing local file or dependency | Inspect magctl service logs -r <service> and fix missing config or dependency; soft restart if transient |
| Upgrade stalls with UPGRADE_ERROR | Maximum wait time exceeded for specific service(s) | Check maglev package status and magctl appstack status to identify stuck services; inspect logs and perform controlled restart if safe |
| Node update shows "took longer to complete" for 169.254.1.21 | Node-specific hook or resource issue during upgrade | Inspect cat log/maglev-node-updater-169.254.1.21.log to determine root cause; remediation may require node-level recovery |
| Hard restart used instead of soft restart | Operator escalated to pod deletion without confirming persistence needs | Avoid -d unless confirmed; soft restart is less disruptive. If used, ensure any ephemeral data was not critical |
Key Takeaways
- Always start troubleshooting with
magctl appstack statusto identify the failing application/pod and the node involved; this localizes your efforts quickly. - Use
maglev package statusandmaglev system_updater update_infoto correlate application failures with upgrade activity; many issues are upgrade-timing related. - Logs are your primary source of truth:
magctl service logs -rfor raw dumps,| tail -n Nfor recent lines, andmagctl service logs -rffor live monitoring. Node-specific files such aslog/maglev-node-updater-169.254.1.21.logcontain valuable per-node details. - Restart carefully: prefer soft restarts, and only use hard restart (
-d) when you understand the consequences to non-persistent in-container data.
Tip: In production, combine these CLI checks with the Catalyst Center UI at lab.nhprep.com (read-only dashboards) and an out-of-band access method to the node before performing destructive actions.
This completes Lesson 7: Catalyst Center Troubleshooting. You should now be able to identify failing appstacks, collect the correct logs, interpret upgrade-related errors, and perform safe recovery actions.