Lesson 7 of 7

Catalyst Center Troubleshooting

Objective

In this lesson you will learn how to troubleshoot common Catalyst Center problems using the built‑in CLI tooling: checking service and appstack status, reading service logs (raw, truncated, and live), examining system-upgrade artifacts, and performing controlled service restarts. These troubleshooting skills matter in production because upgrades and application failures are the most frequent causes of degraded management-plane visibility; rapid diagnosis prevents long outages for device inventory, provisioning, and telemetry. Real-world scenario: during an upgrade some application pods fail to become ready — you must inspect appstack status, correlate with system-updater and maglev logs, and decide whether to restart a container or escalate to re-deploy.

Quick Recap

Refer to the topology used in earlier lessons. This lesson does not add devices; we focus on Catalyst Center nodes that appear in diagnostic output:

ASCII topology (management/control plane view)

[Operator Console]
    |
    | (management)
    |
[Catalyst Center Node A]
    - eth0: 10.128.249.10
    - pod IP: 169.254.41.122

[Catalyst Center Node B]
    - pod IP: 169.254.1.21

Important: all IPs shown above are taken directly from the diagnostic output that Catalyst Center produces during upgrades and failure conditions. In production, these are the addresses you will see when correlating pod readiness to node reachability.

Device Table

DeviceRoleManagement IPs (from output)
Catalyst Center Node Amaglev / application host10.128.249.10, pod 169.254.41.122
Catalyst Center Node Bnode that received updatepod 169.254.1.21

Key Concepts

  • magctl / maglev: magctl is the CLI used to inspect and control Catalyst Center services (appstack, service logs, restarts). maglev is the underlying system updater and package manager. Think of magctl as the remote control and maglev as the engine — you query magctl to see the engine state and read maglev logs to understand what the engine did.
  • Appstack vs Service: An appstack view (magctl appstack status) shows Kubernetes pods and readiness for an application. A service-level view (magctl service logs/status/restart) targets the logical service managed by Catalyst Center. When a pod fails to become ready, start with appstack to see which pod/node is affected, then inspect the service logs.
  • Upgrade phases: System Update progression has phases (Preparation 0–31%, Upgrade 32–94%, Post 95–100%). During the 32–94% phase, node-updater logs and system-updater logs are the most useful artifacts — they show per-node timing and errors such as "Node update took longer to complete in node 169.254.1.21".
  • Log types and tailing: Use magctl service logs -r <service> to dump raw logs, | tail -n N to see last N lines, and magctl service logs -rf <service> to follow live output (like tail -f). Use cat on maglev logs for node-specific history.
  • Restart semantics: magctl service restart <service> performs a soft restart (container restart). magctl service restart -d <service> performs a hard restart (pod deletion and recreation) — hard restart can cause loss of non-persistent in-container data. Always understand persistence before choosing -d.

Step-by-step configuration / troubleshooting

Step 1: Inspect appstack status

What we are doing: Query the appstack to find unhealthy or not-ready pods. This quickly shows which application or pod is failing and on which node — essential first step to localize the problem.

magctl appstack status

What just happened: The command enumerates all appstacks (namespaces and pod names) and prints the READY count, STATUS, RESTARTS, AGE, IP, and NODE. The NODE column ties a pod back to the worker node (e.g., 10.128.249.10) while the IP column shows the pod IP (e.g., 169.254.41.122). Seeing READY 1/2 indicates the pod containers are not fully ready.

Real-world note: In large deployments, appstack status lets you spot a partially-ready application quickly; start here before tailing logs.

Verify:

magctl appstack status
NAMESPACE           NAME                                                  READY  STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
fusion              apic-em-pki-broker-service-595ddd545b-txqnt           1/2    Running   93         6d4h  169.254.41.122  10.128.249.10 <none>            <none>

Step 2: Check package and system-updater state

What we are doing: Examine the package deployment status and system updater info to determine whether a system or application upgrade is in progress, paused, or failed. This helps correlate app failures to recent upgrade steps.

maglev package status
maglev system_updater update_info

What just happened: maglev package status reports application versions (deployed and available) and status (e.g., UPGRADE_ERROR). maglev system_updater update_info provides high-level updater state including installed version, current version being processed, and the sub-state (e.g., INSTALLING_UPDATES or COMPLETED). If package status shows UPGRADE_ERROR it means an application upgrade encountered an exception or timeout.

Real-world note: During rolling upgrades, package mismatches between deployed and available versions can be normal; UPGRADE_ERROR needs immediate attention.

Verify:

maglev package status
NAME                 DISPLAY_NAME                           DEPLOYED     AVAILABLE     STATUS
network-visibility   Network Controller Platform            2.1.718.60779  2.1.720.60128 UPGRADE_ERROR -Exception in task -Maximum wait time 5400 seconds exceeded for the following services to be ready: apic-em-pki-broker-service
maglev system_updater update_info
System update status:
Version successfully installed : 1.8.222
Updater State:
    Currently processed version : 1.8.222
    State : INSTALLING_UPDATES
    Sub-State : COMPLETED
Details :The system has been successfully updated
Source : system-updater
Abort pending : False

Step 3: Read system-updater and node-updater logs

What we are doing: Retrieve the system-updater service logs and node-updater log for the affected node (169.254.1.21). These logs contain stack traces and timing messages that explain why a node update took longer or failed.

magctl service logs -r system-updater
magctl service logs -r system-updater | tail -n 50
cat log/maglev-node-updater-169.254.1.21.log
cat log/maglev-hook-installer.log

What just happened: The first command prints the full system-updater log. Piping to tail shows the recent lines; this is useful to find final errors. The cat of maglev-node-updater-169.254.1.21.log pulls the node-specific log — many node-related timing and exception messages are recorded there. maglev-hook-installer.log contains hook-related installation messages. You will often find messages like "Node update took longer to complete in node 169.254.1.21" and a system-updater error noting the node issue.

Real-world note: Node-specific logs help determine if the issue is node-local (resource exhaustion, reboot) versus application-specific.

Verify:

magctl service logs -r system-updater
| 2004 | 2025-01-30T00:56:41.565Z | ERROR| 57 | ThreadPoolExecutor-4_2 | 140303126214400 | node-updater | node_updater.py:709 | Node update took longer to complete in node 169.254.1.21 |
| 2005 | 2025-01-30T00:56:41.589Z | ERROR| 57 | MainThread | 140304732464960 | system-updater | system_update_orchestrator.py:452 | Status: 1/Node update took longer to complete in 169.254.1.21
cat log/maglev-node-updater-169.254.1.21.log
2025-01-30T00:56:40.000Z INFO Node update started for node 169.254.1.21
2025-01-30T00:56:41.565Z ERROR Node update took longer to complete in node 169.254.1.21
cat log/maglev-hook-installer.log
2025-01-30 00:55:00 INFO Hook installer started
2025-01-30 00:56:30 ERROR Hook install timeout for node 169.254.1.21

Step 4: Inspect failing application service logs

What we are doing: Pull the logs for the failing application (apic-em-pki-broker-service) to identify application-side errors like connection failures, missing files, or configuration errors.

magctl service logs -r apic-em-pki-broker-service
magctl service status apic-em-pki-broker-service
magctl service display apic-em-pki-broker-service

What just happened: magctl service logs -r dumps the raw application logs; these often contain stack traces (e.g., "There was an exception while connecting to the URL http://localhost:16029/pki-is-ejbca-ready") or configuration errors (e.g., "DiskConfig file:/media/floppy/config.json not found."). magctl service status shows concise status for the service, and magctl service display shows stateful info. Together they help determine whether the problem is internal to the service or due to an external dependency.

Real-world note: Application logs are the most direct evidence of why a particular service is not ready. For example, a missing local file or downstream service can cause readiness probes to fail repeatedly.

Verify:

magctl service logs -r apic-em-pki-broker-service
2025-01-24 07:06:29,791 | ERROR| pool-4-thread-1 | | c.c.e.pki.impl.utils.MakeRestCalls | There was an exception while connecting to the URL http://localhost:16029/pki-is-ejbca-ready
2025-01-24 07:06:30,828 | ERROR| main | | c.c.grapevine.api.SecurityManager | DiskConfig file:/media/floppy/config.json not found.
magctl service status apic-em-pki-broker-service
Name: apic-em-pki-broker-service
Status: Degraded
Ready: 1/2
Restarts: 93
Node: 10.128.249.10
Pod IP: 169.254.41.122

Step 5: Controlled service restart (soft and hard)

What we are doing: Attempt a soft restart first to let the container gracefully restart; if that fails, perform a hard restart (pod deletion) as an escalation. Understand the difference: hard restart recreates the pod and can result in loss of non-persistent in-container data.

magctl service restart apic-em-pki-broker-service
# If soft restart does not resolve, escalate:
magctl service restart -d apic-em-pki-broker-service

What just happened: The first command asks the orchestration layer to restart the service container(s) in-place — useful to recover transient faults. The -d flag performs a pod deletion and recreation (hard restart) — this forces a fresh pod lifecycle and can clear corrupted in-memory state but will drop ephemeral container data. Use magctl appstack status afterward to confirm readiness.

Warning: Use hard restart only when you are certain ephemeral state is not required (or you have backups). Hard restart is equivalent to deleting the pod and letting the controller re-create it.

Verify:

magctl service restart apic-em-pki-broker-service
# Expected result: command returns success and you then check appstack
magctl appstack status
NAMESPACE           NAME                                                  READY  STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
fusion              apic-em-pki-broker-service-595ddd545b-txqnt           2/2    Running   94         6d4h  169.254.41.122  10.128.249.10 <none>            <none>

If soft restart did not resolve:

magctl service restart -d apic-em-pki-broker-service
magctl appstack status
NAMESPACE           NAME                                                  READY  STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
fusion              apic-em-pki-broker-service-595ddd545b-xxxxx           1/1    Running   0          0m    169.254.41.130  10.128.249.10 <none>            <none>

(Expect the pod name and IP to change after hard restart; above shows the logical outcome of recreation.)

Step 6: Live monitor upgrade progress

What we are doing: Follow the system-updater logs in real time during an upgrade or while a restart is happening to confirm progression or to catch recurrent errors as they occur.

magctl service logs -rf system-updater

What just happened: The -rf flag streams the system-updater log live to your console, equivalent to tail -f. This is vital during an upgrade window when you need real-time feedback on operations (node updates, hook executions, timeouts).

Real-world note: Use live tailing during maintenance windows to immediately respond to timeouts (e.g., maximum wait time exceeded) and to capture transient errors that do not persist in static logs.

Verify: Run the command and observe live entries similar to the earlier error lines. You should see continuous output; when the process completes you will see state transitions similar to the update_info output.

Verification Checklist

  • Check 1: Appstack shows pods READY for the affected application. Verify with magctl appstack status.
  • Check 2: System updater completed or indicates actionable error. Verify with maglev system_updater update_info.
  • Check 3: Node-specific updater logs do not show repeated timeouts for node 169.254.1.21. Verify by cat log/maglev-node-updater-169.254.1.21.log.
  • Check 4: Application logs no longer show connection or missing-file errors. Verify with magctl service logs -r apic-em-pki-broker-service.

Common Mistakes

SymptomCauseFix
App shows READY 1/2 repeatedlyReadiness probe failing due to missing local file or dependencyInspect magctl service logs -r <service> and fix missing config or dependency; soft restart if transient
Upgrade stalls with UPGRADE_ERRORMaximum wait time exceeded for specific service(s)Check maglev package status and magctl appstack status to identify stuck services; inspect logs and perform controlled restart if safe
Node update shows "took longer to complete" for 169.254.1.21Node-specific hook or resource issue during upgradeInspect cat log/maglev-node-updater-169.254.1.21.log to determine root cause; remediation may require node-level recovery
Hard restart used instead of soft restartOperator escalated to pod deletion without confirming persistence needsAvoid -d unless confirmed; soft restart is less disruptive. If used, ensure any ephemeral data was not critical

Key Takeaways

  • Always start troubleshooting with magctl appstack status to identify the failing application/pod and the node involved; this localizes your efforts quickly.
  • Use maglev package status and maglev system_updater update_info to correlate application failures with upgrade activity; many issues are upgrade-timing related.
  • Logs are your primary source of truth: magctl service logs -r for raw dumps, | tail -n N for recent lines, and magctl service logs -rf for live monitoring. Node-specific files such as log/maglev-node-updater-169.254.1.21.log contain valuable per-node details.
  • Restart carefully: prefer soft restarts, and only use hard restart (-d) when you understand the consequences to non-persistent in-container data.

Tip: In production, combine these CLI checks with the Catalyst Center UI at lab.nhprep.com (read-only dashboards) and an out-of-band access method to the node before performing destructive actions.

This completes Lesson 7: Catalyst Center Troubleshooting. You should now be able to identify failing appstacks, collect the correct logs, interpret upgrade-related errors, and perform safe recovery actions.