Lesson 3 of 7

Device Discovery and Inventory

Objective

Discover network devices using Catalyst Center, validate inventory and credentials, and manage software-image readiness and telemetry configuration. In this lesson you will perform discovery checks, view distribution/activation status for images, and force a telemetry configuration push for managed devices. In production networks this matters because centralized inventory and telemetry ensure devices are visible, healthy, and ready for automated upgrades or analytics — without accurate inventory and telemetry you cannot safely scale image distribution or AI analytics.

Quick Recap

Refer to the topology established in Lesson 1. This lesson does NOT add new routers or switches. We will interact with the Catalyst Center appliance and its application stack, and the Assurance/AI Analytics components already established.

ASCII Topology (only the Catalyst Center application endpoint is shown; this is the exact IP available in the reference material):

[Management LAN]
      |
      | eth0:169.254.43.143
      |
+-----------------------------+
| Catalyst Center (apiproxy)  |
|  - service: ai-network-analytics/apiproxy
|  - hostname: lab.nhprep.com
+-----------------------------+

Tip: The apiproxy IP 169.254.43.143 is the API-facing address shown by the platform CLI. In many deployments the API proxy lives on the management plane and is queried by UI/API and internal services.

Key Concepts (theory + practical)

  • Device Discovery & Inventory

    • Theory: Inventory is a canonical list of managed devices and their attributes (model, firmware, credentials, health). Discovery populates this list using protocols like SNMP/SSH/HTTPS (Catalyst Center uses collectors/telemetry to ingest data).
    • Practical: Before pushing images or telemetry configs, make sure each device is "Managed & Reachable" in Inventory; otherwise pushes will fail or timeout.
  • Image Distribution vs Activation

    • Theory: Image provisioning is two phases — distribution (copying the image to the device) and activation (setting boot variables/reloading or switching packages). Failures can occur in either phase.
    • Practical: Distribution can fail for lack of flash space; activation can fail because of misconfiguration (wrong boot-config, unsupported platform). Always check detailed provisioning logs.
  • Telemetry Configuration Push

    • Theory: Telemetry settings are configuration payloads sent to devices so they export operational data to the controllers/collectors. A forced configuration push re-applies these settings when you suspect drift.
    • Practical: Catalyst Center provides a GUI action "Update Telemetry Settings → Force Configuration Push" to reapply telemetry. Use this when telemetry shows "not connected" or the device lacks telemetry metrics.
  • Collectors, Pipelines, Kafka (data path)

    • Theory: Telemetry flows from devices → collectors → Kafka (broker) → analytics pipelines (Flink) → UI. Any lag or failure at collectors/Kafka/pipelines interrupts telemetry.
    • Practical: Use the Collectors and Pipelines views to check uptime, topics, exceptions and clear Kafka lag if needed.
  • Cloud Connectivity for AI Analytics

    • Theory: Many AI features require outbound HTTPS (TCP 443) connectivity to cloud hosts for model learning and analytics.
    • Practical: If AI Analytics features fail, check that the platform can reach cloud hosts over TCP 443. This is the most common root cause in TAC reports.

Step-by-step configuration

Each step below includes the action, exact commands or UI actions (as presented in the reference material), explanation of what happened and why it matters, plus a verification action with expected output.

Step 1: Verify Catalyst Center application stack (AI Analytics) status

What we are doing: Use the platform CLI to confirm the AI Analytics application components (apiproxy) are running. This validates that the platform services that handle device inventory and analytics are up.

# On the Catalyst Center management shell, run:
magctlappstack status

What just happened:
This command queries the platform orchestrator for the application stack status. It lists namespaces, component names, readiness, and the IP addresses of service endpoints (for example the apiproxy). If apiproxy is not READY, the UI/API will not respond reliably to inventory or telemetry operations.

Real-world note: Operators often check this after upgrades or service restarts; a non-running apiproxy will block inventory sync and API-driven automation.

Verify:

# Expected complete output (example from a live system)
NAMESPACE                    NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE               NOMINATED NODE   READINESS GATES
ai-network-analytics         apiproxy-85998b7d5d-gqgpq  1/1     Running   1          38d  169.254.43.143   <node-name>        <none>           <none>
  • The important fields: READY 1/1, STATUS Running, and the IP 169.254.43.143 for apiproxy. If READY != 1/1 or STATUS is not Running, stop and troubleshoot the platform before proceeding.

Step 2: Confirm the device is "Managed & Reachable" and re-run readiness checks

What we are doing: From the Catalyst Center Inventory UI, verify the target device is marked Managed & Reachable. If a device shows "Needs Update", re-run the readiness check and examine details to discover distribution/activation or configuration problems.

# GUI steps (Inventory → select device)
# 1) Click the device to open its Inventory page
# 2) If the device status shows "Needs Update", click "Needs Update"
# 3) Click "See Details" to view image provisioning and readiness information

What just happened:
You inspected the inventory status for the device and invoked the readiness check UI. The "See Details" pane provides enhanced visibility into distribution and activation steps, revealing whether an image failed to copy (distribution) or failed to take effect (activation).

Real-world note: In production, image push fails most often due to insufficient flash or incorrect activation commands; use the details view to pinpoint which phase failed.

Verify:

# Expected GUI observations (paraphrased; view via "See Details")
# - Provisioning stages listed: "Distribution", "Activation", "Inventory Update"
# - If distribution failed: message indicates insufficient flash space on the device
# - If activation failed: message indicates configuration or activation mismatch preventing activation
  • Verify the device shows "Managed" and "Reachable". If it remains unreachable, confirm network reachability (management VLAN, SNMP/SSH credentials, or firewall rules).

Step 3: Push telemetry configuration to the device (Force Configuration Push)

What we are doing: From Inventory, select one or more devices and use the "Update Telemetry Settings" action, then choose "Force Configuration Push" to reapply telemetry configuration to the device(s). This fixes drifted or missing telemetry settings.

# GUI steps (Inventory → select devices)
# 1) Select device(s) checkbox
# 2) Click "Update Telemetry Settings"
# 3) In the popup, ensure the selected devices are correct and choose "Force Configuration Push"
# 4) Click "Next" to start the push

What just happened:
Catalyst Center queued a configuration push for telemetry settings to the selected devices. Internally, the platform sends the telemetry configuration payload (SNMP/streaming telemetry/collectors details) to the device via the configured management protocol. This re-establishes telemetry data flow to collectors for analytics.

Real-world note: Use force push during maintenance windows when devices have been observed to stop exporting telemetry after a config rollback, or after an upgrade that reset telemetry settings.

Verify:

# Verification via GUI:
# - Open Assurance → Dashboards → Health → Client (or Telemetry Status dashboard)
# - The selected device should move from "Telemetry: Not reporting" to "Telemetry: Reporting" or show recent telemetry timestamp
  • Alternatively, check device logs via Inventory → Device → Logs to see a log entry indicating telemetry config was applied.

Step 4: Inspect Collectors and Pipelines status; clear Kafka lag if necessary

What we are doing: From the platform UI, check Collector status, Pipeline slots and Kafka topic lags to ensure the telemetry ingestion path is healthy. If ingestion is backlogged, clearing Kafka lag or restarting a pipeline may be necessary.

# GUI steps:
# 1) Go to Collect & Ingest → Collectors; click a Collector to view status, uptime, service name
# 2) Go to Collect & Ingest → Pipelines; click a Pipeline to view metrics, connected topics, exceptions
# 3) Go to Kafka view and inspect subscribed pipelines; if lag present, use the "Clear Kafka lag" action (available in the UI)

What just happened:
You validated the telemetry ingestion components. Collectors receive device telemetry, publish to Kafka topics; pipelines consume topics for analytics. Clearing a Kafka lag tells the platform to drop or reset the consumer offset for a topic to allow processing to continue (use with caution).

Warning: Clearing a Kafka lag can result in data loss for that time window. Always assess the impact on analytics and retention before clearing.

Verify:

# Expected observations in the UI:
# - Collector status: shows "Up" with uptime and service names
# - Pipeline view: lists active slots, metrics, and any exceptions
# - Kafka view: shows subscribed pipelines and provides action to clear lag if present
  • If pipeline slots are all occupied or pipelines report exceptions, escalate to platform/admin operations for deeper debugging.

Step 5: Use Service Logs to investigate provisioning or telemetry failures

What we are doing: When distribution/activation or telemetry pushes fail, open the Device's Logs view (or the Dashboard/Grafana logs) to review service logs filtered by device to find exact error messages.

# GUI steps:
# 1) Inventory → select device → Click "Logs" to view service logs filtered for this device
# 2) Alternatively, open Assurance → Dashboards → Grafana (System Logs) to view log aggregation

What just happened:
You retrieved device-specific service logs that aggregate provisioning, telemetry, and collector events. These logs often contain actionable error messages (e.g., insufficient storage, authentication failure, or API timeouts) that pinpoint remediation steps.

Real-world note: Logs in Grafana/Kibana provide a time-correlated view across services — use timestamps to correlate a failed image activation with collector restarts or network events.

Verify:

# Expected log excerpts (paraphrased examples you might see)
# - "Image distribution failed: device flash capacity insufficient"
# - "Activation failed: boot variable not set or platform mismatch"
# - "Telemetry push applied successfully; last push timestamp: <timestamp>"
  • Use the timestamps and correlated entries to create an incident timeline for troubleshooting.

Verification Checklist

  • Check 1: Confirm apiproxy is Running and READY 1/1 with IP 169.254.43.143 using magctlappstack status.

    • How: Run magctlappstack status and verify the apiproxy line shows READY 1/1 and STATUS Running.
  • Check 2: Device is marked "Managed & Reachable" in Inventory and readiness checks pass.

    • How: Inventory → select device → ensure no "Needs Update" flag; if present, run "See Details" and resolve any distribution/activation issues.
  • Check 3: Telemetry configuration push completed and the Telemetry Status dashboard shows recent telemetry for the device.

    • How: Inventory → Update Telemetry Settings → Force Configuration Push; then Assurance → Telemetry Status to confirm reporting.
  • Check 4: Collectors and Pipelines show healthy status and no Kafka lag blocking pipelines.

    • How: Collect & Ingest → Collectors and Pipelines → Kafka view; clear lag only if you understand implications.

Common Mistakes

SymptomCauseFix
Device remains "Unreachable" in InventoryManagement network/firewall blocking management traffic or wrong credentialsVerify management IP connectivity, confirm SNMP/SSH/HTTPS credentials, and re-run readiness check
Image distribution stalls or failsInsufficient flash space on deviceClean up device flash or choose a smaller image; re-run distribution after freeing space
Activation fails after successful distributionBoot variables or activation commands misconfiguredReview activation configuration in "See Details"; correct boot variables and reattempt activation
Telemetry shows no data despite push successCollectors or Kafka ingestion pipeline lag/exceptionCheck Collectors and Pipelines status; inspect Kafka topics and clear lag only after understanding data loss risk

Key Takeaways

  • Inventory accuracy and device reachability are prerequisites for safe image distribution and telemetry ingestion — always confirm the device is "Managed & Reachable" before pushing images or telemetry.
  • Distribution and activation are distinct phases; failure in either requires different remediation (free flash vs. activation config).
  • Force-pushing telemetry settings is a common remediation for telemetry drift but verify collector/pipeline health afterward.
  • The primary systemic points to check when analytics fail are: platform app stack (apiproxy), collector uptime, Kafka topic lags, and outbound HTTPS connectivity for AI cloud features.

Final operational tip: Keep a reproducible checklist for every automated action (image push, telemetry push) — include a pre-check (inventory + platform health), the change window, and post-checks (logs + telemetry timestamps). This minimizes surprises in production.


End of Lesson 3 of 7 — Device Discovery and Inventory.