Lesson 3 of 5

Troubleshooting Provisioning

Objective

In this lesson you'll learn how to debug and resolve provisioning failures in Catalyst Center (SWIM) workflows — focusing on template/CLI push errors, file distribution and activation failures, and automatic rollback behavior. Troubleshooting provisioning is critical in production because image or configuration pushes that fail silently can leave network devices unreachable or running unsupported software. In a real environment this workflow is used when rolling out OS upgrades or configuration templates to many access switches and wireless controllers across branches.

Topology

Reference the topology from Lesson 1. This lesson does not add new devices or interfaces; we assume the Catalyst Center (management plane) can reach managed devices over their management network. All Catalyst Center service hostnames are under the lab.nhprep.com domain.

Important: this lesson focuses on the provisioning control plane (Catalyst Center services + device-side install commands). It intentionally does not change data-plane forwarding.

Device Table

DeviceRoleManagement FQDN
Catalyst CenterProvisioning / SWIM Serverswim.lab.nhprep.com
Managed SwitchTarget device for image/template pushswitch1.lab.nhprep.com
Managed WLC (example)Wireless controllerwlc9800.lab.nhprep.com

Key Concepts — Theory before CLI

  • SWIM Provisioning Workflow: Provisioning begins with creating a request (via the provisioning-service REST API), which becomes a task (task-service) and is executed by the orchestration-engine. The orchestration engine invokes supporting microservices (spf-service, spf-service-manager, network-validation service, network-programmer) to validate, generate, and push device configuration. Think of the orchestration-engine as the conductor and the supporting services as specialist musicians.

    Packet/flow note: the orchestration process is API-driven. Device interaction typically happens over management-plane protocols (HTTPS/SCP/SFTP), so network reachability and protocol access are prerequisites.

  • Distribution vs Activation: Image update has two distinct phases — distribute (copy image into device flash and stage it) and activate (run the commands to switch boot variables or install/commit the image). The reference process moved an activation step into distribute for some platforms — be aware of platform-specific behavior.

    Real-world analogy: Distribution is like shipping the parts to the factory; activation is assembling them and turning the machine on.

  • Common Failure Modes:

    • Insufficient flash space prevents distribution (copy stage), causing a distribution failure.
    • Misconfigured activation commands or template syntax causes activation failure.
    • Template/CLI push errors occur when generated device config doesn’t match device state (e.g., wrong template variables), or the network-programmer cannot apply the CLI.
    • Telemetry or connectivity issues may make a device appear unmanaged/unreachable preventing provisioning from starting.
  • Rollback Behavior: When activation fails, robust workflows issue rollbacks to return the device to its last known good state. If rollback fails, devices must be remediated manually to avoid service disruption.

  • File Transfer Protocols and Pre-checks: SWIM supports HTTPS, SCP, and SFTP (for WLC images SFTP/HTTPS/SCP may be used). Pre-checks include flash memory availability, startup-config checks, and service entitlement validation.

Step-by-step configuration

Each step below contains the exact command(s), why they matter, and verification with expected output.

Step 1: Re-run Readiness / Pre-check via provisioning-service

What we are doing: Trigger the provisioning-service to re-run the readiness checks for a device so we can surface distribution/activation pre-check failures (flash space, config register, entitlement, protocol reachability). This is typically the first troubleshooting action to capture deterministic pre-check results.

curl -X POST https://provisioning-service.lab.nhprep.com/api/v1/devices/switch1.lab.nhprep.com/readiness \
  -H "Content-Type: application/json" \
  -u "admin:Lab@123" \
  -k

What just happened: This POST asks the provisioning-service to re-evaluate pre-checks on switch1. The service will invoke network-validation and spf-service checks (flash, startup-config validation, file transfer protocol reachability) and return a JSON status showing pass/fail details.

Real-world note: In production, re-running readiness helps you avoid wasteful distribution attempts and surfaces issues like insufficient flash or missing entitlements before an image copy.

Verify:

curl -X GET https://provisioning-service.lab.nhprep.com/api/v1/devices/switch1.lab.nhprep.com/readiness/status \
  -H "Accept: application/json" \
  -u "admin:Lab@123" \
  -k

{
  "device": "switch1.lab.nhprep.com",
  "readiness": "FAILED",
  "checks": {
    "flash_space": {
      "status": "FAILED",
      "required_mb": 512,
      "available_mb": 120,
      "message": "Insufficient flash space"
    },
    "startup_config": {
      "status": "PASSED"
    },
    "file_transfer_protocol": {
      "status": "PASSED",
      "protocols": ["HTTPS","SCP"]
    }
  }
}

Step 2: Inspect Distribution Details (See Details) and Logs

What we are doing: Retrieve the distribution/activation detail log for the device to identify the exact failure step and error message. This surfaces whether the issue is a file copy failure, an activation CLI error, or a template generation problem.

curl -X GET https://provisioning-service.lab.nhprep.com/api/v1/devices/switch1.lab.nhprep.com/distribution/details \
  -H "Accept: application/json" \
  -u "admin:Lab@123" \
  -k

What just happened: The server returned granular distribution/activation steps and the failure messages. Typical returned fields include step name (COPY, ACTIVATE), status, and error logs produced by network-programmer or orchestration-engine.

Real-world note: The "See Details" view in the management UI performs the same API call; reviewing logs saves manual SSHing to devices for initial triage.

Verify:

curl -X GET https://provisioning-service.lab.nhprep.com/api/v1/devices/switch1.lab.nhprep.com/distribution/details \
  -H "Accept: application/json" \
  -u "admin:Lab@123" \
  -k

{
  "device": "switch1.lab.nhprep.com",
  "distribution": {
    "step": "COPY",
    "status": "FAILED",
    "error": "Write failed: insufficient flash space"
  },
  "activation": {
    "step": "ACTIVATE",
    "status": "NOT_RUN"
  }
}

Step 3: Free Flash Space on the Device (device CLI)

What we are doing: Manually free flash space on the target device so the distribution copy can succeed. The install workflow requires sufficient available flash for the image.

enable
configure terminal
file delete flash:old-image.bin
end
write memory

What just happened: Deleting old or unnecessary images and files frees flash space. Saving the config ensures any related housekeeping persists. This operation is necessary because distribution cannot proceed when required flash is unavailable.

Real-world note: In production, maintain an image rotation policy to avoid hitting flash limits during scheduled upgrades.

Verify:

show file systems

Expected output:

Filesystem            Size(b)      Free(b)      Type  Flags
flash:                83886080     36204032     flash  rw
bootflash:            2147483648   1073741824   flash  rw

(Here we expect Free(b) to be large enough for the image; pre-check earlier required 512MB free.)

Step 4: Re-run Distribution (install add / pre-download for WLC)

What we are doing: Trigger the distribution phase again so the image is copied to the device. For some platforms the management system executes platform-specific copy commands (example: for IOS XE image-based installs use 'install add file', and for EWLC 9800 you may use pre-download CLI).

install add file flash:cat9k_iosxe.bin
! For WLC 9800 controllers:
ap image pre-download controller cat9k_image.tar

What just happened: On platforms using install-based upgrades, install add file stages the image into the install repository. For EWLC 9800, ap image pre-download starts the controller-side pre-download of images. These staged images are necessary before activation commits the new software.

Real-world note: Distribution copies are often performed in parallel across many devices; bandwidth and storage considerations matter.

Verify:

show install repository

Expected output (sample):

Repository: flash:
  Image: cat9k_iosxe.bin
  Status: INSTALLED
  Size: 340000000
  Added: 2026-03-01 12:34:56

For WLC:

show ap image status
Image pre-download status: COMPLETED
Controller: wlc9800.lab.nhprep.com

Step 5: Attempt Activation (install activate / install commit) and handle activation errors

What we are doing: Activate the staged image. If activation fails due to CLI/template errors, capture the activation output and trigger rollback where available.

install activate cat9k_iosxe.bin
install commit

What just happened: install activate changes the running set to boot using the staged image and performs activation steps; install commit makes the change persistent. If activation fails because of misconfiguration or incompatible ISSU, the device should emit error messages and may attempt to revert.

Real-world note: Activation errors commonly stem from platform compatibility or incorrect activation sequences. Always review compatibility matrix before activation.

Verify:

show install active summary

Expected output:

Active image summary:
  Active image: cat9k_iosxe.bin
  Status: ACTIVE
  Commit State: COMMITTED
  Boot variable: cat9k_iosxe.bin

If activation fails, expected error log sample (retrieved from device or provisioning-service logs):

ERROR: Activation failed: CLI push error at line 42: 'no such command: system install commit'
Rollback initiated.
Rollback result: SUCCESS
Device returned to previous active image: cat9k_iosxe_old.bin

Step 6: Troubleshoot Template / CLI Push Errors and Reapply Telemetry Configs

What we are doing: If the activation failure indicates a template or CLI push error from the network-programmer, inspect the generated template, correct variables, and re-run network-programmer push. Also force a telemetry config push from Inventory if telemetry configs are missing.

# Re-generate device config via spf-service-manager (conceptual REST call)
curl -X POST https://spf-service-manager.lab.nhprep.com/api/v1/devices/switch1.lab.nhprep.com/generate-config \
  -H "Content-Type: application/json" \
  -d '{"template":"iosxe-image-activate","variables":{"hostname":"switch1","image":"cat9k_iosxe.bin"}}' \
  -u "admin:Lab@123" -k

# Force telemetry config push from Inventory (via provisioning-service)
curl -X POST https://provisioning-service.lab.nhprep.com/api/v1/devices/update-telemetry \
  -H "Content-Type: application/json" \
  -d '{"devices":["switch1.lab.nhprep.com"], "force":"true"}' \
  -u "admin:Lab@123" -k

What just happened: The first call asks the spf-service-manager to render the device-specific CLI from the template and variable set. The second call forces Catalyst Center to re-push telemetry settings to the device — useful when telemetry is missing and blocks orchestration interactions.

Real-world note: Template variable mismatches are a leading cause of CLI push errors; always review rendered CLI before applying at scale.

Verify:

# Check task status that performed the config push
curl -X GET https://task-service.lab.nhprep.com/api/v1/tasks/12345/status \
  -H "Accept: application/json" \
  -u "admin:Lab@123" \
  -k

{
  "taskId": "12345",
  "status": "COMPLETED",
  "result": {
    "device": "switch1.lab.nhprep.com",
    "operation": "telemetry_push",
    "status": "SUCCESS"
  }
}

If the CLI push failed, expected result:

{
  "taskId": "12346",
  "status": "FAILED",
  "result": {
    "device": "switch1.lab.nhprep.com",
    "operation": "config_push",
    "error": "Template rendering error: missing variable 'mgmt_gateway' at line 7"
  }
}

Verification Checklist

  • Check 1: Device readiness must PASS — use provisioning-service readiness status to confirm (see Step 1). Expected: readiness == PASSED.
  • Check 2: Image must be present in the device repository — verify with device repository show (see Step 4). Expected: Image listed and status INSTALLED/COMPLETED.
  • Check 3: Activation must be ACTIVE and COMMITTED — verify with show install active summary (see Step 5). Expected: Boot variable points to new image and Commit State is COMMITTED.
  • Check 4: Template/CLI pushes must complete without errors — verify task-service for config_push task status. Expected: status == COMPLETED and result status == SUCCESS.

Common Mistakes

SymptomCauseFix
Distribution fails with "insufficient flash space"Device flash contains old images/logs; pre-check required more free spaceDelete unused images/files on device; re-run readiness checks; maintain image rotation policy
Activation fails with CLI errorTemplate variables missing or invalid commands for platformInspect rendered template from spf-service-manager; correct variables; validate platform compatibility and re-run config push
Device shows as "Needs Update" and provisioning cannot reach itDevice not managed / unreachable or telemetry configs missingEnsure device is managed and reachable from Catalyst Center; force telemetry configuration push from Inventory
File transfer repeatedly times outFirewall/proxy blocking file transfer ports or OCSP/CRL lookups failing SSL validationAllow transfer protocols (HTTPS/SCP/SFTP) and OCSP/CRL URLs (e.g., ocsp.quovadisglobal.com, crl.quovadisglobal.com, *.identrust.com) or configure proxy with proper certificate checks

Key Takeaways

  • Provisioning is multi-stage: readiness/pre-checks → distribution → activation. Failures can occur at any stage and should be diagnosed in that order.
  • Always run readiness checks before distribution — they catch common issues (flash, config, protocols) and save time.
  • Template rendering errors are common when variables are missing or templates target the wrong platform; always validate rendered CLI output before mass deployment.
  • Catalyst Center provides detailed logs via the "See Details" view (or equivalent REST endpoints). Use these logs and task-service status to pinpoint whether failure happened in distribution, activation, or template generation.
  • In production, have a rollback plan (and test it): many workflows initiate rollback automatically, but be prepared for manual remediation if rollback cannot complete.

Tip: Treat the orchestration logs and task responses as the “single source of truth” when troubleshooting — they tell you whether the failure was a network transfer, a device CLI error, or a template render problem.


End of Lesson 3 — Troubleshooting Provisioning.