Lesson 5 of 5

Change Management Automation

Objective

In this lesson you will build an automated change management workflow using a Serverless Workflow that integrates with NSO via RESTCONF. You will implement a pre-check (inventory & version retrieval), an approval gate, and a post-check (verification). This matters in production because automated pre/post checks and explicit approval gates reduce human error, provide repeatability, and create an auditable trail for changes such as software upgrades or configuration pushes. Real-world scenario: NOC operators use these workflows to perform rolling XR OS upgrades across leaf/spine routers while ensuring compatibility, obtaining manager approval, and verifying success automatically.

Quick Recap

Refer to the topology described in Lesson 1. This lesson uses the management-plane components only — the NSO/CWM control plane and the managed XR devices. No new physical devices are added in this lesson. The NSO/CWM server will be reached at the management domain host:

  • NSO / CWM management host: lab.nhprep.com (credentials: username: admin, password: Lab@123)
  • Managed devices are represented in NSO device inventory (device names: xr1, xr2)

ASCII Topology (management-plane view — IPs shown for management interfaces):

[NSO/CWM Server] lab.nhprep.com (mgmt: 192.0.2.10/24)
        |
        | 192.0.2.11
[Mgmt Switch] (mgmt VLAN)
        |
   +----+----+
   |         |
xr1 (mgmt: 192.0.2.21)   xr2 (mgmt: 192.0.2.22)

Tip: The diagram uses documentation-range IPs (192.0.2.0/24). In your lab, use the management IPs assigned in NSO device definitions.

Key Concepts

  • Serverless Workflow DSL — A JSON/YAML workflow definition that describes tasks, conditions, and transitions. Think of it as an automation playbook: the workflow orchestrates calls (RESTCONF, adapters) and branches based on results.
    • In production, workflows centralize change logic so operators do not manually call multiple APIs.
  • NSO RESTCONF API & exec-any — NSO exposes managed-device interactions via RESTCONF. The exec any capability in the CLI NED lets NSO execute device CLI commands when needed (for example, retrieving "show version").
    • When you call RESTCONF to invoke exec-any, NSO opens a NETCONF/CLI session to the device, runs the command, and returns structured output.
  • Pre-checks / Post-checks — Pre-checks validate inventory state and compatibility (e.g., current software version, free space). Post-checks verify the change (e.g., version after upgrade). These are idempotent queries rather than destructive changes.
    • Practically, pre-checks allow gating: if a device report shows an incompatible version or missing disk space, workflow aborts before change.
  • Approval Gate — A human approval step that pauses workflow execution until an authorized user confirms. This provides separation of duties and an auditable approval timestamp.
    • In production, approvals are required for high-impact operations (e.g., cluster-wide upgrades).
  • Adapters & Utility Functions — CWM can call language-specific adapters (Go, Python) to validate input, transform data (jq), or interface with external systems (ticketing, monitoring).
    • Utility adapters enforce schema validation and encapsulate reusable logic.

Step-by-step configuration

Each step below provides the commands, why they matter, and verification output.

Step 1: Create a Serverless Workflow to retrieve XR version (pre-check)

What we are doing: Define a Serverless Workflow that calls NSO RESTCONF to retrieve the XR version for devices xr1 and xr2. This pre-check collects device versions and decides whether to continue.

# workflow-get-xr-version.yaml
id: nhprep-get-xr-version
version: "1.0"
name: nhprep-get-xr-version
states:
  - name: getVersions
    type: action
    action:
      ref: http://lab.nhprep.com:8080/restconf/operations/ncs:get-device-version
      input:
        devices: ["xr1", "xr2"]
    transition: evaluateResults
  - name: evaluateResults
    type: switch
    dataConditions:
      - condition: ${ .result.failed == true }
        transition: abort
      - condition: ${ true }
        transition: approvalGate
  - name: approvalGate
    type: event
    onEvents:
      - eventRef: approvalEvent
        transition: proceedOrAbort
  - name: proceedOrAbort
    type: switch
    dataConditions:
      - condition: ${ .eventData.approved == true }
        transition: performChange
      - condition: ${ true }
        transition: abort
  - name: performChange
    type: action
    action:
      ref: http://lab.nhprep.com:8080/restconf/operations/cwm:invoke-xr-upgrade
    transition: postCheck
  - name: postCheck
    type: action
    action:
      ref: http://lab.nhprep.com:8080/restconf/operations/ncs:get-device-version
      input:
        devices: ["xr1", "xr2"]
    end: true

What just happened: The workflow definition registers a sequence:

  • getVersions calls an NSO operation to collect device versions (pre-check).
  • evaluation branches to an approval event.
  • performChange invokes the xr-upgrade workflow via CWM (this is the parent workflow for upgrades).
  • postCheck re-runs the version query to verify success.

Real-world note: Keeping pre- and post-check calls identical simplifies comparison logic and reduces errors when validating results.

Verify:

# Upload the workflow definition to CWM (example curl)
curl -u admin:Lab@123 -X POST \
  -H "Content-Type: application/json" \
  --data-binary @workflow-get-xr-version.yaml \
  https://lab.nhprep.com:8443/cwm/api/workflows

# Expected response (complete)
{
  "status": "success",
  "message": "Workflow nhprep-get-xr-version uploaded",
  "id": "nhprep-get-xr-version",
  "version": "1.0"
}

Step 2: Implement a Python pre-check script that calls NSO RESTCONF

What we are doing: Create a small Python script that invokes NSO RESTCONF to retrieve the device version using the NSO RESTCONF URI and then prints a JSON summary for the workflow to consume. This encapsulates the RESTCONF call and parsing logic.

# get_device_version.py
import requests
import json
import sys

nso_host = "https://lab.nhprep.com:8080"
username = "admin"
password = "Lab@123"

def get_version(device):
    url = f"{nso_host}/restconf/data/tailf-ncs:devices/device={device}/live-status/show"
    payload = {"cmd": "show version"}
    resp = requests.post(url, auth=(username, password), json=payload, verify=False)
    return resp.json()

if __name__ == "__main__":
    devices = ["xr1", "xr2"]
    result = {}
    for d in devices:
        result[d] = get_version(d)
    print(json.dumps({"versions": result}))

What just happened: The script posts to an NSO RESTCONF URI that proxies a CLI command (show version) to each device via the NETCONF/CLI NED. It returns parsed JSON to the console. Encapsulating this logic ensures consistent parsing and reuse in workflows or CI pipelines.

Real-world note: In production, wrap this call with retry/backoff and TLS verification; here we set verify=False for lab convenience.

Verify:

python3 get_device_version.py

Expected output (full JSON):

{
  "versions": {
    "xr1": {
      "platform": "NCS5500",
      "version": "7.5.1",
      "uptime": "12 days, 3 hours"
    },
    "xr2": {
      "platform": "NCS5500",
      "version": "7.5.0",
      "uptime": "5 days, 7 hours"
    }
  }
}

Step 3: Add an approval gate (manual event) to the workflow

What we are doing: Configure the workflow approvalEvent so the workflow pauses and waits until a user posts an approval event. This enforces human validation before the change.

# Example: Post an approval event to CWM to resume the workflow
curl -u admin:Lab@123 -X POST \
  -H "Content-Type: application/json" \
  -d '{"workflowId":"nhprep-get-xr-version","event":"approvalEvent","eventData":{"approved":true,"approver":"operator1"}}' \
  https://lab.nhprep.com:8443/cwm/api/events

What just happened: The curl posts a JSON event (approvalEvent) that CWM correlates to the paused workflow instance. When CWM receives the event with approved:true, the workflow resumes and continues to the performChange state.

Real-world note: Approval events are typically integrated with ticketing systems (e.g., ServiceNow) to provide auditability and map to change requests.

Verify:

# Query workflow instance status
curl -u admin:Lab@123 -X GET \
  https://lab.nhprep.com:8443/cwm/api/workflows/instances/nhprep-get-xr-version

# Expected output (complete)
{
  "instanceId": "inst-0001",
  "workflowId": "nhprep-get-xr-version",
  "status": "RUNNING",
  "currentState": "performChange",
  "lastEvent": {
    "eventName": "approvalEvent",
    "eventData": {
      "approved": true,
      "approver": "operator1",
      "timestamp": "2026-04-02T12:34:56Z"
    }
  }
}

Step 4: Invoke the xr-upgrade parent workflow (performChange)

What we are doing: Trigger the xr-upgrade parent workflow via CWM — this workflow is responsible for copying the image to devices and performing the compatibility matrix check before upgrade. We call it as an action from our main workflow.

# Sample invocation payload to start xr-upgrade (via CWM)
curl -u admin:Lab@123 -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "workflow":"invoke-xr-upgrade",
    "input": {
      "devices": ["xr1","xr2"],
      "imageURL": "ftp://192.0.2.100/images/xr-8.0.1.bin",
      "force": false
    }
  }' \
  https://lab.nhprep.com:8443/cwm/api/workflows/invoke

What just happened: The CWM API received the request to start the xr-upgrade workflow with the devices and image location. The xr-upgrade workflow will:

  • copy the image to the device (using NETCONF/CLI ops via NSO),
  • run an upgrade matrix check to ensure compatibility,
  • perform the upgrade according to the workflow logic (stop/start), and
  • report status back to the parent workflow.

Real-world note: Use a reliable repository (HTTP/FTP/SMB) with checksum verification. The workflow should validate the image checksum as part of the matrix check.

Verify:

# Check the xr-upgrade workflow instance
curl -u admin:Lab@123 -X GET \
  https://lab.nhprep.com:8443/cwm/api/workflows/instances?filter=workflow:invoke-xr-upgrade

# Expected output (complete)
[
  {
    "instanceId": "upgrade-20260402-01",
    "workflowId": "invoke-xr-upgrade",
    "status": "RUNNING",
    "devices": ["xr1","xr2"],
    "progress": {
      "xr1": "image copied",
      "xr2": "matrix_check_passed"
    }
  }
]

Step 5: Post-check — verify device versions after upgrade

What we are doing: After the upgrade completes, re-run the get-device-version pre-check to compare pre/post results and confirm the upgrade succeeded.

# Trigger post-check (this is the same RESTCONF call from Step 2)
python3 get_device_version.py

What just happened: The same script queries NSO, which in turn queries devices. Differences in the "version" fields between pre- and post-check indicate whether the upgrade occurred. The workflow should store both pre and post results for audit.

Real-world note: Post-checks should include service-level tests (e.g., BGP session state, dataplane tests) in addition to version checks for full validation.

Verify:

# Expected full JSON output after successful upgrade
{
  "versions": {
    "xr1": {
      "platform": "NCS5500",
      "version": "8.0.1",
      "uptime": "0 days, 1 hour"
    },
    "xr2": {
      "platform": "NCS5500",
      "version": "8.0.1",
      "uptime": "0 days, 0 hours, 40 minutes"
    }
  }
}

Verification Checklist

  • Check 1: Pre-check collected versions for all devices — verify by running get_device_version.py and comparing to workflow pre-check output.
  • Check 2: Approval gate pauses the workflow — verify by checking workflow instance status and ensuring currentState is approvalGate before posting approval.
  • Check 3: xr-upgrade invoked and reports progress — verify via CWM workflow instance query showing device progress entries.
  • Check 4: Post-check shows target version — verify by running get_device_version.py and confirming version equals desired image version.

Common Mistakes

SymptomCauseFix
Workflow fails at getVersions with authentication errorsWrong NSO credentials in workflow or Python scriptEnsure credentials are admin / Lab@123 and NSO has matching user; update workflow adapter if using vault integration
Approval event not correlated — workflow stays pausedSent event uses wrong instanceId or event name mismatchQuery workflow instance to find correct instanceId, then post event with exact event name approvalEvent
xr-upgrade fails due to image copy errorFTP/HTTP server unreachable or wrong image URLVerify the imageURL is reachable from devices; ensure FTP server IP and path are correct and accessible
Post-check shows old version even after successful workflowDevice did not reboot or image activation step failedCheck device boot logic; consult xr-upgrade workflow logs to see if activation step completed; verify device boot variables

Key Takeaways

  • Use pre-checks to detect incompatible device state before applying changes; this prevents wasted change attempts.
  • An approval gate provides human control and auditability — integrate with ticketing in production.
  • Post-checks verify the change and should include both control-plane (versions, protocol states) and dataplane tests.
  • Encapsulate RESTCONF interactions in scripts or adapters (Python/Go) and validate inputs with JSON schema — this makes workflows robust and reusable.

Final practical insight: In production, always run workflow changes against a small pilot group (one device or one site) first, record the outcomes, and only then scale to the full fleet. Automation reduces manual work — but only if the workflow contains the right checks, approvals, and rollback/abort logic.