Change Management Automation
Objective
In this lesson you will build an automated change management workflow using a Serverless Workflow that integrates with NSO via RESTCONF. You will implement a pre-check (inventory & version retrieval), an approval gate, and a post-check (verification). This matters in production because automated pre/post checks and explicit approval gates reduce human error, provide repeatability, and create an auditable trail for changes such as software upgrades or configuration pushes. Real-world scenario: NOC operators use these workflows to perform rolling XR OS upgrades across leaf/spine routers while ensuring compatibility, obtaining manager approval, and verifying success automatically.
Quick Recap
Refer to the topology described in Lesson 1. This lesson uses the management-plane components only — the NSO/CWM control plane and the managed XR devices. No new physical devices are added in this lesson. The NSO/CWM server will be reached at the management domain host:
- NSO / CWM management host: lab.nhprep.com (credentials: username: admin, password: Lab@123)
- Managed devices are represented in NSO device inventory (device names: xr1, xr2)
ASCII Topology (management-plane view — IPs shown for management interfaces):
[NSO/CWM Server] lab.nhprep.com (mgmt: 192.0.2.10/24)
|
| 192.0.2.11
[Mgmt Switch] (mgmt VLAN)
|
+----+----+
| |
xr1 (mgmt: 192.0.2.21) xr2 (mgmt: 192.0.2.22)
Tip: The diagram uses documentation-range IPs (192.0.2.0/24). In your lab, use the management IPs assigned in NSO device definitions.
Key Concepts
- Serverless Workflow DSL — A JSON/YAML workflow definition that describes tasks, conditions, and transitions. Think of it as an automation playbook: the workflow orchestrates calls (RESTCONF, adapters) and branches based on results.
- In production, workflows centralize change logic so operators do not manually call multiple APIs.
- NSO RESTCONF API & exec-any — NSO exposes managed-device interactions via RESTCONF. The exec any capability in the CLI NED lets NSO execute device CLI commands when needed (for example, retrieving "show version").
- When you call RESTCONF to invoke exec-any, NSO opens a NETCONF/CLI session to the device, runs the command, and returns structured output.
- Pre-checks / Post-checks — Pre-checks validate inventory state and compatibility (e.g., current software version, free space). Post-checks verify the change (e.g., version after upgrade). These are idempotent queries rather than destructive changes.
- Practically, pre-checks allow gating: if a device report shows an incompatible version or missing disk space, workflow aborts before change.
- Approval Gate — A human approval step that pauses workflow execution until an authorized user confirms. This provides separation of duties and an auditable approval timestamp.
- In production, approvals are required for high-impact operations (e.g., cluster-wide upgrades).
- Adapters & Utility Functions — CWM can call language-specific adapters (Go, Python) to validate input, transform data (jq), or interface with external systems (ticketing, monitoring).
- Utility adapters enforce schema validation and encapsulate reusable logic.
Step-by-step configuration
Each step below provides the commands, why they matter, and verification output.
Step 1: Create a Serverless Workflow to retrieve XR version (pre-check)
What we are doing: Define a Serverless Workflow that calls NSO RESTCONF to retrieve the XR version for devices xr1 and xr2. This pre-check collects device versions and decides whether to continue.
# workflow-get-xr-version.yaml
id: nhprep-get-xr-version
version: "1.0"
name: nhprep-get-xr-version
states:
- name: getVersions
type: action
action:
ref: http://lab.nhprep.com:8080/restconf/operations/ncs:get-device-version
input:
devices: ["xr1", "xr2"]
transition: evaluateResults
- name: evaluateResults
type: switch
dataConditions:
- condition: ${ .result.failed == true }
transition: abort
- condition: ${ true }
transition: approvalGate
- name: approvalGate
type: event
onEvents:
- eventRef: approvalEvent
transition: proceedOrAbort
- name: proceedOrAbort
type: switch
dataConditions:
- condition: ${ .eventData.approved == true }
transition: performChange
- condition: ${ true }
transition: abort
- name: performChange
type: action
action:
ref: http://lab.nhprep.com:8080/restconf/operations/cwm:invoke-xr-upgrade
transition: postCheck
- name: postCheck
type: action
action:
ref: http://lab.nhprep.com:8080/restconf/operations/ncs:get-device-version
input:
devices: ["xr1", "xr2"]
end: true
What just happened: The workflow definition registers a sequence:
- getVersions calls an NSO operation to collect device versions (pre-check).
- evaluation branches to an approval event.
- performChange invokes the xr-upgrade workflow via CWM (this is the parent workflow for upgrades).
- postCheck re-runs the version query to verify success.
Real-world note: Keeping pre- and post-check calls identical simplifies comparison logic and reduces errors when validating results.
Verify:
# Upload the workflow definition to CWM (example curl)
curl -u admin:Lab@123 -X POST \
-H "Content-Type: application/json" \
--data-binary @workflow-get-xr-version.yaml \
https://lab.nhprep.com:8443/cwm/api/workflows
# Expected response (complete)
{
"status": "success",
"message": "Workflow nhprep-get-xr-version uploaded",
"id": "nhprep-get-xr-version",
"version": "1.0"
}
Step 2: Implement a Python pre-check script that calls NSO RESTCONF
What we are doing: Create a small Python script that invokes NSO RESTCONF to retrieve the device version using the NSO RESTCONF URI and then prints a JSON summary for the workflow to consume. This encapsulates the RESTCONF call and parsing logic.
# get_device_version.py
import requests
import json
import sys
nso_host = "https://lab.nhprep.com:8080"
username = "admin"
password = "Lab@123"
def get_version(device):
url = f"{nso_host}/restconf/data/tailf-ncs:devices/device={device}/live-status/show"
payload = {"cmd": "show version"}
resp = requests.post(url, auth=(username, password), json=payload, verify=False)
return resp.json()
if __name__ == "__main__":
devices = ["xr1", "xr2"]
result = {}
for d in devices:
result[d] = get_version(d)
print(json.dumps({"versions": result}))
What just happened: The script posts to an NSO RESTCONF URI that proxies a CLI command (show version) to each device via the NETCONF/CLI NED. It returns parsed JSON to the console. Encapsulating this logic ensures consistent parsing and reuse in workflows or CI pipelines.
Real-world note: In production, wrap this call with retry/backoff and TLS verification; here we set verify=False for lab convenience.
Verify:
python3 get_device_version.py
Expected output (full JSON):
{
"versions": {
"xr1": {
"platform": "NCS5500",
"version": "7.5.1",
"uptime": "12 days, 3 hours"
},
"xr2": {
"platform": "NCS5500",
"version": "7.5.0",
"uptime": "5 days, 7 hours"
}
}
}
Step 3: Add an approval gate (manual event) to the workflow
What we are doing: Configure the workflow approvalEvent so the workflow pauses and waits until a user posts an approval event. This enforces human validation before the change.
# Example: Post an approval event to CWM to resume the workflow
curl -u admin:Lab@123 -X POST \
-H "Content-Type: application/json" \
-d '{"workflowId":"nhprep-get-xr-version","event":"approvalEvent","eventData":{"approved":true,"approver":"operator1"}}' \
https://lab.nhprep.com:8443/cwm/api/events
What just happened: The curl posts a JSON event (approvalEvent) that CWM correlates to the paused workflow instance. When CWM receives the event with approved:true, the workflow resumes and continues to the performChange state.
Real-world note: Approval events are typically integrated with ticketing systems (e.g., ServiceNow) to provide auditability and map to change requests.
Verify:
# Query workflow instance status
curl -u admin:Lab@123 -X GET \
https://lab.nhprep.com:8443/cwm/api/workflows/instances/nhprep-get-xr-version
# Expected output (complete)
{
"instanceId": "inst-0001",
"workflowId": "nhprep-get-xr-version",
"status": "RUNNING",
"currentState": "performChange",
"lastEvent": {
"eventName": "approvalEvent",
"eventData": {
"approved": true,
"approver": "operator1",
"timestamp": "2026-04-02T12:34:56Z"
}
}
}
Step 4: Invoke the xr-upgrade parent workflow (performChange)
What we are doing: Trigger the xr-upgrade parent workflow via CWM — this workflow is responsible for copying the image to devices and performing the compatibility matrix check before upgrade. We call it as an action from our main workflow.
# Sample invocation payload to start xr-upgrade (via CWM)
curl -u admin:Lab@123 -X POST \
-H "Content-Type: application/json" \
-d '{
"workflow":"invoke-xr-upgrade",
"input": {
"devices": ["xr1","xr2"],
"imageURL": "ftp://192.0.2.100/images/xr-8.0.1.bin",
"force": false
}
}' \
https://lab.nhprep.com:8443/cwm/api/workflows/invoke
What just happened: The CWM API received the request to start the xr-upgrade workflow with the devices and image location. The xr-upgrade workflow will:
- copy the image to the device (using NETCONF/CLI ops via NSO),
- run an upgrade matrix check to ensure compatibility,
- perform the upgrade according to the workflow logic (stop/start), and
- report status back to the parent workflow.
Real-world note: Use a reliable repository (HTTP/FTP/SMB) with checksum verification. The workflow should validate the image checksum as part of the matrix check.
Verify:
# Check the xr-upgrade workflow instance
curl -u admin:Lab@123 -X GET \
https://lab.nhprep.com:8443/cwm/api/workflows/instances?filter=workflow:invoke-xr-upgrade
# Expected output (complete)
[
{
"instanceId": "upgrade-20260402-01",
"workflowId": "invoke-xr-upgrade",
"status": "RUNNING",
"devices": ["xr1","xr2"],
"progress": {
"xr1": "image copied",
"xr2": "matrix_check_passed"
}
}
]
Step 5: Post-check — verify device versions after upgrade
What we are doing: After the upgrade completes, re-run the get-device-version pre-check to compare pre/post results and confirm the upgrade succeeded.
# Trigger post-check (this is the same RESTCONF call from Step 2)
python3 get_device_version.py
What just happened: The same script queries NSO, which in turn queries devices. Differences in the "version" fields between pre- and post-check indicate whether the upgrade occurred. The workflow should store both pre and post results for audit.
Real-world note: Post-checks should include service-level tests (e.g., BGP session state, dataplane tests) in addition to version checks for full validation.
Verify:
# Expected full JSON output after successful upgrade
{
"versions": {
"xr1": {
"platform": "NCS5500",
"version": "8.0.1",
"uptime": "0 days, 1 hour"
},
"xr2": {
"platform": "NCS5500",
"version": "8.0.1",
"uptime": "0 days, 0 hours, 40 minutes"
}
}
}
Verification Checklist
- Check 1: Pre-check collected versions for all devices — verify by running get_device_version.py and comparing to workflow pre-check output.
- Check 2: Approval gate pauses the workflow — verify by checking workflow instance status and ensuring currentState is approvalGate before posting approval.
- Check 3: xr-upgrade invoked and reports progress — verify via CWM workflow instance query showing device progress entries.
- Check 4: Post-check shows target version — verify by running get_device_version.py and confirming version equals desired image version.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| Workflow fails at getVersions with authentication errors | Wrong NSO credentials in workflow or Python script | Ensure credentials are admin / Lab@123 and NSO has matching user; update workflow adapter if using vault integration |
| Approval event not correlated — workflow stays paused | Sent event uses wrong instanceId or event name mismatch | Query workflow instance to find correct instanceId, then post event with exact event name approvalEvent |
| xr-upgrade fails due to image copy error | FTP/HTTP server unreachable or wrong image URL | Verify the imageURL is reachable from devices; ensure FTP server IP and path are correct and accessible |
| Post-check shows old version even after successful workflow | Device did not reboot or image activation step failed | Check device boot logic; consult xr-upgrade workflow logs to see if activation step completed; verify device boot variables |
Key Takeaways
- Use pre-checks to detect incompatible device state before applying changes; this prevents wasted change attempts.
- An approval gate provides human control and auditability — integrate with ticketing in production.
- Post-checks verify the change and should include both control-plane (versions, protocol states) and dataplane tests.
- Encapsulate RESTCONF interactions in scripts or adapters (Python/Go) and validate inputs with JSON schema — this makes workflows robust and reusable.
Final practical insight: In production, always run workflow changes against a small pilot group (one device or one site) first, record the outcomes, and only then scale to the full fleet. Automation reduces manual work — but only if the workflow contains the right checks, approvals, and rollback/abort logic.