Agentic Workflows for Networking
Objective
Build an agentic workflow that automates troubleshooting and optimization for an enterprise network. You will register an AI Agent, bind it to a set of tools (Tool A, Tool B, Tool C, Tool D), create a workflow that reacts to telemetry events, and verify automated remediation. This matters in production because routine incidents (link flaps, congestion affecting GPU clusters, RDMA errors) must be detected and remediated quickly — reducing mean time to repair (MTTR) and avoiding costly job stalls in AI/ML compute environments.
Topology (Quick Recap)
Refer to the topology from Lesson 1. This lesson does not add new routers, switches, or interfaces; it builds agentic orchestration on top of the existing fabric and control-plane systems. All platform operations are performed against the central orchestration endpoint at lab.nhprep.com.
Tip: For this lesson we interact with the orchestration/control hub API at lab.nhprep.com. Use the credentials shown in examples (passwords use Lab@123). In production, use short-lived tokens and RBAC.
Device Table
| Device | Role | Management Endpoint |
|---|---|---|
| Control Hub | AI Agent orchestration and telemetry ingestion | lab.nhprep.com |
| Telemetry Collector | Streams device telemetry/events to Control Hub | (part of Lesson 1 topology) |
| Agent Runner | Executes tools and configuration actions (logical) | lab.nhprep.com (agent runtime) |
Introduction
In this lesson we create an agentic workflow: an automated sequence run by AI Agents that receives telemetry, diagnoses issues, and applies remediation steps. In production, this is used when an AI training cluster experiences periodic network stalls (for example, RDMA errors or packet drops causing All-Reduce to slow down). An agentic workflow reduces human intervention and allows consistent, auditable remediation steps.
Key Concepts (Theory + Practical)
- Agentic Workflow — A sequence of automated actions executed by an agent in response to triggers (telemetry, alerts, time-based). Think of it as a recipe: trigger → gather data → analyze → take action. In production, workflows provide repeatable remediation for recurring issues.
- AI Agent and Tools — The AI Agent is an orchestrator that can call tools (Tool A..D). Tools correspond to capabilities: telemetry query, config change, ticket creation, or runbook execution. Each tool is invoked via API calls. Agents manage state and decide next steps based on tool outputs.
- Telemetry & Event Triggers — Telemetry is pushed via events (webhooks) or polled from collectors. When a telemetry event arrives indicating a problem (e.g., high RDMA retransmits), the workflow is triggered. Webhooks deliver JSON payloads; the agent parses them and decides actions.
- RDMA Impacts on Networking — RDMA (RoCEv2, iWARP) bypasses standard TCP/IP stack, causing extremely low-latency, high-throughput flows that can saturate fabric links and magnify the impact of microbursts. In production, agent workflows must correlate RDMA-related metrics (QCN, PFC, retransmits) with port counters and queue drops before reconfiguring QoS or path selection.
- Idempotence, Safety, and Auditing — Every automated change must be idempotent (safe to repeat), require escalation thresholds (don’t flap settings), and be logged for audit. The agent keeps run history and outputs for verification.
Step-by-step configuration
Step 1: Register the AI Agent
What we are doing: Register a new agent "Agentic-Troubleshooter" in the Control Hub so it can be scheduled and run workflows. This matters because the agent identity binds tooling, permissions, and the runtime environment.
# Register an agent via Control Hub API
curl -u admin:Lab@123 -X POST https://lab.nhprep.com/api/agents \
-H "Content-Type: application/json" \
-d '{
"name": "Agentic-Troubleshooter",
"runtime": "serverless",
"tools": ["Tool A", "Tool B", "Tool C", "Tool D"],
"owner": "NHPREP"
}'
What just happened: The POST request created an agent record on the orchestration platform. The platform binds the listed tools to the agent and provisions a runtime handle (serverless container or runner) that will receive workflow invocations. The agent will later execute actions by calling tools via the platform’s service bus.
Real-world note: In production, agent registrations are gated by RBAC and approval workflows. Never run agents with overly broad permissions.
Verify:
# Query the agent list to confirm registration
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/agents
# Expected output (complete JSON)
[
{
"id": "agent-001",
"name": "Agentic-Troubleshooter",
"runtime": "serverless",
"tools": [
"Tool A",
"Tool B",
"Tool C",
"Tool D"
],
"owner": "NHPREP",
"status": "idle",
"last_seen": "2026-03-30T10:05:23Z"
}
]
Step 2: Define Tools and Capabilities
What we are doing: Register the tools the agent will call. Tools map to capabilities like telemetry query (Tool A), diagnostic parser (Tool B), config apply (Tool C), and ticketing/logging (Tool D). Mapping tools explicitly ensures the agent cannot call arbitrary actions.
# Register Tool A (telemetry), Tool B (diagnostics), Tool C (config), Tool D (logging)
curl -u admin:Lab@123 -X POST https://lab.nhprep.com/api/tools \
-H "Content-Type: application/json" \
-d '[
{"name":"Tool A", "type":"telemetry_query", "endpoint":"https://lab.nhprep.com/api/telemetry"},
{"name":"Tool B", "type":"diagnostics", "endpoint":"https://lab.nhprep.com/api/diagnostics"},
{"name":"Tool C", "type":"config_apply", "endpoint":"https://lab.nhprep.com/api/config"},
{"name":"Tool D", "type":"logging", "endpoint":"https://lab.nhprep.com/api/logs"}
]'
What just happened: The Control Hub now has named capabilities. When the agent invokes "Tool C", the platform routes that request to the config apply service which performs validated changes. This decouples decision logic (agent) from action machinery (tools), improving safety and auditability.
Real-world note: Separating tools allows central approval; e.g., Tool C may require a signed change template before applying to production.
Verify:
# List registered tools
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/tools
# Expected output
[
{
"name": "Tool A",
"type": "telemetry_query",
"endpoint": "https://lab.nhprep.com/api/telemetry",
"status": "available"
},
{
"name": "Tool B",
"type": "diagnostics",
"endpoint": "https://lab.nhprep.com/api/diagnostics",
"status": "available"
},
{
"name": "Tool C",
"type": "config_apply",
"endpoint": "https://lab.nhprep.com/api/config",
"status": "available"
},
{
"name": "Tool D",
"type": "logging",
"endpoint": "https://lab.nhprep.com/api/logs",
"status": "available"
}
]
Step 3: Create the Agentic Workflow
What we are doing: Define the workflow that reacts to telemetry events indicating RDMA retransmits or queue drops, queries telemetry, runs diagnostics, and performs a safe remediation (e.g., adjust QoS or shift traffic). Workflows define steps, conditionals, and escalation thresholds.
# Create a workflow that triggers on 'rdma_issue' events
curl -u admin:Lab@123 -X POST https://lab.nhprep.com/api/workflows \
-H "Content-Type: application/json" \
-d '{
"name": "RDMA_AutoRemediate",
"agent": "Agentic-Troubleshooter",
"trigger": {"type":"event","event_type":"rdma_issue"},
"steps": [
{"id":"s1","tool":"Tool A","action":"query","params":{"metric":"rdma_retransmits","window":"120s"}},
{"id":"s2","tool":"Tool B","action":"analyze","params":{"threshold":100,"operation":"greater_than"}},
{"id":"s3","tool":"Tool C","action":"apply","params":{"change_type":"qos_adjust","template":"reduce-be-l1"}},
{"id":"s4","tool":"Tool D","action":"log","params":{"level":"info","message":"RDMA remediation executed"}}
],
"safety": {"manual_approval_required": false, "backout": true}
}'
What just happened: The Control Hub stored the workflow and bound it to the Agentic-Troubleshooter. The trigger listens for "rdma_issue" events. When triggered, the agent runs step s1 to fetch recent RDMA metrics, step s2 to decide if remediation is warranted, step s3 to apply config changes if thresholds are exceeded, and step s4 to log the action. The safety settings enable automated rollback if the remediation causes worse metrics.
Real-world note: For RDMA-heavy clusters, prefer conservative QoS adjustments and use staged rollouts to avoid widespread disruption.
Verify:
# Get workflow details
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/workflows/RDMA_AutoRemediate
# Expected output
{
"id": "workflow-100",
"name": "RDMA_AutoRemediate",
"agent": "Agentic-Troubleshooter",
"trigger": {"type":"event","event_type":"rdma_issue"},
"steps": [
{"id":"s1","tool":"Tool A","action":"query"},
{"id":"s2","tool":"Tool B","action":"analyze"},
{"id":"s3","tool":"Tool C","action":"apply"},
{"id":"s4","tool":"Tool D","action":"log"}
],
"safety": {"manual_approval_required": false, "backout": true},
"status": "active"
}
Step 4: Simulate an Incident (Trigger the Workflow)
What we are doing: Post a telemetry event that simulates elevated RDMA retransmits, which should trigger the workflow. This tests the end-to-end automation without injecting changes into production devices.
# Send a simulated telemetry event
curl -u admin:Lab@123 -X POST https://lab.nhprep.com/api/events \
-H "Content-Type: application/json" \
-d '{
"event_type": "rdma_issue",
"source": "telemetry-collector-01",
"timestamp": "2026-03-30T10:15:00Z",
"payload": {"node":"compute-rack-2","rdma_retransmits":250,"severity":"high"}
}'
What just happened: The Control Hub received the event and matched it to the RDMA_AutoRemediate workflow trigger. The agent runtime was scheduled to execute the workflow steps. The agent fetched telemetry (Tool A), analyzed it (Tool B), and — since retransmits exceeded the threshold — performed an approved QoS change via Tool C and logged the action via Tool D.
Real-world note: Use synthetic events in a staging environment. In production, events come from telemetry collectors or alerting systems.
Verify:
# Check recent workflow runs
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/workflows/workflow-100/runs
# Expected output
[
{
"run_id": "run-20260330-01",
"workflow_id": "workflow-100",
"trigger": {"type":"event","event_type":"rdma_issue"},
"start_time": "2026-03-30T10:15:01Z",
"end_time": "2026-03-30T10:15:18Z",
"status": "success",
"steps": [
{"id":"s1","status":"success","output":{"rdma_retransmits":250}},
{"id":"s2","status":"success","output":{"decision":"remediate"}},
{"id":"s3","status":"success","output":{"applied":"qos_template:reduce-be-l1","rollback_token":"rb-xyz"}},
{"id":"s4","status":"success","output":{"log_id":"log-678"}}
]
}
]
Step 5: Review Logs and Rollback Safety
What we are doing: Inspect the log entry and verify the rollback token exists (for safe backout). This step ensures remediations are auditable and reversible.
# Fetch the log and rollback status
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/logs/log-678
curl -u admin:Lab@123 -X GET https://lab.nhprep.com/api/config/rollback/rb-xyz
# Expected outputs
# Log
{
"log_id": "log-678",
"workflow_run": "run-20260330-01",
"message": "RDMA remediation executed",
"details": {"node":"compute-rack-2","change":"applied qos_template:reduce-be-l1"},
"timestamp":"2026-03-30T10:15:18Z"
}
# Rollback token info
{
"rollback_token": "rb-xyz",
"associated_change": "qos_template:reduce-be-l1",
"status": "ready",
"instructions": "Call rollback endpoint with token to revert"
}
What just happened: The log confirms the agent executed the remediation and created a rollback token. The rollback endpoint maintains the change snapshot (configuration before change) so the action is reversible if metrics worsen.
Real-world note: Always verify rollbacks in staging. Automated rollbacks should be conditioned on monitored metrics, not just time.
Verification Checklist
- Check 1: Agent registration exists — verify with
GET /api/agentsto show "Agentic-Troubleshooter" and status "idle". - Check 2: Workflow active — verify
GET /api/workflows/RDMA_AutoRemediatereturns status "active" and steps s1..s4. - Check 3: Successful run and outputs — verify
GET /api/workflows/{id}/runsshows a "success" run with step outputs and a rollback token.
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| Workflow never triggers | Trigger event_type mismatched (typo) or event source not sending events | Verify event_type in workflow and the telemetry event payload; resend with exact event_type string |
| Agent reports "tool unavailable" | Tool endpoint mis-registered or network connectivity issue between runner and tool | Confirm GET /api/tools shows status "available"; fix endpoint URL or network ACLs |
| Remediation applied but metrics worsen | Change was too aggressive or wrong template selected | Use rollback token to revert; adjust thresholds and test in staging before production rollout |
| No rollback token created | Tool C did not snapshot pre-change state or lacked permissions | Ensure config tool has pre-change snapshot capability and necessary permissions; re-register Tool C with proper service account |
Key Takeaways
- Agentic workflows automate detection, diagnosis, and remediation and should be built with idempotence, safety (rollback), and auditing in mind.
- Separate decision logic (agent) from action mechanisms (tools). This separation enables central approvals and protects production systems.
- For AI/ML clusters using RDMA (RoCEv2, iWARP), correlate RDMA metrics with queue drops and port counters before making changes; small misconfigurations can have outsized impact.
- Always validate workflows in staging, ensure rollback tokens are available, and use RBAC to limit what automated agents can change.
Important: In production, replace static credentials (used here for lab clarity) with short-lived tokens and follow your organization's change management procedures.
If you completed these steps, you now have an agentic workflow that can detect RDMA-related issues and perform a safe, auditable remediation — the basis for scalable, automated network operations in AI/ML environments.