Lesson 6 of 7

Building AI Skills as a Network Engineer

Objective

This lesson teaches a practical, vendor-neutral learning path for network engineers who want to add AI/ML skills to their toolkit. You will set up a reproducible local environment, run a small ML experiment, learn how to deploy a simple model service, and map the networking concerns that matter when AI workloads move from a laptop to production. This matters in production because AI workloads introduce high-bandwidth, low-latency flows (and sometimes RDMA) that change how you design fabrics, QoS, and monitoring in a data center.

Introduction

In this lesson we will not change router or L2/L3 configs in the lab topology from Lesson 1; instead we will build the human and technical foundation you need to work with AI/ML workloads. You will create a local, reproducible Python environment, run a basic ML training job, expose the model as a service, and record the experiment for later auditing. In production networks, these steps become part of on-prem AI pipelines where compute, storage, and the network must be designed to meet throughput and latency requirements.

Quick Recap

Reference the topology and devices already deployed in Lesson 1. This lesson adds no new network devices or IP addresses. All work is done from your workstation or a lab VM that is already connected to the existing lab network. No change to network interfaces or IP addressing is required for the exercises in this lesson.

Device Table

DeviceRoleIP / Notes
Lab VM / WorkstationAI/ML client and experiment hostNo new IPs; use existing workstation connected to the Lesson 1 topology
Repository / ArchiveExperiment and artifact storage (optional)Use lab.nhprep.com host if you have a remote git server; otherwise local only

Tip: Keep your experiment VM on the same rack network as target compute when you later test performance — data locality matters.

Key Concepts (before hands-on)

  • AI/ML Pipeline Stages — Training, Validation, and Inference. Training needs sustained compute and heavy data movement; inference is latency-sensitive. Know which stage you are targeting because networking and observability requirements differ.
  • Data Locality and Bandwidth — Large datasets and model checkpoints drive high egress/ingress rates. In production, fabrics for training often prioritize throughput and congestion control (e.g., RoCE environments), whereas inference often prioritizes low-latency and QoS.
  • RDMA and RoCEv2 — RDMA bypasses the OS stack and reduces CPU overhead, enabling higher effective throughput and lower latency. When RDMA is used, standard TCP/IP behaviors (retransmit, congestion) change; network hardware must support lossless behavior and proper ECN/PAUSE handling.
  • Experiment Reproducibility — Use virtual environments, explicit package versions, and version control for code and data manifests. Reproducible runs make it possible to troubleshoot whether a network issue or compute configuration is causing degraded ML performance.

Real-world analogy: Think of training as multiple trucks delivering crates (data) to a processing plant (GPU racks). If the roads (network) are congested or unreliable, the processing lines stall; if you move the plant closer to warehouses (data locality), throughput improves.

Steps (Hands-on)

Follow the steps below. Each step includes the commands you run, why they matter, and a verification command with expected output.

Step 1: Prepare a reproducible Python workspace

What we are doing: Create a workspace directory, initialize a Python virtual environment, and install core ML packages. This isolates dependencies so future experiments are reproducible and avoids polluting system Python.

mkdir -p ~/nhprep_ai_lab
cd ~/nhprep_ai_lab
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy pandas scikit-learn flask

What just happened:

  • mkdir and cd created and moved into your workspace.
  • python3 -m venv venv created an isolated Python environment where packages will be installed locally.
  • source venv/bin/activate activated that environment so pip and python refer to it.
  • pip install installed deterministic packages used for a simple ML experiment and a tiny web service for inference.

Real-world note: In production, teams use container images with pinned package versions instead of plain venv for stronger reproducibility and deployment consistency.

Verify:

python -c "import sys, numpy, pandas, sklearn; print('py', sys.version.split()[0], 'numpy', numpy.__version__, 'pandas', pandas.__version__, 'sklearn', sklearn.__version__)"

Expected output (example; actual versions may vary):

py 3.9.13 numpy 1.24.3 pandas 2.0.1 sklearn 1.3.0

Step 2: Create and run a small ML training script

What we are doing: Create a small script that trains a scikit-learn classifier on a synthetic dataset. This teaches the full local cycle: data → train → evaluate and gives you a baseline to measure compute and network behavior when scaled up.

cat > train_model.py <<'PY'
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import joblib
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
joblib.dump(clf, 'model.joblib')
print('Accuracy:', acc)
PY

python train_model.py

What just happened:

  • The script synthesized a dataset to avoid external data dependencies (important for reproducibility).
  • We trained a RandomForest model and saved it to model.joblib.
  • This models a simple training workload; in production you would replace this with a framework like a distributed trainer and a real dataset.

Real-world note: Synthetic runs let you check end-to-end pipeline behavior (e.g., storage throughput and model save/load) without pulling large datasets across the network.

Verify:

ls -l model.joblib
python -c "import joblib; m=joblib.load('model.joblib'); print(type(m))"

Expected output:

-rw-r--r-- 1 youruser yourgroup 123456 Apr  2 12:34 model.joblib
<class 'sklearn.ensemble._forest.RandomForestClassifier'>

Step 3: Expose the model as a simple inference service

What we are doing: Install Flask (already installed in Step 1) and create a minimal REST endpoint that loads the model and serves predictions. This mirrors how inference is exposed in production (though production systems add containers, autoscaling, and load balancers).

cat > serve_model.py <<'PY'
from flask import Flask, request, jsonify
import joblib
import numpy as np
app = Flask(__name__)
model = joblib.load('model.joblib')
@app.route('/predict', methods=['POST'])
def predict():
    data = request.json.get('data')
    arr = np.array(data)
    preds = model.predict(arr).tolist()
    return jsonify({'predictions': preds})
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
PY

export FLASK_APP=serve_model.py
python serve_model.py &

What just happened:

  • serve_model.py spins up a Flask app on port 5000 that accepts JSON arrays and returns predictions.
  • The app loads model.joblib so the model is served in-process.

Real-world note: In production, the Flask app would be containerized and fronted by a reverse proxy or ingress; you would also add authentication, logging, and health checks.

Verify:

curl -s -H "Content-Type: application/json" -d '{"data":[[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,0,0,0,0,0,0,0,0,0,0]]}' http://127.0.0.1:5000/predict

Expected output:

{"predictions":[0]}

Step 4: Version control and artifact manifest

What we are doing: Initialize a local git repository to track code and create a simple experiment manifest that records versions and environment info. This is essential for reproducibility and for correlating network/compute issues with a particular run.

git init
git add train_model.py serve_model.py model.joblib
git commit -m "Initial ml experiment: train + serve"
python -c "import sys, json, platform, sklearn; print(json.dumps({'python': platform.python_version(), 'sklearn': sklearn.__version__}))" > experiment_manifest.json
git add experiment_manifest.json
git commit -m "Add experiment manifest"

What just happened:

  • You created a git repository and committed the code and artifact manifest.
  • experiment_manifest.json records runtime versions so you (or another engineer) can reproduce the same environment later.

Real-world note: Many organizations centralize artifacts in a remote git or artifact store; if you push to a remote, ensure credentials and network access comply with corporate policy. Use lab.nhprep.com if your organization provides an internal git server.

Verify:

git log --oneline
cat experiment_manifest.json

Expected output:

<commit-hash-2> Add experiment manifest
<commit-hash-1> Initial ml experiment: train + serve
{"python": "3.9.13", "sklearn": "1.3.0"}

Step 5: Network-aware checklist for scaling experiments

What we are doing: Run simple local networking checks and note the network requirements you'll need when scaling. We measure latency to a remote data node and list the items to evaluate for production (no privileged system changes required here).

ping -c 3 data-node.lab.nhprep.com
traceroute -m 10 data-node.lab.nhprep.com || true

What just happened:

  • ping gives basic reachability and RTTs to a data node.
  • traceroute shows the network path; use this to see if traffic crosses expected spine/super-spine elements or traverses network domains.

Real-world note: For training, you’ll want to measure sustained throughput (iperf or RDMA tests) and check for packet drops and congestion; for RoCEv2 environments, you’ll coordinate with server and switch teams to enable lossless transport.

Verify: Example expected output for ping (IP and host are illustrative):

PING data-node.lab.nhprep.com (10.1.2.50) 56(84) bytes of data.
64 bytes from 10.1.2.50: icmp_seq=1 ttl=64 time=0.732 ms
64 bytes from 10.1.2.50: icmp_seq=2 ttl=64 time=0.679 ms
64 bytes from 10.1.2.50: icmp_seq=3 ttl=64 time=0.689 ms

--- data-node.lab.nhprep.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.679/0.700/0.732/0.020 ms

Example expected traceroute excerpt:

traceroute to data-node.lab.nhprep.com (10.1.2.50), 10 hops max, 60 byte packets
 1  10.0.0.1 (10.0.0.1)  0.269 ms  0.241 ms  0.216 ms
 2  10.0.1.1 (10.0.1.1)  0.345 ms  0.329 ms  0.317 ms
 3  10.1.2.50 (10.1.2.50)  0.689 ms  0.673 ms  0.660 ms

Warning: When you scale to real datasets and distributed training, measure sustained throughput (not just RTT) and confirm fabric features (lossless vs lossy) before enabling RDMA.

Verification Checklist

  • Check 1: Virtual environment active — verify with python -c "import sys; print(sys.prefix)" and ensure it points to ~/nhprep_ai_lab/venv.
  • Check 2: Model artifact exists and loads — verify ls -l model.joblib and python -c "import joblib; print(type(joblib.load('model.joblib')))".
  • Check 3: Inference endpoint responds — verify with the curl request to http://127.0.0.1:5000/predict and expect JSON with predictions.
  • Check 4: Experiment manifest committed — verify with git log --oneline and cat experiment_manifest.json.

Common Mistakes

SymptomCauseFix
Virtualenv packages not foundNot activating the venv (source venv/bin/activate)Activate the venv before running pip/python; verify which python points into venv
Model file not found when servingmodel.joblib not in current working directory or not committedConfirm ls -l model.joblib; ensure server process starts from project root or use absolute path
Inference endpoint times outFirewall or process bound to wrong interfaceEnsure the Flask app uses host='0.0.0.0' and check local firewall rules; curl localhost to test
Reproduced results differ across runsPackages/versions differ or random seeds not setRecord package versions in manifest, set random seeds, or use deterministic framework settings

Key Takeaways

  • A reproducible local environment (venv or container) and an experiment manifest are the foundation for debugging whether performance problems are compute-, storage-, or network-related.
  • AI workloads have distinct networking demands: training favors throughput and sometimes lossless transport (RDMA), while inference favors low-latency and strict QoS. Understand which stage you are supporting.
  • Start small with synthetic datasets and local services to learn the full lifecycle: train → save → serve → monitor. This isolates network variables before you scale to distributed training.
  • Track code, artifacts, and environment metadata in version control so you can correlate observed network metrics with specific experiments and reproduce behavior in production.

Important: As you move from local experiments to on-prem or cloud deployments, collaborate early with server and network teams about fabric behavior (e.g., ECN, PFC, and congestion management) — AI/ML workloads can reveal subtle issues in network design.


This lesson focused on building practical skills and a workflow for network engineers to begin running and troubleshooting AI/ML experiments. Practice these steps until you can reproduce runs consistently; in the next lesson we will map these experiments onto production fabric considerations and test RDMA-aware workloads.