Skip to content

15.1 Network Operations

The previous part covered security: threats, segmentation, IEC 62443, and hardening. A hardened network still requires ongoing operations to stay reliable. This chapter covers the processes and tools that keep a network running: documentation, change management, monitoring, and disaster recovery.

A network without documentation takes longer to troubleshoot. A network without change management suffers unplanned outages. A network without monitoring discovers problems only when users complain. Operations processes prevent these failures.

Good documentation is the foundation of network operations. Without it, troubleshooting takes longer, changes introduce errors, and institutional knowledge is lost when staff leave.

DocumentPurpose
Physical diagramShows device locations, cables, rack positions
Logical diagramShows IP addresses, VLANs, routing
Cable mapDocuments port-to-port connections
IPAM (IP Address Management)Tracks IP assignments, subnets, DHCP/DNS
Asset inventoryRecords model, serial number, firmware, location

Manual asset inventories become outdated the moment a device is replaced. The following script queries all switches via SNMP and generates a network inventory report with hostname, IP, firmware version, and uptime.

from pysnmp.hlapi import (
getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
ContextData, ObjectType, ObjectIdentity
)
import csv, sys
OIDS = {
"sysName": "1.3.6.1.2.1.1.5.0",
"sysDescr": "1.3.6.1.2.1.1.1.0",
"sysUpTime":"1.3.6.1.2.1.1.3.0",
"sysContact":"1.3.6.1.2.1.1.4.0",
"sysLocation":"1.3.6.1.2.1.1.6.0",
}
def snmp_get(host, community, oid):
it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
UdpTransportTarget((host, 161), timeout=2, retries=1),
ContextData(), ObjectType(ObjectIdentity(oid)))
err_ind, err_stat, _, var_binds = next(it)
if err_ind or err_stat:
return "N/A"
return str(var_binds[0][1])
def ticks_to_days(ticks_str):
try:
return f"{int(ticks_str) / 8640000:.1f} days"
except ValueError:
return ticks_str
switches = ["192.168.50.1", "192.168.50.2", "192.168.50.3",
"192.168.50.4", "192.168.50.5"]
community = "public"
writer = csv.writer(sys.stdout)
writer.writerow(["IP", "Hostname", "Description", "Uptime", "Location"])
for host in switches:
row = [host]
for name, oid in OIDS.items():
val = snmp_get(host, community, oid)
if name == "sysUpTime":
val = ticks_to_days(val)
row.append(val)
writer.writerow(row[:5]) # IP, Name, Descr, Uptime, Location

Run this script weekly and store the output in version control. Comparing consecutive reports reveals firmware changes, reboots (uptime resets), and hostname misconfigurations.

Change management controls modifications to the network to prevent unplanned outages. Every change follows a process.

TypeApprovalExample
StandardPre-approved (follows a template)Adding a new access port to an existing VLAN
NormalRequires CAB reviewAdding a new VLAN, changing firewall rules
EmergencyExpedited approval (post-implementation review)Restoring service after an outage

A CAB (Change Advisory Board) reviews normal changes. The CAB includes the network engineer, the OT manager, and a representative from operations. The CAB evaluates the risk, the rollback plan, and the maintenance window.

Every change request includes a rollback plan that answers:

  • What specific commands or actions reverse the change?
  • How long does the rollback take?
  • What is the verification step after rollback?
  • Who has the authority to initiate rollback?

A change without a tested rollback plan does not get approved.

HiOS has two configuration stores: running config (active, in RAM) and startup config (saved, in flash). Changes take effect immediately but are lost on reboot unless saved. A baseline configuration is a known-good configuration used as a reference.

The following script uses netmiko to connect to all switches via SSH, download the running configuration, and save it to a file. Run it nightly via cron and store the output in Git for version history.

from netmiko import ConnectHandler
from pathlib import Path
from datetime import date
SWITCHES = [
{"host": "192.168.50.1", "device_type": "generic", "username": "admin",
"password": "SecurePass123!", "port": 22},
{"host": "192.168.50.2", "device_type": "generic", "username": "admin",
"password": "SecurePass123!", "port": 22},
]
backup_dir = Path(f"backups/{date.today()}")
backup_dir.mkdir(parents=True, exist_ok=True)
for sw in SWITCHES:
try:
conn = ConnectHandler(**sw)
config = conn.send_command("show running-config")
hostname = conn.send_command("show system info").split("\n")[0]
filename = backup_dir / f"{sw['host']}.cfg"
filename.write_text(config)
print(f"OK: {sw['host']} ({hostname}) -> {filename}")
conn.disconnect()
except Exception as e:
print(f"FAIL: {sw['host']}: {e}")

Compare the current configuration to the baseline to detect unauthorized changes. After running the backup script, use diff to compare today’s backup to the baseline:

Terminal window
diff backups/baseline/192.168.50.1.cfg backups/2026-04-29/192.168.50.1.cfg

Any difference indicates a configuration change. If the change was not documented in a change request, investigate immediately.

MethodWhat It MeasuresGranularityUse Case
SNMP pollingInterface counters (bytes, errors, drops)Per-interface, per-poll-intervalBandwidth utilization, error trends
NetFlow/sFlowPer-flow records (src/dst IP, port, bytes)Per-flowTraffic analysis, anomaly detection
Packet captureFull packet contentPer-packetProtocol debugging, deep inspection

SNMP polling answers “how much traffic is on this interface?” Flow data answers “who is talking to whom?” Packet capture answers “what are they saying?”

The following script polls interface utilization on all switches every 10 seconds and alerts when utilization exceeds a threshold. It calculates utilization by comparing consecutive counter readings.

from pysnmp.hlapi import (
getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
ContextData, ObjectType, ObjectIdentity
)
import time
def snmp_get_int(host, community, oid):
it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
UdpTransportTarget((host, 161), timeout=2, retries=1),
ContextData(), ObjectType(ObjectIdentity(oid)))
err_ind, err_stat, _, var_binds = next(it)
if err_ind or err_stat:
return 0
return int(var_binds[0][1])
HOST = "192.168.50.1"
COMMUNITY = "public"
IFACE_INDEX = 1
INTERVAL = 10
THRESHOLD_PERCENT = 80
SPEED_BPS = 1_000_000_000 # 1 Gbps
oid_in = f"1.3.6.1.2.1.31.1.1.1.6.{IFACE_INDEX}" # ifHCInOctets
oid_out = f"1.3.6.1.2.1.31.1.1.1.10.{IFACE_INDEX}" # ifHCOutOctets
prev_in = snmp_get_int(HOST, COMMUNITY, oid_in)
prev_out = snmp_get_int(HOST, COMMUNITY, oid_out)
while True:
time.sleep(INTERVAL)
curr_in = snmp_get_int(HOST, COMMUNITY, oid_in)
curr_out = snmp_get_int(HOST, COMMUNITY, oid_out)
bps_in = (curr_in - prev_in) * 8 / INTERVAL
bps_out = (curr_out - prev_out) * 8 / INTERVAL
util_in = bps_in / SPEED_BPS * 100
util_out = bps_out / SPEED_BPS * 100
print(f"Port {IFACE_INDEX}: IN={util_in:.1f}% OUT={util_out:.1f}%")
if util_in > THRESHOLD_PERCENT or util_out > THRESHOLD_PERCENT:
print(f" ALERT: utilization exceeds {THRESHOLD_PERCENT}%")
prev_in, prev_out = curr_in, curr_out

This script uses HC (High Capacity) counters (64-bit) from the IF-MIB, which do not wrap around on high-speed interfaces. The 32-bit counters wrap every 34 seconds on a 1 Gbps link at full utilization.

TermDefinition
RPO (Recovery Point Objective)Maximum acceptable data loss, measured in time
RTO (Recovery Time Objective)Maximum acceptable time to restore service
MTTR (Mean Time to Repair)Average time to restore a failed component
MTBF (Mean Time Between Failures)Average time between failures
ArchitectureHow It WorksRTOCost
Cold siteEmpty facility with power and cooling. Equipment shipped after disaster.DaysLow
Warm siteFacility with equipment installed but not running. Data restored from backup.HoursMedium
Hot siteFacility with equipment running and data replicated. Manual switchover.MinutesHigh
Active-passiveStandby system receives data replication. Automatic or manual failover.MinutesHigh
Active-activeBoth systems handle traffic simultaneously. Load balanced. No failover needed.SecondsHighest

For OT networks, the MRP ring topology provides active-passive redundancy at the network layer (sub-200 ms failover). At the application layer, redundant SCADA servers in active-passive mode provide historian and HMI continuity.

A backup that has never been tested is not a backup. It is a hope. Test backups by:

  1. Restoring a switch configuration from backup to a lab switch quarterly
  2. Verifying the restored configuration matches the baseline
  3. Documenting the restore procedure and time
  4. Including backup restoration in the disaster recovery drill

Automate inventory and backups

SNMP inventory scripts and netmiko backup scripts run nightly. Manual documentation becomes outdated immediately.

Every change needs a rollback plan

The CAB reviews risk and rollback before approving. A change without a tested rollback plan does not get approved.

Test your backups

A backup that has never been restored is not a backup. Test quarterly. Measure restore time. That is your real RTO.

Operations keep the network running. When something breaks despite good operations, structured troubleshooting finds the root cause. The next chapter covers the CompTIA 7-step troubleshooting methodology, OSI-based diagnostics, and a Python toolkit for automating common troubleshooting tasks.

  • CompTIA Network+ N10-009 Exam Objectives, Domain 3: Network Operations
  • RFC 3411 — An Architecture for Describing SNMP Management Frameworks (IETF, 2002)
  • ITIL 4 Foundation. (2019). ITIL 4 Foundation: ITIL 4 Edition. AXELOS.
  • RFC 2863 — The Interfaces Group MIB (IETF, 2000)