15.1 Network Operations

The previous part covered security: threats, segmentation, IEC 62443, and hardening. A hardened network still requires ongoing operations to stay reliable. This chapter covers the processes and tools that keep a network running: documentation, change management, monitoring, and disaster recovery.

Why Operations Matter

A network without documentation takes longer to troubleshoot. A network without change management suffers unplanned outages. A network without monitoring discovers issues only when users complain. Operations processes address these gaps.

Network Documentation

Good documentation is the foundation of network operations. Without documentation, troubleshooting takes longer, changes introduce detected errors, and institutional knowledge is lost when staff leave.

Document	Purpose
Physical diagram	Shows device locations, cables, rack positions
Logical diagram	Shows IP addresses, VLANs, routing
Cable map	Documents port-to-port connections
IPAM (IP Address Management)	Tracks IP assignments, subnets, DHCP/DNS
Asset inventory	Records model, serial number, firmware, location

Automated Network Inventory

Manual asset inventories become outdated the moment a device is replaced. The following script queries switches via SNMP and generates a network inventory report with hostname, IP, firmware version, and uptime.

from pysnmp.hlapi import (
    getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
    ContextData, ObjectType, ObjectIdentity
)
import csv, sys

OIDS = {
    "sysName":  "1.3.6.1.2.1.1.5.0",
    "sysDescr": "1.3.6.1.2.1.1.1.0",
    "sysUpTime":"1.3.6.1.2.1.1.3.0",
    "sysContact":"1.3.6.1.2.1.1.4.0",
    "sysLocation":"1.3.6.1.2.1.1.6.0",
}

def snmp_get(host, community, oid):
    it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
                UdpTransportTarget((host, 161), timeout=2, retries=1),
                ContextData(), ObjectType(ObjectIdentity(oid)))
    err_ind, err_stat, _, var_binds = next(it)
    if err_ind or err_stat:
        return "N/A"
    return str(var_binds[0][1])

def ticks_to_days(ticks_str):
    try:
        return f"{int(ticks_str) / 8640000:.1f} days"
    except ValueError:
        return ticks_str

switches = ["192.168.50.1", "192.168.50.2", "192.168.50.3",
            "192.168.50.4", "192.168.50.5"]
community = "public"

writer = csv.writer(sys.stdout)
writer.writerow(["IP", "Hostname", "Description", "Uptime", "Location"])
for host in switches:
    row = [host]
    for name, oid in OIDS.items():
        val = snmp_get(host, community, oid)
        if name == "sysUpTime":
            val = ticks_to_days(val)
        row.append(val)
    writer.writerow(row[:5])  # IP, Name, Descr, Uptime, Location

Run this script weekly and store the output in version control. Comparing consecutive reports reveals firmware changes, reboots (uptime resets), and hostname misconfigurations.

Change Management

Change management controls modifications to the network and helps reduce unplanned outages. Every change follows a process.

Change Types

Type	Approval	Example
Standard	Pre-approved (follows a template)	Adding a new access port to an existing VLAN
Normal	Requires CAB review	Adding a new VLAN, changing firewall rules
Emergency	Expedited approval (post-implementation review)	Restoring service after an outage

A CAB (Change Advisory Board) reviews normal changes. The CAB includes the network engineer, the OT manager, and a representative from operations. The CAB evaluates the consequence, the rollback plan, and the maintenance window.

Change Management Process

Rollback Plan Requirements

Every change request includes a rollback plan that answers:

What specific commands or actions reverse the change?
How long does the rollback take?
What is the verification step after rollback?
Who has the authority to initiate rollback?

A change without a tested rollback plan does not get approved.

Configuration Management

HiOS has 2 configuration stores: running config (active, in RAM) and startup config (saved, in flash). Changes take effect immediately but are lost on reboot unless saved. A baseline configuration is a known-good configuration used as a reference.

Automated Configuration Backup

The following script uses netmiko to connect to switches via SSH, download the running configuration, and save the configuration to a file. Run the script nightly via cron and store the output in Git for version history.

from netmiko import ConnectHandler
from pathlib import Path
from datetime import date

SWITCHES = [
    {"host": "192.168.50.1", "device_type": "generic", "username": "admin",
     "password": "SecurePass123!", "port": 22},
    {"host": "192.168.50.2", "device_type": "generic", "username": "admin",
     "password": "SecurePass123!", "port": 22},
]

backup_dir = Path(f"backups/{date.today()}")
backup_dir.mkdir(parents=True, exist_ok=True)

for sw in SWITCHES:
    try:
        conn = ConnectHandler(**sw)
        config = conn.send_command("show running-config")
        hostname = conn.send_command("show system info").split("\n")[0]
        filename = backup_dir / f"{sw['host']}.cfg"
        filename.write_text(config)
        print(f"OK: {sw['host']} ({hostname}) -> {filename}")
        conn.disconnect()
    except Exception as e:
        print(f"FAIL: {sw['host']}: {e}")

Configuration Drift Detection

Compare the current configuration to the baseline to detect unauthorized changes. After running the backup script, use diff to compare the current backup to the baseline:

diff backups/baseline/192.168.50.1.cfg backups/2026-04-29/192.168.50.1.cfg

Any difference indicates a configuration change. If the change was not documented in a change request, investigate immediately.

Network Monitoring

SNMP Polling vs Flow Data

Method	What the Method Measures	Granularity	Use Case
SNMP polling	Interface counters (bytes, detected errors, drops)	Per-interface, per-poll-interval	Bandwidth utilization, detected error trends
NetFlow/sFlow	Per-flow records (src/dst IP, port, bytes)	Per-flow	Traffic analysis, anomaly detection
Packet capture	Full packet content	Per-packet	Protocol debugging, deep inspection

SNMP polling answers “how much traffic is on this interface?” Flow data answers “who is talking to whom?” Packet capture answers “what are the devices saying?”

SNMP Interface Monitoring with Alerting

The following script polls interface utilization on switches every 10 seconds and alerts when utilization exceeds a threshold. The script calculates utilization by comparing consecutive counter readings.

from pysnmp.hlapi import (
    getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
    ContextData, ObjectType, ObjectIdentity
)
import time

def snmp_get_int(host, community, oid):
    it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
                UdpTransportTarget((host, 161), timeout=2, retries=1),
                ContextData(), ObjectType(ObjectIdentity(oid)))
    err_ind, err_stat, _, var_binds = next(it)
    if err_ind or err_stat:
        return 0
    return int(var_binds[0][1])

HOST = "192.168.50.1"
COMMUNITY = "public"
IFACE_INDEX = 1
INTERVAL = 10
THRESHOLD_PERCENT = 80
SPEED_BPS = 1_000_000_000  # 1 Gbps

oid_in = f"1.3.6.1.2.1.31.1.1.1.6.{IFACE_INDEX}"   # ifHCInOctets
oid_out = f"1.3.6.1.2.1.31.1.1.1.10.{IFACE_INDEX}"  # ifHCOutOctets

prev_in = snmp_get_int(HOST, COMMUNITY, oid_in)
prev_out = snmp_get_int(HOST, COMMUNITY, oid_out)

while True:
    time.sleep(INTERVAL)
    curr_in = snmp_get_int(HOST, COMMUNITY, oid_in)
    curr_out = snmp_get_int(HOST, COMMUNITY, oid_out)
    bps_in = (curr_in - prev_in) * 8 / INTERVAL
    bps_out = (curr_out - prev_out) * 8 / INTERVAL
    util_in = bps_in / SPEED_BPS * 100
    util_out = bps_out / SPEED_BPS * 100
    print(f"Port {IFACE_INDEX}: IN={util_in:.1f}% OUT={util_out:.1f}%")
    if util_in > THRESHOLD_PERCENT or util_out > THRESHOLD_PERCENT:
        print(f"  ALERT: utilization exceeds {THRESHOLD_PERCENT}%")
    prev_in, prev_out = curr_in, curr_out

This script uses HC (High Capacity) counters (64-bit) from the IF-MIB. The 64-bit counters do not wrap around on high-speed interfaces. The 32-bit counters wrap every 34 seconds on a 1 Gbps link at full utilization.

Disaster Recovery

Term	Definition
RPO (Recovery Point Objective)	Maximum acceptable data loss, measured in time
RTO (Recovery Time Objective)	Maximum acceptable time to restore service
MTTR (Mean Time to Repair)	Average time to restore a component that is inoperable
MTBF (Mean Time Between Detected Inoperabilities)	Average time between detected inoperabilities

Active-Active vs Active-Passive

Architecture	How the Architecture Works	RTO	Cost
Cold site	Empty facility with power and cooling. Equipment shipped after disaster.	Days	Low
Warm site	Facility with equipment installed but not running. Data restored from backup.	Hours	Medium
Hot site	Facility with equipment running and data replicated. Manual switchover.	Minutes	High
Active-passive	Standby system receives data replication. Automatic or manual failover.	Minutes	High
Active-active	Both systems handle traffic simultaneously. Load balanced. No failover needed.	Seconds	Highest

For OT networks, the MRP ring topology supports active-passive redundancy at the network layer (sub-200 ms failover). At the application layer, redundant SCADA servers in active-passive mode support historian and HMI continuity.

Backup Testing

A backup that has never been tested is not a backup. Test backups by:

Restoring a switch configuration from backup to a lab switch quarterly
Verifying the restored configuration matches the baseline
Documenting the restore procedure and time
Including backup restoration in the disaster recovery drill

Key Takeaways

Automate inventory and backups

SNMP inventory scripts and netmiko backup scripts run nightly. Manual documentation becomes outdated immediately.

Every change needs a rollback plan

The CAB reviews consequence and rollback before approving. A change without a tested rollback plan does not get approved.

Test your backups

A backup that has never been restored is not a backup. Test quarterly. Measure restore time. That measurement is the real RTO.

What Comes Next

Operations keep the network running. When something breaks despite good operations, structured troubleshooting finds the root cause. The next chapter covers the CompTIA 7-step troubleshooting methodology, OSI-based diagnostics, and a Python toolkit for automating common troubleshooting tasks.

References

CompTIA Network+ N10-009 Exam Objectives, Domain 3: Network Operations
RFC 3411 — An Architecture for Describing SNMP Management Frameworks (IETF, 2002)
ITIL 4 Foundation. (2019). ITIL 4 Foundation: ITIL 4 Edition. AXELOS.
RFC 2863 — The Interfaces Group MIB (IETF, 2000)