Automate inventory and backups
SNMP inventory scripts and netmiko backup scripts run nightly. Manual documentation becomes outdated immediately.
The previous part covered security: threats, segmentation, IEC 62443, and hardening. A hardened network still requires ongoing operations to stay reliable. This chapter covers the processes and tools that keep a network running: documentation, change management, monitoring, and disaster recovery.
A network without documentation takes longer to troubleshoot. A network without change management suffers unplanned outages. A network without monitoring discovers problems only when users complain. Operations processes prevent these failures.
Good documentation is the foundation of network operations. Without it, troubleshooting takes longer, changes introduce errors, and institutional knowledge is lost when staff leave.
| Document | Purpose |
|---|---|
| Physical diagram | Shows device locations, cables, rack positions |
| Logical diagram | Shows IP addresses, VLANs, routing |
| Cable map | Documents port-to-port connections |
| IPAM (IP Address Management) | Tracks IP assignments, subnets, DHCP/DNS |
| Asset inventory | Records model, serial number, firmware, location |
Manual asset inventories become outdated the moment a device is replaced. The following script queries all switches via SNMP and generates a network inventory report with hostname, IP, firmware version, and uptime.
from pysnmp.hlapi import ( getCmd, SnmpEngine, CommunityData, UdpTransportTarget, ContextData, ObjectType, ObjectIdentity)import csv, sys
OIDS = { "sysName": "1.3.6.1.2.1.1.5.0", "sysDescr": "1.3.6.1.2.1.1.1.0", "sysUpTime":"1.3.6.1.2.1.1.3.0", "sysContact":"1.3.6.1.2.1.1.4.0", "sysLocation":"1.3.6.1.2.1.1.6.0",}
def snmp_get(host, community, oid): it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1), UdpTransportTarget((host, 161), timeout=2, retries=1), ContextData(), ObjectType(ObjectIdentity(oid))) err_ind, err_stat, _, var_binds = next(it) if err_ind or err_stat: return "N/A" return str(var_binds[0][1])
def ticks_to_days(ticks_str): try: return f"{int(ticks_str) / 8640000:.1f} days" except ValueError: return ticks_str
switches = ["192.168.50.1", "192.168.50.2", "192.168.50.3", "192.168.50.4", "192.168.50.5"]community = "public"
writer = csv.writer(sys.stdout)writer.writerow(["IP", "Hostname", "Description", "Uptime", "Location"])for host in switches: row = [host] for name, oid in OIDS.items(): val = snmp_get(host, community, oid) if name == "sysUpTime": val = ticks_to_days(val) row.append(val) writer.writerow(row[:5]) # IP, Name, Descr, Uptime, LocationRun this script weekly and store the output in version control. Comparing consecutive reports reveals firmware changes, reboots (uptime resets), and hostname misconfigurations.
Change management controls modifications to the network to prevent unplanned outages. Every change follows a process.
| Type | Approval | Example |
|---|---|---|
| Standard | Pre-approved (follows a template) | Adding a new access port to an existing VLAN |
| Normal | Requires CAB review | Adding a new VLAN, changing firewall rules |
| Emergency | Expedited approval (post-implementation review) | Restoring service after an outage |
A CAB (Change Advisory Board) reviews normal changes. The CAB includes the network engineer, the OT manager, and a representative from operations. The CAB evaluates the risk, the rollback plan, and the maintenance window.
Every change request includes a rollback plan that answers:
A change without a tested rollback plan does not get approved.
HiOS has two configuration stores: running config (active, in RAM) and startup config (saved, in flash). Changes take effect immediately but are lost on reboot unless saved. A baseline configuration is a known-good configuration used as a reference.
The following script uses netmiko to connect to all switches via SSH, download the running configuration, and save it to a file. Run it nightly via cron and store the output in Git for version history.
from netmiko import ConnectHandlerfrom pathlib import Pathfrom datetime import date
SWITCHES = [ {"host": "192.168.50.1", "device_type": "generic", "username": "admin", "password": "SecurePass123!", "port": 22}, {"host": "192.168.50.2", "device_type": "generic", "username": "admin", "password": "SecurePass123!", "port": 22},]
backup_dir = Path(f"backups/{date.today()}")backup_dir.mkdir(parents=True, exist_ok=True)
for sw in SWITCHES: try: conn = ConnectHandler(**sw) config = conn.send_command("show running-config") hostname = conn.send_command("show system info").split("\n")[0] filename = backup_dir / f"{sw['host']}.cfg" filename.write_text(config) print(f"OK: {sw['host']} ({hostname}) -> {filename}") conn.disconnect() except Exception as e: print(f"FAIL: {sw['host']}: {e}")Compare the current configuration to the baseline to detect unauthorized changes. After running the backup script, use diff to compare today’s backup to the baseline:
diff backups/baseline/192.168.50.1.cfg backups/2026-04-29/192.168.50.1.cfgAny difference indicates a configuration change. If the change was not documented in a change request, investigate immediately.
| Method | What It Measures | Granularity | Use Case |
|---|---|---|---|
| SNMP polling | Interface counters (bytes, errors, drops) | Per-interface, per-poll-interval | Bandwidth utilization, error trends |
| NetFlow/sFlow | Per-flow records (src/dst IP, port, bytes) | Per-flow | Traffic analysis, anomaly detection |
| Packet capture | Full packet content | Per-packet | Protocol debugging, deep inspection |
SNMP polling answers “how much traffic is on this interface?” Flow data answers “who is talking to whom?” Packet capture answers “what are they saying?”
The following script polls interface utilization on all switches every 10 seconds and alerts when utilization exceeds a threshold. It calculates utilization by comparing consecutive counter readings.
from pysnmp.hlapi import ( getCmd, SnmpEngine, CommunityData, UdpTransportTarget, ContextData, ObjectType, ObjectIdentity)import time
def snmp_get_int(host, community, oid): it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1), UdpTransportTarget((host, 161), timeout=2, retries=1), ContextData(), ObjectType(ObjectIdentity(oid))) err_ind, err_stat, _, var_binds = next(it) if err_ind or err_stat: return 0 return int(var_binds[0][1])
HOST = "192.168.50.1"COMMUNITY = "public"IFACE_INDEX = 1INTERVAL = 10THRESHOLD_PERCENT = 80SPEED_BPS = 1_000_000_000 # 1 Gbps
oid_in = f"1.3.6.1.2.1.31.1.1.1.6.{IFACE_INDEX}" # ifHCInOctetsoid_out = f"1.3.6.1.2.1.31.1.1.1.10.{IFACE_INDEX}" # ifHCOutOctets
prev_in = snmp_get_int(HOST, COMMUNITY, oid_in)prev_out = snmp_get_int(HOST, COMMUNITY, oid_out)
while True: time.sleep(INTERVAL) curr_in = snmp_get_int(HOST, COMMUNITY, oid_in) curr_out = snmp_get_int(HOST, COMMUNITY, oid_out) bps_in = (curr_in - prev_in) * 8 / INTERVAL bps_out = (curr_out - prev_out) * 8 / INTERVAL util_in = bps_in / SPEED_BPS * 100 util_out = bps_out / SPEED_BPS * 100 print(f"Port {IFACE_INDEX}: IN={util_in:.1f}% OUT={util_out:.1f}%") if util_in > THRESHOLD_PERCENT or util_out > THRESHOLD_PERCENT: print(f" ALERT: utilization exceeds {THRESHOLD_PERCENT}%") prev_in, prev_out = curr_in, curr_outThis script uses HC (High Capacity) counters (64-bit) from the IF-MIB, which do not wrap around on high-speed interfaces. The 32-bit counters wrap every 34 seconds on a 1 Gbps link at full utilization.
| Term | Definition |
|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss, measured in time |
| RTO (Recovery Time Objective) | Maximum acceptable time to restore service |
| MTTR (Mean Time to Repair) | Average time to restore a failed component |
| MTBF (Mean Time Between Failures) | Average time between failures |
| Architecture | How It Works | RTO | Cost |
|---|---|---|---|
| Cold site | Empty facility with power and cooling. Equipment shipped after disaster. | Days | Low |
| Warm site | Facility with equipment installed but not running. Data restored from backup. | Hours | Medium |
| Hot site | Facility with equipment running and data replicated. Manual switchover. | Minutes | High |
| Active-passive | Standby system receives data replication. Automatic or manual failover. | Minutes | High |
| Active-active | Both systems handle traffic simultaneously. Load balanced. No failover needed. | Seconds | Highest |
For OT networks, the MRP ring topology provides active-passive redundancy at the network layer (sub-200 ms failover). At the application layer, redundant SCADA servers in active-passive mode provide historian and HMI continuity.
A backup that has never been tested is not a backup. It is a hope. Test backups by:
Automate inventory and backups
SNMP inventory scripts and netmiko backup scripts run nightly. Manual documentation becomes outdated immediately.
Every change needs a rollback plan
The CAB reviews risk and rollback before approving. A change without a tested rollback plan does not get approved.
Test your backups
A backup that has never been restored is not a backup. Test quarterly. Measure restore time. That is your real RTO.
Operations keep the network running. When something breaks despite good operations, structured troubleshooting finds the root cause. The next chapter covers the CompTIA 7-step troubleshooting methodology, OSI-based diagnostics, and a Python toolkit for automating common troubleshooting tasks.