Document every solved problem
Step 7 is the most skipped and most valuable. A documented solution prevents the same problem from taking hours next time.
The previous chapter covered network operations: documentation, change management, and monitoring. When monitoring detects a problem, structured troubleshooting finds the root cause. This chapter provides the methodology, the tools, and a Python toolkit for automating common diagnostics.
Without a structured approach, engineers chase symptoms. They reboot switches, swap cables, and change configurations randomly. A methodology prevents wasted effort and ensures the root cause is found, not just the symptom.
Scenario: An HMI in production cell 3 loses connectivity to the SCADA server every 15 minutes for approximately 30 seconds.
Step 1: Identify the problem. Interview the operator: “The screen goes gray every 15 minutes, then comes back.” Check the NMS: no link-down events on the HMI port. Check recent changes: a new VFD was installed in cell 3 yesterday.
Step 2: Establish a theory. The new VFD generates electromagnetic interference (EMI) that causes CRC errors on the nearby switch port. CRC errors trigger retransmissions that exceed the SCADA timeout.
Step 3: Test the theory. Check CRC error counters on the HMI port: Basic Settings → Port Statistics. The counter increments by 50 to 100 every 15 minutes. Check the VFD duty cycle: it ramps to full speed every 15 minutes. Theory confirmed: VFD switching noise causes CRC errors.
Step 4: Plan of action. Reroute the Ethernet cable away from the VFD power cable. Use shielded cable (S/FTP Cat 6A). Maintain 200 mm separation between data and power cables. Rollback: restore original cable routing if the new routing causes other issues.
Step 5: Implement. During the next maintenance window, reroute the cable and replace it with shielded cable.
Step 6: Verify. Monitor CRC counters for 24 hours. Zero new CRC errors. HMI connectivity is stable. Operator confirms no more gray screens.
Step 7: Document. Record in the change management system: “Cell 3 HMI intermittent connectivity caused by EMI from new VFD. Resolved by rerouting Ethernet cable with 200 mm separation and upgrading to S/FTP Cat 6A.”
The bottom-up approach works for most connectivity problems. Start at Layer 1 and work up:
Physical-layer problems cause the most frustrating intermittent failures. They pass basic connectivity tests but fail under load or at specific times.
Both wiring standards work. The problem occurs when one end of a cable uses T568A and the other uses T568B. This creates a crossover cable, which causes link failures on devices that do not support Auto-MDI/X. Modern switches support Auto-MDI/X, but older PLCs and industrial devices may not.
| Pin | T568A | T568B |
|---|---|---|
| 1 | White/Green | White/Orange |
| 2 | Green | Orange |
| 3 | White/Orange | White/Green |
| 6 | Orange | Green |
Use T568B consistently throughout the plant. Label both ends of every cable.
A split pair occurs when a wire pair is split across two different twisted pairs in the cable. The cable passes a continuity test (all pins connect) but fails performance tests because the split pair loses the noise-canceling benefit of twisting. Split pairs cause CRC errors under load and pass basic ping tests.
Detect split pairs with a cable tester that measures NEXT (Near-End Crosstalk). A continuity tester alone does not detect split pairs.
A TDR (Time Domain Reflectometer) sends a pulse down a copper cable and measures the reflection. It reports the distance to a fault (open, short, or impedance mismatch) in meters. Use a TDR to locate a cable break without pulling the entire cable.
An OTDR (Optical Time Domain Reflectometer) does the same for fiber-optic cables. It measures attenuation along the fiber and identifies splice points, connectors, and breaks with their exact distance from the test point.
| Tool | Cable Type | Measures | Use Case |
|---|---|---|---|
| Continuity tester | Copper | Pin-to-pin connectivity | Verify wiring |
| Cable certifier | Copper | NEXT, return loss, length | Certify Cat 6A compliance |
| TDR | Copper | Distance to fault | Locate cable breaks |
| OTDR | Fiber | Attenuation, splice loss, break location | Locate fiber faults |
| Light meter | Fiber | Optical power (dBm) | Verify SFP and fiber link budget |
Wireless problems are harder to diagnose because the medium is invisible and shared.
| Metric | Good | Marginal | Poor | Meaning |
|---|---|---|---|---|
| RSSI (signal strength) | > -65 dBm | -65 to -75 dBm | < -75 dBm | How strong the signal is at the client |
| SNR (signal-to-noise ratio) | > 25 dB | 15 to 25 dB | < 15 dB | Signal strength relative to background noise |
| Channel utilization | < 50% | 50 to 80% | > 80% | How busy the channel is |
| Retry rate | < 10% | 10 to 20% | > 20% | Percentage of frames that required retransmission |
| Symptom | Likely Cause | Diagnostic |
|---|---|---|
| Low throughput, good signal | Co-channel interference | Check channel utilization, survey for overlapping APs |
| Intermittent disconnects | Roaming between APs with different configs | Verify all APs share the same SSID, security, and VLAN |
| Good signal, high retry rate | Hidden node problem | Enable RTS/CTS, reposition APs |
| No connection in specific area | Dead zone (signal blocked by metal) | Perform a site survey with a Wi-Fi analyzer |
Industrial environments have unique interference sources: VFDs (variable frequency drives) generate broadband noise, welding equipment creates impulse noise, and metal structures (cabinets, conveyors, ductwork) block and reflect signals. Perform a site survey with a Wi-Fi analyzer before deploying APs in a plant.
PROFINET RT requires frames to arrive within the configured cycle time (typically 1 to 4 ms). A violation causes the IO controller to declare the IO device as failed, which stops the associated process.
Detect cycle time violations with tshark:
# Capture PROFINET RT frames and show timingtshark -i eth0 -f "ether proto 0x8892" -T fields \ -e frame.time_delta_displayed \ -e pn_rt.frame_id \ -e pn_rt.cycle_counter \ | awk '$1 > 0.004 {print "VIOLATION: " $0}'A frame.time_delta_displayed greater than the cycle time (0.004 seconds for a 4 ms cycle) indicates a violation. Common causes: switch queue congestion (too much non-PROFINET traffic on the same VLAN), STP topology change (temporary forwarding interruption), or a slow link in the path.
MRP ring oscillation occurs when two devices claim the MRM (Media Redundancy Manager) role in the same ring domain. Both MRMs open and close the ring simultaneously, causing continuous topology changes and MAC table flushes.
Detect oscillation by monitoring MRP topology change events:
# Count MRP topology changes per secondtshark -i eth0 -f "ether proto 0x88e3" -T fields \ -e frame.time_epoch -e mrp.type \ | grep "TopologyChange" | cut -d. -f1 | uniq -cMore than one topology change per minute (outside of a real cable failure) indicates oscillation. Verify that exactly one MRM exists per ring domain.
Modbus TCP uses a request-response pattern. The SCADA server sends a request and waits for a response. If the response does not arrive within the timeout (typically 1 to 5 seconds), the SCADA server marks the device as unreachable.
# Show Modbus TCP response timestshark -i eth0 -f "tcp port 502" -T fields \ -e frame.time_delta_displayed \ -e ip.src -e ip.dst \ -e modbus.func_code \ | awk '$1 > 1.0 {print "SLOW: " $0}'Response times greater than 1 second indicate network congestion, PLC CPU overload, or a routing problem. Consistent timeouts from a single PLC point to a PLC issue. Timeouts from all PLCs point to a network issue.
The following script extends the basic diagnostic toolkit with DNS resolution checks, SNMP reachability, and MRP ring status via SNMP.
import subprocess, socket, sysfrom dataclasses import dataclass, field
@dataclassclass DiagResult: host: str checks: list[dict] = field(default_factory=list)
def add(self, layer: str, check: str, passed: bool, detail: str = ""): self.checks.append({"layer": layer, "check": check, "passed": passed, "detail": detail})
def report(self): print(f"\n{'='*60}") print(f"Diagnostic Report: {self.host}") print(f"{'='*60}") for c in self.checks: icon = "PASS" if c["passed"] else "FAIL" print(f"[{icon}] [{c['layer']}] {c['check']}") if c["detail"]: print(f" {c['detail']}") failed = [c for c in self.checks if not c["passed"]] print(f"\n{'All checks passed' if not failed else f'{len(failed)} check(s) failed'}")
def check_ping(host: str) -> tuple[bool, str]: result = subprocess.run( ["ping", "-c", "4", "-W", "1", host], capture_output=True, text=True ) if result.returncode == 0: for line in result.stdout.splitlines(): if "rtt" in line or "min/avg/max" in line: return True, line.strip() return False, "No response"
def check_port(host: str, port: int) -> tuple[bool, str]: try: with socket.create_connection((host, port), timeout=2.0): return True, f"Port {port}/tcp open" except (ConnectionRefusedError, OSError) as e: return False, f"Port {port}/tcp: {e}"
def check_dns(hostname: str) -> tuple[bool, str]: try: addr = socket.gethostbyname(hostname) return True, f"Resolves to {addr}" except socket.gaierror as e: return False, f"DNS resolution failed: {e}"
def check_snmp(host: str, community: str = "public") -> tuple[bool, str]: try: from pysnmp.hlapi import ( getCmd, SnmpEngine, CommunityData, UdpTransportTarget, ContextData, ObjectType, ObjectIdentity ) it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1), UdpTransportTarget((host, 161), timeout=2, retries=1), ContextData(), ObjectType(ObjectIdentity("1.3.6.1.2.1.1.5.0"))) err_ind, err_stat, _, var_binds = next(it) if err_ind: return False, f"SNMP error: {err_ind}" return True, f"sysName={var_binds[0][1]}" except Exception as e: return False, f"SNMP failed: {e}"
def run_diagnostics(host: str, ports: list[int], hostname: str = "") -> DiagResult: result = DiagResult(host=host)
# Layer 3: ping passed, detail = check_ping(host) result.add("L3", f"Ping {host}", passed, detail)
# Layer 4: TCP ports for port in ports: passed, detail = check_port(host, port) result.add("L4", f"TCP port {port}", passed, detail)
# Layer 7: DNS if hostname: passed, detail = check_dns(hostname) result.add("L7", f"DNS resolve {hostname}", passed, detail)
# Layer 7: SNMP passed, detail = check_snmp(host) result.add("L7", f"SNMP query {host}", passed, detail)
return result
if __name__ == "__main__": target = sys.argv[1] if len(sys.argv) > 1 else "192.168.1.1" hostname = sys.argv[2] if len(sys.argv) > 2 else "" diag = run_diagnostics(target, ports=[22, 80, 443, 502, 44818], hostname=hostname) diag.report()Running this against a Hirschmann switch at 192.168.50.1 with hostname sw-cell1.plant.local:
python diag.py 192.168.50.1 sw-cell1.plant.localA failed ping points to Layer 1 to 3. A successful ping with a failed port check points to Layer 4 (firewall or service not running). A failed DNS check points to a name resolution problem. A failed SNMP check with successful ping points to SNMP configuration (wrong community string, SNMPv2c disabled).
| Scenario | Wireshark Filter |
|---|---|
| MRP frames | eth.type == 0x88e3 |
| PROFINET RT | eth.type == 0x8892 |
| LLDP | eth.type == 0x88cc |
| Modbus TCP | tcp.port == 502 |
| EtherNet/IP | tcp.port == 44818 or udp.port == 2222 |
| OPC UA | tcp.port == 4840 |
| CRC errors | eth.fcs_bad == 1 |
| TCP retransmissions | tcp.analysis.retransmission |
| ARP conflicts | arp.duplicate-address-detected |
| Slow TCP responses | tcp.analysis.ack_rtt > 0.1 |
| Counter | Meaning | Likely Cause |
|---|---|---|
| CRC errors | Frames with bad FCS | Bad cable, SFP, or EMI |
| Runts | Frames < 64 bytes | Collision (half-duplex) or NIC fault |
| Giants | Frames > 1518 bytes | MTU misconfiguration |
| Output drops | Frames dropped (queue full) | Congestion |
| Late collisions | Collisions after 64 bytes | Duplex mismatch or cable too long |
Check on Hirschmann HiOS at Basic Settings → Port Statistics. Check on Linux with ethtool -S eth0 | grep -E "error|drop|crc".
Document every solved problem
Step 7 is the most skipped and most valuable. A documented solution prevents the same problem from taking hours next time.
Split pairs pass continuity, fail performance
A continuity tester does not detect split pairs. Use a cable certifier that measures NEXT. Split pairs cause CRC errors under load.
tshark detects OT protocol violations
Use tshark filters for PROFINET cycle time violations, MRP oscillation, and Modbus timeout patterns. These are the three most common OT network problems.