Document every solved issue
Step 7 is the most skipped and most valuable. A documented solution keeps the same issue from taking hours next time.
The previous chapter covered network operations: documentation, change management, and monitoring. When monitoring detects an issue, structured troubleshooting finds the root cause. This chapter covers the methodology, the tools, and a Python toolkit for automating common diagnostics.
Without a structured approach, engineers chase symptoms. Engineers reboot switches, swap cables, and change configurations randomly. A methodology reduces wasted effort and locates the root cause, not the symptom.
Scenario: An HMI in production cell 3 loses connectivity to the SCADA server every 15 minutes for approximately 30 seconds.
Step 1: Identify the issue. Interview the operator: “The screen goes gray every 15 minutes, then comes back.” Check the NMS: no link-down events on the HMI port. Check recent changes: a new VFD was installed in cell 3 yesterday.
Step 2: Establish a theory. The new VFD generates electromagnetic interference (EMI) that causes CRC detected errors on the nearby switch port. CRC detected errors trigger retransmissions that exceed the SCADA timeout.
Step 3: Test the theory. Check CRC counters on the HMI port: Basic Settings → Port Statistics. The counter increments by 50 to 100 every 15 minutes. Check the VFD duty cycle: the VFD ramps to full speed every 15 minutes. Theory confirmed: VFD switching noise causes CRC detected errors.
Step 4: Plan of action. Reroute the Ethernet cable away from the VFD power cable. Use shielded cable (S/FTP Cat 6A). Maintain 200 mm separation between data and power cables. Rollback: restore original cable routing if the new routing causes other issues.
Step 5: Implement. During the next maintenance window, reroute the cable and replace the cable with shielded cable.
Step 6: Verify. Monitor CRC counters for 24 hours. Zero new CRC detected errors. HMI connectivity is stable. The operator confirms no more gray screens.
Step 7: Document. Record in the change management system: “Cell 3 HMI intermittent connectivity caused by EMI from new VFD. Resolved by rerouting Ethernet cable with 200 mm separation and upgrading to S/FTP Cat 6A.”
The bottom-up approach works for most connectivity issues. Start at Layer 1 and work up:
Physical-layer issues cause the most frustrating intermittent conditions. These conditions pass basic connectivity tests but appear under load or at specific times.
Both wiring standards work. The issue occurs when 1 end of a cable uses T568A and the other uses T568B. This combination creates a crossover cable, which causes link conditions on devices that do not support Auto-MDI/X. Modern switches support Auto-MDI/X, but older PLCs and industrial devices do not.
| Pin | T568A | T568B |
|---|---|---|
| 1 | White/Green | White/Orange |
| 2 | Green | Orange |
| 3 | White/Orange | White/Green |
| 6 | Orange | Green |
Use T568B consistently throughout the plant. Label both ends of every cable.
A split pair occurs when a wire pair is split across 2 different twisted pairs in the cable. The cable passes a continuity test (pins connect) but does not pass performance tests because the split pair loses the noise-canceling benefit of twisting. Split pairs cause CRC detected errors under load and pass basic ping tests.
Detect split pairs with a cable tester that measures NEXT (Near-End Crosstalk). A continuity tester alone does not detect split pairs.
A TDR (Time Domain Reflectometer) sends a pulse down a copper cable and measures the reflection. The TDR reports the distance to a fault (open, short, or impedance mismatch) in meters. Use a TDR to locate a cable break without pulling the entire cable.
An OTDR (Optical Time Domain Reflectometer) does the same for fiber-optic cables. The OTDR measures attenuation along the fiber and identifies splice points, connectors, and breaks with their exact distance from the test point.
| Tool | Cable Type | Measures | Use Case |
|---|---|---|---|
| Continuity tester | Copper | Pin-to-pin connectivity | Verify wiring |
| Cable certifier | Copper | NEXT, return loss, length | Certify Cat 6A compliance |
| TDR | Copper | Distance to fault | Locate cable breaks |
| OTDR | Fiber | Attenuation, splice loss, break location | Locate fiber faults |
| Light meter | Fiber | Optical power (dBm) | Verify SFP and fiber link budget |
Wireless issues are harder to diagnose because the medium is invisible and shared.
| Metric | Good | Marginal | Poor | Meaning |
|---|---|---|---|---|
| RSSI (signal strength) | > -65 dBm | -65 to -75 dBm | < -75 dBm | How strong the signal is at the client |
| SNR (signal-to-noise ratio) | > 25 dB | 15 to 25 dB | < 15 dB | Signal strength relative to background noise |
| Channel utilization | < 50% | 50 to 80% | > 80% | How busy the channel is |
| Retry rate | < 10% | 10 to 20% | > 20% | Percentage of frames that required retransmission |
| Symptom | Likely Cause | Diagnostic |
|---|---|---|
| Low throughput, good signal | Co-channel interference | Check channel utilization, survey for overlapping APs |
| Intermittent disconnects | Roaming between APs with different configs | Verify that APs share the same SSID, security, and VLAN |
| Good signal, high retry rate | Hidden node condition | Enable RTS/CTS, reposition APs |
| No connection in specific area | Dead zone (signal blocked by metal) | Perform a site survey with a Wi-Fi analyzer |
Industrial environments have unique interference sources: VFDs (variable frequency drives) generate broadband noise, welding equipment creates impulse noise, and metal structures (cabinets, conveyors, ductwork) block and reflect signals. Perform a site survey with a Wi-Fi analyzer before deploying APs in a plant.
PROFINET RT requires frames to arrive within the configured cycle time (typically 1 to 4 ms). A violation causes the IO controller to declare the IO device as inoperable, which stops the associated process.
Detect cycle time violations with tshark:
# Capture PROFINET RT frames and show timingtshark -i eth0 -f "ether proto 0x8892" -T fields \ -e frame.time_delta_displayed \ -e pn_rt.frame_id \ -e pn_rt.cycle_counter \ | awk '$1 > 0.004 {print "VIOLATION: " $0}'A frame.time_delta_displayed greater than the cycle time (0.004 seconds for a 4 ms cycle) indicates a violation. Common causes: switch queue congestion (too much non-PROFINET traffic on the same VLAN), STP topology change (temporary forwarding interruption), or a slow link in the path.
MRP ring oscillation occurs when 2 devices claim the MRM (Media Redundancy Manager) role in the same ring domain. Both MRMs open and close the ring simultaneously, causing continuous topology changes and MAC table flushes.
Detect oscillation by monitoring MRP topology change events:
# Count MRP topology changes per secondtshark -i eth0 -f "ether proto 0x88e3" -T fields \ -e frame.time_epoch -e mrp.type \ | grep "TopologyChange" | cut -d. -f1 | uniq -cMore than 1 topology change per minute (outside of a real cable disconnection) indicates oscillation. Verify that exactly 1 MRM exists per ring domain.
Modbus TCP uses a request-response pattern. The SCADA server sends a request and waits for a response. If the response does not arrive within the timeout (typically 1 to 5 seconds), then the SCADA server marks the device as unreachable.
# Show Modbus TCP response timestshark -i eth0 -f "tcp port 502" -T fields \ -e frame.time_delta_displayed \ -e ip.src -e ip.dst \ -e modbus.func_code \ | awk '$1 > 1.0 {print "SLOW: " $0}'Response times greater than 1 second indicate network congestion, PLC CPU overload, or a routing issue. Consistent timeouts from a single PLC point to a PLC issue. Timeouts from all PLCs point to a network issue.
The following script extends the basic diagnostic toolkit with DNS resolution checks, SNMP reachability, and MRP ring status via SNMP.
import subprocess, socket, sysfrom dataclasses import dataclass, field
@dataclassclass DiagResult: host: str checks: list[dict] = field(default_factory=list)
def add(self, layer: str, check: str, passed: bool, detail: str = ""): self.checks.append({"layer": layer, "check": check, "passed": passed, "detail": detail})
def report(self): print(f"\n{'='*60}") print(f"Diagnostic Report: {self.host}") print(f"{'='*60}") for c in self.checks: icon = "PASS" if c["passed"] else "FAIL" print(f"[{icon}] [{c['layer']}] {c['check']}") if c["detail"]: print(f" {c['detail']}") failed = [c for c in self.checks if not c["passed"]] print(f"\n{'All checks passed' if not failed else f'{len(failed)} check(s) failed'}")
def check_ping(host: str) -> tuple[bool, str]: result = subprocess.run( ["ping", "-c", "4", "-W", "1", host], capture_output=True, text=True ) if result.returncode == 0: for line in result.stdout.splitlines(): if "rtt" in line or "min/avg/max" in line: return True, line.strip() return False, "No response"
def check_port(host: str, port: int) -> tuple[bool, str]: try: with socket.create_connection((host, port), timeout=2.0): return True, f"Port {port}/tcp open" except (ConnectionRefusedError, OSError) as e: return False, f"Port {port}/tcp: {e}"
def check_dns(hostname: str) -> tuple[bool, str]: try: addr = socket.gethostbyname(hostname) return True, f"Resolves to {addr}" except socket.gaierror as e: return False, f"DNS resolution unsuccessful: {e}"
def check_snmp(host: str, community: str = "public") -> tuple[bool, str]: try: from pysnmp.hlapi import ( getCmd, SnmpEngine, CommunityData, UdpTransportTarget, ContextData, ObjectType, ObjectIdentity ) it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1), UdpTransportTarget((host, 161), timeout=2, retries=1), ContextData(), ObjectType(ObjectIdentity("1.3.6.1.2.1.1.5.0"))) err_ind, err_stat, _, var_binds = next(it) if err_ind: return False, f"SNMP detected error: {err_ind}" return True, f"sysName={var_binds[0][1]}" except Exception as e: return False, f"SNMP unsuccessful: {e}"
def run_diagnostics(host: str, ports: list[int], hostname: str = "") -> DiagResult: result = DiagResult(host=host)
# Layer 3: ping passed, detail = check_ping(host) result.add("L3", f"Ping {host}", passed, detail)
# Layer 4: TCP ports for port in ports: passed, detail = check_port(host, port) result.add("L4", f"TCP port {port}", passed, detail)
# Layer 7: DNS if hostname: passed, detail = check_dns(hostname) result.add("L7", f"DNS resolve {hostname}", passed, detail)
# Layer 7: SNMP passed, detail = check_snmp(host) result.add("L7", f"SNMP query {host}", passed, detail)
return result
if __name__ == "__main__": target = sys.argv[1] if len(sys.argv) > 1 else "192.168.1.1" hostname = sys.argv[2] if len(sys.argv) > 2 else "" diag = run_diagnostics(target, ports=[22, 80, 443, 502, 44818], hostname=hostname) diag.report()Running this script against a Hirschmann switch at 192.168.50.1 with hostname sw-cell1.plant.local:
python diag.py 192.168.50.1 sw-cell1.plant.localAn unsuccessful ping points to Layer 1 to 3. A successful ping with an unsuccessful port check points to Layer 4 (firewall or service not running). An unsuccessful DNS check points to a name resolution issue. An unsuccessful SNMP check with a successful ping points to SNMP configuration (wrong community string, SNMPv2c disabled).
| Scenario | Wireshark Filter |
|---|---|
| MRP frames | eth.type == 0x88e3 |
| PROFINET RT | eth.type == 0x8892 |
| LLDP | eth.type == 0x88cc |
| Modbus TCP | tcp.port == 502 |
| EtherNet/IP | tcp.port == 44818 or udp.port == 2222 |
| OPC UA | tcp.port == 4840 |
| CRC detected errors | eth.fcs_bad == 1 |
| TCP retransmissions | tcp.analysis.retransmission |
| ARP conflicts | arp.duplicate-address-detected |
| Slow TCP responses | tcp.analysis.ack_rtt > 0.1 |
| Counter | Meaning | Likely Cause |
|---|---|---|
| CRC detected errors | Frames with bad FCS | Bad cable, SFP, or EMI |
| Runts | Frames < 64 bytes | Collision (half-duplex) or NIC fault |
| Giants | Frames > 1518 bytes | MTU misconfiguration |
| Output drops | Frames dropped (queue full) | Congestion |
| Late collisions | Collisions after 64 bytes | Duplex mismatch or cable too long |
Check on Hirschmann HiOS at Basic Settings → Port Statistics. Check on Linux with ethtool -S eth0 | grep -E "error|drop|crc".
Document every solved issue
Step 7 is the most skipped and most valuable. A documented solution keeps the same issue from taking hours next time.
Split pairs pass continuity, not performance
A continuity tester does not detect split pairs. Use a cable certifier that measures NEXT. Split pairs cause CRC detected errors under load.
tshark detects OT protocol violations
Use tshark filters for PROFINET cycle time violations, MRP oscillation, and Modbus timeout patterns. These are the 3 most common OT network issues.