Skip to content

16.1 Troubleshooting Methodology

The previous chapter covered network operations: documentation, change management, and monitoring. When monitoring detects a problem, structured troubleshooting finds the root cause. This chapter provides the methodology, the tools, and a Python toolkit for automating common diagnostics.

Without a structured approach, engineers chase symptoms. They reboot switches, swap cables, and change configurations randomly. A methodology prevents wasted effort and ensures the root cause is found, not just the symptom.

Scenario: An HMI in production cell 3 loses connectivity to the SCADA server every 15 minutes for approximately 30 seconds.

Step 1: Identify the problem. Interview the operator: “The screen goes gray every 15 minutes, then comes back.” Check the NMS: no link-down events on the HMI port. Check recent changes: a new VFD was installed in cell 3 yesterday.

Step 2: Establish a theory. The new VFD generates electromagnetic interference (EMI) that causes CRC errors on the nearby switch port. CRC errors trigger retransmissions that exceed the SCADA timeout.

Step 3: Test the theory. Check CRC error counters on the HMI port: Basic Settings → Port Statistics. The counter increments by 50 to 100 every 15 minutes. Check the VFD duty cycle: it ramps to full speed every 15 minutes. Theory confirmed: VFD switching noise causes CRC errors.

Step 4: Plan of action. Reroute the Ethernet cable away from the VFD power cable. Use shielded cable (S/FTP Cat 6A). Maintain 200 mm separation between data and power cables. Rollback: restore original cable routing if the new routing causes other issues.

Step 5: Implement. During the next maintenance window, reroute the cable and replace it with shielded cable.

Step 6: Verify. Monitor CRC counters for 24 hours. Zero new CRC errors. HMI connectivity is stable. Operator confirms no more gray screens.

Step 7: Document. Record in the change management system: “Cell 3 HMI intermittent connectivity caused by EMI from new VFD. Resolved by rerouting Ethernet cable with 200 mm separation and upgrading to S/FTP Cat 6A.”

The bottom-up approach works for most connectivity problems. Start at Layer 1 and work up:

Physical-layer problems cause the most frustrating intermittent failures. They pass basic connectivity tests but fail under load or at specific times.

Both wiring standards work. The problem occurs when one end of a cable uses T568A and the other uses T568B. This creates a crossover cable, which causes link failures on devices that do not support Auto-MDI/X. Modern switches support Auto-MDI/X, but older PLCs and industrial devices may not.

PinT568AT568B
1White/GreenWhite/Orange
2GreenOrange
3White/OrangeWhite/Green
6OrangeGreen

Use T568B consistently throughout the plant. Label both ends of every cable.

A split pair occurs when a wire pair is split across two different twisted pairs in the cable. The cable passes a continuity test (all pins connect) but fails performance tests because the split pair loses the noise-canceling benefit of twisting. Split pairs cause CRC errors under load and pass basic ping tests.

Detect split pairs with a cable tester that measures NEXT (Near-End Crosstalk). A continuity tester alone does not detect split pairs.

A TDR (Time Domain Reflectometer) sends a pulse down a copper cable and measures the reflection. It reports the distance to a fault (open, short, or impedance mismatch) in meters. Use a TDR to locate a cable break without pulling the entire cable.

An OTDR (Optical Time Domain Reflectometer) does the same for fiber-optic cables. It measures attenuation along the fiber and identifies splice points, connectors, and breaks with their exact distance from the test point.

ToolCable TypeMeasuresUse Case
Continuity testerCopperPin-to-pin connectivityVerify wiring
Cable certifierCopperNEXT, return loss, lengthCertify Cat 6A compliance
TDRCopperDistance to faultLocate cable breaks
OTDRFiberAttenuation, splice loss, break locationLocate fiber faults
Light meterFiberOptical power (dBm)Verify SFP and fiber link budget

Wireless problems are harder to diagnose because the medium is invisible and shared.

MetricGoodMarginalPoorMeaning
RSSI (signal strength)> -65 dBm-65 to -75 dBm< -75 dBmHow strong the signal is at the client
SNR (signal-to-noise ratio)> 25 dB15 to 25 dB< 15 dBSignal strength relative to background noise
Channel utilization< 50%50 to 80%> 80%How busy the channel is
Retry rate< 10%10 to 20%> 20%Percentage of frames that required retransmission
SymptomLikely CauseDiagnostic
Low throughput, good signalCo-channel interferenceCheck channel utilization, survey for overlapping APs
Intermittent disconnectsRoaming between APs with different configsVerify all APs share the same SSID, security, and VLAN
Good signal, high retry rateHidden node problemEnable RTS/CTS, reposition APs
No connection in specific areaDead zone (signal blocked by metal)Perform a site survey with a Wi-Fi analyzer

Industrial environments have unique interference sources: VFDs (variable frequency drives) generate broadband noise, welding equipment creates impulse noise, and metal structures (cabinets, conveyors, ductwork) block and reflect signals. Perform a site survey with a Wi-Fi analyzer before deploying APs in a plant.

PROFINET RT requires frames to arrive within the configured cycle time (typically 1 to 4 ms). A violation causes the IO controller to declare the IO device as failed, which stops the associated process.

Detect cycle time violations with tshark:

Terminal window
# Capture PROFINET RT frames and show timing
tshark -i eth0 -f "ether proto 0x8892" -T fields \
-e frame.time_delta_displayed \
-e pn_rt.frame_id \
-e pn_rt.cycle_counter \
| awk '$1 > 0.004 {print "VIOLATION: " $0}'

A frame.time_delta_displayed greater than the cycle time (0.004 seconds for a 4 ms cycle) indicates a violation. Common causes: switch queue congestion (too much non-PROFINET traffic on the same VLAN), STP topology change (temporary forwarding interruption), or a slow link in the path.

MRP ring oscillation occurs when two devices claim the MRM (Media Redundancy Manager) role in the same ring domain. Both MRMs open and close the ring simultaneously, causing continuous topology changes and MAC table flushes.

Detect oscillation by monitoring MRP topology change events:

Terminal window
# Count MRP topology changes per second
tshark -i eth0 -f "ether proto 0x88e3" -T fields \
-e frame.time_epoch -e mrp.type \
| grep "TopologyChange" | cut -d. -f1 | uniq -c

More than one topology change per minute (outside of a real cable failure) indicates oscillation. Verify that exactly one MRM exists per ring domain.

Modbus TCP uses a request-response pattern. The SCADA server sends a request and waits for a response. If the response does not arrive within the timeout (typically 1 to 5 seconds), the SCADA server marks the device as unreachable.

Terminal window
# Show Modbus TCP response times
tshark -i eth0 -f "tcp port 502" -T fields \
-e frame.time_delta_displayed \
-e ip.src -e ip.dst \
-e modbus.func_code \
| awk '$1 > 1.0 {print "SLOW: " $0}'

Response times greater than 1 second indicate network congestion, PLC CPU overload, or a routing problem. Consistent timeouts from a single PLC point to a PLC issue. Timeouts from all PLCs point to a network issue.

The following script extends the basic diagnostic toolkit with DNS resolution checks, SNMP reachability, and MRP ring status via SNMP.

import subprocess, socket, sys
from dataclasses import dataclass, field
@dataclass
class DiagResult:
host: str
checks: list[dict] = field(default_factory=list)
def add(self, layer: str, check: str, passed: bool, detail: str = ""):
self.checks.append({"layer": layer, "check": check,
"passed": passed, "detail": detail})
def report(self):
print(f"\n{'='*60}")
print(f"Diagnostic Report: {self.host}")
print(f"{'='*60}")
for c in self.checks:
icon = "PASS" if c["passed"] else "FAIL"
print(f"[{icon}] [{c['layer']}] {c['check']}")
if c["detail"]:
print(f" {c['detail']}")
failed = [c for c in self.checks if not c["passed"]]
print(f"\n{'All checks passed' if not failed else f'{len(failed)} check(s) failed'}")
def check_ping(host: str) -> tuple[bool, str]:
result = subprocess.run(
["ping", "-c", "4", "-W", "1", host],
capture_output=True, text=True
)
if result.returncode == 0:
for line in result.stdout.splitlines():
if "rtt" in line or "min/avg/max" in line:
return True, line.strip()
return False, "No response"
def check_port(host: str, port: int) -> tuple[bool, str]:
try:
with socket.create_connection((host, port), timeout=2.0):
return True, f"Port {port}/tcp open"
except (ConnectionRefusedError, OSError) as e:
return False, f"Port {port}/tcp: {e}"
def check_dns(hostname: str) -> tuple[bool, str]:
try:
addr = socket.gethostbyname(hostname)
return True, f"Resolves to {addr}"
except socket.gaierror as e:
return False, f"DNS resolution failed: {e}"
def check_snmp(host: str, community: str = "public") -> tuple[bool, str]:
try:
from pysnmp.hlapi import (
getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
ContextData, ObjectType, ObjectIdentity
)
it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
UdpTransportTarget((host, 161), timeout=2, retries=1),
ContextData(),
ObjectType(ObjectIdentity("1.3.6.1.2.1.1.5.0")))
err_ind, err_stat, _, var_binds = next(it)
if err_ind:
return False, f"SNMP error: {err_ind}"
return True, f"sysName={var_binds[0][1]}"
except Exception as e:
return False, f"SNMP failed: {e}"
def run_diagnostics(host: str, ports: list[int],
hostname: str = "") -> DiagResult:
result = DiagResult(host=host)
# Layer 3: ping
passed, detail = check_ping(host)
result.add("L3", f"Ping {host}", passed, detail)
# Layer 4: TCP ports
for port in ports:
passed, detail = check_port(host, port)
result.add("L4", f"TCP port {port}", passed, detail)
# Layer 7: DNS
if hostname:
passed, detail = check_dns(hostname)
result.add("L7", f"DNS resolve {hostname}", passed, detail)
# Layer 7: SNMP
passed, detail = check_snmp(host)
result.add("L7", f"SNMP query {host}", passed, detail)
return result
if __name__ == "__main__":
target = sys.argv[1] if len(sys.argv) > 1 else "192.168.1.1"
hostname = sys.argv[2] if len(sys.argv) > 2 else ""
diag = run_diagnostics(target, ports=[22, 80, 443, 502, 44818],
hostname=hostname)
diag.report()

Running this against a Hirschmann switch at 192.168.50.1 with hostname sw-cell1.plant.local:

Terminal window
python diag.py 192.168.50.1 sw-cell1.plant.local

A failed ping points to Layer 1 to 3. A successful ping with a failed port check points to Layer 4 (firewall or service not running). A failed DNS check points to a name resolution problem. A failed SNMP check with successful ping points to SNMP configuration (wrong community string, SNMPv2c disabled).

ScenarioWireshark Filter
MRP frameseth.type == 0x88e3
PROFINET RTeth.type == 0x8892
LLDPeth.type == 0x88cc
Modbus TCPtcp.port == 502
EtherNet/IPtcp.port == 44818 or udp.port == 2222
OPC UAtcp.port == 4840
CRC errorseth.fcs_bad == 1
TCP retransmissionstcp.analysis.retransmission
ARP conflictsarp.duplicate-address-detected
Slow TCP responsestcp.analysis.ack_rtt > 0.1
CounterMeaningLikely Cause
CRC errorsFrames with bad FCSBad cable, SFP, or EMI
RuntsFrames < 64 bytesCollision (half-duplex) or NIC fault
GiantsFrames > 1518 bytesMTU misconfiguration
Output dropsFrames dropped (queue full)Congestion
Late collisionsCollisions after 64 bytesDuplex mismatch or cable too long

Check on Hirschmann HiOS at Basic Settings → Port Statistics. Check on Linux with ethtool -S eth0 | grep -E "error|drop|crc".

Document every solved problem

Step 7 is the most skipped and most valuable. A documented solution prevents the same problem from taking hours next time.

Split pairs pass continuity, fail performance

A continuity tester does not detect split pairs. Use a cable certifier that measures NEXT. Split pairs cause CRC errors under load.

tshark detects OT protocol violations

Use tshark filters for PROFINET cycle time violations, MRP oscillation, and Modbus timeout patterns. These are the three most common OT network problems.

  • CompTIA Network+ N10-009 Exam Objectives, Domain 5: Network Troubleshooting
  • Wireshark Foundation. (2024). Wireshark User’s Guide. https://www.wireshark.org/docs/wsug_html/
  • Lammle, T. (2023). CompTIA Network+ Study Guide: Exam N10-009. Sybex.
  • TIA-568.2-D — Balanced Twisted-Pair Telecommunications Cabling and Components Standard