Skip to content

16.1 Troubleshooting Methodology

The previous chapter covered network operations: documentation, change management, and monitoring. When monitoring detects an issue, structured troubleshooting finds the root cause. This chapter covers the methodology, the tools, and a Python toolkit for automating common diagnostics.

Without a structured approach, engineers chase symptoms. Engineers reboot switches, swap cables, and change configurations randomly. A methodology reduces wasted effort and locates the root cause, not the symptom.

Scenario: An HMI in production cell 3 loses connectivity to the SCADA server every 15 minutes for approximately 30 seconds.

Step 1: Identify the issue. Interview the operator: “The screen goes gray every 15 minutes, then comes back.” Check the NMS: no link-down events on the HMI port. Check recent changes: a new VFD was installed in cell 3 yesterday.

Step 2: Establish a theory. The new VFD generates electromagnetic interference (EMI) that causes CRC detected errors on the nearby switch port. CRC detected errors trigger retransmissions that exceed the SCADA timeout.

Step 3: Test the theory. Check CRC counters on the HMI port: Basic Settings → Port Statistics. The counter increments by 50 to 100 every 15 minutes. Check the VFD duty cycle: the VFD ramps to full speed every 15 minutes. Theory confirmed: VFD switching noise causes CRC detected errors.

Step 4: Plan of action. Reroute the Ethernet cable away from the VFD power cable. Use shielded cable (S/FTP Cat 6A). Maintain 200 mm separation between data and power cables. Rollback: restore original cable routing if the new routing causes other issues.

Step 5: Implement. During the next maintenance window, reroute the cable and replace the cable with shielded cable.

Step 6: Verify. Monitor CRC counters for 24 hours. Zero new CRC detected errors. HMI connectivity is stable. The operator confirms no more gray screens.

Step 7: Document. Record in the change management system: “Cell 3 HMI intermittent connectivity caused by EMI from new VFD. Resolved by rerouting Ethernet cable with 200 mm separation and upgrading to S/FTP Cat 6A.”

The bottom-up approach works for most connectivity issues. Start at Layer 1 and work up:

Physical-layer issues cause the most frustrating intermittent conditions. These conditions pass basic connectivity tests but appear under load or at specific times.

Both wiring standards work. The issue occurs when 1 end of a cable uses T568A and the other uses T568B. This combination creates a crossover cable, which causes link conditions on devices that do not support Auto-MDI/X. Modern switches support Auto-MDI/X, but older PLCs and industrial devices do not.

PinT568AT568B
1White/GreenWhite/Orange
2GreenOrange
3White/OrangeWhite/Green
6OrangeGreen

Use T568B consistently throughout the plant. Label both ends of every cable.

A split pair occurs when a wire pair is split across 2 different twisted pairs in the cable. The cable passes a continuity test (pins connect) but does not pass performance tests because the split pair loses the noise-canceling benefit of twisting. Split pairs cause CRC detected errors under load and pass basic ping tests.

Detect split pairs with a cable tester that measures NEXT (Near-End Crosstalk). A continuity tester alone does not detect split pairs.

A TDR (Time Domain Reflectometer) sends a pulse down a copper cable and measures the reflection. The TDR reports the distance to a fault (open, short, or impedance mismatch) in meters. Use a TDR to locate a cable break without pulling the entire cable.

An OTDR (Optical Time Domain Reflectometer) does the same for fiber-optic cables. The OTDR measures attenuation along the fiber and identifies splice points, connectors, and breaks with their exact distance from the test point.

ToolCable TypeMeasuresUse Case
Continuity testerCopperPin-to-pin connectivityVerify wiring
Cable certifierCopperNEXT, return loss, lengthCertify Cat 6A compliance
TDRCopperDistance to faultLocate cable breaks
OTDRFiberAttenuation, splice loss, break locationLocate fiber faults
Light meterFiberOptical power (dBm)Verify SFP and fiber link budget

Wireless issues are harder to diagnose because the medium is invisible and shared.

MetricGoodMarginalPoorMeaning
RSSI (signal strength)> -65 dBm-65 to -75 dBm< -75 dBmHow strong the signal is at the client
SNR (signal-to-noise ratio)> 25 dB15 to 25 dB< 15 dBSignal strength relative to background noise
Channel utilization< 50%50 to 80%> 80%How busy the channel is
Retry rate< 10%10 to 20%> 20%Percentage of frames that required retransmission
SymptomLikely CauseDiagnostic
Low throughput, good signalCo-channel interferenceCheck channel utilization, survey for overlapping APs
Intermittent disconnectsRoaming between APs with different configsVerify that APs share the same SSID, security, and VLAN
Good signal, high retry rateHidden node conditionEnable RTS/CTS, reposition APs
No connection in specific areaDead zone (signal blocked by metal)Perform a site survey with a Wi-Fi analyzer

Industrial environments have unique interference sources: VFDs (variable frequency drives) generate broadband noise, welding equipment creates impulse noise, and metal structures (cabinets, conveyors, ductwork) block and reflect signals. Perform a site survey with a Wi-Fi analyzer before deploying APs in a plant.

PROFINET RT requires frames to arrive within the configured cycle time (typically 1 to 4 ms). A violation causes the IO controller to declare the IO device as inoperable, which stops the associated process.

Detect cycle time violations with tshark:

Terminal window
# Capture PROFINET RT frames and show timing
tshark -i eth0 -f "ether proto 0x8892" -T fields \
-e frame.time_delta_displayed \
-e pn_rt.frame_id \
-e pn_rt.cycle_counter \
| awk '$1 > 0.004 {print "VIOLATION: " $0}'

A frame.time_delta_displayed greater than the cycle time (0.004 seconds for a 4 ms cycle) indicates a violation. Common causes: switch queue congestion (too much non-PROFINET traffic on the same VLAN), STP topology change (temporary forwarding interruption), or a slow link in the path.

MRP ring oscillation occurs when 2 devices claim the MRM (Media Redundancy Manager) role in the same ring domain. Both MRMs open and close the ring simultaneously, causing continuous topology changes and MAC table flushes.

Detect oscillation by monitoring MRP topology change events:

Terminal window
# Count MRP topology changes per second
tshark -i eth0 -f "ether proto 0x88e3" -T fields \
-e frame.time_epoch -e mrp.type \
| grep "TopologyChange" | cut -d. -f1 | uniq -c

More than 1 topology change per minute (outside of a real cable disconnection) indicates oscillation. Verify that exactly 1 MRM exists per ring domain.

Modbus TCP uses a request-response pattern. The SCADA server sends a request and waits for a response. If the response does not arrive within the timeout (typically 1 to 5 seconds), then the SCADA server marks the device as unreachable.

Terminal window
# Show Modbus TCP response times
tshark -i eth0 -f "tcp port 502" -T fields \
-e frame.time_delta_displayed \
-e ip.src -e ip.dst \
-e modbus.func_code \
| awk '$1 > 1.0 {print "SLOW: " $0}'

Response times greater than 1 second indicate network congestion, PLC CPU overload, or a routing issue. Consistent timeouts from a single PLC point to a PLC issue. Timeouts from all PLCs point to a network issue.

The following script extends the basic diagnostic toolkit with DNS resolution checks, SNMP reachability, and MRP ring status via SNMP.

import subprocess, socket, sys
from dataclasses import dataclass, field
@dataclass
class DiagResult:
host: str
checks: list[dict] = field(default_factory=list)
def add(self, layer: str, check: str, passed: bool, detail: str = ""):
self.checks.append({"layer": layer, "check": check,
"passed": passed, "detail": detail})
def report(self):
print(f"\n{'='*60}")
print(f"Diagnostic Report: {self.host}")
print(f"{'='*60}")
for c in self.checks:
icon = "PASS" if c["passed"] else "FAIL"
print(f"[{icon}] [{c['layer']}] {c['check']}")
if c["detail"]:
print(f" {c['detail']}")
failed = [c for c in self.checks if not c["passed"]]
print(f"\n{'All checks passed' if not failed else f'{len(failed)} check(s) failed'}")
def check_ping(host: str) -> tuple[bool, str]:
result = subprocess.run(
["ping", "-c", "4", "-W", "1", host],
capture_output=True, text=True
)
if result.returncode == 0:
for line in result.stdout.splitlines():
if "rtt" in line or "min/avg/max" in line:
return True, line.strip()
return False, "No response"
def check_port(host: str, port: int) -> tuple[bool, str]:
try:
with socket.create_connection((host, port), timeout=2.0):
return True, f"Port {port}/tcp open"
except (ConnectionRefusedError, OSError) as e:
return False, f"Port {port}/tcp: {e}"
def check_dns(hostname: str) -> tuple[bool, str]:
try:
addr = socket.gethostbyname(hostname)
return True, f"Resolves to {addr}"
except socket.gaierror as e:
return False, f"DNS resolution unsuccessful: {e}"
def check_snmp(host: str, community: str = "public") -> tuple[bool, str]:
try:
from pysnmp.hlapi import (
getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
ContextData, ObjectType, ObjectIdentity
)
it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
UdpTransportTarget((host, 161), timeout=2, retries=1),
ContextData(),
ObjectType(ObjectIdentity("1.3.6.1.2.1.1.5.0")))
err_ind, err_stat, _, var_binds = next(it)
if err_ind:
return False, f"SNMP detected error: {err_ind}"
return True, f"sysName={var_binds[0][1]}"
except Exception as e:
return False, f"SNMP unsuccessful: {e}"
def run_diagnostics(host: str, ports: list[int],
hostname: str = "") -> DiagResult:
result = DiagResult(host=host)
# Layer 3: ping
passed, detail = check_ping(host)
result.add("L3", f"Ping {host}", passed, detail)
# Layer 4: TCP ports
for port in ports:
passed, detail = check_port(host, port)
result.add("L4", f"TCP port {port}", passed, detail)
# Layer 7: DNS
if hostname:
passed, detail = check_dns(hostname)
result.add("L7", f"DNS resolve {hostname}", passed, detail)
# Layer 7: SNMP
passed, detail = check_snmp(host)
result.add("L7", f"SNMP query {host}", passed, detail)
return result
if __name__ == "__main__":
target = sys.argv[1] if len(sys.argv) > 1 else "192.168.1.1"
hostname = sys.argv[2] if len(sys.argv) > 2 else ""
diag = run_diagnostics(target, ports=[22, 80, 443, 502, 44818],
hostname=hostname)
diag.report()

Running this script against a Hirschmann switch at 192.168.50.1 with hostname sw-cell1.plant.local:

Terminal window
python diag.py 192.168.50.1 sw-cell1.plant.local

An unsuccessful ping points to Layer 1 to 3. A successful ping with an unsuccessful port check points to Layer 4 (firewall or service not running). An unsuccessful DNS check points to a name resolution issue. An unsuccessful SNMP check with a successful ping points to SNMP configuration (wrong community string, SNMPv2c disabled).

ScenarioWireshark Filter
MRP frameseth.type == 0x88e3
PROFINET RTeth.type == 0x8892
LLDPeth.type == 0x88cc
Modbus TCPtcp.port == 502
EtherNet/IPtcp.port == 44818 or udp.port == 2222
OPC UAtcp.port == 4840
CRC detected errorseth.fcs_bad == 1
TCP retransmissionstcp.analysis.retransmission
ARP conflictsarp.duplicate-address-detected
Slow TCP responsestcp.analysis.ack_rtt > 0.1
CounterMeaningLikely Cause
CRC detected errorsFrames with bad FCSBad cable, SFP, or EMI
RuntsFrames < 64 bytesCollision (half-duplex) or NIC fault
GiantsFrames > 1518 bytesMTU misconfiguration
Output dropsFrames dropped (queue full)Congestion
Late collisionsCollisions after 64 bytesDuplex mismatch or cable too long

Check on Hirschmann HiOS at Basic Settings → Port Statistics. Check on Linux with ethtool -S eth0 | grep -E "error|drop|crc".

Document every solved issue

Step 7 is the most skipped and most valuable. A documented solution keeps the same issue from taking hours next time.

Split pairs pass continuity, not performance

A continuity tester does not detect split pairs. Use a cable certifier that measures NEXT. Split pairs cause CRC detected errors under load.

tshark detects OT protocol violations

Use tshark filters for PROFINET cycle time violations, MRP oscillation, and Modbus timeout patterns. These are the 3 most common OT network issues.

  • CompTIA Network+ N10-009 Exam Objectives, Domain 5: Network Troubleshooting
  • Wireshark Foundation. (2024). Wireshark User’s Guide. https://www.wireshark.org/docs/wsug_html/
  • Lammle, T. (2023). CompTIA Network+ Study Guide: Exam N10-009. Sybex.
  • TIA-568.2-D — Balanced Twisted-Pair Telecommunications Cabling and Components Standard