16.1 Troubleshooting Methodology

The previous chapter covered network operations: documentation, change management, and monitoring. When monitoring detects a problem, structured troubleshooting finds the root cause. This chapter provides the methodology, the tools, and a Python toolkit for automating common diagnostics.

Why Methodology Matters

Without a structured approach, engineers chase symptoms. They reboot switches, swap cables, and change configurations randomly. A methodology prevents wasted effort and ensures the root cause is found, not just the symptom.

The CompTIA 7-Step Methodology

Each Step with a Concrete OT Example

Scenario: An HMI in production cell 3 loses connectivity to the SCADA server every 15 minutes for approximately 30 seconds.

Step 1: Identify the problem. Interview the operator: “The screen goes gray every 15 minutes, then comes back.” Check the NMS: no link-down events on the HMI port. Check recent changes: a new VFD was installed in cell 3 yesterday.

Step 2: Establish a theory. The new VFD generates electromagnetic interference (EMI) that causes CRC errors on the nearby switch port. CRC errors trigger retransmissions that exceed the SCADA timeout.

Step 3: Test the theory. Check CRC error counters on the HMI port: Basic Settings → Port Statistics. The counter increments by 50 to 100 every 15 minutes. Check the VFD duty cycle: it ramps to full speed every 15 minutes. Theory confirmed: VFD switching noise causes CRC errors.

Step 4: Plan of action. Reroute the Ethernet cable away from the VFD power cable. Use shielded cable (S/FTP Cat 6A). Maintain 200 mm separation between data and power cables. Rollback: restore original cable routing if the new routing causes other issues.

Step 5: Implement. During the next maintenance window, reroute the cable and replace it with shielded cable.

Step 6: Verify. Monitor CRC counters for 24 hours. Zero new CRC errors. HMI connectivity is stable. Operator confirms no more gray screens.

Step 7: Document. Record in the change management system: “Cell 3 HMI intermittent connectivity caused by EMI from new VFD. Resolved by rerouting Ethernet cable with 200 mm separation and upgrading to S/FTP Cat 6A.”

OSI-Based Troubleshooting

The bottom-up approach works for most connectivity problems. Start at Layer 1 and work up:

Cable and Physical Interface Issues

Physical-layer problems cause the most frustrating intermittent failures. They pass basic connectivity tests but fail under load or at specific times.

T568A vs T568B

Both wiring standards work. The problem occurs when one end of a cable uses T568A and the other uses T568B. This creates a crossover cable, which causes link failures on devices that do not support Auto-MDI/X. Modern switches support Auto-MDI/X, but older PLCs and industrial devices may not.

Pin	T568A	T568B
1	White/Green	White/Orange
2	Green	Orange
3	White/Orange	White/Green
6	Orange	Green

Use T568B consistently throughout the plant. Label both ends of every cable.

Split Pairs

A split pair occurs when a wire pair is split across two different twisted pairs in the cable. The cable passes a continuity test (all pins connect) but fails performance tests because the split pair loses the noise-canceling benefit of twisting. Split pairs cause CRC errors under load and pass basic ping tests.

Detect split pairs with a cable tester that measures NEXT (Near-End Crosstalk). A continuity tester alone does not detect split pairs.

TDR and OTDR for Cable Faults

A TDR (Time Domain Reflectometer) sends a pulse down a copper cable and measures the reflection. It reports the distance to a fault (open, short, or impedance mismatch) in meters. Use a TDR to locate a cable break without pulling the entire cable.

An OTDR (Optical Time Domain Reflectometer) does the same for fiber-optic cables. It measures attenuation along the fiber and identifies splice points, connectors, and breaks with their exact distance from the test point.

Tool	Cable Type	Measures	Use Case
Continuity tester	Copper	Pin-to-pin connectivity	Verify wiring
Cable certifier	Copper	NEXT, return loss, length	Certify Cat 6A compliance
TDR	Copper	Distance to fault	Locate cable breaks
OTDR	Fiber	Attenuation, splice loss, break location	Locate fiber faults
Light meter	Fiber	Optical power (dBm)	Verify SFP and fiber link budget

Wireless Troubleshooting

Wireless problems are harder to diagnose because the medium is invisible and shared.

Key Wireless Metrics

Metric	Good	Marginal	Poor	Meaning
RSSI (signal strength)	> -65 dBm	-65 to -75 dBm	< -75 dBm	How strong the signal is at the client
SNR (signal-to-noise ratio)	> 25 dB	15 to 25 dB	< 15 dB	Signal strength relative to background noise
Channel utilization	< 50%	50 to 80%	> 80%	How busy the channel is
Retry rate	< 10%	10 to 20%	> 20%	Percentage of frames that required retransmission

Common Wireless Problems

Symptom	Likely Cause	Diagnostic
Low throughput, good signal	Co-channel interference	Check channel utilization, survey for overlapping APs
Intermittent disconnects	Roaming between APs with different configs	Verify all APs share the same SSID, security, and VLAN
Good signal, high retry rate	Hidden node problem	Enable RTS/CTS, reposition APs
No connection in specific area	Dead zone (signal blocked by metal)	Perform a site survey with a Wi-Fi analyzer

Interference Sources in OT

Industrial environments have unique interference sources: VFDs (variable frequency drives) generate broadband noise, welding equipment creates impulse noise, and metal structures (cabinets, conveyors, ductwork) block and reflect signals. Perform a site survey with a Wi-Fi analyzer before deploying APs in a plant.

Common OT-Specific Problems

PROFINET RT Cycle Time Violations

PROFINET RT requires frames to arrive within the configured cycle time (typically 1 to 4 ms). A violation causes the IO controller to declare the IO device as failed, which stops the associated process.

Detect cycle time violations with tshark:

# Capture PROFINET RT frames and show timing
tshark -i eth0 -f "ether proto 0x8892" -T fields \
  -e frame.time_delta_displayed \
  -e pn_rt.frame_id \
  -e pn_rt.cycle_counter \
  | awk '$1 > 0.004 {print "VIOLATION: " $0}'

A frame.time_delta_displayed greater than the cycle time (0.004 seconds for a 4 ms cycle) indicates a violation. Common causes: switch queue congestion (too much non-PROFINET traffic on the same VLAN), STP topology change (temporary forwarding interruption), or a slow link in the path.

MRP Ring Oscillation

MRP ring oscillation occurs when two devices claim the MRM (Media Redundancy Manager) role in the same ring domain. Both MRMs open and close the ring simultaneously, causing continuous topology changes and MAC table flushes.

Detect oscillation by monitoring MRP topology change events:

# Count MRP topology changes per second
tshark -i eth0 -f "ether proto 0x88e3" -T fields \
  -e frame.time_epoch -e mrp.type \
  | grep "TopologyChange" | cut -d. -f1 | uniq -c

More than one topology change per minute (outside of a real cable failure) indicates oscillation. Verify that exactly one MRM exists per ring domain.

Modbus TCP Timeout Patterns

Modbus TCP uses a request-response pattern. The SCADA server sends a request and waits for a response. If the response does not arrive within the timeout (typically 1 to 5 seconds), the SCADA server marks the device as unreachable.

# Show Modbus TCP response times
tshark -i eth0 -f "tcp port 502" -T fields \
  -e frame.time_delta_displayed \
  -e ip.src -e ip.dst \
  -e modbus.func_code \
  | awk '$1 > 1.0 {print "SLOW: " $0}'

Response times greater than 1 second indicate network congestion, PLC CPU overload, or a routing problem. Consistent timeouts from a single PLC point to a PLC issue. Timeouts from all PLCs point to a network issue.

Extended Python Diagnostic Toolkit

The following script extends the basic diagnostic toolkit with DNS resolution checks, SNMP reachability, and MRP ring status via SNMP.

import subprocess, socket, sys
from dataclasses import dataclass, field

@dataclass
class DiagResult:
    host: str
    checks: list[dict] = field(default_factory=list)

    def add(self, layer: str, check: str, passed: bool, detail: str = ""):
        self.checks.append({"layer": layer, "check": check,
                            "passed": passed, "detail": detail})

    def report(self):
        print(f"\n{'='*60}")
        print(f"Diagnostic Report: {self.host}")
        print(f"{'='*60}")
        for c in self.checks:
            icon = "PASS" if c["passed"] else "FAIL"
            print(f"[{icon}] [{c['layer']}] {c['check']}")
            if c["detail"]:
                print(f"       {c['detail']}")
        failed = [c for c in self.checks if not c["passed"]]
        print(f"\n{'All checks passed' if not failed else f'{len(failed)} check(s) failed'}")

def check_ping(host: str) -> tuple[bool, str]:
    result = subprocess.run(
        ["ping", "-c", "4", "-W", "1", host],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        for line in result.stdout.splitlines():
            if "rtt" in line or "min/avg/max" in line:
                return True, line.strip()
    return False, "No response"

def check_port(host: str, port: int) -> tuple[bool, str]:
    try:
        with socket.create_connection((host, port), timeout=2.0):
            return True, f"Port {port}/tcp open"
    except (ConnectionRefusedError, OSError) as e:
        return False, f"Port {port}/tcp: {e}"

def check_dns(hostname: str) -> tuple[bool, str]:
    try:
        addr = socket.gethostbyname(hostname)
        return True, f"Resolves to {addr}"
    except socket.gaierror as e:
        return False, f"DNS resolution failed: {e}"

def check_snmp(host: str, community: str = "public") -> tuple[bool, str]:
    try:
        from pysnmp.hlapi import (
            getCmd, SnmpEngine, CommunityData, UdpTransportTarget,
            ContextData, ObjectType, ObjectIdentity
        )
        it = getCmd(SnmpEngine(), CommunityData(community, mpModel=1),
                    UdpTransportTarget((host, 161), timeout=2, retries=1),
                    ContextData(),
                    ObjectType(ObjectIdentity("1.3.6.1.2.1.1.5.0")))
        err_ind, err_stat, _, var_binds = next(it)
        if err_ind:
            return False, f"SNMP error: {err_ind}"
        return True, f"sysName={var_binds[0][1]}"
    except Exception as e:
        return False, f"SNMP failed: {e}"

def run_diagnostics(host: str, ports: list[int],
                    hostname: str = "") -> DiagResult:
    result = DiagResult(host=host)

    # Layer 3: ping
    passed, detail = check_ping(host)
    result.add("L3", f"Ping {host}", passed, detail)

    # Layer 4: TCP ports
    for port in ports:
        passed, detail = check_port(host, port)
        result.add("L4", f"TCP port {port}", passed, detail)

    # Layer 7: DNS
    if hostname:
        passed, detail = check_dns(hostname)
        result.add("L7", f"DNS resolve {hostname}", passed, detail)

    # Layer 7: SNMP
    passed, detail = check_snmp(host)
    result.add("L7", f"SNMP query {host}", passed, detail)

    return result

if __name__ == "__main__":
    target = sys.argv[1] if len(sys.argv) > 1 else "192.168.1.1"
    hostname = sys.argv[2] if len(sys.argv) > 2 else ""
    diag = run_diagnostics(target, ports=[22, 80, 443, 502, 44818],
                           hostname=hostname)
    diag.report()

Running this against a Hirschmann switch at 192.168.50.1 with hostname sw-cell1.plant.local:

python diag.py 192.168.50.1 sw-cell1.plant.local

A failed ping points to Layer 1 to 3. A successful ping with a failed port check points to Layer 4 (firewall or service not running). A failed DNS check points to a name resolution problem. A failed SNMP check with successful ping points to SNMP configuration (wrong community string, SNMPv2c disabled).

Wireshark Filter Reference

Scenario	Wireshark Filter
MRP frames	`eth.type == 0x88e3`
PROFINET RT	`eth.type == 0x8892`
LLDP	`eth.type == 0x88cc`
Modbus TCP	`tcp.port == 502`
EtherNet/IP	`tcp.port == 44818 or udp.port == 2222`
OPC UA	`tcp.port == 4840`
CRC errors	`eth.fcs_bad == 1`
TCP retransmissions	`tcp.analysis.retransmission`
ARP conflicts	`arp.duplicate-address-detected`
Slow TCP responses	`tcp.analysis.ack_rtt > 0.1`

Interface Counter Reference

Counter	Meaning	Likely Cause
CRC errors	Frames with bad FCS	Bad cable, SFP, or EMI
Runts	Frames < 64 bytes	Collision (half-duplex) or NIC fault
Giants	Frames > 1518 bytes	MTU misconfiguration
Output drops	Frames dropped (queue full)	Congestion
Late collisions	Collisions after 64 bytes	Duplex mismatch or cable too long

Check on Hirschmann HiOS at Basic Settings → Port Statistics. Check on Linux with ethtool -S eth0 | grep -E "error|drop|crc".

Key Takeaways

Document every solved problem

Step 7 is the most skipped and most valuable. A documented solution prevents the same problem from taking hours next time.

Split pairs pass continuity, fail performance

A continuity tester does not detect split pairs. Use a cable certifier that measures NEXT. Split pairs cause CRC errors under load.

tshark detects OT protocol violations

Use tshark filters for PROFINET cycle time violations, MRP oscillation, and Modbus timeout patterns. These are the three most common OT network problems.

References

CompTIA Network+ N10-009 Exam Objectives, Domain 5: Network Troubleshooting
Wireshark Foundation. (2024). Wireshark User’s Guide. https://www.wireshark.org/docs/wsug_html/
Lammle, T. (2023). CompTIA Network+ Study Guide: Exam N10-009. Sybex.
TIA-568.2-D — Balanced Twisted-Pair Telecommunications Cabling and Components Standard