Recently a client approached me as they regularly had interruptions on their network which was starting to frustrate them. After a short period of monitoring via PRTG it was clear they were suffering periodic bouts of packet loss on the LAN which seemed to peak during working hours and the slow downs were experienced by users across any of the 13 access layer switches.
Packet Loss % for pings on the LAN
A quick show of the spanning-tree details on the core switches shown the “last change occurred” counters being reset every couple of minutes with the originating port changing each time, immediately indicating what I initially thought was a STP loop.
Huge number of topology changes at a high frequency originating from edge ports
After a bit more research I found articles that indicated that while you can normally get away with not using PortFast or BPDU Guard features it can cause headaches in larger networks such as the one at this client. What seemed to be happening was that a Topology Change Notification (TCN) was being flooded every time a device was added or removed from the network causing excess TCN / TCA traffic in combination with a reduction in aging time, exacerbating issues with their already large broadcast domain.
Another common issue caused by flooding is Spanning-Tree Protocol (STP) Topology Change Notification (TCN). TCN is designed to correct forwarding tables after the forwarding topology has changed. This is necessary to avoid a connectivity outage, as after a topology change some destinations previously accessible via particular ports might become accessible via different ports. TCN operates by shortening the forwarding table aging time, such that if the address is not relearned, it will age out and flooding will occur.
TCNs are triggered by a port that is transitioning to or from the forwarding state. After the TCN, even if the particular destination MAC address has aged out, flooding should not happen for long in most cases since the address will be relearned. The issue might arise when TCNs are occurring repeatedly with short intervals. The switches will constantly be fast-aging their forwarding tables so flooding will be nearly constant.
Normally, a TCN is rare in a well-configured network. When the port on a switch goes up or down, there is eventually a TCN once the STP state of the port is changing to or from forwarding. When the port is flapping, repetitive TCNs and flooding occurs.
Ports with the STP portfast feature enabled will not cause TCNs when going to or from the forwarding state. Configuration of portfast on all end-device ports (such as printers, PCs, servers, and so on) should limit TCNs to a low amount.
As the client was using trunked ports to the edge devices with no STP protections in place (BPDU Guard or PortFast) I had to remediate this in the short term with the “spanning-tree portfast trunk” command.
Original port configuration with the addition of “spanning-tree portfast trunk”
After issuing this across their switches I’d managed to completely remove the rogue TCNs and eliminate packet loss on the client’s LAN. I also took the opportunity to update all their switches to Rapid PVST to make any future STP convergence much quicker.
Going forward the client will be configured to better use VLANs to reduce the broadcast domain (the client has two subnets existing on VLAN1) together with manual pruning of the VLANs where possible, as well as a cleanse of the switch configurations to use a consistent approach with regular access switch ports with PortFast and BPDUGuard.