AWS NLB with TLS listener and tcp_sack

When news of the TCP_SACK panic vulnerability came out, we followed much of the world in applying the “sledgehammer” mitigation until updated kernels become available and we have a chance to perform updates and reboots:

echo 0 > /proc/sys/net/ipv4/tcp_sack

This has largely gone without any significant impact, except in one very specific configuration in AWS, where a TLS Listener is attached to an NLB. Normally, you'd use an ALB, not an NLB for https termination, but in this particular case we needed to serve both https and gitd traffic from the same IP address, which is difficult to achieve with an ALB.

Shortly after turning off tcp_sack, our external monitoring service started reporting intermittent availability alerts for the https endpoint — the check would fail once every 10 minutes or so, but then almost immediately recover. All other checks for that server were reporting green, so the sysops on-call staff ascribed this to the fact that the server is located in the ap-southeast-1 zone, which routinely sees high latency and blips in availability when monitored from North American endpoints.

However, after the situation had persisted for over 24 hours and we started receiving complaints from the clients, it became clear that something was going wrong with TLS termination. Since the only change done to those systems was the tcp_sack modification, that was the first thing we rolled back, with immediate dramatic results:

Availability graph

The graph is a bit misleading when it shows near-solid red, because availability appeared to be very intermittent and was more apparent with some web clients than with others — e.g. accessing the site with Firefox almost always succeeded (needless to say, when the site “obviously works just fine” as far as the troubleshooter is concerned, that seriously muddles the issue).

From what we can surmise, if the client didn't immediately send the payload after completing the TLS handshake, that tcp session had a high chance of eventually timing out. I'm not 100% sure where tcp_sack fits into the exchange between the AWS NLB+TLS and the EC2 instance, since we have no visibility into any tcp traffic after it leaves the box, but obviously it makes use of selective acks, or it wouldn't have had such a dramatic effect when we turned those off. We equally have no explanation why check latency dropped by 100ms during the same time period.

Unfortunately, we couldn't afford to keep the affected systems in their broken state for in-depth troubleshooting, but hopefully this experience is useful to others facing similar issues.