Friday 9 April 2010

TCP Throughput Over Long Fat Networks : “Why am I not getting the throughput I’m paying for?”

Throughput. We all know what that is: how much ‘stuff’ you can move in a period of time, which in the IT world means bits or bytes per second. We also know intuitively about the things which limit throughput; the capacity of a resource and the competing demands for it.

A concept harder to grasp perhaps is the idea that throughput is limited by physical distance and other network delays which, in the case of TCP, it is. This isn’t something we notice every day because our ubiquitous use of TCP - as HTTP browsing - rarely touches this throughput limit. There can however be real business consequences of this effect anywhere enterprises are moving large volumes of data over long physical distances.

Luckily this TCP throughput limit only exists within individual TCP sessions and doesn’t affect capacity for multiple, concurrent TCP flows. Fortunately there’s also a solution which we’ll come to later.

So what is this TCP throughput constraint and why does it exist?

Back in December 1974 Mud (no, not Elvis) had a UK Number 1 with ‘Lonely this Christmas’ and the first TCP specification (RFC675) was published with a 16 bit window size. The window size is the maximum amount of data which can be sent before an acknowledgement is required. The window size can be varied by the receiver to assert flow control but can’t increase beyond 65535 bytes. As the Round Trip Time (RTT) of the network increases, the sender has to wait longer for an acknowledgement before starting to send the next window.

Queuing delays at network hops and long distance propagation delays are significant components of the RTT and certain finite delays can’t be tuned out of the system. Change any of these delay components and the maximum TCP throughput changes and can fluctuate in the course of a transmission and indeed be different in each direction.

By way of example a network path with a RTT of 10ms has a TCP throughput limit of 52Mbit/s including protocol overhead which, with large frames, is about 49Mbit/s of real application data; the so-called ‘goodput’. Halve the RTT and the throughput doubles and vice versa. Even if the Gigabit link to your data centre has an RTT of only 1ms, a bog-standard TCP session could only half fill it. Equally enlightening is that the absolute throughput reduction due to 1ms of extra RTT will halve 500MBit/s to 250Mbit/s but only reduces 50Mbit/s by to 45Mbit/s – the effect of RTT variability is worse at higher throughputs.

The relationship between the path capacity and the path latency determines where performance-limiting factors may lie. Paths with high capacity & high latency will be limited by TCP throughput whereas paths with low capacity and low latency are limited by the data link capacity. This relationship is referred to as the ‘bandwidth delay product’ and is a measure of how much data could be in transit – in the pipe but not yet received - at any point in time. Networks with a high bandwidth delay product are called ‘Long Fat Networks’ or LFNs for short.

If the bandwidth delay product is greater than the TCP window size then the entire window can be in transit. The receiver must sit and wait for it to arrive and then clock it into the buffers before an acknowledgement can be sent. This stop-start pumping effect reduces the effective throughput across LFNs.

So what can be done about this TCP throughput limit?

RFC1323 ‘TCP Extensions for High Performance’ offers a solution called Window Scaling. This is a TCP Option (3) negotiated by both sides during TCP connection establishment to indicate their willingness to shift the advertised window bitwise to the left, doubling it in size each time. A window scale factor of 0 indicates scaling capability but zero scaling, a window scale factor of 1 doubles the maximum advertised window from 65.5KByte to 131KByte. The highest scale factor of 14 can increase the maximum window to just over 1GByte.

Most TCP stacks have incorporated the Window Scale option by default for some time – it’s been around since 1992 after all – but there are a few considerations: Firstly the TCP receive buffer size must be large enough to absorb the increased window size. Secondly there’s a reliability trade-off because losing a single frame within a scaled window will require the whole window to be retransmitted with a further impact on net throughput. Thirdly, TCP congestion avoidance schemes may still limit the ability of the window to ever achieve its maximum size. Fourthly some network devices – most notably firewalls – have been known to re-write the Window Scale option with unpredictable results. The Fifth effect is that aggressive TCP Window scaling can create unfair bandwidth hogging which might not always be desirable.

So how can you engineer the TCP throughput required by particular business applications?

The first step is to understand the real world performance objectives or KPIs. If the raw throughput of individual TCP sessions is paramount then begin by looking at whether the bandwidth-delay product of the proposed network path exceeds the standard TCP window size.

The next step is to make sure that the end systems are tuned both in terms of buffers and two-way support for RFC1323. I’d always recommend throughput testing in conjunction with protocol analysis to validate this and also highlight the effect of congestion avoidance schemes.

I advise impairment testing too because introducing incremental delays and loss under controlled conditions can you get a really good feel for the point at which the tuned system fails to yield throughput returns. A method of monitoring the throughput being achieved may also be necessary prove that the network service is delivering to throughput KPIs and the business is getting what its paying for.