Posted on 2025-06-29 :: 2780 Words

The Options

If you've read about TCP in any capacity, you've probably heard of the TCP_NODELAY flag and Nagle's algorithm. If you haven't, a brief summary is that it was added to most OSes TCP stack and enabled by default to prevent "tinygrams" - TCP segments with only a few bytes of data, such that the 40 bytes of headers needed to send the segment in the right direction is larger than the data actually being sent. It does this by only allowing a single "small segment" to be in flight at any time. After sending one, Nagle's algorithm will hold onto data until enough accumulates to form a reasonably sized segment, or the tinygram is ACKnowledged.

In a time when bandwidth was considerably more expensive it made sense, but does it make sense nowadays? The common wisdom is no. Blog posts like "It's always TCP_NODELAY" suggest that Nagle's algorithm is dated and unnecessary; bandwidth isn't precious nowadays and we care more about reducing latency than we do efficient use of network resources. I disagree, but we'll get onto that later.

If you've read the aforementioned blog post, you'll have heard of TCP_QUICKACK and delayed acknowledgements. Similar to Nagle's algorithm, delayed ACKs are used to efficiently use network resources. Upon receiving data an ACK can be delayed for a short amount of time, to allow even more data to be received (and therefore even more ACKs to need to be sent). These ACKs can then be bundled together, requiring one 40 byte segment rather than, let's say, five 40 byte segments to acknowledge that received data. Better yet, this delay gives time for the client to attempt to send data, allowing these ACKs to be bundled with that data, essentially allowing us to ACK without using any unnecessary bytes.

Both of these ideas are good but they can both - separately and combined - cause issues too. If you want to send small segments, Nagle's algorithm will slow you down considerably. If you're never sending outgoing data, or no more data comes in for you to ACK, then delayed acknowledgements will introduce latency for no efficiency gains.

N.B.

TCP_NODELAY is the socket option for disabling Nagle's algorithm and TCP_QUICKACK is for disabling delayed acknowledgements. It confuses me occasionally too.

The Problem

If you think these options can be bad on their own, you should see them when they're combined. If a client running Nagle's algorithm is holding onto data and the remote is using delayed acknowledgements and doesn't have anything to respond with, then the client's data is going to get held up. The client is waiting for an ACK from the remote and the remote is waiting for a timer to expire. This situation can result in a considerable delay in the transmission of data - sometimes up to 200ms for no reason at all.

This situation may sound a little contrived but it happens more than you might think. In 2015 - documented in this blog post - someone encountered this exact issue. They were attempting to use Ruby to proxy a request through HAProxy (an OSS TCP/HTTP load balancer).

The Ruby library split the HTTP request into two TCP segments - one with headers and one with the body - and Nagle's algorithm sent the first segment but held back the second. HAProxy received the first segment but delayed its ACK until it 1) had data to send back or 2) its timer expired. In this case, a proxy has nothing to return until it receives the full request and as such it had to wait for its timer. This meant there was a full 200ms wait until it ACKnowledged and subsequently received the body.

Is this all the fault of Nagle's algorithm? Is this the fault of delayed ACKs? Would setting quickack with TCP_QUICKACK solve this issue?

TCP_QUICKACK Is Awful

Well... it's actually not that simple. You'd think disabling delayed acknowledgements would be as simple as setting TCP_QUICKACK = TRUE but it isn't. The man page isn't even helpful. Apparently, some of the time (but it won't tell you exactly when) delayed acknowledgements will be re-enabled. Great. Thanks guys.

TCP_QUICKACK is not permanent, it only enables a switch to or from quickack mode. Subsequent operation of the TCP protocol will once again enter/leave quickack mode depending on internal protocol processing and factors such as delayed ack timeouts occurring and data transfer.

extract from TCP man page

In many places online, it's suggested to simply re-enable this value after every single write to ensure this value remains set. That's crazy - surely this documented behaviour is incorrect. I'm not alone in thinking this, even John Nagle seems to think so...

TCP_QUICKACK, which turns off delayed ACKs, is in Linux, but the manual page is very confused about what it actually does. Apparently it turns itself off after a while. I wish someone would get that right.

Quote from John Nagle from a Hacker News thread.

As it happens, John and I are wrong. The tiny test below proves it. I enable QUICKACK, send a bunch of data and once my connection closes, TCP_QUICKACK is disabled. Even checking mid-connection shows that TCP_QUICKACK is disabled.

todo!()

tcpdump tells us the same story:

> tcpdump

I don't like not knowing things, so let's dive into the behaviour of TCP_QUICKACK and figure out what's going on.

The Kernel

int sock = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(sock, IPPROTO_TCP, TCP_QUICKACK, &1, 32);

Using C, we can create a socket and enable the flag using the socket() and setsockopt() functions. When we look into the setsockopt() function we can see that it delegates to the do_tcp_setsockopt() function which delegates to the __tcp_sock_set_quickack() function.

int do_tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen)
{
  // ...

  case TCP_QUICKACK:
    __tcp_sock_set_quickack(sk, val);
    break;

  // ...
}

static void __tcp_sock_set_quickack(struct sock *sk, int val)
{
  if (!val) {
    inet_csk_enter_pingpong_mode(sk);
    return;
  }

  inet_csk_exit_pingpong_mode(sk);
  if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
      inet_csk_ack_scheduled(sk))
  {
    inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_PUSHED;
    tcp_cleanup_rbuf(sk, 1);

    if (!(val & 1))
      inet_csk_enter_pingpong_mode(sk);
  }
}

If val is zero, TCP_QUICKACK is disabled by immediately entering "ping-pong" mode. Otherwise, we exit ping-pong mode and send any ACKs that are waiting. Oddly, if val is even, we re-enter ping-pong mode afterwards. This behaviour isn't documented anywhere. The functions for entering and exiting this mode simply set the value of the icsk_ack.pingpong field on the socket.

static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
{
  inet_csk(sk)->icsk_ack.pingpong =
    READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_pingpong_thresh);
}

static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
{
  inet_csk(sk)->icsk_ack.pingpong = 0;
}

When exiting, this field is set to 0. When entering this field is read from a value which can be controlled via sysctl(), a utility for reading/writing kernel attributes. So what is ping-pong mode and why would we want to be able to tune its parameters?

A TCP connection can have a few different traffic patterns and ping-pong mode is used to describe what is also known as an "interactive" pattern. This is a connection in which data is sent back and forth between both ends of a connection, with each end responding to each other. An example of this would be the use of telnet where every keypress is a TCP segment and every keypress is echoed back to the user. Something like an FTP file download would be considered as "bulk-transfer" and not an interactive session.

In the interactive pattern, you don't want to delay acknowledgements as latency is critical and yet TCP_QUICKACK is disabled. It's confusing at first, but the nature of interactive traffic means it's not necessary.

Delayed acknowledgements are used to reduce the number of segments sent containing no data and only ACKs. In interactive moe, there is always a segment to immediately send back (i.e. echoing a keypress in telnet) so the ACK can be sent for free.

The functions above tell us that TCP_QUICKACK is only relevant for "bulk-transfer" type traffic, and that makes sense. In a one-sided bulk transfer of data, there's no data being sent the other way for the ACKs to piggyback on, so many empty segments would be sent. Delayed acknolwedgedments reduces the number of segments at the cost of potentially increased latency.

Why ping-pong mode is re-enabled immediately if an even value is passed to setsockopt() is unknown to me.

From the code we've seen so far, we would think that TCP_QUICKACK is permanent as long as we set it to 1, so let's look deeper and see if we can spot where it changes. Searching for methods containing quickack we see these few and they tell an interesting story.

static inline void tcp_dec_quickack_mode(struct sock *sk)
{
  struct inet_connection_sock *icsk = inet_csk(sk);

  if (icsk->icsk_ack.quick) {
    /* How many ACKs S/ACKing new data have we sent? */
    const unsigned int pkts = inet_csk_ack_scheduled(sk) ? 1 : 0;

    if (pkts >= icsk->icsk_ack.quick) {
      icsk->icsk_ack.quick = 0;
      /* Leaving quickack mode we deflate ATO. */
      icsk->icsk_ack.ato   = TCP_ATO_MIN;
    } else
      icsk->icsk_ack.quick -= pkts;
  }
}

static void tcp_incr_quickack(struct sock *sk, unsigned int max_quickacks)
{
  struct inet_connection_sock *icsk = inet_csk(sk);
  unsigned int quickacks = tcp_sk(sk)->rcv_wnd / (2 * icsk->icsk_ack.rcv_mss);

  if (quickacks == 0)
    quickacks = 2;
  quickacks = min(quickacks, max_quickacks);
  if (quickacks > icsk->icsk_ack.quick)
    icsk->icsk_ack.quick = quickacks;
}

The functions, incr_quickack() and dec_quickack() make it pretty clear how TCP_QUICKACK works. The socket contains a field icsk_ack.quick which is a count of the number of quick ACKs allowed to be made.

The decrementing function decrements by the number of segments sent and, if the counter hits 0, sets the ACK delay timer to its minimum value. My guess for why it does this is so that any older and large values that could be left over from before we enabled QUICKACK don't cause unnecessary latency once we leave the mode.

The incrementing function increments the counter, but not beyond a maximum number provided to the function.

unsigned int quickacks = tcp_sk(sk)->rcv_wnd / (2 * icsk->icsk_ack.rcv_mss);

This line above in the incr_quickack() function determines how much to attempt to increment the counter by. It divides the size of the receiving window by twice the maximum segment size. Briefly, the receive window is just the amount of buffer space available for receiving data in and this window size is included in every ACK and to help the sender efficiently transmit, without overwhelming the receiver. The maximum segment size is the largest amount of data a single TCP segment can contain.

Combined in that equation, it determines how many full sized segments it would take to fill half the buffer. I don't know for sure, but my assumption is that this is to prevent the overloading of the buffer by issuing too many ACKs. It forces the number of ACKs to scale with window size; when the buffer gets small, the counter is incremented by less, ensuring we revert back to delayed ACKs the closer we get to overwhelming the buffer. When the buffer size increases, the number of QUICKACKs increase to fill the available space as quickly as possible.

static void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks)
{
  struct inet_connection_sock *icsk = inet_csk(sk);

  tcp_incr_quickack(sk, max_quickacks);
  inet_csk_exit_pingpong_mode(sk);
  icsk->icsk_ack.ato = TCP_ATO_MIN;
}

/* Send ACKs quickly, if "quick" count is not exhausted
 * and the session is not interactive.
 */
static bool tcp_in_quickack_mode(struct sock *sk)
{
  const struct inet_connection_sock *icsk = inet_csk(sk);
  const struct dst_entry *dst = __sk_dst_get(sk);

  return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
    (icsk->icsk_ack.quick && !inet_csk_in_pingpong_mode(sk));
}

Just a few lines below the incr_quickack() function we see the two functions shown above. The enter_quickack() function uses the incrementing function and then exits ping pong mode and resets the ACK delay timer, for much the same reason dec_quickack() does. Perhaps one of these instances is redundant? Much like incr_quickack(), a maximum number of allowed quickacks is provided.

The in_quickack_mode() function is a helper to be called when determining if QUICKACK mode is enabled. Two conditions are evaluated and returned, the latter of which should make sense to us.

icsk->icsk_ack.quick && !inet_csk_in_pingpong_mode(sk)

If the counter is 0 and we're not in an interactive flow pattern, then we are considered to be in TCP_QUICKACK mode. The first condition however is completely foreign, we haven't seen anything like it in our short exploration but a little digging reveals something very interesting and very helpful.

dst && dst_metric(dst, RTAX_QUICKACK)

This blog series goes into it in a lot more detail, but, to summarise, the dst_metric() function is used to retrieve attributes about routing. In the in_quickack_mode() function, the attribute being looked up is RTAX_QUICKACK which is used for setting quickack in the routing table based on socket destination as opposed to doing it directly on the socket. What makes this interesting revelation deserve the "very" modifier, is the fact that it does not check the counter or if the connection interactive. At first glance, this looks to me like a way to permanently set TCP_QUICKACK.

One peruse of the linux kernel mailing list later and we see that this is the case. In 2015 there was a discussion and patch which moved the above condition inside the in_quickack_mode() function in order to make sure every ACK performed is quick, regardless of what the socket thinks should actually be happening.

So, if we want to permanently disable delayed ACKs (permanent until we restart the machine at least) we can just add the attribute to the routing table for our interface.

ip route change [IFACE] quickack 1

Making a Patch

This does beg the question, why is there no way to do this on a socket level? It doesn't seem right that we have to modify our routing table - which requires superuser permissions - in order to get the behaviour that most users of TCP_QUICKACK desire in the first place - there's many places on the internet suggesting using setsockopt() after every write() to ensure the value stays set, and even that might not be effective.

I think that if it's okay to force the option using a routing table, it should be okay - and possible - to do the same using setsockopt(). So... let's do it! I've always wanted to submit a kernel patch.

Rather than change the behaviour of the existing flag we'll create a new one called TCP_PERMA_QUICKACK which utilises the same method as RTAX_QUICKACK to override the protocols wishes. It's only a few lines to do so, we simply create the flag, the methods to create it using setsockopt() and add it to the in_quickack_mode() function. Super simple. Here I was thinking linux maintainers had difficult jobs...

todo!()

TODO: stop being a pussy and actually submit the patch. Submitting the patch is nearly as difficult as creating it, to be completely honest, but it didn't take long for it to be accepted so socket-level permanently-enabled TCP_QUICKACK should be coming to a kernel near you. Thank me later.

Why does TCP_QUICKACK toggle?

As for what actually triggers TCP_QUICKACK to toggle - since we got sidetracked making a kernel patch - there are only a few places in the codebase where it can happen. All of those places are in functions that handle window sizes and dropped packets.

To understand why, we need to look at how TCP connections evolve after the handshale. Even in the ESTABLISHED state, a TCP connection transitions through several flow-control states that influence when and where a quick ACK makes sense...

As for what actually triggers TCP_QUICKACK to toggle - since we never actually learnt - there are only a few places in the codebase where it can happen and those places are all in functions that handle window sizes and dropped packets. As a general rule of thumb, if packets are dropped we want them re-sent ASAP so TCP_QUICKACK is enabled and if the window size changes we either want to disable it to stop it overflowing and causing us to drop packets, or enable it to ensure the entire buffer is filled as fast as possible. In this latter case, the number of allowed quick ACKs is the exact number it would take to fill the buffer entirely, if every ACKed segment was maximally full.

TODO: fill in more.

todo!()

I submitted a patch to better document this behaviour too.

Delayed ACKs or Nagle's Algorithm?

So, with an understanding of delayed acknowledgements and actually useful behaviour, let's examine how these options affect modern traffic on a production proxy server...