Posted on :: 2848 Words

Introduction

At Ping we lease IPs to our customers and tell them they can use it as much as they want. No concurrency limits, no bandwidth limits, we'll support pretty much any scale.

This means, in some extreme cases, a client will try to do silly things. Maybe they'll make ten thousand connections to ten thousand sites. Maybe they'll make ten thousand connections to the same site. And, sometimes, maybe, they'll make ten thousand connections to the same site, per second.

It's that last one that causes issues.

If you try to handle that many requests, you start running into the limits of what it's possible to do, you encounter behaviour that is undocumented and the common mental model of how sockets work breaks down.

This blog will teach you how socket address binding really works, starting with a high level mental model and working our way down each time a new example breaks our model. It's okay to cry.

Before we begin, to help you understand how complicated it truly is, you should know that Cloudflare has two blogs on the topic, the second of which is a correction of the first and is named "The Quantum State of a TCP Port". "Quantum" should be ringing alarm bells.

Let's begin with an overview of the Socket API.

Note: Why is this necessary?

You might be wondering if anyone would ever need that many connections to a single site?

Well, in April, one of our customers had all their proxy providers but us go offline. They routed all their traffic through us and generated around 300 million requests in two hours, all to the same website. It came to around 5,000 requests per proxy, per minute. Without the information in this article, their requests would have started failing after a few minutes.

The Socket API

The Berkeley Software Distribution (BSD) socket API is pretty simple at first glance. You get five important functions: socket(), setsockopt(), bind(), connect() and listen().

  • socket() creates your socket. It doesn't bind, it doesn't connect, it doesn't listen. It simply instantiates a file descriptor.

  • setsockopt() allows you to configure that sockets options. If you're going to use it for TCP, maybe you want to set the size of the receiving window. Or maybe you want to enable TCP_QUICKACK - blog post on that here.

  • bind() binds the socket to an IP and port combination. If you provide a wildcard port (0) it will bind a random ephemeral port. If you bind a wildcard IP (0.0.0.0 for IPv4), the behaviour of this call is determined by the next function call you make.

  • connect() connects to the target address. If the socket hasn't been bound, connect will bind to a random address for you. If you did bind a socket, and the IP address was a wildcard then a single random IP on the machine will be used.

  • listen() listens for incoming connections. If the socket hasn't been bound, listen will bind one for you. If you did bind a socket, and the IP address was a wildcard it will listen on all IPs on the machine.

A bind() before listen() is rather common but a bind() before connect() is pretty rare. Typically, when connect()ing to a target you don't care about your egress address. In our case as a proxy server, it lets us ensure that the ingress IP address is the same as the egress IP address. If a request comes in to 192.126.1.123 then the onward connection will bind and send data out of 192.126.1.123.

The Quad

Sockets can be identified by a tuple of four values - the quad - made up of two two-element tuples - the bind tuple and the connect tuple. The bind tuple contains the source IP and the source port. The connect-tuple contains the destination IP and destination port.

|-------------- QUAD --------------|
(SRC.ip, SRC.port, DST.ip, DST.port)
 |---- BIND ----|  |--- CONNECT --|

The kernel ensures that all TCP sockets have a unique quad since that guarantees there will be no "cross-contamination" of data.

  • When connect()ing a socket, the OS will see the full quad. As long as the quad we have provided does not match that of any quad already in use, the connection is allowed to happen.

  • When bind()ing a socket before connect()ing, the only values available at the time of the bind() call are the source values. Despite this, the OS performs the same quad check with a quad made up of the provided bind-tuple and a wildcard connect-tuple which matches on anything. As such, bind() before connect() is much more limited in the number of connections that can be made.

bind() can form as many connections as there are ephemeral ports, regardless of the number of targets (per local IP). connect() can form as many connections there are ephemeral ports to each target (per local IP).

Let's run through some examples to see how things work in practice. All examples will be run on a machine with two IPs and two ephemeral ports.

All targets will accept incoming connections and hold onto them until we initiate close.

Raw connect()

# Local IPs: 1.1.1.1, 2.2.2.2
# Ephemeral Ports: 50_000, 50_001

c1 = connect(3.3.3.3, 1)
c2 = connect(3.3.3.3, 1)
c3 = connect(3.3.3.3, 1)
c4 = connect(3.3.3.3, 1)


Four connect()s to the same target works. All have the same connect-tuple but they all have different bind-tuples. We don't need to specify the bind-tuple via a bind() because we know that a raw connect() will bind any free IP and ephemeral port. This forms a unique quad for each connect() and so no errors occur.

Now, if we try to issue a fifth connect() call without dropping any of the previous connections, what will happen?

# Local IPs: 1.1.1.1, 2.2.2.2
# Ephemeral Ports: 50_000, 50_001

c1 = connect(3.3.3.3, 1)
c2 = connect(3.3.3.3, 1)
c3 = connect(3.3.3.3, 1)
c4 = connect(3.3.3.3, 1)

# ---

connect(3.3.3.3, 1)


The quad of the fifth call cannot be unique as all possible quads have been formed by the previous four connections. This call will fail.

What if we tried to connect to a different target instead?

# ---

connect(4.4.4.4, 1)


This works because this fifth call creates a unique quad. On its own, the bind-tuple is not unique but it's combined with a different connect-tuple. Now, what happens if we drop a connection before we try the fifth connect() call to the same target as the others?

# ---

c1.close()
connect(3.3.3.3, 1)


The intuitive answer from our current mental model would suggest that this code should work... but it wont. When you initiate the close of a TCP connection the connection enters the TIME_WAIT state. This state exists so that any data sent by the remote and not yet received by the client has time to arrive - usually, we wait for double the maximum transmission time (typically 60 to 120 seconds). If we were not to wait, then a subsequent binding of the socket to the same target could receive data from the old connection.

As a proxy company, this issue of TIME_WAIT is a little annoying since we have clients trying to repeatedly form new connections and so need to be able to reuse new connections as quickly as possible The solution is the SO_REUSEADDR option which allows us to rebind sockets in the TIME_WAIT state. You can think of it as telling the OS not to check the list of TIME_WAIT sockets when looking for unique quads.

# ---

c1.close()

s = socket()
s.reuse_addr()
s.connect(3.3.3.3, 1)

Explicit bind() Before connect()

We have a good understanding of how connect() works, so lets throw bind() into the mix...

# Local IPs: 1.1.1.1, 2.2.2.2
# Ephemeral Ports: 50_000, 50_001

s = bind(1.1.1.1, 50_000)
c1 = s1.connect(3.3.3.3, 1)

s = bind(1.1.1.1, 50_001)
c2 = connect(3.3.3.3, 1)

s = bind(2.2.2.2, 50_000)
c3 = connect(3.3.3.3, 1)

s = bind(2.2.2.2, 50_001)
connect(3.3.3.3, 1)


Binding with explicit IPs and ports works exactly as you would expect. We're guaranteeing that the quads are unique by rotating through the available IPs and ports. What happens if we add a fifth bind() and connect()?

# ---

let s = bind(1.1.1.1, 1)
connect(3.3.3.3, 1)


Upon binding, only the bind-tuple is available and it is not found to be unique. As such, this example throws an error.

Does anything change if we connect() to a different target than the other four sockets?

# ---

let s = bind(1.1.1.1, 1)
connect(4.4.4.4, 1)


No. While connect()ing to a new target would provide us with a unique quad at that state, we're failing at bind() because there are no unique bind-tuples available.

If, as with the connect() only examples, we drop a connection, will we be able to make our fifth bind() and connect() calls?

# ---

c1.close()

s = socket()
s.reuse_addr()

s.bind(1.1.1.1, 1)
s.connect(4.4.4.4, 1)


Yes, but only if we set the SO_REUSEADDR socket option. Just as with connect() only, sockets in TIME_WAIT can't be bound unless we explicitly say we are okay with doing so. The above code works regardless of using the same or a different target in the connect() call. We could have gone to 3.3.3.3:1 and it would have succeeded too.

It's pretty clear that bind() before connect() is much more limited than simply doing raw connect()s, and the reason why should be obvious. With only the SRC.ip and SRC.port available there's fewer combinations available to make a unique tuple with.

MNote: aximum Number of Connections

bind() before connect() only allows you to form as many connections as there are ephemeral ports per IP address.

Raw connect()s however, allow you to make as many connections as there are ephemeral ports per IP, per target.

Wildcard bind() Before connect()

todo!()

Mix and Match bind() and connect()s

We've mentioned previously that bind()s are more limited than raw connect() and we have seen an examples of this. If we make four raw connect()s to target A and then make a connect() to target B, everything works fine, but doing the same with bind() and connect() fails.

We understand why this happens but it is a frustrating behaviour. This would limit each proxy on our servers to sixty five thousand connections each, regardless of the number of targets. That's a surprisingly low number and would cause issues for a lot of our customers.

TODO: Grafana screenshot showing 100k+ concurrent tunnels?

So, how can we get around that? Consider the code below, what will happen there? The four bind() and connect() calls will succeed as we know but what will happen when we do a raw connect() to a different target after?

# Local IPs: 1.1.1.1, 2.2.2.2
# Ephemeral Ports: 50_000, 50_001

s = bind(1.1.1.1, 50_000)
c1 = s.connect(3.3.3.3, 1)

s = bind(1.1.1.1, 50_001)
c2 = s.connect(3.3.3.3, 1)

s = bind(2.2.2.2, 50_000)
c3 = s.connect(3.3.3.3, 1)

s = bind(2.2.2.2, 50_001)
c4 = s.connect(3.3.3.3, 1)

# ---

c5 = connect(4.4.4.4, 50_000)


If you've been paying attention, you should have been able to reason out that this would succeed. The final connect() call will have a unique quad and be allowed to succeed. As shown above, if we were to explicitly bind() a socket, it would have failed.

Skipping the bind() check

Ideally, to allow us to explicitly bind() the socket and set our outgoing IP, we want the ability to skip the check of the bind-tuple and only perform the quad-check during the connect() call. Fortunately, there is a socket option for exactly this: IP_BIND_ADDRESS_NO_PORT. Using this socket option lets us set the IP we want to bind whilst leaving the port binding and the bind-tuple check until the connect() call.

# ---

s = socket()
s.bind_address_no_port()

s = bind(1.1.1.1, 50_000)
c5 = s.connect(4.4.4.4, 1)


This succeeds. IP_BIND_ADDRESS_NO_PORT allows us to skip what would otherwise be a failing bind() call and instead only perform the full quad check during the connect() call.

Now, think about what happens if that final connect() call were to be to 3.3.3.3:1 instead of 4.4.4.4:1..

# ---

s = socket()
s.bind_address_no_port()

s = bind(1.1.1.1, 50_000)
c5 = s.connect(3.3.3.3, 1)


If you truly understand this idea of the quad, bind and connnect tuples, you should have concluded that the above would not be possible. The code fails on the connect() call when the full quad is checked and found to not be unique. Recalling previous examples, you might also have concluded that we we can close one of the sockets and use SO_REUSEADDR to allow the example to succeed.

So, let's do that...

# ---

c1.close()

# ---

s = socket()
s.reuse_addr()
s.bind_address_no_port()

s = bind(1.1.1.1, 50_000)
c5 = s.connect(3.3.3.3, 1)


If you were to run this code, it would unfortunately ruin your day and throw an error on the bind() operation. It turns out, despite all the clear and logical rules we have built in our mental model, this doesn't work. The reason for this is why Cloudflare's blog has "quantum" in the title and why this one has "schrodinger". Socket binding is just fucked.

To understand this behaviour, to be able to predict this behaviour, we need to go deeper.

The Quad: In Review

Before we do that however, let's review what we've learnt so far, and please, understand that this mental model we've built is more than enough to get you through a long career without issue. It may not be an accurate representation of how things truly work, but it's good enough to suffice. Nobody should be doing the stuff we're going to talk about laterto suffice. Nobody should be doing the stuff we're going to talk about later

  1. The OS ensures that every connection has a unique quad made up of a "bind-tuple" (source IP and port) and a "connect-tuple" (destination IP and destination port).

  2. Connections where you initiate the shutdown enter the TIME_WAIT state and are considered to be in-use, and therefore considered when checking for unique quads. This behaviour can be ignored using the SO_REUSEADDR socket option which removes TIME_WAIT sockets from the list of in-use quads.

  3. Raw connect() calls let you establish as many connections as you have ephemeral ports per target. bind() before connect() only lets you establish as many connections as you have ephemeral ports because it only looks at the bind-tuple.

  4. You can use IP_BIND_ADDRESS_NO_PORT to skip the check of the bind tuple at the point of the bind() call and instead have it happen during the connect() call. To use this option, you have to use be happy with a randomly assigned ephemeral port, you can't choose which port to bind.

  5. Wildcard bind() before connect()...

Feel free to finish stop reading here. If you chose to cross the line below, know that I am sorry.


Fast Reuse

To understand why combining IP_BIND_ADDRESS_NO_PORT and SO_REUSEADDR breaks the quad-based mental model , you need to understand that the representation of a socket in the kernel is different to the APIs used to create, interact with and, modify the socket.

The majority of this behaviour is mapped into a single field fast_reuse which determines when a socket can be reused and what checks need to be performed by the bind() and connect() calls.

The fast_reuse field can be one of three values: -1, 0 or 1. With each value providing different levels of freedom in how you can use the socket.

todo!()

From this, you can see that the most desirable state to be in is state 1 as it allows you the most freedom with what you bind() and where you connect(). Cloudflare provide a "helpful" graph in their blog which visualises how different actions manipulate this value.

todo!()

Unfortunately, this graph doesn't display the SO_REUSEADDR and IP_BIND_ADDRESS_NO_PORT options. If we try to reason about the failing example above, even with this new understanding, we still cannot understand why it would fail. We need to understand how those options affect fast_reuse so I have created an even more cursed graph. Enjoy.

todo!()

...

Buckets and Kernel Code

This understanding of fast_reuse explains the behaviour we saw that broke our previous mental model, but it is so nebulous and hand-wavey that it doesn't help us to build a new mental model. We don't know why these behaviours change the value in the way that they do.

To do that, we need to dig deeper and understand the structures the kernel uses to store and retrieve sockets. To build a new mental model that accurately predicts all behaviour we need to understand these structures intimately...

...

Hold Your Horses

TODO: We put all of this understanding into Ancelotti - our in house proxy server - and yet, things still didn't work. Why? Because we weretrying to reuse sockets fasterthan the Linux kernel could release them. The solution, horribly, was toretry the binding three times, waiting 10ms each time.