iptables TCPMSS target limitation

Introduction

I was recently making some tests regarding the behavior of a given transparent TCP proxy where IPv6 is in use. Specifically, I wanted to test the scenario in which some network link in the destination path after the proxy contained an unusually small MTU.

State of the art

First of all, let's remember that IPv6 doesn't support fragmentation inside middle-boxes, but it is explicitly stated that fragmentation should be done by the endpoints. Usually that means that endpoints should have some kind of PMTU Discovery algorithm to try to guess the best MSS which can fill the MTU as much as possible without exceeding its maximum size.

In case of TCP, an MSS option is available (see rfc6691) to tell your peer at connection establishment time (SYN, SYN/ACK) which is the maximum MSS you expect to receive according to your knowledge of the network surrounding you. This way, your peer knows how much data can he store in every packet in order to be able to avoid it being dropped by some router in the way. Then, ideally, smart middle-boxes routing this kind of packets should modify its TCP MSS option value, lowering it in case the next hop is known to contain a smaller MTU. Following this procedure, when the packet reaches the other endpoint, the MSS value is correctly set to match the value needed for the lowest MTU on the path.

Sounds good, right? This kinda works… until a dumb middle-box appears in the path! Let's imagine a router with an output interface whose link is the minimal one supported by IPv6, that is 1280 bytes. And the SYN packet passes with a nice MSS=1440 option set in it, which of course our dumb friend is not going to update. Then, the other peer is going to receive this erroneous big value and it is going to use it, sending packets which are going to exceed the 1280 bytes, and they will be dropped by the router as there is no fragmentation in IPv6. Conclusion: TCP MSS option is easy and efficient, but doesn't work in all scenarios.

Fortunately for us, IPv6 provides its own mechanisms to notify the sender endpoint that packets too big are being sent. Quickly explained, when a middle-box drops a packet too big to be forwarded over the MTU of the router link, it also sends an ICMPv6 "Packet too big" message to the sender (see rfc4443), providing information on the size needed. This way, the TCP stack of the sender can learn about the new MSS size required and re-send the information using smaller packets. As you can see, this procedure would take a long time if it had to be done for each new hop, that's why TCP MSS option is usually there to help us, avoiding lots of possible extra round-trip times until first data packet reaches the receiver.

Adding the proxy

Let's imagine we are using a simple proxy which creates 2 sockets for each connection created by a client. The first socket is used to read what the client sends, and then a second socket is created to communicate server-side. Then, the proxy basically copies data from one socket to the other, reading/modifying payload to achieve some kind of objective (we are using a proxy for some reason, right?). It can be seen that the proxy actually cuts the connection in two, buffering both end. That is, whatever data is sent from the client is ACKed by the proxy, and at some point later on, it is sent to the server.

Now, let's imagine the problematic network we talked about before: a network path to the server with a small MTU and an incorrectly updated MSS TCP option size, and the client sending some data to the server. As we discussed, we will receive an ICMPv6 "Packet too big" packet from some middle-box, so… where's the problem? well, there's a problem. We cannot actually forward back the ICMPv6 packet, because we are "transparent", and well… we already ACKed that piece of data when we read it from our client-side socket before sending it over the server-side socket, which means the client probably doesn't have the data saved anymore and it won't be able to re-send it (after all, from client point of view, it supposedly reached the destination, didn't it?).

OK, we cannot forward back the issue to the client and we have a packet too big to be routed forward…. the best solution so far is to stop forwarding the icmpv6 "Packet too big" packet and redirect it to our server-side socket, to let the TCP stack lower the MSS to fit into the offending MTU size. That means, of course, that we are going to be the ones issuing the fragmentation on behalf of the client. This is far from optimal, but fortunately it will only be needed for some specific network paths for which MSS option is not calculated correctly.

Testing

As I was explaining a long time ago, at the start of the post, I wanted to see if the proxy is behaving correctly in this kind of scenarios.

I have a setup with different hosts set-up in line with of course the Proxy in place. Then, I needed to have somehow this faulty network environment to test it (aka MSS received by the client being bigger than the real MTU being used).

TCPMSS iptables target

This was my first idea, which actually didn't work due to a limitation in TCPMSS iptables target (the original reason to write this post). The setup consists mainly on two parts:

Set the MTU in the server to 1280 bytes, the minimum one supported by IPv6: ip link set dev eth0 mtu 1280
Modify the outgoing SYN/ACK packet which is initially set to some valid value under 1280 bytes to a bigger invalid value that will force packet drop and ICMPv6 message generation, let's say 1440: ip6tables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,ACK SYN,ACK -o eth0 -j TCPMSS --set-mss 1440

Result: Using wireshark, I see the MSS option value in the SYN/ACK packet is left untouched, with its original value. On the other hand, ip6tables -L -v -n -t mangle shows one packet hit the rule… what's going on in here?

From my experience, the best to do in this cases is reading the netfilter kernel module implementing the iptables target. It is usually quite fast, just check which kernel config is needed to enable the feature, then see which file it adds to the build and go see that specific file.

Quick grep in the kernel shows the required information on the location (net/netfilter/xt_TCPMSS.c):

$ grep -r TCPMSS
...
net/netfilter/Makefile:obj-$(CONFIG_NETFILTER_XT_TARGET_TCPMSS) += xt_TCPMSS.o
...
`</pre>

Then, in that file, you can find the following snippet of code:

<pre>`/* Never increase MSS, even when setting it, as
 * doing so results in problems for hosts that rely
 * on MSS being set correctly.
 */
if (oldmss >= newmss)
    return 0;

So, basically checks in the code are actually preventing me to do exactly what I want to do. I cannot change the MSS to a bigger value than the original one. This is done to prevent probable malfunctioning or at least sub-optimal use of the network, but that's exactly the faulty scenario I want to test the proxy against!

Easier solution

This one works, and is simpler than the first one, and on top of that, no TCPMSS iptables rule is needed:

Just set the MTU=1280 in interface of middle-box connected to the server, instead of lowering the MTU in the server itself. This way, the server will still think the MTU is 1500 (ethernet default) and the MSS calculated on the SYN/ACK will be based on that instead of 1280. On top of that, the middle-box is not going to change the MSS option by default unless you configure it to do so with an iptables rule (you would probably use TCPMSS module for that!). Then, when a data packet reaches this middle-box and it tries to forward it to the server using the lowered MTU interface, it will drop the packet and it will send the ICMPv6 message back to the client (and will be actually catched by the proxy).

Conclusion

TCPMSS iptables module seems to currently have an undocumented limitation, that is, you cannot increase the existing original MSS value, only decreasing is allowed.

To add into my TODO list: It may be interesting to add an --allow-increase option to iptables TCPMSS module, which should be disabled by default. However, this would still be a useful-to-have feature for testing (or any scenario which requires it, for instance handling packets from an incredibly broken middle-box or TCP stack).