foobar – Just another WordPress site

Analyzing infinite loops without a debugger

I recently run into an issue regarding an embedded system already deployed in the field for several customers.

Those systems were running a daemon from us which sometimes at some point after a few hours or days was reported to have stopped doing the task it was expected to be doing.

A more fine-grained analysis showed that the daemon process was still running, and it was not a good sign to see that it was taking 100% of the CPU, even in scenarios in which that daemon should have been doing almost no work at all.

The system reported that all the CPU load from the process was being spent in userspace, which looked even more strange, because that daemon was expected to be offloading most of its work to the kernel. That quickly pointed the issue to be some kind of infinite tight loop in userspace which was calling no system calls, probably due to some data structure memory corruption which we were unable to trigger easily during our test phase. A quick check with strace showing no syscalls being issued for the thread taking 100% of the CPU confirmed the suspicion.

We knew which kind of issue we had in front of us, but how to really find out where the issue was occurring? If we had known how to trigger it, we could have reproduced it in our test environment in which we have plenty of tools to analyze what's going on: modifying the binary to print more information, using a debugger, etc. However, we were so far able to see it only in production environments, and that meant we needed to find a way to get at least some minimal information to know where to start looking at the issue. The problem with the production environment, which is used for customers, is that it contains no analysis or debugging tools in it which makes it quite difficult to find where the program is actually stuck.

Luckily for us, the Linux kernel provides us with some information for processes running which can be really useful for us in this scenario. I am talking about /proc/$PID/task/$TID/stat file here. This file contains information related to a specific thread $TID which is part of process $PID. You can find more information on the kind of information this file provides by looking at function fs/proc/array.c:proc_tid_stat in the kernel. More specifically, we are interested in the 30th value appearing in that file, which provides us with the EIP value of the thread specified from the path of the file. The EIP contains the address of memory being executed by the thread, which means that in general if we print that value from time to time we should be able to follow which of our source code the process is executing, provided of course that we are able to transform those in-memory process addresses to source code filename & line references.

And of course, we can do that too, and the best gain from this process is that we don't need to install specific tools into the production system. We can usually do all the conversion from our PC just taking some small information from the running system for late post-processing. We basically need two things from the running production system:

List of EIP values from the offending thread, obtained as already explained above.
Content of the file proc/$PID/maps from the same process of the offending thread. We will explain later why is this content sometimes needed.

In our system, we will mainly need:

addr2line tool (part of binutils)
binary file of the offending process compiled with debug symbols (compiled with -g flags).

Now, what's the process involved to convert the running in-memory address into some file:line information I can use to look at corresponding code? In general, it's really easy, just call addr2line like this:

addr2line -a -p -C -f -i -e &amp;quot;$EXEFILE&amp;quot; $a

For instance, let's say our offending process comes from a binary called infinite and we found it's been running at some point at address 400597:

$ addr2line -a -p -C -f -i -e infinite 400597
0x0000000000400597: infinite_function at infinite.c:19 (discriminator 1)

Cool! we now know the infinite loop scope includes infinite.c line 19. We can just pass a few more addresses to have a better idea on which is the code being called the most.

And now you may be wondering… why the hell do I need that /proc/$PID/mapsfile?! Well… because unfortunately this is not as easy as shown here for all scenarios. This simple scenario only covers the case in which the process is running code from the main binary. However, if the process is running some code inside some shared library at the time we record the EIP value, a bit more work must be done.

First of all, we need to generally find out to which module that process' in-memory code address belongs too. That's the kind of information that maps file provides. Let's have a look at an example with the infinite binary being run:

$ cat /proc/$(pidof infinite)/maps
00400000-00401000 r-xp 00000000 08:03 3546387                            /infinite
00600000-00601000 r--p 00000000 08:03 3546387                            /infinite
00601000-00602000 rw-p 00001000 08:03 3546387                            /infinite
020ff000-02120000 rw-p 00000000 00:00 0                                  [heap]
7f334b793000-7f334b928000 r-xp 00000000 08:03 658652                     /usr/lib/libc-2.24.so
7f334b928000-7f334bb27000 ---p 00195000 08:03 658652                     /usr/lib/libc-2.24.so
7f334bb27000-7f334bb2b000 r--p 00194000 08:03 658652                     /usr/lib/libc-2.24.so
7f334bb2b000-7f334bb2d000 rw-p 00198000 08:03 658652                     /usr/lib/libc-2.24.so
7f334bb2d000-7f334bb31000 rw-p 00000000 00:00 0
7f334bb31000-7f334bb54000 r-xp 00000000 08:03 658651                     /usr/lib/ld-2.24.so
7f334bd1b000-7f334bd1d000 rw-p 00000000 00:00 0
7f334bd53000-7f334bd54000 r--p 00022000 08:03 658651                     /usr/lib/ld-2.24.so
7f334bd54000-7f334bd55000 rw-p 00023000 08:03 658651                     /usr/lib/ld-2.24.so
7f334bd55000-7f334bd56000 rw-p 00000000 00:00 0
7ffc57345000-7ffc57366000 rw-p 00000000 00:00 0                          [stack]
7ffc573c5000-7ffc573c7000 r--p 00000000 00:00 0                          [vvar]
7ffc573c7000-7ffc573c9000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Let's now imagine we record an address 7f334b84afe0. The maps shows that code section of libc-2.24.so (code section can be identified because it contains the x executable flag) is mapped to the process address space at addresses starting from 7f334b793000 going up to 7f334b928000. As our recorded address falls between those 2 values, it means it's inside the code section of libc.

Once we know that code comes from libc and not from our main binary, we know we need to inspect that binary file instead of the one from the main binary when using addr2line or other tools like objdump or nm.

When using those tools, then we need to remember we should usually use the offset from the starting point of that module rather than using the memory address we recorded from our production system. That means, for instance using last example, that instead of 0x7f334b84afe0 we may want to look at offset 0xB7FE0 inside that module (0x7f334b84afe0 – 0x7f334b793000). Let's see the difference in this scenario:

$ addr2line -a -p -C -f -i -e ./infinite 7f334b84afe0
0x00007f334b84afe0: ?? ??:0

$ addr2line -a -p -C -f -i -e /usr/lib/libc-2.24.so 7f334b84afe0
0x00007f334b84afe0: ?? ??:0

$  addr2line -a -p -C -f -i -e /usr/lib/libc-2.24.so B7FE0
0x00000000000b7fe0: __nanosleep_nocancel at :?

We can see the last one using the offset starting from the libc code section works correctly, as we were actually calling usleep() in the code. There is no source information (:? is displayed instead of filename & line) because the libc I used for the experiment does not contain debug information (hence we see here too that having the binary in your PC compiled with the -g is important to get all the information).

You can find a testbed environment to play with what is discussed here in the following repository I built up: https://github.com/pespin/trace-eip

iptables TCPMSS target limitation

Introduction

I was recently making some tests regarding the behavior of a given transparent TCP proxy where IPv6 is in use. Specifically, I wanted to test the scenario in which some network link in the destination path after the proxy contained an unusually small MTU.

State of the art

First of all, let's remember that IPv6 doesn't support fragmentation inside middle-boxes, but it is explicitly stated that fragmentation should be done by the endpoints. Usually that means that endpoints should have some kind of PMTU Discovery algorithm to try to guess the best MSS which can fill the MTU as much as possible without exceeding its maximum size.

In case of TCP, an MSS option is available (see rfc6691) to tell your peer at connection establishment time (SYN, SYN/ACK) which is the maximum MSS you expect to receive according to your knowledge of the network surrounding you. This way, your peer knows how much data can he store in every packet in order to be able to avoid it being dropped by some router in the way. Then, ideally, smart middle-boxes routing this kind of packets should modify its TCP MSS option value, lowering it in case the next hop is known to contain a smaller MTU. Following this procedure, when the packet reaches the other endpoint, the MSS value is correctly set to match the value needed for the lowest MTU on the path.

Sounds good, right? This kinda works… until a dumb middle-box appears in the path! Let's imagine a router with an output interface whose link is the minimal one supported by IPv6, that is 1280 bytes. And the SYN packet passes with a nice MSS=1440 option set in it, which of course our dumb friend is not going to update. Then, the other peer is going to receive this erroneous big value and it is going to use it, sending packets which are going to exceed the 1280 bytes, and they will be dropped by the router as there is no fragmentation in IPv6. Conclusion: TCP MSS option is easy and efficient, but doesn't work in all scenarios.

Fortunately for us, IPv6 provides its own mechanisms to notify the sender endpoint that packets too big are being sent. Quickly explained, when a middle-box drops a packet too big to be forwarded over the MTU of the router link, it also sends an ICMPv6 "Packet too big" message to the sender (see rfc4443), providing information on the size needed. This way, the TCP stack of the sender can learn about the new MSS size required and re-send the information using smaller packets. As you can see, this procedure would take a long time if it had to be done for each new hop, that's why TCP MSS option is usually there to help us, avoiding lots of possible extra round-trip times until first data packet reaches the receiver.

Adding the proxy

Let's imagine we are using a simple proxy which creates 2 sockets for each connection created by a client. The first socket is used to read what the client sends, and then a second socket is created to communicate server-side. Then, the proxy basically copies data from one socket to the other, reading/modifying payload to achieve some kind of objective (we are using a proxy for some reason, right?). It can be seen that the proxy actually cuts the connection in two, buffering both end. That is, whatever data is sent from the client is ACKed by the proxy, and at some point later on, it is sent to the server.

Now, let's imagine the problematic network we talked about before: a network path to the server with a small MTU and an incorrectly updated MSS TCP option size, and the client sending some data to the server. As we discussed, we will receive an ICMPv6 "Packet too big" packet from some middle-box, so… where's the problem? well, there's a problem. We cannot actually forward back the ICMPv6 packet, because we are "transparent", and well… we already ACKed that piece of data when we read it from our client-side socket before sending it over the server-side socket, which means the client probably doesn't have the data saved anymore and it won't be able to re-send it (after all, from client point of view, it supposedly reached the destination, didn't it?).

OK, we cannot forward back the issue to the client and we have a packet too big to be routed forward…. the best solution so far is to stop forwarding the icmpv6 "Packet too big" packet and redirect it to our server-side socket, to let the TCP stack lower the MSS to fit into the offending MTU size. That means, of course, that we are going to be the ones issuing the fragmentation on behalf of the client. This is far from optimal, but fortunately it will only be needed for some specific network paths for which MSS option is not calculated correctly.

Testing

As I was explaining a long time ago, at the start of the post, I wanted to see if the proxy is behaving correctly in this kind of scenarios.

I have a setup with different hosts set-up in line with of course the Proxy in place. Then, I needed to have somehow this faulty network environment to test it (aka MSS received by the client being bigger than the real MTU being used).

TCPMSS iptables target

This was my first idea, which actually didn't work due to a limitation in TCPMSS iptables target (the original reason to write this post). The setup consists mainly on two parts:

Set the MTU in the server to 1280 bytes, the minimum one supported by IPv6: ip link set dev eth0 mtu 1280
Modify the outgoing SYN/ACK packet which is initially set to some valid value under 1280 bytes to a bigger invalid value that will force packet drop and ICMPv6 message generation, let's say 1440: ip6tables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,ACK SYN,ACK -o eth0 -j TCPMSS --set-mss 1440

Result: Using wireshark, I see the MSS option value in the SYN/ACK packet is left untouched, with its original value. On the other hand, ip6tables -L -v -n -t mangle shows one packet hit the rule… what's going on in here?

From my experience, the best to do in this cases is reading the netfilter kernel module implementing the iptables target. It is usually quite fast, just check which kernel config is needed to enable the feature, then see which file it adds to the build and go see that specific file.

Quick grep in the kernel shows the required information on the location (net/netfilter/xt_TCPMSS.c):

$ grep -r TCPMSS
...
net/netfilter/Makefile:obj-$(CONFIG_NETFILTER_XT_TARGET_TCPMSS) += xt_TCPMSS.o
...
`</pre>

Then, in that file, you can find the following snippet of code:

<pre>`/* Never increase MSS, even when setting it, as
 * doing so results in problems for hosts that rely
 * on MSS being set correctly.
 */
if (oldmss >= newmss)
    return 0;

So, basically checks in the code are actually preventing me to do exactly what I want to do. I cannot change the MSS to a bigger value than the original one. This is done to prevent probable malfunctioning or at least sub-optimal use of the network, but that's exactly the faulty scenario I want to test the proxy against!

Easier solution

This one works, and is simpler than the first one, and on top of that, no TCPMSS iptables rule is needed:

Just set the MTU=1280 in interface of middle-box connected to the server, instead of lowering the MTU in the server itself. This way, the server will still think the MTU is 1500 (ethernet default) and the MSS calculated on the SYN/ACK will be based on that instead of 1280. On top of that, the middle-box is not going to change the MSS option by default unless you configure it to do so with an iptables rule (you would probably use TCPMSS module for that!). Then, when a data packet reaches this middle-box and it tries to forward it to the server using the lowered MTU interface, it will drop the packet and it will send the ICMPv6 message back to the client (and will be actually catched by the proxy).

Conclusion

TCPMSS iptables module seems to currently have an undocumented limitation, that is, you cannot increase the existing original MSS value, only decreasing is allowed.

To add into my TODO list: It may be interesting to add an --allow-increase option to iptables TCPMSS module, which should be disabled by default. However, this would still be a useful-to-have feature for testing (or any scenario which requires it, for instance handling packets from an incredibly broken middle-box or TCP stack).