Using strace to Debug Stuck Celery Tasks

Celery is a great tool for background task processing in Django. We use it in a lot of the custom web apps we build at Caktus, and it's quickly becoming the standard for all variety of task scheduling work loads, from simple to highly complex.

Although rarely, sometimes a Celery worker may stop processing tasks and appear completely hung. In other words, issuing a restart command (through Supervisor) or kill (from the command line) doesn't immediately restart or shutdown the process. This can particularly be an issue in cases where you have a queue setup with only one worker (e.g., to avoid processing any of the tasks in this queue simultaneously), because then none of the new tasks in the queue will get processed. In these cases you may find yourself resorting to manually calling kill -9 <pid> on the process to get the queue started back up again.

We recently ran into this issue at Caktus, and in our case the stuck worker process wasn't processing any new tasks and wasn't showing any CPU activity in top. That seemed a bit odd, so I thought I'd make an attempt to discover what that process was actually doing at the time that it became non-responsive. Enter strace.

strace is a powerful command-line tool for inspecting running processes to determine what "system calls" they're making. System calls are low level calls to the operating system kernel that might involve accessing hard disks, the network, creating new processes, or other such operations. First, let's find the PID of the celery process we're interested in:

ps auxww|grep celery

You'll find the PID in the second column. For the purposes of this post let's assume that's 1234. You can inspect the full command in the process list to make sure you've identified the right celery worker.

Next, run strace on that PID as follows:

sudo strace -p 1234 -s 100000

The -p flag specifies the PID, and the -s flag specifies the size of the output. By default it's limited to 32 characters, which we found isn't very helpful if the system call being made includes a long string as an argument. You might see something like this:

Process 1234 attached - interrupt to quit
write(5, "ion id='89273' responses='12'>\n     ...", 37628

In our case, the task was writing what looked like some XML to file descriptor "5". The XML was much longer and at the end included what looked like a few attributes of a pickled Python object, but I've shortened it here for clarity's sake. You can see what "5" corresponds to by looking at the output of lsof:

sudo lsof|grep 1234

The file descriptor shows up in the "FD" column; in our version of strace, that happens to be the 4th column from the left. You'll see a bunch of files that you don't care about, and then down near the bottom, the list of open file descriptors:

python    1234   myuser    0r     FIFO                0,8      0t0    6593806 pipe
python    1234   myuser    1w     FIFO                0,8      0t0    6593807 pipe
python    1234   myuser    2w     FIFO                0,8      0t0    6593807 pipe
python    1234   myuser    3u     0000                0,9        0       4738 anon_inode
python    1234   myuser    4r     FIFO                0,8      0t0    6593847 pipe
python    1234   myuser    5w     FIFO                0,8      0t0    6593847 pipe
python    1234   myuser    6r      CHR                1,9      0t0       4768 /dev/urandom
python    1234   myuser    7r     FIFO                0,8      0t0    6593850 pipe
python    1234   myuser    8w     FIFO                0,8      0t0    6593850 pipe
python    1234   myuser    9r     FIFO                0,8      0t0    6593916 pipe
python    1234   myuser   10u     IPv4            6593855      0t0        TCP ip-10-142-126-212.ec2.internal:33589->ip-10-112-43-181.ec2.internal:amqp (ESTABLISHED)

You can see "5" corresponds to a pipe, which at least in theory ends up with a TCP connection to the amqp port on another EC2 server (host names are fictional).

RabbitMQ was operating properly and not reporting any errors, so our attention turned to the Celery task in question. Upon further examination, an object we were passing to the task included a long XML string as an attribute, which was being pickled and passed to RabbitMQ. Issues have been reported with long argument sizes in Celery before, and while it appears they should be supported, an easy workaround for us (and Celery's recommended approach) was to pass an ID for this object rather than the object itself, greatly reducing the size of the task's arguments and avoiding the risk of overwriting any object attributes.

While there may have been other ways to fix the underlying issue, strace and lsof were crucial in helping us figure out the problem. One might be able to accomplish the same thing with a lot of logging, but if your code is stuck in a system call and doesn't appear to be showing any noticeable CPU usage in top, strace can take you immediately to the root of the problem.

New Call-to-action
blog comments powered by Disqus



You're already subscribed