How a print statement hangs the server

Load test is the most exciting part of programming - It takes little time to write, breaks things without being blamed and reveals bugs that only occurs in high traffic.

I was load testing a machine learning model exposed as a RESTful API in Flask. The image is run manually with a docker command, then a load test is run against it. It was working fine until it stopped responding after processing 3 requests...

Deep dive

I immediately checked the CPU usage:

CONTAINER ID   NAME            CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O         PIDS
709b1f63ed67   api             3.23%     10.71GiB / 15.58GiB   68.78%    15.7MB / 924kB   2.54GB / 5.68MB   79

The docker image is idle?

I suspected there are some deadlocks in the container, so docker exec -- /bin/bash into it, followed by lslocks. It returned nothing!

Not admitting defeat, I searched for python PID:

root@709b1f63ed67:/app# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   5720     0 pts/0    Ss+  May19   0:00 bash api.sh
root           6  0.0 24.9 6164624 4079964 pts/0 Sl+  May19   0:55 python /app/api.py --model_path /
root          16  0.0  0.0  14144  5736 pts/0    S+   May19   0:00 /usr/local/bin/python -c from multiprocessing.sem
root          24  4.4 42.8 12129852 7006196 pts/0 Sl+ May19 323:04 /usr/local/bin/python /app/api.py
root          34  0.0  0.0  14144  9204 pts/0    S+   May19   0:00 /usr/local/bin/python -c from multiprocessing.sem
root        5159  0.2  0.0   5984  3600 pts/1    Ss   08:46   0:00 /bin/bash
root        5166  0.0  0.0   8584  3192 pts/1    R+   08:47   0:00 ps aux

PIDs 6,16,24,34 are related to the python, let's look into them.

Figure out the target process

Running lsof -p <PID> on each of the PIDs comes with a lot of outputs. But one output caught my attention:

root@709b1f63ed67:/app # lsof -p 24 | grep '/api'
python   24 root    9u   REG       8,4       8200   1714599 /api/output/1653362182_810.txt
python   24 root   12u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   20u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   23u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   49u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   51u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   53u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   55u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   57u   REG       8,4          0   1714579 /api/output/output.test
python   24 root   59u   REG       8,4          0   1714607 /api/output/output.test2

This PID 24 opens several output files and still holds them. Upon looking at the python script, I suspected this piece of code hangs:

    with open(output_path, 'w+') as f:
        for sent in sents:
            f.write(cleanup(sent) + '\n')

This looks perfectly fine, so the problem must lie in the cleanup() function

def cleanup(text):
    print(f'original text: {text}')
    # ...
    # string manipulations
    #
    print(f'processed text: {text}')
    return text

This looks fine too... At this point I'm stuck. Why it hangs on the cleanup function?

Pyrasite comes to rescue

I wished to insert some debug statements into the code. However, this bug does not occur frequently and I am afraid that if the process gets restarted, we never get the chance to squash the bug again.

After some googling I found this stackoverflow issue. Pyrasite can attach a python interpreter to a live python process. It seems a perfect fit for our case.

root@709b1f63ed67:/app/nn_model/gector# pyrasite-shell 24
Pyrasite Shell 2.0
Connected to '/usr/local/bin/python /app/api.py '
Python 3.7.13 (default, Apr 20 2022, 19:06:16)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
(DistantInteractiveConsole)

>>>

Great, let's list the running frames:

>>> import sys
>>> sys._current_frames()
{140193134786304: <frame at 0x55c1dbe3cee0, file '<console>', line 1, code <module>>, 140193143179008: <frame at 0x7f819bd33650, file '/usr/local/lib/python3.7/traceback.py', line 105, code print_exception>, 140193151571712: <frame at 0x7f8192501050, file '/app/pre_prediction.py', line 6, code cleanup>, 140193159964416: <frame at 0x7f81924fa810, file '/app/pre_prediction.py', line 6, code cleanup> ...

So it stucks on the function cleanup. Let's get more details:

>>> traceback.print_stack(list(sys._current_frames().values())[2])
  ...
  File "/app/pre_prediction.py", line 42, in preprocessing
    f.write(cleanup(sent) + '\n')
  File "/app/pre_prediction.py", line 6, in cleanup
    print(f'original text: {text}')

It is stuck at the print function !?

Docker bug encountered

After some research it turns out to be a docker bug: Logging long lines breaks container.

The problem is the -t option which mimics a terminal TTY. There is some bug in its implementation, which stops accepting stdout after a long line is received.

Summary

Had a lot of sweat and fun debugging this issue. I was suspecting the block issue has a different cause, but real world is always full of surprises. Again, load test is your friend :)