Load test is the most exciting part of programming - It takes little time to write, breaks things without being blamed and reveals bugs that only occurs in high traffic.
I was load testing a machine learning model exposed as a RESTful API in Flask. The image is run manually with a docker command, then a load test is run against it. It was working fine until it stopped responding after processing 3 requests...
Deep dive
I immediately checked the CPU usage:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
709b1f63ed67 api 3.23% 10.71GiB / 15.58GiB 68.78% 15.7MB / 924kB 2.54GB / 5.68MB 79
The docker image is idle?
I suspected there are some deadlocks in the container, so docker exec -- /bin/bash
into it, followed by lslocks
. It returned nothing!
Not admitting defeat, I searched for python PID:
root@709b1f63ed67:/app# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 5720 0 pts/0 Ss+ May19 0:00 bash api.sh
root 6 0.0 24.9 6164624 4079964 pts/0 Sl+ May19 0:55 python /app/api.py --model_path /
root 16 0.0 0.0 14144 5736 pts/0 S+ May19 0:00 /usr/local/bin/python -c from multiprocessing.sem
root 24 4.4 42.8 12129852 7006196 pts/0 Sl+ May19 323:04 /usr/local/bin/python /app/api.py
root 34 0.0 0.0 14144 9204 pts/0 S+ May19 0:00 /usr/local/bin/python -c from multiprocessing.sem
root 5159 0.2 0.0 5984 3600 pts/1 Ss 08:46 0:00 /bin/bash
root 5166 0.0 0.0 8584 3192 pts/1 R+ 08:47 0:00 ps aux
PIDs 6
,16
,24
,34
are related to the python, let's look into them.
Figure out the target process
Running lsof -p <PID>
on each of the PIDs comes with a lot of outputs. But one output caught my attention:
root@709b1f63ed67:/app # lsof -p 24 | grep '/api'
python 24 root 9u REG 8,4 8200 1714599 /api/output/1653362182_810.txt
python 24 root 12u REG 8,4 0 1714579 /api/output/output.test
python 24 root 20u REG 8,4 0 1714579 /api/output/output.test
python 24 root 23u REG 8,4 0 1714579 /api/output/output.test
python 24 root 49u REG 8,4 0 1714579 /api/output/output.test
python 24 root 51u REG 8,4 0 1714579 /api/output/output.test
python 24 root 53u REG 8,4 0 1714579 /api/output/output.test
python 24 root 55u REG 8,4 0 1714579 /api/output/output.test
python 24 root 57u REG 8,4 0 1714579 /api/output/output.test
python 24 root 59u REG 8,4 0 1714607 /api/output/output.test2
This PID 24
opens several output files and still holds them.
Upon looking at the python script, I suspected this piece of code hangs:
with open(output_path, 'w+') as f:
for sent in sents:
f.write(cleanup(sent) + '\n')
This looks perfectly fine, so the problem must lie in the cleanup()
function
def cleanup(text):
print(f'original text: {text}')
# ...
# string manipulations
#
print(f'processed text: {text}')
return text
This looks fine too... At this point I'm stuck. Why it hangs on the cleanup function?
Pyrasite comes to rescue
I wished to insert some debug statements into the code. However, this bug does not occur frequently and I am afraid that if the process gets restarted, we never get the chance to squash the bug again.
After some googling I found this stackoverflow issue. Pyrasite can attach a python interpreter to a live python process. It seems a perfect fit for our case.
root@709b1f63ed67:/app/nn_model/gector# pyrasite-shell 24
Pyrasite Shell 2.0
Connected to '/usr/local/bin/python /app/api.py '
Python 3.7.13 (default, Apr 20 2022, 19:06:16)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
(DistantInteractiveConsole)
>>>
Great, let's list the running frames:
>>> import sys
>>> sys._current_frames()
{140193134786304: <frame at 0x55c1dbe3cee0, file '<console>', line 1, code <module>>, 140193143179008: <frame at 0x7f819bd33650, file '/usr/local/lib/python3.7/traceback.py', line 105, code print_exception>, 140193151571712: <frame at 0x7f8192501050, file '/app/pre_prediction.py', line 6, code cleanup>, 140193159964416: <frame at 0x7f81924fa810, file '/app/pre_prediction.py', line 6, code cleanup> ...
So it stucks on the function cleanup
. Let's get more details:
>>> traceback.print_stack(list(sys._current_frames().values())[2])
...
File "/app/pre_prediction.py", line 42, in preprocessing
f.write(cleanup(sent) + '\n')
File "/app/pre_prediction.py", line 6, in cleanup
print(f'original text: {text}')
It is stuck at the print function !?
Docker bug encountered
After some research it turns out to be a docker bug: Logging long lines breaks container.
The problem is the -t
option which mimics a terminal TTY. There is some bug in its implementation, which stops accepting stdout after a long line is received.
Summary
Had a lot of sweat and fun debugging this issue. I was suspecting the block issue has a different cause, but real world is always full of surprises. Again, load test is your friend :)