In the previous post, I mentioned two great books about debugging. In this article, I want to share a real case which is solved by using some tricks. Last week, one of my colleague came to me and asked me a question: There is a Python script running. But can not use "kill process_pid" to kill it. Only "kill -9 process_id" works. Why?
My first guess is that the script installing a signal handler and not exit. My colleague promised it did exit. OK, let's check. First step, we observe the process's status by using ps:
shit 17392 99.7 0.2 244216 23300 ? Rl Aug10 9149:48 python /shit.py /shit.config
My first guess is that the script installing a signal handler and not exit. My colleague promised it did exit. OK, let's check. First step, we observe the process's status by using ps:
shit 17392 99.7 0.2 244216 23300 ? Rl Aug10 9149:48 python /shit.py /shit.config
Hey, it is running and occupying almost CPU time. This means it may trap into some infinite loop. But there is no loop in the handler. Hmmm, some bug hidden in lower level? How to identify this?
We use gdb to attach the process and see where it is:
(gdb) attach 17392
We know process is in this loop. But why? The reason is easy: in the signal handler, my colleague deletes an object which contains objects which invokes this C function when the object is destroyed. However, signal is sent to process at any possible time, if there is some object works on the list above and receives the signal, the list would not be complete...results in this strange behavior.
So how to solve this? It is easy. Just do not delete the object explicitly and let OS recycle the resource is enough. And remind colleague make sure every function calls in the signal handlers, no matter explicitly or implicitly, should be async-signal-safe. :-)
We use gdb to attach the process and see where it is:
(gdb) attach 17392
(gdb) where
#0 adns_forallqueries_next (ads=0xb5d6500, context_r=0x7fff2ffe9b68) at ../src/setup.c:706
#1 0x00002ac062aa3d92 in PyDict_New () from /usr/lib64/python2.6/site-packages/adns.so
#2 0x0000003b5ead8e19 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0
...
And use "si" to step on assembly level since the adns.so compiled without any debug info. We make sure the value of IP is limited in adns_forallqueries_next()'s address range. We got the bug!! Check the source code of adns library(adns-1.2/src/setup.c):
- adns_query adns_forallqueries_next(adns_state ads, void **context_r) {
- adns_query qu, nqu;
- adns__consistency(ads,0,cc_entex);
- nqu= ads->forallnext;
- for (;;) {
- qu= nqu;
- if (!qu) return 0;
- if (qu->next) {
- nqu= qu->next;
- } else if (qu == ads->udpw.tail) {
- nqu=
- ads->tcpw.head ? ads->tcpw.head :
- ads->childw.head ? ads->childw.head :
- ads->output.head;
- } else if (qu == ads->tcpw.tail) {
- nqu=
- ads->childw.head ? ads->childw.head :
- ads->output.head;
- } else if (qu == ads->childw.tail) {
- nqu= ads->output.head;
- } else {
- nqu= 0;
- }
- if (!qu->parent) break;
- }
- ads->forallnext= nqu;
- if (context_r) *context_r= qu->ctx.ext;
- return qu;
- }
We know process is in this loop. But why? The reason is easy: in the signal handler, my colleague deletes an object which contains objects which invokes this C function when the object is destroyed. However, signal is sent to process at any possible time, if there is some object works on the list above and receives the signal, the list would not be complete...results in this strange behavior.
So how to solve this? It is easy. Just do not delete the object explicitly and let OS recycle the resource is enough. And remind colleague make sure every function calls in the signal handlers, no matter explicitly or implicitly, should be async-signal-safe. :-)
留言
張貼留言