A Case Study of Debugging

In the previous post, I mentioned two great books about debugging. In this article, I want to share a real case which is solved by using some tricks. Last week, one of my colleague came to me and asked me a question: There is a Python script running. But can not use "kill process_pid" to kill it. Only "kill -9 process_id" works. Why?

My first guess is that the script installing a signal handler and not exit. My colleague promised it did exit. OK, let's check. First step, we observe the process's status by using ps:

shit 17392 99.7 0.2 244216 23300 ? Rl Aug10 9149:48 python /shit.py /shit.config

Hey, it is running and occupying almost CPU time. This means it may trap into some infinite loop. But there is no loop in the handler. Hmmm, some bug hidden in lower level? How to identify this?

We use gdb to attach the process and see where it is:

(gdb) attach 17392

(gdb) where

#0 adns_forallqueries_next (ads=0xb5d6500, context_r=0x7fff2ffe9b68) at ../src/setup.c:706

#1 0x00002ac062aa3d92 in PyDict_New () from /usr/lib64/python2.6/site-packages/adns.so

#2 0x0000003b5ead8e19 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

...

And use "si" to step on assembly level since the adns.so compiled without any debug info. We make sure the value of IP is limited in adns_forallqueries_next()'s address range. We got the bug!! Check the source code of adns library(adns-1.2/src/setup.c):

adns_query adns_forallqueries_next(adns_state ads, void **context_r) {
adns_query qu, nqu;
adns__consistency(ads,0,cc_entex);
nqu= ads->forallnext;
for (;;) {
qu= nqu;
if (!qu) return 0;
if (qu->next) {
nqu= qu->next;
} else if (qu == ads->udpw.tail) {
nqu=
ads->tcpw.head ? ads->tcpw.head :
ads->childw.head ? ads->childw.head :
ads->output.head;
} else if (qu == ads->tcpw.tail) {
nqu=
ads->childw.head ? ads->childw.head :
ads->output.head;
} else if (qu == ads->childw.tail) {
nqu= ads->output.head;
} else {
nqu= 0;
}
if (!qu->parent) break;
}
ads->forallnext= nqu;
if (context_r) *context_r= qu->ctx.ext;
return qu;
}

We know process is in this loop. But why? The reason is easy: in the signal handler, my colleague deletes an object which contains objects which invokes this C function when the object is destroyed. However, signal is sent to process at any possible time, if there is some object works on the list above and receives the signal, the list would not be complete...results in this strange behavior.

So how to solve this? It is easy. Just do not delete the object explicitly and let OS recycle the resource is enough. And remind colleague make sure every function calls in the signal handlers, no matter explicitly or implicitly, should be async-signal-safe. :-)

軟體學徒forever

搜尋此網誌

A Case Study of Debugging

標籤

留言

張貼留言

這個網誌中的熱門文章

誰在呼叫我？不同的backtrace實作說明好文章

淺讀Linux root file system初始化流程

kernel panic之後怎麼辦？