[Click] [PATCH 2/2] Task: Kill process_pending dead lock

Joonwoo Park joonwpark81 at gmail.com
Wed Nov 5 03:32:19 EST 2008


Hi Eddie,

Thank you very much for you explanation.
I investigated it again and I'm still thinking dead lock problem exists.
Moreover I managed to make soft-lock-up problem.
Here is my analysis for the current task blocker.
I hope these could be help you.

---------
* no one is holding lock, on the same thread
* access clickfs
CPU 0                         CPU 1
_task_blocker_waiting     _task_blocker
                         schedule_block_task                  0  ->  1
                       0
driver_lock_task                                                   1
->  1                        0  ->  -1
                         top of  block_task(true)
1                                -1                       * spinning &
scheduling
driver_unlock_task                                               1
                           -1  ->  0
driver_lock_task                                                   1
                             0    ->  -1
driver_unlock_task                                               1
                           -1  ->  0
                         bottom of block_task(true)           1  ->  0
                      0   ->  1                    * exiting spinning
block_tasks
   0                           1  ->  *2*
unblock_tasks
 0                           2  ->  *1*
driver_lock_tasks
0                           1  -> never reach     * hanging
---------

I'm attaching as a file in case of hard to read due to someone's email client.

Thanks!
Joonwoo

On Tue, Nov 4, 2008 at 10:55 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
> Yes, that's right.  ScheduleLinux() causes the current RouterThread::driver
> to yield to Linux.  No other kernel thread will start that
> RouterThread::driver.
>
> Eddie
>
>
> Joonwoo Park wrote:
>>
>> Hi Eddie,
>>
>> Even though a situation which is using ScheduleLinux() element?
>>
>> Joonwoo
>>
>> On Tue, Nov 4, 2008 at 9:12 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>
>>> Joonwoo,
>>>
>>> Multiple routerthreads will NEVER call driver_lock_tasks() on THE SAME
>>> ROUTERTHREAD OBJECT at the same time.
>>>
>>> Each routerthread has exactly ONE kernel thread, and only that kernel
>>> thread
>>> ever calls driver_lock_tasks().
>>>
>>> Does this help?
>>> Eddie
>>>
>>>
>>> Joonwoo Park wrote:
>>>>
>>>> Hi Eddie,
>>>>
>>>> Thanks for your work and I am very happy to help click project even
>>>> though it's just a little bit.
>>>>
>>>> I have a quick question about your work.  I think it could fix the
>>>> problem between block_tasks() and others.
>>>> However still it seems to have a problem between driver_lock_tasks()
>>>> and driver_lock_tasks().
>>>>
>>>> What I'm concerning is like this:
>>>> access clickfs
>>>>     multiple _task_blocker_waitings became 1
>>>>     schedule
>>>>     multiple routerthreads call driver_lock_tasks() at the same time
>>>>     dead lock at code : while (!_task_blocker.compare_and_swap(0, -1))
>>>>
>>>> How do you think?
>>>> Please correct me if I'm wrong and I'm sorry that I can't be help a lot.
>>>>
>>>> Thanks,
>>>> Joonwoo
>>>>
>>>> On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>>>
>>>>> Hi Joonwoo,
>>>>>
>>>>> I appreciate all your work.  Thanks for the time you have spent!
>>>>>
>>>>> After some poking around and a bunch of rebooting, I have a different
>>>>> analysis of the problem, and have checked in a patch.  It is here:
>>>>>
>>>>>
>>>>>
>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547
>>>>>
>>>>> Does this make sense?  And if and when you get a chance, does it work
>>>>> for
>>>>> you?
>>>>>
>>>>> Eddie
>>>>>
>>>>>
>>>>> Joonwoo Park wrote:
>>>>>>
>>>>>> Hello Eddie,
>>>>>>
>>>>>> Thank you for your reviewing.  I cannot take a look at the code, I'll
>>>>>> check my patch soon again as soon as I have a chance.
>>>>>> I am not using click for work nowadays.  So it's pretty hard to spend
>>>>>> enough time for it.
>>>>>>
>>>>>> Anyhow, I have been turning on the kassert.  However I couldn't see
>>>>>> assertion failure (both before & after patching)
>>>>>> It it make sense?
>>>>>>
>>>>>> Thanks,
>>>>>> Joonwoo
>>>>>>
>>>>>> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> Joonwoo,
>>>>>>>
>>>>>>> I don't think this patch has any affect on the correctness of the
>>>>>>> code.
>>>>>>>  It
>>>>>>> just slows things down.
>>>>>>>
>>>>>>> There are also bugs in the patch, including setting
>>>>>>> _task_blocker_owner
>>>>>>> in
>>>>>>> RouterThread::attempt_lock_tasks but not resetting it if the attempt
>>>>>>> fails.
>>>>>>>
>>>>>>> Have you run after having configured with --enable-kassert?  If so,
>>>>>>> do
>>>>>>> you
>>>>>>> see any assertions?  If not, could you please?
>>>>>>>
>>>>>>> I'd like to track this down, but this patch is not the way.
>>>>>>> Eddie
>>>>>>>
>>>>>>>
>>>>>>> Joonwoo Park wrote:
>>>>>>>>
>>>>>>>> Hello Eddie,
>>>>>>>>
>>>>>>>> I tried to fix task blocker to support nested locking and attached a
>>>>>>>> patch.
>>>>>>>> Can you please take a look at this?  I've tested minimally.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Joonwoo
>>>>>>>>
>>>>>>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park
>>>>>>>> <joonwpark81 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am folking 3 threads.
>>>>>>>>>
>>>>>>>>> Joonwoo
>>>>>>>>>
>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>
>>>>>>>>>> And how many threads?
>>>>>>>>>>
>>>>>>>>>> Eddie
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>
>>>>>>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>>>>>>> Here is the config can cause lock up without device elements.
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>> FF:FF:FF:FF:FF:FF,
>>>>>>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>>
>>>>>>>>>>> sched::BalancedThreadSched(100);
>>>>>>>>>>> ----
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Joonwoo
>>>>>>>>>>>
>>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Joonwoo,
>>>>>>>>>>>>
>>>>>>>>>>>> I intended block_tasks() and driver_lock_tasks() to be
>>>>>>>>>>>> recursive.
>>>>>>>>>>>>  I
>>>>>>>>>>>> could
>>>>>>>>>>>> certainly have failed!  Can you tell me more about the
>>>>>>>>>>>> configuration
>>>>>>>>>>>> you're
>>>>>>>>>>>> running?  Can you cause a soft lockup even without device
>>>>>>>>>>>> elements
>>>>>>>>>>>> (such
>>>>>>>>>>>> as
>>>>>>>>>>>> with InfiniteSources)?
>>>>>>>>>>>>
>>>>>>>>>>>> Eddie
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>>>>>>> However I got a following soft lock up problem with your patch.
>>>>>>>>>>>>> With a quick review, it's seems to block_tasks() and
>>>>>>>>>>>>> driver_tasks()
>>>>>>>>>>>>> doesn't support recursive lock. (please correct me if I am
>>>>>>>>>>>>> wrong)
>>>>>>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks,
>>>>>>>>>>>>> it
>>>>>>>>>>>>> looks like goes hang.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is my oops message and gdb output.  I used my 2.6.24
>>>>>>>>>>>>> patched
>>>>>>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>>
>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG:
>>>>>>>>>>>>> soft
>>>>>>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>> SysRq : Changing Loglevel
>>>>>>>>>>>>> Loglevel set to 9
>>>>>>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>> CPU 0:
>>>>>>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter
>>>>>>>>>>>>> ip_tables
>>>>>>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth
>>>>>>>>>>>>> ext3
>>>>>>>>>>>>> jbd
>>>>>>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>>>>>>> RIP: 0010:[<ffffffff881f818a>]  [<ffffffff881f818a>]
>>>>>>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>>>>>>> RSP: 0018:ffff8100370d7d30  EFLAGS: 00000286
>>>>>>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX:
>>>>>>>>>>>>> ffff810037892430
>>>>>>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI:
>>>>>>>>>>>>> ffff81003792fc60
>>>>>>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09:
>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12:
>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15:
>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>> FS:  00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>>>>>>> knlGS:0000000000000000
>>>>>>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4:
>>>>>>>>>>>>> 00000000000006e0
>>>>>>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>>>>>>>>>> 0000000000000400
>>>>>>>>>>>>>
>>>>>>>>>>>>> Call Trace:
>>>>>>>>>>>>>  [<ffffffff88166803>]
>>>>>>>>>>>>> :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>>>>>>  [<ffffffff8818ebc8>]
>>>>>>>>>>>>> :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>>>>>>  [<ffffffff88183349>]
>>>>>>>>>>>>> :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>>>>>>  [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>>>>>>  [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>  [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>  [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>>>>>>  [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>>>>>>  [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb
>>>>>>>>>>>>> click.ko
>>>>>>>>>>>>> GNU gdb 6.8-debian
>>>>>>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type
>>>>>>>>>>>>> "show
>>>>>>>>>>>>> copying"
>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>>>>>>> info line *(gdb) info line
>>>>>>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>>>>>>> Line 311 of
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>>>>>>  starts at address 0x9c1ba
>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>>>>>>  and ends at 0x9c1be
>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>>>>>>> (gdb) l
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>>>>>>> 306         assert(!current_thread_is_running());
>>>>>>>>>>>>> 307         if (!scheduled)
>>>>>>>>>>>>> 308             ++_task_blocker_waiting;
>>>>>>>>>>>>> 309         while (1) {
>>>>>>>>>>>>> 310             int32_t blocker = _task_blocker.value();
>>>>>>>>>>>>> 311             if (blocker >= 0
>>>>>>>>>>>>> 312                 && _task_blocker.compare_and_swap(blocker,
>>>>>>>>>>>>> blocker +
>>>>>>>>>>>>> 1))
>>>>>>>>>>>>> 313                 break;
>>>>>>>>>>>>> 314             if (nice) {
>>>>>>>>>>>>> 315     #if CLICK_LINUXMODULE
>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Joonwoo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I took look into this lock up issue and I think I found
>>>>>>>>>>>>>>> something.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks.
>>>>>>>>>>>>>>> But after calling run_tasks(), current processor can be
>>>>>>>>>>>>>>> changed
>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>>>>>>> So I think that's problem.  How do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before.
>>>>>>>>>>>>>> :(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So my current analysis is this.  It is not appropriate for a
>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> call
>>>>>>>>>>>>>> blocking functions and/or schedule() when that thread has
>>>>>>>>>>>>>> prevented
>>>>>>>>>>>>>> preemption via get_cpu().  My prior patches prevented
>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The solution is to separate "locking the task list" from
>>>>>>>>>>>>>> "blocking
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>> execution."  Clickfs, when executing an exclusive handler,
>>>>>>>>>>>>>> "blocks
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>> execution."  A thread that wants to examine the task list
>>>>>>>>>>>>>> "locks"
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This commit:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>>>>>>> introduces separate APIs for locking the list and blocking
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>  Exclusive handlers block task execution, but do not lock the
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>> believe the locking works out too.  User-level multithreading
>>>>>>>>>>>>>> tests
>>>>>>>>>>>>>> appear
>>>>>>>>>>>>>> OK.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any willing stresstesters?  Pretty please? :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>>
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: click-threads-blocker-analysis.txt
Url: http://amsterdam.lcs.mit.edu/pipermail/click/attachments/20081105/156cb74e/attachment-0002.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: hanging.txt
Url: http://amsterdam.lcs.mit.edu/pipermail/click/attachments/20081105/156cb74e/attachment-0003.txt 


More information about the click mailing list