[Click] [PATCH 2/2] Task: Kill process_pending dead lock

Eddie Kohler kohler at cs.ucla.edu
Sun Nov 9 02:21:54 EST 2008


Hi Joonwoo,

Your analysis helped me diagnose this.  What's missing from your analysis is 
that CPU 1 should always call unblock_task() eventually.  The only reason it 
wouldn't is that CPU 0 is monopolizing the CPU.  Just as in my last checkin, 
CPU 0 needs to call schedule() while waiting, rather than spinning.

Please see 
http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=939e4b0853ee82d59bf43b7e4ad25fe5b117c81c

Does this work for you?
Eddie


Joonwoo Park wrote:
> Hi Eddie,
> 
> Thank you very much for you explanation.
> I investigated it again and I'm still thinking dead lock problem exists.
> Moreover I managed to make soft-lock-up problem.
> Here is my analysis for the current task blocker.
> I hope these could be help you.
> 
> ---------
> * no one is holding lock, on the same thread
> * access clickfs
> CPU 0                         CPU 1
> _task_blocker_waiting     _task_blocker
>                          schedule_block_task                  0  ->  1
>                        0
> driver_lock_task                                                   1
> ->  1                        0  ->  -1
>                          top of  block_task(true)
> 1                                -1                       * spinning &
> scheduling
> driver_unlock_task                                               1
>                            -1  ->  0
> driver_lock_task                                                   1
>                              0    ->  -1
> driver_unlock_task                                               1
>                            -1  ->  0
>                          bottom of block_task(true)           1  ->  0
>                       0   ->  1                    * exiting spinning
> block_tasks
>    0                           1  ->  *2*
> unblock_tasks
>  0                           2  ->  *1*
> driver_lock_tasks
> 0                           1  -> never reach     * hanging
> ---------
> 
> I'm attaching as a file in case of hard to read due to someone's email client.
> 
> Thanks!
> Joonwoo
> 
> On Tue, Nov 4, 2008 at 10:55 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>> Yes, that's right.  ScheduleLinux() causes the current RouterThread::driver
>> to yield to Linux.  No other kernel thread will start that
>> RouterThread::driver.
>>
>> Eddie
>>
>>
>> Joonwoo Park wrote:
>>> Hi Eddie,
>>>
>>> Even though a situation which is using ScheduleLinux() element?
>>>
>>> Joonwoo
>>>
>>> On Tue, Nov 4, 2008 at 9:12 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>> Joonwoo,
>>>>
>>>> Multiple routerthreads will NEVER call driver_lock_tasks() on THE SAME
>>>> ROUTERTHREAD OBJECT at the same time.
>>>>
>>>> Each routerthread has exactly ONE kernel thread, and only that kernel
>>>> thread
>>>> ever calls driver_lock_tasks().
>>>>
>>>> Does this help?
>>>> Eddie
>>>>
>>>>
>>>> Joonwoo Park wrote:
>>>>> Hi Eddie,
>>>>>
>>>>> Thanks for your work and I am very happy to help click project even
>>>>> though it's just a little bit.
>>>>>
>>>>> I have a quick question about your work.  I think it could fix the
>>>>> problem between block_tasks() and others.
>>>>> However still it seems to have a problem between driver_lock_tasks()
>>>>> and driver_lock_tasks().
>>>>>
>>>>> What I'm concerning is like this:
>>>>> access clickfs
>>>>>     multiple _task_blocker_waitings became 1
>>>>>     schedule
>>>>>     multiple routerthreads call driver_lock_tasks() at the same time
>>>>>     dead lock at code : while (!_task_blocker.compare_and_swap(0, -1))
>>>>>
>>>>> How do you think?
>>>>> Please correct me if I'm wrong and I'm sorry that I can't be help a lot.
>>>>>
>>>>> Thanks,
>>>>> Joonwoo
>>>>>
>>>>> On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>>>> Hi Joonwoo,
>>>>>>
>>>>>> I appreciate all your work.  Thanks for the time you have spent!
>>>>>>
>>>>>> After some poking around and a bunch of rebooting, I have a different
>>>>>> analysis of the problem, and have checked in a patch.  It is here:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547
>>>>>>
>>>>>> Does this make sense?  And if and when you get a chance, does it work
>>>>>> for
>>>>>> you?
>>>>>>
>>>>>> Eddie
>>>>>>
>>>>>>
>>>>>> Joonwoo Park wrote:
>>>>>>> Hello Eddie,
>>>>>>>
>>>>>>> Thank you for your reviewing.  I cannot take a look at the code, I'll
>>>>>>> check my patch soon again as soon as I have a chance.
>>>>>>> I am not using click for work nowadays.  So it's pretty hard to spend
>>>>>>> enough time for it.
>>>>>>>
>>>>>>> Anyhow, I have been turning on the kassert.  However I couldn't see
>>>>>>> assertion failure (both before & after patching)
>>>>>>> It it make sense?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joonwoo
>>>>>>>
>>>>>>> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu>
>>>>>>> wrote:
>>>>>>>> Joonwoo,
>>>>>>>>
>>>>>>>> I don't think this patch has any affect on the correctness of the
>>>>>>>> code.
>>>>>>>>  It
>>>>>>>> just slows things down.
>>>>>>>>
>>>>>>>> There are also bugs in the patch, including setting
>>>>>>>> _task_blocker_owner
>>>>>>>> in
>>>>>>>> RouterThread::attempt_lock_tasks but not resetting it if the attempt
>>>>>>>> fails.
>>>>>>>>
>>>>>>>> Have you run after having configured with --enable-kassert?  If so,
>>>>>>>> do
>>>>>>>> you
>>>>>>>> see any assertions?  If not, could you please?
>>>>>>>>
>>>>>>>> I'd like to track this down, but this patch is not the way.
>>>>>>>> Eddie
>>>>>>>>
>>>>>>>>
>>>>>>>> Joonwoo Park wrote:
>>>>>>>>> Hello Eddie,
>>>>>>>>>
>>>>>>>>> I tried to fix task blocker to support nested locking and attached a
>>>>>>>>> patch.
>>>>>>>>> Can you please take a look at this?  I've tested minimally.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Joonwoo
>>>>>>>>>
>>>>>>>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park
>>>>>>>>> <joonwpark81 at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> I am folking 3 threads.
>>>>>>>>>>
>>>>>>>>>> Joonwoo
>>>>>>>>>>
>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>> And how many threads?
>>>>>>>>>>>
>>>>>>>>>>> Eddie
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>
>>>>>>>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>>>>>>>> Here is the config can cause lock up without device elements.
>>>>>>>>>>>>
>>>>>>>>>>>> ----
>>>>>>>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>>> FF:FF:FF:FF:FF:FF,
>>>>>>>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>>>
>>>>>>>>>>>> sched::BalancedThreadSched(100);
>>>>>>>>>>>> ----
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>
>>>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>> Hi Joonwoo,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I intended block_tasks() and driver_lock_tasks() to be
>>>>>>>>>>>>> recursive.
>>>>>>>>>>>>>  I
>>>>>>>>>>>>> could
>>>>>>>>>>>>> certainly have failed!  Can you tell me more about the
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>> you're
>>>>>>>>>>>>> running?  Can you cause a soft lockup even without device
>>>>>>>>>>>>> elements
>>>>>>>>>>>>> (such
>>>>>>>>>>>>> as
>>>>>>>>>>>>> with InfiniteSources)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>>>>>>>> However I got a following soft lock up problem with your patch.
>>>>>>>>>>>>>> With a quick review, it's seems to block_tasks() and
>>>>>>>>>>>>>> driver_tasks()
>>>>>>>>>>>>>> doesn't support recursive lock. (please correct me if I am
>>>>>>>>>>>>>> wrong)
>>>>>>>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks,
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> looks like goes hang.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is my oops message and gdb output.  I used my 2.6.24
>>>>>>>>>>>>>> patched
>>>>>>>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG:
>>>>>>>>>>>>>> soft
>>>>>>>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>>> SysRq : Changing Loglevel
>>>>>>>>>>>>>> Loglevel set to 9
>>>>>>>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>>> CPU 0:
>>>>>>>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter
>>>>>>>>>>>>>> ip_tables
>>>>>>>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth
>>>>>>>>>>>>>> ext3
>>>>>>>>>>>>>> jbd
>>>>>>>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>>>>>>>> RIP: 0010:[<ffffffff881f818a>]  [<ffffffff881f818a>]
>>>>>>>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>>>>>>>> RSP: 0018:ffff8100370d7d30  EFLAGS: 00000286
>>>>>>>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX:
>>>>>>>>>>>>>> ffff810037892430
>>>>>>>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI:
>>>>>>>>>>>>>> ffff81003792fc60
>>>>>>>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09:
>>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12:
>>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15:
>>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>>> FS:  00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>>>>>>>> knlGS:0000000000000000
>>>>>>>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4:
>>>>>>>>>>>>>> 00000000000006e0
>>>>>>>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>>>>>>>>>>> 0000000000000400
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Call Trace:
>>>>>>>>>>>>>>  [<ffffffff88166803>]
>>>>>>>>>>>>>> :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>>>>>>>  [<ffffffff8818ebc8>]
>>>>>>>>>>>>>> :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>>>>>>>  [<ffffffff88183349>]
>>>>>>>>>>>>>> :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>>>>>>>  [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>>>>>>>  [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>>  [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>>  [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>>>>>>>  [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>>>>>>>  [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb
>>>>>>>>>>>>>> click.ko
>>>>>>>>>>>>>> GNU gdb 6.8-debian
>>>>>>>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type
>>>>>>>>>>>>>> "show
>>>>>>>>>>>>>> copying"
>>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>>>>>>>> info line *(gdb) info line
>>>>>>>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>>>>>>>> Line 311 of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>>>>>>>  starts at address 0x9c1ba
>>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>>>>>>>  and ends at 0x9c1be
>>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>>>>>>>> (gdb) l
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>>>>>>>> 306         assert(!current_thread_is_running());
>>>>>>>>>>>>>> 307         if (!scheduled)
>>>>>>>>>>>>>> 308             ++_task_blocker_waiting;
>>>>>>>>>>>>>> 309         while (1) {
>>>>>>>>>>>>>> 310             int32_t blocker = _task_blocker.value();
>>>>>>>>>>>>>> 311             if (blocker >= 0
>>>>>>>>>>>>>> 312                 && _task_blocker.compare_and_swap(blocker,
>>>>>>>>>>>>>> blocker +
>>>>>>>>>>>>>> 1))
>>>>>>>>>>>>>> 313                 break;
>>>>>>>>>>>>>> 314             if (nice) {
>>>>>>>>>>>>>> 315     #if CLICK_LINUXMODULE
>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>>>> Joonwoo,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I took look into this lock up issue and I think I found
>>>>>>>>>>>>>>>> something.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks.
>>>>>>>>>>>>>>>> But after calling run_tasks(), current processor can be
>>>>>>>>>>>>>>>> changed
>>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>>>>>>>> So I think that's problem.  How do you think?
>>>>>>>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before.
>>>>>>>>>>>>>>> :(
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So my current analysis is this.  It is not appropriate for a
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>> blocking functions and/or schedule() when that thread has
>>>>>>>>>>>>>>> prevented
>>>>>>>>>>>>>>> preemption via get_cpu().  My prior patches prevented
>>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The solution is to separate "locking the task list" from
>>>>>>>>>>>>>>> "blocking
>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>> execution."  Clickfs, when executing an exclusive handler,
>>>>>>>>>>>>>>> "blocks
>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>> execution."  A thread that wants to examine the task list
>>>>>>>>>>>>>>> "locks"
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This commit:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>>>>>>>> introduces separate APIs for locking the list and blocking
>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>>  Exclusive handlers block task execution, but do not lock the
>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>>> believe the locking works out too.  User-level multithreading
>>>>>>>>>>>>>>> tests
>>>>>>>>>>>>>>> appear
>>>>>>>>>>>>>>> OK.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any willing stresstesters?  Pretty please? :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>>>


More information about the click mailing list