[Click] [PATCH 2/2] Task: Kill process_pending dead lock

Eddie Kohler kohler at cs.ucla.edu
Mon Nov 3 21:32:47 EST 2008


Hi Joonwoo,

I appreciate all your work.  Thanks for the time you have spent!

After some poking around and a bunch of rebooting, I have a different analysis 
of the problem, and have checked in a patch.  It is here:

http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547

Does this make sense?  And if and when you get a chance, does it work for you?

Eddie


Joonwoo Park wrote:
> Hello Eddie,
> 
> Thank you for your reviewing.  I cannot take a look at the code, I'll
> check my patch soon again as soon as I have a chance.
> I am not using click for work nowadays.  So it's pretty hard to spend
> enough time for it.
> 
> Anyhow, I have been turning on the kassert.  However I couldn't see
> assertion failure (both before & after patching)
> It it make sense?
> 
> Thanks,
> Joonwoo
> 
> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>> Joonwoo,
>>
>> I don't think this patch has any affect on the correctness of the code.  It
>> just slows things down.
>>
>> There are also bugs in the patch, including setting _task_blocker_owner in
>> RouterThread::attempt_lock_tasks but not resetting it if the attempt fails.
>>
>> Have you run after having configured with --enable-kassert?  If so, do you
>> see any assertions?  If not, could you please?
>>
>> I'd like to track this down, but this patch is not the way.
>> Eddie
>>
>>
>> Joonwoo Park wrote:
>>> Hello Eddie,
>>>
>>> I tried to fix task blocker to support nested locking and attached a
>>> patch.
>>> Can you please take a look at this?  I've tested minimally.
>>>
>>> Thanks!
>>> Joonwoo
>>>
>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park <joonwpark81 at gmail.com>
>>> wrote:
>>>> I am folking 3 threads.
>>>>
>>>> Joonwoo
>>>>
>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>> And how many threads?
>>>>>
>>>>> Eddie
>>>>>
>>>>>
>>>>> Joonwoo Park wrote:
>>>>>> Hi Eddie,
>>>>>>
>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>> Here is the config can cause lock up without device elements.
>>>>>>
>>>>>> ----
>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800, FF:FF:FF:FF:FF:FF,
>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>
>>>>>> sched::BalancedThreadSched(100);
>>>>>> ----
>>>>>>
>>>>>> Thanks!
>>>>>> Joonwoo
>>>>>>
>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>> Hi Joonwoo,
>>>>>>>
>>>>>>> I intended block_tasks() and driver_lock_tasks() to be recursive.  I
>>>>>>> could
>>>>>>> certainly have failed!  Can you tell me more about the configuration
>>>>>>> you're
>>>>>>> running?  Can you cause a soft lockup even without device elements
>>>>>>> (such
>>>>>>> as
>>>>>>> with InfiniteSources)?
>>>>>>>
>>>>>>> Eddie
>>>>>>>
>>>>>>>
>>>>>>> Joonwoo Park wrote:
>>>>>>>> Hi Eddie,
>>>>>>>>
>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>> However I got a following soft lock up problem with your patch.
>>>>>>>> With a quick review, it's seems to block_tasks() and driver_tasks()
>>>>>>>> doesn't support recursive lock. (please correct me if I am wrong)
>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks, it
>>>>>>>> looks like goes hang.
>>>>>>>>
>>>>>>>> Here is my oops message and gdb output.  I used my 2.6.24 patched
>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Joonwoo
>>>>>>>>
>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG: soft
>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>> SysRq : Changing Loglevel
>>>>>>>> Loglevel set to 9
>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>> CPU 0:
>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter ip_tables
>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth ext3 jbd
>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>> RIP: 0010:[<ffffffff881f818a>]  [<ffffffff881f818a>]
>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>> RSP: 0018:ffff8100370d7d30  EFLAGS: 00000286
>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX: ffff810037892430
>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI: ffff81003792fc60
>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09: 0000000000000000
>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000001
>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15: 0000000000000001
>>>>>>>> FS:  00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>> knlGS:0000000000000000
>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4: 00000000000006e0
>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>>>>>
>>>>>>>> Call Trace:
>>>>>>>>  [<ffffffff88166803>] :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>  [<ffffffff8818ebc8>] :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>  [<ffffffff88183349>] :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>  [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>  [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>  [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>  [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>  [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>  [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>
>>>>>>>>
>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb click.ko
>>>>>>>> GNU gdb 6.8-debian
>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>>>>>> copying"
>>>>>>>> and "show warranty" for details.
>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>> info line *(gdb) info line
>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>> Line 311 of
>>>>>>>>
>>>>>>>>
>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>  starts at address 0x9c1ba
>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>  and ends at 0x9c1be
>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>> (gdb) l
>>>>>>>>
>>>>>>>>
>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>> 306         assert(!current_thread_is_running());
>>>>>>>> 307         if (!scheduled)
>>>>>>>> 308             ++_task_blocker_waiting;
>>>>>>>> 309         while (1) {
>>>>>>>> 310             int32_t blocker = _task_blocker.value();
>>>>>>>> 311             if (blocker >= 0
>>>>>>>> 312                 && _task_blocker.compare_and_swap(blocker,
>>>>>>>> blocker +
>>>>>>>> 1))
>>>>>>>> 313                 break;
>>>>>>>> 314             if (nice) {
>>>>>>>> 315     #if CLICK_LINUXMODULE
>>>>>>>> (gdb)
>>>>>>>>
>>>>>>>>
>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>> Joonwoo,
>>>>>>>>>
>>>>>>>>>> I took look into this lock up issue and I think I found something.
>>>>>>>>>>
>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks.
>>>>>>>>>> But after calling run_tasks(), current processor can be changed
>>>>>>>>>> since
>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>> So I think that's problem.  How do you think?
>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>
>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before. :(
>>>>>>>>>
>>>>>>>>> So my current analysis is this.  It is not appropriate for a thread
>>>>>>>>> to
>>>>>>>>> call
>>>>>>>>> blocking functions and/or schedule() when that thread has prevented
>>>>>>>>> preemption via get_cpu().  My prior patches prevented preemption.
>>>>>>>>>
>>>>>>>>> The solution is to separate "locking the task list" from "blocking
>>>>>>>>> task
>>>>>>>>> execution."  Clickfs, when executing an exclusive handler, "blocks
>>>>>>>>> task
>>>>>>>>> execution."  A thread that wants to examine the task list "locks"
>>>>>>>>> the
>>>>>>>>> list.
>>>>>>>>>
>>>>>>>>> This commit:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>> introduces separate APIs for locking the list and blocking task
>>>>>>>>> execution.
>>>>>>>>>  Exclusive handlers block task execution, but do not lock the task
>>>>>>>>> list.
>>>>>>>>>  I
>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>> preemption.
>>>>>>>>>  I
>>>>>>>>> believe the locking works out too.  User-level multithreading tests
>>>>>>>>> appear
>>>>>>>>> OK.
>>>>>>>>>
>>>>>>>>> Any willing stresstesters?  Pretty please? :)
>>>>>>>>>
>>>>>>>>> Eddie
>>>>>>>>>


More information about the click mailing list