[Click] [PATCH 2/2] Task: Kill process_pending dead lock

Eddie Kohler kohler at cs.ucla.edu
Tue Nov 4 13:55:58 EST 2008


Yes, that's right.  ScheduleLinux() causes the current RouterThread::driver to 
yield to Linux.  No other kernel thread will start that RouterThread::driver.

Eddie


Joonwoo Park wrote:
> Hi Eddie,
> 
> Even though a situation which is using ScheduleLinux() element?
> 
> Joonwoo
> 
> On Tue, Nov 4, 2008 at 9:12 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>> Joonwoo,
>>
>> Multiple routerthreads will NEVER call driver_lock_tasks() on THE SAME
>> ROUTERTHREAD OBJECT at the same time.
>>
>> Each routerthread has exactly ONE kernel thread, and only that kernel thread
>> ever calls driver_lock_tasks().
>>
>> Does this help?
>> Eddie
>>
>>
>> Joonwoo Park wrote:
>>> Hi Eddie,
>>>
>>> Thanks for your work and I am very happy to help click project even
>>> though it's just a little bit.
>>>
>>> I have a quick question about your work.  I think it could fix the
>>> problem between block_tasks() and others.
>>> However still it seems to have a problem between driver_lock_tasks()
>>> and driver_lock_tasks().
>>>
>>> What I'm concerning is like this:
>>> access clickfs
>>>      multiple _task_blocker_waitings became 1
>>>      schedule
>>>      multiple routerthreads call driver_lock_tasks() at the same time
>>>      dead lock at code : while (!_task_blocker.compare_and_swap(0, -1))
>>>
>>> How do you think?
>>> Please correct me if I'm wrong and I'm sorry that I can't be help a lot.
>>>
>>> Thanks,
>>> Joonwoo
>>>
>>> On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>> Hi Joonwoo,
>>>>
>>>> I appreciate all your work.  Thanks for the time you have spent!
>>>>
>>>> After some poking around and a bunch of rebooting, I have a different
>>>> analysis of the problem, and have checked in a patch.  It is here:
>>>>
>>>>
>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547
>>>>
>>>> Does this make sense?  And if and when you get a chance, does it work for
>>>> you?
>>>>
>>>> Eddie
>>>>
>>>>
>>>> Joonwoo Park wrote:
>>>>> Hello Eddie,
>>>>>
>>>>> Thank you for your reviewing.  I cannot take a look at the code, I'll
>>>>> check my patch soon again as soon as I have a chance.
>>>>> I am not using click for work nowadays.  So it's pretty hard to spend
>>>>> enough time for it.
>>>>>
>>>>> Anyhow, I have been turning on the kassert.  However I couldn't see
>>>>> assertion failure (both before & after patching)
>>>>> It it make sense?
>>>>>
>>>>> Thanks,
>>>>> Joonwoo
>>>>>
>>>>> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>>>> Joonwoo,
>>>>>>
>>>>>> I don't think this patch has any affect on the correctness of the code.
>>>>>>  It
>>>>>> just slows things down.
>>>>>>
>>>>>> There are also bugs in the patch, including setting _task_blocker_owner
>>>>>> in
>>>>>> RouterThread::attempt_lock_tasks but not resetting it if the attempt
>>>>>> fails.
>>>>>>
>>>>>> Have you run after having configured with --enable-kassert?  If so, do
>>>>>> you
>>>>>> see any assertions?  If not, could you please?
>>>>>>
>>>>>> I'd like to track this down, but this patch is not the way.
>>>>>> Eddie
>>>>>>
>>>>>>
>>>>>> Joonwoo Park wrote:
>>>>>>> Hello Eddie,
>>>>>>>
>>>>>>> I tried to fix task blocker to support nested locking and attached a
>>>>>>> patch.
>>>>>>> Can you please take a look at this?  I've tested minimally.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Joonwoo
>>>>>>>
>>>>>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park <joonwpark81 at gmail.com>
>>>>>>> wrote:
>>>>>>>> I am folking 3 threads.
>>>>>>>>
>>>>>>>> Joonwoo
>>>>>>>>
>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>> And how many threads?
>>>>>>>>>
>>>>>>>>> Eddie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>> Hi Eddie,
>>>>>>>>>>
>>>>>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>>>>>> Here is the config can cause lock up without device elements.
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>> FF:FF:FF:FF:FF:FF,
>>>>>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>
>>>>>>>>>> sched::BalancedThreadSched(100);
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Joonwoo
>>>>>>>>>>
>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>> Hi Joonwoo,
>>>>>>>>>>>
>>>>>>>>>>> I intended block_tasks() and driver_lock_tasks() to be recursive.
>>>>>>>>>>>  I
>>>>>>>>>>> could
>>>>>>>>>>> certainly have failed!  Can you tell me more about the
>>>>>>>>>>> configuration
>>>>>>>>>>> you're
>>>>>>>>>>> running?  Can you cause a soft lockup even without device elements
>>>>>>>>>>> (such
>>>>>>>>>>> as
>>>>>>>>>>> with InfiniteSources)?
>>>>>>>>>>>
>>>>>>>>>>> Eddie
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>
>>>>>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>>>>>> However I got a following soft lock up problem with your patch.
>>>>>>>>>>>> With a quick review, it's seems to block_tasks() and
>>>>>>>>>>>> driver_tasks()
>>>>>>>>>>>> doesn't support recursive lock. (please correct me if I am wrong)
>>>>>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks, it
>>>>>>>>>>>> looks like goes hang.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is my oops message and gdb output.  I used my 2.6.24 patched
>>>>>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>
>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG:
>>>>>>>>>>>> soft
>>>>>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>> SysRq : Changing Loglevel
>>>>>>>>>>>> Loglevel set to 9
>>>>>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>> CPU 0:
>>>>>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter
>>>>>>>>>>>> ip_tables
>>>>>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth ext3
>>>>>>>>>>>> jbd
>>>>>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>>>>>> RIP: 0010:[<ffffffff881f818a>]  [<ffffffff881f818a>]
>>>>>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>>>>>> RSP: 0018:ffff8100370d7d30  EFLAGS: 00000286
>>>>>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX: ffff810037892430
>>>>>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI: ffff81003792fc60
>>>>>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09: 0000000000000000
>>>>>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000001
>>>>>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15: 0000000000000001
>>>>>>>>>>>> FS:  00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>>>>>> knlGS:0000000000000000
>>>>>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4: 00000000000006e0
>>>>>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>>>>>>>>>
>>>>>>>>>>>> Call Trace:
>>>>>>>>>>>>  [<ffffffff88166803>] :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>>>>>  [<ffffffff8818ebc8>] :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>>>>>  [<ffffffff88183349>]
>>>>>>>>>>>> :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>>>>>  [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>>>>>  [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>  [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>  [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>>>>>  [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>>>>>  [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb
>>>>>>>>>>>> click.ko
>>>>>>>>>>>> GNU gdb 6.8-debian
>>>>>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>> This is free software: you are free to change and redistribute
>>>>>>>>>>>> it.
>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>>>>>>>>>>> copying"
>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>>>>>> info line *(gdb) info line
>>>>>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>>>>>> Line 311 of
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>>>>>  starts at address 0x9c1ba
>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>>>>>  and ends at 0x9c1be
>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>>>>>> (gdb) l
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>>>>>> 306         assert(!current_thread_is_running());
>>>>>>>>>>>> 307         if (!scheduled)
>>>>>>>>>>>> 308             ++_task_blocker_waiting;
>>>>>>>>>>>> 309         while (1) {
>>>>>>>>>>>> 310             int32_t blocker = _task_blocker.value();
>>>>>>>>>>>> 311             if (blocker >= 0
>>>>>>>>>>>> 312                 && _task_blocker.compare_and_swap(blocker,
>>>>>>>>>>>> blocker +
>>>>>>>>>>>> 1))
>>>>>>>>>>>> 313                 break;
>>>>>>>>>>>> 314             if (nice) {
>>>>>>>>>>>> 315     #if CLICK_LINUXMODULE
>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>> Joonwoo,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I took look into this lock up issue and I think I found
>>>>>>>>>>>>>> something.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks.
>>>>>>>>>>>>>> But after calling run_tasks(), current processor can be changed
>>>>>>>>>>>>>> since
>>>>>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>>>>>> So I think that's problem.  How do you think?
>>>>>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before. :(
>>>>>>>>>>>>>
>>>>>>>>>>>>> So my current analysis is this.  It is not appropriate for a
>>>>>>>>>>>>> thread
>>>>>>>>>>>>> to
>>>>>>>>>>>>> call
>>>>>>>>>>>>> blocking functions and/or schedule() when that thread has
>>>>>>>>>>>>> prevented
>>>>>>>>>>>>> preemption via get_cpu().  My prior patches prevented
>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The solution is to separate "locking the task list" from
>>>>>>>>>>>>> "blocking
>>>>>>>>>>>>> task
>>>>>>>>>>>>> execution."  Clickfs, when executing an exclusive handler,
>>>>>>>>>>>>> "blocks
>>>>>>>>>>>>> task
>>>>>>>>>>>>> execution."  A thread that wants to examine the task list
>>>>>>>>>>>>> "locks"
>>>>>>>>>>>>> the
>>>>>>>>>>>>> list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This commit:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>>>>>> introduces separate APIs for locking the list and blocking task
>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>  Exclusive handlers block task execution, but do not lock the
>>>>>>>>>>>>> task
>>>>>>>>>>>>> list.
>>>>>>>>>>>>>  I
>>>>>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>  I
>>>>>>>>>>>>> believe the locking works out too.  User-level multithreading
>>>>>>>>>>>>> tests
>>>>>>>>>>>>> appear
>>>>>>>>>>>>> OK.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any willing stresstesters?  Pretty please? :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>


More information about the click mailing list