[Click] [PATCH 2/2] Task: Kill process_pending dead lock

Wed Nov 12 11:03:52 EST 2008

Hi Eddie,

Now it seems to work well.
Thank you for your work

Joonwoo

On Sat, Nov 8, 2008 at 11:21 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
> Hi Joonwoo,
>
> Your analysis helped me diagnose this.  What's missing from your analysis is
> that CPU 1 should always call unblock_task() eventually.  The only reason it
> wouldn't is that CPU 0 is monopolizing the CPU.  Just as in my last checkin,
> CPU 0 needs to call schedule() while waiting, rather than spinning.
>
> Please see
> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=939e4b0853ee82d59bf43b7e4ad25fe5b117c81c
>
> Does this work for you?
> Eddie
>
>
> Joonwoo Park wrote:
>>
>> Hi Eddie,
>>
>> Thank you very much for you explanation.
>> I investigated it again and I'm still thinking dead lock problem exists.
>> Moreover I managed to make soft-lock-up problem.
>> Here is my analysis for the current task blocker.
>> I hope these could be help you.
>>
>> ---------
>> * no one is holding lock, on the same thread
>> * access clickfs
>> CPU 0                         CPU 1
>> _task_blocker_waiting     _task_blocker
>>                         schedule_block_task                  0  ->  1
>>                       0
>> driver_lock_task                                                   1
>> ->  1                        0  ->  -1
>>                         top of  block_task(true)
>> 1                                -1                       * spinning &
>> scheduling
>> driver_unlock_task                                               1
>>                           -1  ->  0
>> driver_lock_task                                                   1
>>                             0    ->  -1
>> driver_unlock_task                                               1
>>                           -1  ->  0
>>                         bottom of block_task(true)           1  ->  0
>>                      0   ->  1                    * exiting spinning
>> block_tasks
>>   0                           1  ->  *2*
>> unblock_tasks
>>  0                           2  ->  *1*
>> driver_lock_tasks
>> 0                           1  -> never reach     * hanging
>> ---------
>>
>> I'm attaching as a file in case of hard to read due to someone's email
>> client.
>>
>> Thanks!
>> Joonwoo
>>
>> On Tue, Nov 4, 2008 at 10:55 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>
>>> Yes, that's right.  ScheduleLinux() causes the current
>>> RouterThread::driver
>>> to yield to Linux.  No other kernel thread will start that
>>> RouterThread::driver.
>>>
>>> Eddie
>>>
>>>
>>> Joonwoo Park wrote:
>>>>
>>>> Hi Eddie,
>>>>
>>>> Even though a situation which is using ScheduleLinux() element?
>>>>
>>>> Joonwoo
>>>>
>>>> On Tue, Nov 4, 2008 at 9:12 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>>>
>>>>> Joonwoo,
>>>>>
>>>>> Multiple routerthreads will NEVER call driver_lock_tasks() on THE SAME
>>>>> ROUTERTHREAD OBJECT at the same time.
>>>>>
>>>>> Each routerthread has exactly ONE kernel thread, and only that kernel
>>>>> thread
>>>>> ever calls driver_lock_tasks().
>>>>>
>>>>> Does this help?
>>>>> Eddie
>>>>>
>>>>>
>>>>> Joonwoo Park wrote:
>>>>>>
>>>>>> Hi Eddie,
>>>>>>
>>>>>> Thanks for your work and I am very happy to help click project even
>>>>>> though it's just a little bit.
>>>>>>
>>>>>> I have a quick question about your work.  I think it could fix the
>>>>>> problem between block_tasks() and others.
>>>>>> However still it seems to have a problem between driver_lock_tasks()
>>>>>> and driver_lock_tasks().
>>>>>>
>>>>>> What I'm concerning is like this:
>>>>>> access clickfs
>>>>>>    multiple _task_blocker_waitings became 1
>>>>>>    schedule
>>>>>>    multiple routerthreads call driver_lock_tasks() at the same time
>>>>>>    dead lock at code : while (!_task_blocker.compare_and_swap(0, -1))
>>>>>>
>>>>>> How do you think?
>>>>>> Please correct me if I'm wrong and I'm sorry that I can't be help a
>>>>>> lot.
>>>>>>
>>>>>> Thanks,
>>>>>> Joonwoo
>>>>>>
>>>>>> On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <kohler at cs.ucla.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Joonwoo,
>>>>>>>
>>>>>>> I appreciate all your work.  Thanks for the time you have spent!
>>>>>>>
>>>>>>> After some poking around and a bunch of rebooting, I have a different
>>>>>>> analysis of the problem, and have checked in a patch.  It is here:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547
>>>>>>>
>>>>>>> Does this make sense?  And if and when you get a chance, does it work
>>>>>>> for
>>>>>>> you?
>>>>>>>
>>>>>>> Eddie
>>>>>>>
>>>>>>>
>>>>>>> Joonwoo Park wrote:
>>>>>>>>
>>>>>>>> Hello Eddie,
>>>>>>>>
>>>>>>>> Thank you for your reviewing.  I cannot take a look at the code,
>>>>>>>> I'll
>>>>>>>> check my patch soon again as soon as I have a chance.
>>>>>>>> I am not using click for work nowadays.  So it's pretty hard to
>>>>>>>> spend
>>>>>>>> enough time for it.
>>>>>>>>
>>>>>>>> Anyhow, I have been turning on the kassert.  However I couldn't see
>>>>>>>> assertion failure (both before & after patching)
>>>>>>>> It it make sense?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Joonwoo
>>>>>>>>
>>>>>>>> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Joonwoo,
>>>>>>>>>
>>>>>>>>> I don't think this patch has any affect on the correctness of the
>>>>>>>>> code.
>>>>>>>>>  It
>>>>>>>>> just slows things down.
>>>>>>>>>
>>>>>>>>> There are also bugs in the patch, including setting
>>>>>>>>> _task_blocker_owner
>>>>>>>>> in
>>>>>>>>> RouterThread::attempt_lock_tasks but not resetting it if the
>>>>>>>>> attempt
>>>>>>>>> fails.
>>>>>>>>>
>>>>>>>>> Have you run after having configured with --enable-kassert?  If so,
>>>>>>>>> do
>>>>>>>>> you
>>>>>>>>> see any assertions?  If not, could you please?
>>>>>>>>>
>>>>>>>>> I'd like to track this down, but this patch is not the way.
>>>>>>>>> Eddie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Eddie,
>>>>>>>>>>
>>>>>>>>>> I tried to fix task blocker to support nested locking and attached
>>>>>>>>>> a
>>>>>>>>>> patch.
>>>>>>>>>> Can you please take a look at this?  I've tested minimally.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Joonwoo
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park
>>>>>>>>>> <joonwpark81 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I am folking 3 threads.
>>>>>>>>>>>
>>>>>>>>>>> Joonwoo
>>>>>>>>>>>
>>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>
>>>>>>>>>>>> And how many threads?
>>>>>>>>>>>>
>>>>>>>>>>>> Eddie
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>>>>>>>>> Here is the config can cause lock up without device elements.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----
>>>>>>>>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>>>> FF:FF:FF:FF:FF:FF,
>>>>>>>>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>>>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>>>>>>>
>>>>>>>>>>>>> sched::BalancedThreadSched(100);
>>>>>>>>>>>>> ----
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Joonwoo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I intended block_tasks() and driver_lock_tasks() to be
>>>>>>>>>>>>>> recursive.
>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>> certainly have failed!  Can you tell me more about the
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>> running?  Can you cause a soft lockup even without device
>>>>>>>>>>>>>> elements
>>>>>>>>>>>>>> (such
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> with InfiniteSources)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Eddie,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>>>>>>>>> However I got a following soft lock up problem with your
>>>>>>>>>>>>>>> patch.
>>>>>>>>>>>>>>> With a quick review, it's seems to block_tasks() and
>>>>>>>>>>>>>>> driver_tasks()
>>>>>>>>>>>>>>> doesn't support recursive lock. (please correct me if I am
>>>>>>>>>>>>>>> wrong)
>>>>>>>>>>>>>>> So when BalancedThreadSched's run_timer try to lock the
>>>>>>>>>>>>>>> tasks,
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> looks like goes hang.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is my oops message and gdb output.  I used my 2.6.24
>>>>>>>>>>>>>>> patched
>>>>>>>>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Joonwoo
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG:
>>>>>>>>>>>>>>> soft
>>>>>>>>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>>>> SysRq : Changing Loglevel
>>>>>>>>>>>>>>> Loglevel set to 9
>>>>>>>>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>>>>>>>> CPU 0:
>>>>>>>>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter
>>>>>>>>>>>>>>> ip_tables
>>>>>>>>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth
>>>>>>>>>>>>>>> ext3
>>>>>>>>>>>>>>> jbd
>>>>>>>>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>>>>>>>>> RIP: 0010:[<ffffffff881f818a>]  [<ffffffff881f818a>]
>>>>>>>>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>>>>>>>>> RSP: 0018:ffff8100370d7d30  EFLAGS: 00000286
>>>>>>>>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX:
>>>>>>>>>>>>>>> ffff810037892430
>>>>>>>>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI:
>>>>>>>>>>>>>>> ffff81003792fc60
>>>>>>>>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09:
>>>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12:
>>>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15:
>>>>>>>>>>>>>>> 0000000000000001
>>>>>>>>>>>>>>> FS:  00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>>>>>>>>> knlGS:0000000000000000
>>>>>>>>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4:
>>>>>>>>>>>>>>> 00000000000006e0
>>>>>>>>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>>>>>>>> 0000000000000000
>>>>>>>>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>>>>>>>>>>>> 0000000000000400
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Call Trace:
>>>>>>>>>>>>>>>  [<ffffffff88166803>]
>>>>>>>>>>>>>>> :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>>>>>>>>  [<ffffffff8818ebc8>]
>>>>>>>>>>>>>>> :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>>>>>>>>  [<ffffffff88183349>]
>>>>>>>>>>>>>>> :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>>>>>>>>  [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>>>>>>>>  [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>>>  [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>>>>>>>>  [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>>>>>>>>  [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>>>>>>>>  [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>>>>>>>>  [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb
>>>>>>>>>>>>>>> click.ko
>>>>>>>>>>>>>>> GNU gdb 6.8-debian
>>>>>>>>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>>>>> This is free software: you are free to change and
>>>>>>>>>>>>>>> redistribute
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type
>>>>>>>>>>>>>>> "show
>>>>>>>>>>>>>>> copying"
>>>>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>>>>>>>>> info line *(gdb) info line
>>>>>>>>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>>>>>>>>> Line 311 of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>>>>>>>>  starts at address 0x9c1ba
>>>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>>>>>>>>  and ends at 0x9c1be
>>>>>>>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>>>>>>>>> (gdb) l
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>>>>>>>>> 306         assert(!current_thread_is_running());
>>>>>>>>>>>>>>> 307         if (!scheduled)
>>>>>>>>>>>>>>> 308             ++_task_blocker_waiting;
>>>>>>>>>>>>>>> 309         while (1) {
>>>>>>>>>>>>>>> 310             int32_t blocker = _task_blocker.value();
>>>>>>>>>>>>>>> 311             if (blocker >= 0
>>>>>>>>>>>>>>> 312                 &&
>>>>>>>>>>>>>>> _task_blocker.compare_and_swap(blocker,
>>>>>>>>>>>>>>> blocker +
>>>>>>>>>>>>>>> 1))
>>>>>>>>>>>>>>> 313                 break;
>>>>>>>>>>>>>>> 314             if (nice) {
>>>>>>>>>>>>>>> 315     #if CLICK_LINUXMODULE
>>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Joonwoo,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I took look into this lock up issue and I think I found
>>>>>>>>>>>>>>>>> something.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked
>>>>>>>>>>>>>>>>> tasks.
>>>>>>>>>>>>>>>>> But after calling run_tasks(), current processor can be
>>>>>>>>>>>>>>>>> changed
>>>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>>>>>>>>> So I think that's problem.  How do you think?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before.
>>>>>>>>>>>>>>>> :(
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So my current analysis is this.  It is not appropriate for a
>>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>> blocking functions and/or schedule() when that thread has
>>>>>>>>>>>>>>>> prevented
>>>>>>>>>>>>>>>> preemption via get_cpu().  My prior patches prevented
>>>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The solution is to separate "locking the task list" from
>>>>>>>>>>>>>>>> "blocking
>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>> execution."  Clickfs, when executing an exclusive handler,
>>>>>>>>>>>>>>>> "blocks
>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>> execution."  A thread that wants to examine the task list
>>>>>>>>>>>>>>>> "locks"
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This commit:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>>>>>>>>> introduces separate APIs for locking the list and blocking
>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>> execution.
>>>>>>>>>>>>>>>>  Exclusive handlers block task execution, but do not lock
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> task
>>>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>>>>>>>>> preemption.
>>>>>>>>>>>>>>>>  I
>>>>>>>>>>>>>>>> believe the locking works out too.  User-level
>>>>>>>>>>>>>>>> multithreading
>>>>>>>>>>>>>>>> tests
>>>>>>>>>>>>>>>> appear
>>>>>>>>>>>>>>>> OK.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any willing stresstesters?  Pretty please? :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Eddie
>>>>>>>>>>>>>>>>
>