[Click] [PATCH 2/2] Task: Kill process_pending dead lock
Joonwoo Park
joonwpark81 at gmail.com
Tue Nov 4 10:45:32 EST 2008
Hi Eddie,
Thanks for your work and I am very happy to help click project even
though it's just a little bit.
I have a quick question about your work. I think it could fix the
problem between block_tasks() and others.
However still it seems to have a problem between driver_lock_tasks()
and driver_lock_tasks().
What I'm concerning is like this:
access clickfs
multiple _task_blocker_waitings became 1
schedule
multiple routerthreads call driver_lock_tasks() at the same time
dead lock at code : while (!_task_blocker.compare_and_swap(0, -1))
How do you think?
Please correct me if I'm wrong and I'm sorry that I can't be help a lot.
Thanks,
Joonwoo
On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
> Hi Joonwoo,
>
> I appreciate all your work. Thanks for the time you have spent!
>
> After some poking around and a bunch of rebooting, I have a different
> analysis of the problem, and have checked in a patch. It is here:
>
> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547
>
> Does this make sense? And if and when you get a chance, does it work for
> you?
>
> Eddie
>
>
> Joonwoo Park wrote:
>>
>> Hello Eddie,
>>
>> Thank you for your reviewing. I cannot take a look at the code, I'll
>> check my patch soon again as soon as I have a chance.
>> I am not using click for work nowadays. So it's pretty hard to spend
>> enough time for it.
>>
>> Anyhow, I have been turning on the kassert. However I couldn't see
>> assertion failure (both before & after patching)
>> It it make sense?
>>
>> Thanks,
>> Joonwoo
>>
>> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>>>
>>> Joonwoo,
>>>
>>> I don't think this patch has any affect on the correctness of the code.
>>> It
>>> just slows things down.
>>>
>>> There are also bugs in the patch, including setting _task_blocker_owner
>>> in
>>> RouterThread::attempt_lock_tasks but not resetting it if the attempt
>>> fails.
>>>
>>> Have you run after having configured with --enable-kassert? If so, do
>>> you
>>> see any assertions? If not, could you please?
>>>
>>> I'd like to track this down, but this patch is not the way.
>>> Eddie
>>>
>>>
>>> Joonwoo Park wrote:
>>>>
>>>> Hello Eddie,
>>>>
>>>> I tried to fix task blocker to support nested locking and attached a
>>>> patch.
>>>> Can you please take a look at this? I've tested minimally.
>>>>
>>>> Thanks!
>>>> Joonwoo
>>>>
>>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park <joonwpark81 at gmail.com>
>>>> wrote:
>>>>>
>>>>> I am folking 3 threads.
>>>>>
>>>>> Joonwoo
>>>>>
>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>
>>>>>> And how many threads?
>>>>>>
>>>>>> Eddie
>>>>>>
>>>>>>
>>>>>> Joonwoo Park wrote:
>>>>>>>
>>>>>>> Hi Eddie,
>>>>>>>
>>>>>>> I guess so that you intended to they are recursive. :-)
>>>>>>> Here is the config can cause lock up without device elements.
>>>>>>>
>>>>>>> ----
>>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>> FF:FF:FF:FF:FF:FF,
>>>>>>> FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800,
>>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard
>>>>>>>
>>>>>>> sched::BalancedThreadSched(100);
>>>>>>> ----
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Joonwoo
>>>>>>>
>>>>>>> 2008/9/16 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>
>>>>>>>> Hi Joonwoo,
>>>>>>>>
>>>>>>>> I intended block_tasks() and driver_lock_tasks() to be recursive. I
>>>>>>>> could
>>>>>>>> certainly have failed! Can you tell me more about the configuration
>>>>>>>> you're
>>>>>>>> running? Can you cause a soft lockup even without device elements
>>>>>>>> (such
>>>>>>>> as
>>>>>>>> with InfiniteSources)?
>>>>>>>>
>>>>>>>> Eddie
>>>>>>>>
>>>>>>>>
>>>>>>>> Joonwoo Park wrote:
>>>>>>>>>
>>>>>>>>> Hi Eddie,
>>>>>>>>>
>>>>>>>>> I agree with your blocking task execution as a solution.
>>>>>>>>> However I got a following soft lock up problem with your patch.
>>>>>>>>> With a quick review, it's seems to block_tasks() and driver_tasks()
>>>>>>>>> doesn't support recursive lock. (please correct me if I am wrong)
>>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks, it
>>>>>>>>> looks like goes hang.
>>>>>>>>>
>>>>>>>>> Here is my oops message and gdb output. I used my 2.6.24 patched
>>>>>>>>> kernel. I'm sorry for that.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Joonwoo
>>>>>>>>>
>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ BUG: soft
>>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>> SysRq : Changing Loglevel
>>>>>>>>> Loglevel set to 9
>>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116]
>>>>>>>>> CPU 0:
>>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter ip_tables
>>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth ext3
>>>>>>>>> jbd
>>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3
>>>>>>>>> RIP: 0010:[<ffffffff881f818a>] [<ffffffff881f818a>]
>>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630
>>>>>>>>> RSP: 0018:ffff8100370d7d30 EFLAGS: 00000286
>>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX: ffff810037892430
>>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI: ffff81003792fc60
>>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09: 0000000000000000
>>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000001
>>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15: 0000000000000001
>>>>>>>>> FS: 00002acdb07f76e0(0000) GS:ffffffff806ae000(0000)
>>>>>>>>> knlGS:0000000000000000
>>>>>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4: 00000000000006e0
>>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>>>>>>
>>>>>>>>> Call Trace:
>>>>>>>>> [<ffffffff88166803>] :click:_Z12element_hookP5TimerPv+0x13/0x20
>>>>>>>>> [<ffffffff8818ebc8>] :click:_ZN6Master10run_timersEv+0x178/0x320
>>>>>>>>> [<ffffffff88183349>] :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0
>>>>>>>>> [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260
>>>>>>>>> [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30
>>>>>>>>> [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>> [<ffffffff8020cfe8>] child_rip+0xa/0x12
>>>>>>>>> [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94
>>>>>>>>> [<ffffffff8020c6ff>] restore_args+0x0/0x30
>>>>>>>>> [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260
>>>>>>>>> [<ffffffff8020cfde>] child_rip+0x0/0x12
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> joonwpark at joonwpark-desktop-64:~/SRC5/click/linuxmodule$ gdb
>>>>>>>>> click.ko
>>>>>>>>> GNU gdb 6.8-debian
>>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc.
>>>>>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
>>>>>>>>> copying"
>>>>>>>>> and "show warranty" for details.
>>>>>>>>> This GDB was configured as "x86_64-linux-gnu"...
>>>>>>>>> info line *(gdb) info line
>>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a
>>>>>>>>> Line 311 of
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh"
>>>>>>>>> starts at address 0x9c1ba
>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418>
>>>>>>>>> and ends at 0x9c1be
>>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>.
>>>>>>>>> (gdb) l
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311
>>>>>>>>> 306 assert(!current_thread_is_running());
>>>>>>>>> 307 if (!scheduled)
>>>>>>>>> 308 ++_task_blocker_waiting;
>>>>>>>>> 309 while (1) {
>>>>>>>>> 310 int32_t blocker = _task_blocker.value();
>>>>>>>>> 311 if (blocker >= 0
>>>>>>>>> 312 && _task_blocker.compare_and_swap(blocker,
>>>>>>>>> blocker +
>>>>>>>>> 1))
>>>>>>>>> 313 break;
>>>>>>>>> 314 if (nice) {
>>>>>>>>> 315 #if CLICK_LINUXMODULE
>>>>>>>>> (gdb)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2008/9/15 Eddie Kohler <kohler at cs.ucla.edu>:
>>>>>>>>>>
>>>>>>>>>> Joonwoo,
>>>>>>>>>>
>>>>>>>>>>> I took look into this lock up issue and I think I found
>>>>>>>>>>> something.
>>>>>>>>>>>
>>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks.
>>>>>>>>>>> But after calling run_tasks(), current processor can be changed
>>>>>>>>>>> since
>>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element)
>>>>>>>>>>> So I think that's problem. How do you think?
>>>>>>>>>>
>>>>>>>>>> I totally agree that this could be a problem.
>>>>>>>>>>
>>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before. :(
>>>>>>>>>>
>>>>>>>>>> So my current analysis is this. It is not appropriate for a
>>>>>>>>>> thread
>>>>>>>>>> to
>>>>>>>>>> call
>>>>>>>>>> blocking functions and/or schedule() when that thread has
>>>>>>>>>> prevented
>>>>>>>>>> preemption via get_cpu(). My prior patches prevented preemption.
>>>>>>>>>>
>>>>>>>>>> The solution is to separate "locking the task list" from "blocking
>>>>>>>>>> task
>>>>>>>>>> execution." Clickfs, when executing an exclusive handler, "blocks
>>>>>>>>>> task
>>>>>>>>>> execution." A thread that wants to examine the task list "locks"
>>>>>>>>>> the
>>>>>>>>>> list.
>>>>>>>>>>
>>>>>>>>>> This commit:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143
>>>>>>>>>> introduces separate APIs for locking the list and blocking task
>>>>>>>>>> execution.
>>>>>>>>>> Exclusive handlers block task execution, but do not lock the task
>>>>>>>>>> list.
>>>>>>>>>> I
>>>>>>>>>> believe that task execution, in this patch, does not prevent
>>>>>>>>>> preemption.
>>>>>>>>>> I
>>>>>>>>>> believe the locking works out too. User-level multithreading
>>>>>>>>>> tests
>>>>>>>>>> appear
>>>>>>>>>> OK.
>>>>>>>>>>
>>>>>>>>>> Any willing stresstesters? Pretty please? :)
>>>>>>>>>>
>>>>>>>>>> Eddie
>>>>>>>>>>
>
More information about the click
mailing list