[Click] Spinlock Problems

Eddie Kohler kohler at cs.ucla.edu
Sat Sep 6 08:03:33 EDT 2008


Guys,

Thank you so much for finding this issue!

Kevin, I've checked in your patch.  But I wonder whether something else would 
be better.  I forget, now, why the Linux version of RouterThread uses a 
non-recursive spinlock_t rather than a recursive Spinlock.  Could someone 
who's got a performance testing setup check out whether the system is 
appreciably slower with a Spinlock?  Something like the attached (untested)?

Eddie


springbo at cs.wisc.edu wrote:
> Perfect Joonwoo!
> 
> commit 381b5f8ff5fe2a1d7f26ac39b5591b99c4365bae (Write handlers become
> EXCLUSIVE by default) made all write handlers exclusive by default. When
> click-uninstall writes to /click/config to clear the configuration it is
> done with a now exclusive write handler (requiring all threads to be
> locked). The action of the write handler then attempts to lock all tasks
> again routerthread.cc:633. The write handler action is attempting to grab
> a spinlock it already holds -> Deadlock.
> 
> The attached patch simply marks the write handler as NONEXCLUSIVE. The
> proper fix would be to remove lock_tasks() from
> RouterThread::unschedule_router_tasks,  if you can guarantee all routes
> into the function have already locked the tasks.
> 
> Thanks!
> Kevin
> 
> 
> 
>> Hi all,
>>
>> I bisected tree for this issue, and found following commit is the
>> first bad commit.
>>
>>  commit a7dd5c93a51b3b5b8cd4a177ef379492c27859bf
>>  click-combine documentation: Mention that RouterLink is fake.
>>
>> Any idea?
>>
>> Thanks,
>> Joonwoo
>>
>> 2008/9/3  <springbo at cs.wisc.edu>:
>>> Hi Eddie,
>>>
>>> Sorry for the wait. Today I pulled the latest sources and rebuilt a
>>> vanilla click kernel on a 64-bit 4 core Xeon machine (with the click
>>> e1000
>>> driver installed). The locking seems to be working fine, but
>>> 'click-uninstall' is resulting in a lockup. See the attached dumps
>>> (click
>>> was started with click-install -t 3 in both cases). I still see the
>>> lockup
>>> with -t 1.
>>>
>>> Click uninstalls without problem with our version from a couple of weeks
>>> back.
>>>
>>> Unrelated, but I also saw:
>>> ../include/click/timestamp.hh: In constructor
>>> 'Timestamp::Timestamp(double)':
>>> ../include/click/timestamp.hh:938: warning: converting to 'int64_t' from
>>> 'double'
>>> all over when building userlevel click.
>>>
>>> Sorry I don't have the time to dive into this further right now.
>>>
>>> Thanks!
>>> Kevin
>>>
>>>
>>>
>>>
>>>> Kevin,
>>>>
>>>> Is there any chance you'd be able to try the newer Spinlock code any
>>>> time
>>>> soon?
>>>>
>>>> Eddie
>>>>
>>>>
>>>> springbo at cs.wisc.edu wrote:
>>>>> Hi,
>>>>>
>>>>> I think there is a concurrency issue with Spinlocks in linuxmodule
>>>>> multi-threaded click (running a 2.6.19.2 patched kernel, the
>>>>> e1000-NAPI
>>>>> driver and today's trunk). I've put together a temporary patch, but
>>>>> there
>>>>> might be further issues. Sorry, I don't have the time to investigate
>>>>> more.
>>>>>
>>>>>
>>>>> Problems:
>>>>> The series of events are:
>>>>> -Thread A is running on CPU 0 and thread B is running on CPU 1.
>>>>> -A acquires the Spinlock and executes code in the critical section.
>>>>> -Linux schedules Thread A to run on CPU 1 and thread B to run on CPU
>>>>> 0.
>>>>> -B acquires the Spinlock (works because CPU 0 is the owner of the
>>>>> lock,
>>>>> not the thread)
>>>>> -A and B are both operating in the critical section
>>>>> -(if you're lucky) A releases the Spinlock and generates a "releasing
>>>>> someone else's lock" message
>>>>> -(if you're unlucky) Oops
>>>>>
>>>>> Attached is a test element and a configuration to elicit the behavior.
>>>>> I
>>>>> could only replicate the problem when using a polling input source
>>>>> under
>>>>> heavy load.
>>>>>
>>>>> After fixing the above problem I was still seeing two threads inside
>>>>> the
>>>>> CS. For some reason the atomic_uint32_t::swap was not working
>>>>> atomically
>>>>> as expected. Changing this to atomic_uint32::compare_and_swap solved
>>>>> the
>>>>> problem. We're running x86_64 quad core Xeon CPUs.
>>>>>
>>>>>
>>>>> Solution:
>>>>> In the patch attached I added a new function click_current_thread()
>>>>> which
>>>>> returns a thread_info pointer. Inside of Spinlock I replace
>>>>> click_current_processor() uses with click_current_thread(). This way
>>>>> even
>>>>> if the thread is assigned to another core it will still have the same
>>>>> thread_info pointer.
>>>>>
>>>>> The problem with this solution is that it does not work as an index
>>>>> into
>>>>> an array nicely and cannot simply replace all uses of
>>>>> click_current_processor(). There may be other places in the code which
>>>>> make the same incorrect use of click_current_processor(), but a better
>>>>> approach will be needed if the value is used to index into an array
>>>>> (such
>>>>> as in ReadWriteLock).
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Kevin Springborn
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> click mailing list
>>>>> click at amsterdam.lcs.mit.edu
>>>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>> _______________________________________________
>>> click mailing list
>>> click at amsterdam.lcs.mit.edu
>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>>
>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: linuxmodule-uses-Spinlock.diff
Type: text/x-patch
Size: 2693 bytes
Desc: not available
Url : http://amsterdam.lcs.mit.edu/pipermail/click/attachments/20080906/46199eec/attachment-0001.bin 


More information about the click mailing list