[Click] Race condition with FullNoteQueue signaling in multithreading
Beyers Cronje
bcronje at gmail.com
Wed Oct 20 06:41:58 EDT 2010
Thanks for the heads up, I will update.
On Wed, Oct 20, 2010 at 8:16 AM, Cliff Frey <cliff at meraki.com> wrote:
> Beyers,
>
> Eddie and I discovered and fixed a few other (very evil) memory barrier
> bugs, so you should probably update.
>
> Cliff
>
> On Fri, Oct 15, 2010 at 7:40 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>
>> Great, glad it's fixed!
>> E
>>
>>
>>
>> On 10/15/2010 06:35 AM, Beyers Cronje wrote:
>>
>>> Hi Eddie,
>>>
>>> I'm happy to report that I've been unable to reproduce the bug using the
>>> latest commit. So as far as I can tell this fixed the issue I was having.
>>> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
>>> Kind regards
>>> Beyers Cronje
>>>
>>> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
>>> <mailto:kohler at cs.ucla.edu>> wrote:
>>>
>>> Beyers,
>>>
>>> I believe I've found the problem... or A problem. What does this
>>> commit do for you?
>>>
>>>
>>> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>>>
>>> Eddie
>>>
>>>
>>>
>>> On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>>>
>>> Hi Eddie,
>>>
>>> Thank you for your response and looking into this. I keep
>>> forgetting
>>> about your excellent Script element, during my tests I used a
>>> much more
>>> rudimentary method of verifying the issue :) I used a Counter
>>> element
>>> before and after the queue and manually read the counter's rate
>>> and
>>> count handlers. After a while I noticed that both counters rate
>>> dropped
>>> to 0 and and both counters count stopped increasing,
>>> *permanently*. From
>>> what I can see the tasks NEVER re-schedule themselves.
>>> The only way I found to reschedule them was to manually issue a
>>> reset on
>>> the queue, which seems to re-activate the notifier signals and
>>> in turn
>>> schedule the tasks.
>>> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
>>> this seems
>>> to trigger the condition quicker. Also, strangely if I set the
>>> first
>>> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>>> posted) it also seems to trigger the condition much quicker, in
>>> fact I
>>> couldn't reproduce the issue with both InfiniteSource elements
>>> set to -1
>>> during a 10min test run where with LIMIT 1000 it always happens
>>> within
>>> 10min of testing time.
>>> I'm out of the office now, but will test with both your script and
>>> Cliff's script tomorrow and post back my results.
>>> Beyers
>>>
>>> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
>>> <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
>>> <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>>>
>>> Hi Beyers!
>>>
>>> I am not sure that I agree with your analysis about the queue
>>> getting "stuck."
>>>
>>> Here is the configuration I used, a tweak of yours:
>>>
>>> //////////////////////////////////////////////////////////////
>>>
>>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
>>> 00 08 00>,
>>> LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
>>> -> Discard;
>>>
>>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
>>> 00 08 00>,
>>> LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
>>> -> Discard;
>>>
>>>
>>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>> Script(TYPE ACTIVE,
>>> init x 0,
>>> set x $(add $x 1),
>>> goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>>> print "$x: is1.scheduled false, uq1.scheduled false,
>>> q1.length $(q1.length)",
>>> label foo,
>>> goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>>> print "$x: is2.scheduled false, uq2.scheduled false,
>>> q2.length $(q2.length)",
>>> label bar,
>>> wait 0.5s,
>>> loop);
>>> //////////////////////////////////////////////////////////////
>>>
>>> This should print a message whenever neither source nor sink
>>> is
>>> scheduled.
>>>
>>> I also tried this with "FullNoteQueue" replaced with
>>> "ThreadSafeQueue."
>>>
>>> If a queue was truly getting 'stuck,' we would expect to see
>>> something like
>>>
>>> 745: is1.scheduled false, uq1.scheduled false
>>> 746: is1.scheduled false, uq1.scheduled false
>>> 747: is1.scheduled false, uq1.scheduled false
>>> 748: is1.scheduled false, uq1.scheduled false
>>> 749: is1.scheduled false, uq1.scheduled false....
>>>
>>> with no gaps, because something would go unscheduled *and
>>> stay that
>>> way*.
>>>
>>> But that's not what I see. I see this:
>>>
>>> 1: is1.scheduled false, uq1.scheduled false, q1.length 17
>>> 1: is2.scheduled false, uq2.scheduled false, q2.length 0
>>> 20: is1.scheduled false, uq1.scheduled false, q1.length 14
>>> 22: is1.scheduled false, uq1.scheduled false, q1.length 0
>>> 28: is1.scheduled false, uq1.scheduled false, q1.length 14
>>> 41: is1.scheduled false, uq1.scheduled false, q1.length 0
>>> 47: is2.scheduled false, uq2.scheduled false, q2.length 0
>>> 55: is2.scheduled false, uq2.scheduled false, q2.length 0
>>> 56: is2.scheduled false, uq2.scheduled false, q2.length 0
>>> 76: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>
>>> *** Now this, on the other hand, I would expect. In a
>>> multithreaded
>>> environment, notification will cause some delay between the
>>> upstream
>>> & downstream sides of a queue. This is because the upstream side
>>> must wake up the downstream side, or vice versa. The
>>> cross-thread
>>> communication required for this can happen a bit slowly,
>>> leaving a
>>> moment (maybe more than a moment) where neither task appears
>>> to be
>>> scheduled. (In fact, at least one of the tasks will be
>>> "pre-scheduled.")
>>>
>>> So my questions for you are this. Did you really mean
>>> "stuck"?
>>> Have you verified that the tasks NEVER re-schedule
>>> themselves? Do
>>> you agree with my analysis? Or, are you saying that the
>>> "pauses"
>>> are too long, even though the tasks do reschedule themselves?
>>>
>>> Eddie
>>>
>>>
>>>
>>> On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>>>
>>> Hi Eddie, list,
>>>
>>> I've come across a multithreading bug in queue signal
>>> handling
>>> which I can
>>> reproduce fairly consistently with userlevel
>>> multithreading
>>> Click. The
>>> symptoms are that either the upstream or downstream task
>>> from
>>> the queue are
>>> unscheduled even though the notifier signal is active,
>>> this
>>> occurs when the
>>> queue either becomes full or empty. To illustrate here
>>> is the
>>> config I use
>>> and some debug handler outputs:
>>>
>>> //******************** Config ************************
>>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
>>> 00 00 00
>>> 08 00>,
>>> LIMIT 1000, STOP false) -> ThreadSafeQueue ->
>>> uq1::Unqueue ->
>>> Discard;
>>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
>>> 00 00 00
>>> 08 00>,
>>> LIMIT -1, STOP false) -> q::ThreadSafeQueue ->
>>> uq2::Unqueue ->
>>> Discard;
>>>
>>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>>
>>>
>>> //******************** Debug Handler Output when
>>> upstream push
>>> task is stuck
>>> ************************
>>> read q.length
>>> 200 Read handler 'q.length' OK
>>> DATA 1
>>> 0
>>>
>>> read q.fullnote_state
>>> 200 Read handler 'q.fullnote_state' OK
>>> DATA 131
>>> empty notifier off
>>> task 0x25b8830 [uq2 :: Unqueue] unscheduled
>>> full notifier on
>>> task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>>>
>>> //******************** Debug Handler Output when
>>> downstream pull
>>> task is
>>> stuck ************************
>>> read q.length
>>> 200 Read handler 'q.length' OK
>>> DATA 4
>>> 1000
>>>
>>> read q.fullnote_state
>>> 200 Read handler 'q.fullnote_state' OK
>>> DATA 131
>>> empty notifier on
>>> task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>>> full notifier off
>>> task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>>>
>>>
>>>
>>> //*****************************************************************************************************************
>>>
>>> Clearly the notifier states are correct, but somehow the
>>> relevant task is
>>> not rescheduled. The above config uses ThreadSafeQueue but
>>> I
>>> verified that
>>> the same issue occurs when using FullNoteQueue.
>>> The obvious places to look are
>>>
>>>
>>> ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>>> but so far I haven't spotted anything wrong with the
>>> relevant
>>> code, clearly
>>> I'm overlooking something.
>>>
>>> Have you or anyone else on the list got any suggestions?
>>>
>>> If it helps, I'm running click source from a couple of
>>> weeks
>>> back, default
>>> 64bit Fedora Core 13 kernel with preemption enabled
>>> (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>>> I start userlevel click with the following command: 'click
>>> --threads=4
>>> conf/threadtest.click -p 777'
>>>
>>> Cheers
>>>
>>> Beyers
>>> _______________________________________________
>>> click mailing list
>>> click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>>> <mailto:click at amsterdam.lcs.mit.edu
>>> <mailto:click at amsterdam.lcs.mit.edu>>
>>>
>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>>
>>>
>>>
>>>
>
More information about the click
mailing list