[Click] Race condition with FullNoteQueue signaling in multithreading
Cliff Frey
cliff at meraki.com
Wed Oct 20 02:16:17 EDT 2010
Beyers,
Eddie and I discovered and fixed a few other (very evil) memory barrier
bugs, so you should probably update.
Cliff
On Fri, Oct 15, 2010 at 7:40 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
> Great, glad it's fixed!
> E
>
>
>
> On 10/15/2010 06:35 AM, Beyers Cronje wrote:
>
>> Hi Eddie,
>>
>> I'm happy to report that I've been unable to reproduce the bug using the
>> latest commit. So as far as I can tell this fixed the issue I was having.
>> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
>> Kind regards
>> Beyers Cronje
>>
>> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
>> <mailto:kohler at cs.ucla.edu>> wrote:
>>
>> Beyers,
>>
>> I believe I've found the problem... or A problem. What does this
>> commit do for you?
>>
>>
>> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>>
>> Eddie
>>
>>
>>
>> On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>>
>> Hi Eddie,
>>
>> Thank you for your response and looking into this. I keep
>> forgetting
>> about your excellent Script element, during my tests I used a
>> much more
>> rudimentary method of verifying the issue :) I used a Counter
>> element
>> before and after the queue and manually read the counter's rate and
>> count handlers. After a while I noticed that both counters rate
>> dropped
>> to 0 and and both counters count stopped increasing,
>> *permanently*. From
>> what I can see the tasks NEVER re-schedule themselves.
>> The only way I found to reschedule them was to manually issue a
>> reset on
>> the queue, which seems to re-activate the notifier signals and
>> in turn
>> schedule the tasks.
>> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
>> this seems
>> to trigger the condition quicker. Also, strangely if I set the
>> first
>> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>> posted) it also seems to trigger the condition much quicker, in
>> fact I
>> couldn't reproduce the issue with both InfiniteSource elements
>> set to -1
>> during a 10min test run where with LIMIT 1000 it always happens
>> within
>> 10min of testing time.
>> I'm out of the office now, but will test with both your script and
>> Cliff's script tomorrow and post back my results.
>> Beyers
>>
>> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
>> <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
>> <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>>
>> Hi Beyers!
>>
>> I am not sure that I agree with your analysis about the queue
>> getting "stuck."
>>
>> Here is the configuration I used, a tweak of yours:
>>
>> //////////////////////////////////////////////////////////////
>>
>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
>> 00 08 00>,
>> LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
>> -> Discard;
>>
>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
>> 00 08 00>,
>> LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
>> -> Discard;
>>
>>
>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>> Script(TYPE ACTIVE,
>> init x 0,
>> set x $(add $x 1),
>> goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>> print "$x: is1.scheduled false, uq1.scheduled false,
>> q1.length $(q1.length)",
>> label foo,
>> goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>> print "$x: is2.scheduled false, uq2.scheduled false,
>> q2.length $(q2.length)",
>> label bar,
>> wait 0.5s,
>> loop);
>> //////////////////////////////////////////////////////////////
>>
>> This should print a message whenever neither source nor sink is
>> scheduled.
>>
>> I also tried this with "FullNoteQueue" replaced with
>> "ThreadSafeQueue."
>>
>> If a queue was truly getting 'stuck,' we would expect to see
>> something like
>>
>> 745: is1.scheduled false, uq1.scheduled false
>> 746: is1.scheduled false, uq1.scheduled false
>> 747: is1.scheduled false, uq1.scheduled false
>> 748: is1.scheduled false, uq1.scheduled false
>> 749: is1.scheduled false, uq1.scheduled false....
>>
>> with no gaps, because something would go unscheduled *and
>> stay that
>> way*.
>>
>> But that's not what I see. I see this:
>>
>> 1: is1.scheduled false, uq1.scheduled false, q1.length 17
>> 1: is2.scheduled false, uq2.scheduled false, q2.length 0
>> 20: is1.scheduled false, uq1.scheduled false, q1.length 14
>> 22: is1.scheduled false, uq1.scheduled false, q1.length 0
>> 28: is1.scheduled false, uq1.scheduled false, q1.length 14
>> 41: is1.scheduled false, uq1.scheduled false, q1.length 0
>> 47: is2.scheduled false, uq2.scheduled false, q2.length 0
>> 55: is2.scheduled false, uq2.scheduled false, q2.length 0
>> 56: is2.scheduled false, uq2.scheduled false, q2.length 0
>> 76: is2.scheduled false, uq2.scheduled false, q2.length 0
>>
>> *** Now this, on the other hand, I would expect. In a
>> multithreaded
>> environment, notification will cause some delay between the
>> upstream
>> & downstream sides of a queue. This is because the upstream side
>> must wake up the downstream side, or vice versa. The
>> cross-thread
>> communication required for this can happen a bit slowly,
>> leaving a
>> moment (maybe more than a moment) where neither task appears
>> to be
>> scheduled. (In fact, at least one of the tasks will be
>> "pre-scheduled.")
>>
>> So my questions for you are this. Did you really mean "stuck"?
>> Have you verified that the tasks NEVER re-schedule
>> themselves? Do
>> you agree with my analysis? Or, are you saying that the
>> "pauses"
>> are too long, even though the tasks do reschedule themselves?
>>
>> Eddie
>>
>>
>>
>> On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>>
>> Hi Eddie, list,
>>
>> I've come across a multithreading bug in queue signal
>> handling
>> which I can
>> reproduce fairly consistently with userlevel multithreading
>> Click. The
>> symptoms are that either the upstream or downstream task
>> from
>> the queue are
>> unscheduled even though the notifier signal is active, this
>> occurs when the
>> queue either becomes full or empty. To illustrate here
>> is the
>> config I use
>> and some debug handler outputs:
>>
>> //******************** Config ************************
>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
>> 00 00 00
>> 08 00>,
>> LIMIT 1000, STOP false) -> ThreadSafeQueue ->
>> uq1::Unqueue ->
>> Discard;
>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
>> 00 00 00
>> 08 00>,
>> LIMIT -1, STOP false) -> q::ThreadSafeQueue ->
>> uq2::Unqueue ->
>> Discard;
>>
>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>
>>
>> //******************** Debug Handler Output when
>> upstream push
>> task is stuck
>> ************************
>> read q.length
>> 200 Read handler 'q.length' OK
>> DATA 1
>> 0
>>
>> read q.fullnote_state
>> 200 Read handler 'q.fullnote_state' OK
>> DATA 131
>> empty notifier off
>> task 0x25b8830 [uq2 :: Unqueue] unscheduled
>> full notifier on
>> task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>>
>> //******************** Debug Handler Output when
>> downstream pull
>> task is
>> stuck ************************
>> read q.length
>> 200 Read handler 'q.length' OK
>> DATA 4
>> 1000
>>
>> read q.fullnote_state
>> 200 Read handler 'q.fullnote_state' OK
>> DATA 131
>> empty notifier on
>> task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>> full notifier off
>> task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>>
>>
>>
>> //*****************************************************************************************************************
>>
>> Clearly the notifier states are correct, but somehow the
>> relevant task is
>> not rescheduled. The above config uses ThreadSafeQueue but
>> I
>> verified that
>> the same issue occurs when using FullNoteQueue.
>> The obvious places to look are
>>
>>
>> ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>> but so far I haven't spotted anything wrong with the
>> relevant
>> code, clearly
>> I'm overlooking something.
>>
>> Have you or anyone else on the list got any suggestions?
>>
>> If it helps, I'm running click source from a couple of
>> weeks
>> back, default
>> 64bit Fedora Core 13 kernel with preemption enabled
>> (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>> I start userlevel click with the following command: 'click
>> --threads=4
>> conf/threadtest.click -p 777'
>>
>> Cheers
>>
>> Beyers
>> _______________________________________________
>> click mailing list
>> click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>> <mailto:click at amsterdam.lcs.mit.edu
>> <mailto:click at amsterdam.lcs.mit.edu>>
>>
>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>
>>
>>
>>
More information about the click
mailing list