[Click] Race condition with FullNoteQueue signaling in multithreading

Fri Oct 15 09:35:27 EDT 2010

Hi Eddie,

I'm happy to report that I've been unable to reproduce the bug using the
latest commit. So as far as I can tell this fixed the issue I was having.

Eddie, Cliff, you guys rock, thanks so much for your assistance on this.

Kind regards

Beyers Cronje

On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:

> Beyers,
>
> I believe I've found the problem... or A problem.  What does this commit do
> for you?
>
>
> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>
> Eddie
>
>
>
> On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>
>> Hi Eddie,
>>
>> Thank you for your response and looking into this. I keep forgetting
>> about your excellent Script element, during my tests I used a much more
>> rudimentary method of verifying the issue :) I used a Counter element
>> before and after the queue and manually read the counter's rate and
>> count handlers. After a while I noticed that both counters rate dropped
>> to 0 and and both counters count stopped increasing, *permanently*. From
>> what I can see the tasks NEVER re-schedule themselves.
>> The only way I found to reschedule them was to manually issue a reset on
>> the queue, which seems to re-activate the notifier signals and in turn
>> schedule the tasks.
>> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1, this seems
>> to trigger the condition quicker. Also, strangely if I set the first
>> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>> posted) it also seems to trigger the condition much quicker, in fact I
>> couldn't reproduce the issue with both InfiniteSource elements set to -1
>> during a 10min test run where with LIMIT 1000 it always happens within
>> 10min of testing time.
>> I'm out of the office now, but will test with both your script and
>> Cliff's script tomorrow and post back my results.
>> Beyers
>>
>> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler <kohler at cs.ucla.edu
>>  <mailto:kohler at cs.ucla.edu>> wrote:
>>
>>    Hi Beyers!
>>
>>    I am not sure that I agree with your analysis about the queue
>>    getting "stuck."
>>
>>    Here is the configuration I used, a tweak of yours:
>>
>>    //////////////////////////////////////////////////////////////
>>
>>    is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08
>> 00>,
>>    LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue -> Discard;
>>
>>    is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08
>> 00>,
>>    LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue -> Discard;
>>
>>
>>    StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>    Script(TYPE ACTIVE,
>>            init x 0,
>>            set x $(add $x 1),
>>            goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>>            print "$x: is1.scheduled false, uq1.scheduled false,
>>    q1.length $(q1.length)",
>>            label foo,
>>            goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>>            print "$x: is2.scheduled false, uq2.scheduled false,
>>    q2.length $(q2.length)",
>>            label bar,
>>            wait 0.5s,
>>            loop);
>>    //////////////////////////////////////////////////////////////
>>
>>    This should print a message whenever neither source nor sink is
>>    scheduled.
>>
>>    I also tried this with "FullNoteQueue" replaced with "ThreadSafeQueue."
>>
>>    If a queue was truly getting 'stuck,' we would expect to see
>>    something like
>>
>>    745: is1.scheduled false, uq1.scheduled false
>>    746: is1.scheduled false, uq1.scheduled false
>>    747: is1.scheduled false, uq1.scheduled false
>>    748: is1.scheduled false, uq1.scheduled false
>>    749: is1.scheduled false, uq1.scheduled false....
>>
>>    with no gaps, because something would go unscheduled *and stay that
>>    way*.
>>
>>    But that's not what I see.  I see this:
>>
>>    1: is1.scheduled false, uq1.scheduled false, q1.length 17
>>    1: is2.scheduled false, uq2.scheduled false, q2.length 0
>>    20: is1.scheduled false, uq1.scheduled false, q1.length 14
>>    22: is1.scheduled false, uq1.scheduled false, q1.length 0
>>    28: is1.scheduled false, uq1.scheduled false, q1.length 14
>>    41: is1.scheduled false, uq1.scheduled false, q1.length 0
>>    47: is2.scheduled false, uq2.scheduled false, q2.length 0
>>    55: is2.scheduled false, uq2.scheduled false, q2.length 0
>>    56: is2.scheduled false, uq2.scheduled false, q2.length 0
>>    76: is2.scheduled false, uq2.scheduled false, q2.length 0
>>
>>    *** Now this, on the other hand, I would expect.  In a multithreaded
>>    environment, notification will cause some delay between the upstream
>>    & downstream sides of a queue.  This is because the upstream side
>>    must wake up the downstream side, or vice versa.  The cross-thread
>>    communication required for this can happen a bit slowly, leaving a
>>    moment (maybe more than a moment) where neither task appears to be
>>    scheduled.  (In fact, at least one of the tasks will be
>>    "pre-scheduled.")
>>
>>    So my questions for you are this.  Did you really mean "stuck"?
>>      Have you verified that the tasks NEVER re-schedule themselves?  Do
>>    you agree with my analysis?  Or, are you saying that the "pauses"
>>    are too long, even though the tasks do reschedule themselves?
>>
>>    Eddie
>>
>>
>>
>>    On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>>
>>        Hi Eddie, list,
>>
>>        I've come across a multithreading bug in queue signal handling
>>        which I can
>>        reproduce fairly consistently with userlevel multithreading
>>        Click. The
>>        symptoms are that either the upstream or downstream task from
>>        the queue are
>>        unscheduled even though the notifier signal is active, this
>>        occurs when the
>>        queue either becomes full or empty. To illustrate here is the
>>        config I use
>>        and some debug handler outputs:
>>
>>        //******************** Config ************************
>>        is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>>          08 00>,
>>        LIMIT 1000, STOP false) ->  ThreadSafeQueue ->  uq1::Unqueue ->
>>          Discard;
>>        is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>>          08 00>,
>>        LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->  uq2::Unqueue ->
>>          Discard;
>>
>>        StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>
>>
>>        //******************** Debug Handler Output when upstream push
>>        task is stuck
>>        ************************
>>        read q.length
>>        200 Read handler 'q.length' OK
>>        DATA 1
>>        0
>>
>>        read q.fullnote_state
>>        200 Read handler 'q.fullnote_state' OK
>>        DATA 131
>>        empty notifier off
>>        task 0x25b8830 [uq2 :: Unqueue] unscheduled
>>        full notifier on
>>        task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>>
>>        //******************** Debug Handler Output when downstream pull
>>        task is
>>        stuck ************************
>>        read q.length
>>        200 Read handler 'q.length' OK
>>        DATA 4
>>        1000
>>
>>        read q.fullnote_state
>>        200 Read handler 'q.fullnote_state' OK
>>        DATA 131
>>        empty notifier on
>>        task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>>        full notifier off
>>        task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>>
>>
>>  //*****************************************************************************************************************
>>
>>        Clearly the notifier states are correct, but somehow the
>>        relevant task is
>>        not rescheduled. The above config uses ThreadSafeQueue but I
>>        verified that
>>        the same issue occurs when using FullNoteQueue.
>>        The obvious places to look are
>>
>>  ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>>        but so far I haven't spotted anything wrong with the relevant
>>        code, clearly
>>        I'm overlooking something.
>>
>>        Have you or anyone else on the list got any suggestions?
>>
>>        If it helps, I'm running click source from a couple of weeks
>>        back, default
>>        64bit Fedora Core 13 kernel with preemption enabled
>>        (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>>        I start userlevel click with the following command: 'click
>>        --threads=4
>>        conf/threadtest.click -p 777'
>>
>>        Cheers
>>
>>        Beyers
>>        _______________________________________________
>>        click mailing list
>>        click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>>
>>        https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>
>>
>>