[Click] Race condition with FullNoteQueue signaling in multithreading

Wed Oct 20 02:16:17 EDT 2010

Beyers,

Eddie and I discovered and fixed a few other (very evil) memory barrier
bugs, so you should probably update.

Cliff

On Fri, Oct 15, 2010 at 7:40 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:

> Great, glad it's fixed!
> E
>
>
>
> On 10/15/2010 06:35 AM, Beyers Cronje wrote:
>
>> Hi Eddie,
>>
>> I'm happy to report that I've been unable to reproduce the bug using the
>> latest commit. So as far as I can tell this fixed the issue I was having.
>> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
>> Kind regards
>> Beyers Cronje
>>
>> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
>> <mailto:kohler at cs.ucla.edu>> wrote:
>>
>>    Beyers,
>>
>>    I believe I've found the problem... or A problem.  What does this
>>    commit do for you?
>>
>>
>> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>>
>>    Eddie
>>
>>
>>
>>    On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>>
>>        Hi Eddie,
>>
>>        Thank you for your response and looking into this. I keep
>> forgetting
>>        about your excellent Script element, during my tests I used a
>>        much more
>>        rudimentary method of verifying the issue :) I used a Counter
>>        element
>>        before and after the queue and manually read the counter's rate and
>>        count handlers. After a while I noticed that both counters rate
>>        dropped
>>        to 0 and and both counters count stopped increasing,
>>        *permanently*. From
>>        what I can see the tasks NEVER re-schedule themselves.
>>        The only way I found to reschedule them was to manually issue a
>>        reset on
>>        the queue, which seems to re-activate the notifier signals and
>>        in turn
>>        schedule the tasks.
>>        One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
>>        this seems
>>        to trigger the condition quicker. Also, strangely if I set the
>> first
>>        InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>>        posted) it also seems to trigger the condition much quicker, in
>>        fact I
>>        couldn't reproduce the issue with both InfiniteSource elements
>>        set to -1
>>        during a 10min test run where with LIMIT 1000 it always happens
>>        within
>>        10min of testing time.
>>        I'm out of the office now, but will test with both your script and
>>        Cliff's script tomorrow and post back my results.
>>        Beyers
>>
>>        On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
>>        <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
>>        <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>>
>>            Hi Beyers!
>>
>>            I am not sure that I agree with your analysis about the queue
>>            getting "stuck."
>>
>>            Here is the configuration I used, a tweak of yours:
>>
>>            //////////////////////////////////////////////////////////////
>>
>>            is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>>        00  08 00>,
>>            LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
>>        -> Discard;
>>
>>            is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>>        00  08 00>,
>>            LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
>>        -> Discard;
>>
>>
>>            StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>            Script(TYPE ACTIVE,
>>                    init x 0,
>>                    set x $(add $x 1),
>>                    goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>>                    print "$x: is1.scheduled false, uq1.scheduled false,
>>            q1.length $(q1.length)",
>>                    label foo,
>>                    goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>>                    print "$x: is2.scheduled false, uq2.scheduled false,
>>            q2.length $(q2.length)",
>>                    label bar,
>>                    wait 0.5s,
>>                    loop);
>>            //////////////////////////////////////////////////////////////
>>
>>            This should print a message whenever neither source nor sink is
>>            scheduled.
>>
>>            I also tried this with "FullNoteQueue" replaced with
>>        "ThreadSafeQueue."
>>
>>            If a queue was truly getting 'stuck,' we would expect to see
>>            something like
>>
>>            745: is1.scheduled false, uq1.scheduled false
>>            746: is1.scheduled false, uq1.scheduled false
>>            747: is1.scheduled false, uq1.scheduled false
>>            748: is1.scheduled false, uq1.scheduled false
>>            749: is1.scheduled false, uq1.scheduled false....
>>
>>            with no gaps, because something would go unscheduled *and
>>        stay that
>>            way*.
>>
>>            But that's not what I see.  I see this:
>>
>>            1: is1.scheduled false, uq1.scheduled false, q1.length 17
>>            1: is2.scheduled false, uq2.scheduled false, q2.length 0
>>            20: is1.scheduled false, uq1.scheduled false, q1.length 14
>>            22: is1.scheduled false, uq1.scheduled false, q1.length 0
>>            28: is1.scheduled false, uq1.scheduled false, q1.length 14
>>            41: is1.scheduled false, uq1.scheduled false, q1.length 0
>>            47: is2.scheduled false, uq2.scheduled false, q2.length 0
>>            55: is2.scheduled false, uq2.scheduled false, q2.length 0
>>            56: is2.scheduled false, uq2.scheduled false, q2.length 0
>>            76: is2.scheduled false, uq2.scheduled false, q2.length 0
>>
>>            *** Now this, on the other hand, I would expect.  In a
>>        multithreaded
>>            environment, notification will cause some delay between the
>>        upstream
>>        & downstream sides of a queue.  This is because the upstream side
>>            must wake up the downstream side, or vice versa.  The
>>        cross-thread
>>            communication required for this can happen a bit slowly,
>>        leaving a
>>            moment (maybe more than a moment) where neither task appears
>>        to be
>>            scheduled.  (In fact, at least one of the tasks will be
>>        "pre-scheduled.")
>>
>>            So my questions for you are this.  Did you really mean "stuck"?
>>              Have you verified that the tasks NEVER re-schedule
>>        themselves?  Do
>>            you agree with my analysis?  Or, are you saying that the
>>        "pauses"
>>            are too long, even though the tasks do reschedule themselves?
>>
>>            Eddie
>>
>>
>>
>>            On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>>
>>                Hi Eddie, list,
>>
>>                I've come across a multithreading bug in queue signal
>>        handling
>>                which I can
>>                reproduce fairly consistently with userlevel multithreading
>>                Click. The
>>                symptoms are that either the upstream or downstream task
>>        from
>>                the queue are
>>                unscheduled even though the notifier signal is active, this
>>                occurs when the
>>                queue either becomes full or empty. To illustrate here
>>        is the
>>                config I use
>>                and some debug handler outputs:
>>
>>                //******************** Config ************************
>>                is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>>        00 00 00
>>                  08 00>,
>>                LIMIT 1000, STOP false) ->  ThreadSafeQueue ->
>>          uq1::Unqueue ->
>>                  Discard;
>>                is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>>        00 00 00
>>                  08 00>,
>>                LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->
>>          uq2::Unqueue ->
>>                  Discard;
>>
>>                StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>
>>
>>                //******************** Debug Handler Output when
>>        upstream push
>>                task is stuck
>>                ************************
>>                read q.length
>>                200 Read handler 'q.length' OK
>>                DATA 1
>>                0
>>
>>                read q.fullnote_state
>>                200 Read handler 'q.fullnote_state' OK
>>                DATA 131
>>                empty notifier off
>>                task 0x25b8830 [uq2 :: Unqueue] unscheduled
>>                full notifier on
>>                task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>>
>>                //******************** Debug Handler Output when
>>        downstream pull
>>                task is
>>                stuck ************************
>>                read q.length
>>                200 Read handler 'q.length' OK
>>                DATA 4
>>                1000
>>
>>                read q.fullnote_state
>>                200 Read handler 'q.fullnote_state' OK
>>                DATA 131
>>                empty notifier on
>>                task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>>                full notifier off
>>                task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>>
>>
>>
>>  //*****************************************************************************************************************
>>
>>                Clearly the notifier states are correct, but somehow the
>>                relevant task is
>>                not rescheduled. The above config uses ThreadSafeQueue but
>> I
>>                verified that
>>                the same issue occurs when using FullNoteQueue.
>>                The obvious places to look are
>>
>>
>>  ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>>                but so far I haven't spotted anything wrong with the
>>        relevant
>>                code, clearly
>>                I'm overlooking something.
>>
>>                Have you or anyone else on the list got any suggestions?
>>
>>                If it helps, I'm running click source from a couple of
>> weeks
>>                back, default
>>                64bit Fedora Core 13 kernel with preemption enabled
>>                (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>>                I start userlevel click with the following command: 'click
>>                --threads=4
>>                conf/threadtest.click -p 777'
>>
>>                Cheers
>>
>>                Beyers
>>                _______________________________________________
>>                click mailing list
>>        click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>>        <mailto:click at amsterdam.lcs.mit.edu
>>        <mailto:click at amsterdam.lcs.mit.edu>>
>>
>>        https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>
>>
>>
>>