[Click] Race condition with FullNoteQueue signaling in multithreading

Wed Oct 20 06:41:58 EDT 2010

Thanks for the heads up, I will update.

On Wed, Oct 20, 2010 at 8:16 AM, Cliff Frey <cliff at meraki.com> wrote:

> Beyers,
>
> Eddie and I discovered and fixed a few other (very evil) memory barrier
> bugs, so you should probably update.
>
> Cliff
>
> On Fri, Oct 15, 2010 at 7:40 AM, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>
>> Great, glad it's fixed!
>> E
>>
>>
>>
>> On 10/15/2010 06:35 AM, Beyers Cronje wrote:
>>
>>> Hi Eddie,
>>>
>>> I'm happy to report that I've been unable to reproduce the bug using the
>>> latest commit. So as far as I can tell this fixed the issue I was having.
>>> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
>>> Kind regards
>>> Beyers Cronje
>>>
>>> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
>>> <mailto:kohler at cs.ucla.edu>> wrote:
>>>
>>>    Beyers,
>>>
>>>    I believe I've found the problem... or A problem.  What does this
>>>    commit do for you?
>>>
>>>
>>> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>>>
>>>    Eddie
>>>
>>>
>>>
>>>    On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>>>
>>>        Hi Eddie,
>>>
>>>        Thank you for your response and looking into this. I keep
>>> forgetting
>>>        about your excellent Script element, during my tests I used a
>>>        much more
>>>        rudimentary method of verifying the issue :) I used a Counter
>>>        element
>>>        before and after the queue and manually read the counter's rate
>>> and
>>>        count handlers. After a while I noticed that both counters rate
>>>        dropped
>>>        to 0 and and both counters count stopped increasing,
>>>        *permanently*. From
>>>        what I can see the tasks NEVER re-schedule themselves.
>>>        The only way I found to reschedule them was to manually issue a
>>>        reset on
>>>        the queue, which seems to re-activate the notifier signals and
>>>        in turn
>>>        schedule the tasks.
>>>        One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
>>>        this seems
>>>        to trigger the condition quicker. Also, strangely if I set the
>>> first
>>>        InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>>>        posted) it also seems to trigger the condition much quicker, in
>>>        fact I
>>>        couldn't reproduce the issue with both InfiniteSource elements
>>>        set to -1
>>>        during a 10min test run where with LIMIT 1000 it always happens
>>>        within
>>>        10min of testing time.
>>>        I'm out of the office now, but will test with both your script and
>>>        Cliff's script tomorrow and post back my results.
>>>        Beyers
>>>
>>>        On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
>>>        <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
>>>        <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>>>
>>>            Hi Beyers!
>>>
>>>            I am not sure that I agree with your analysis about the queue
>>>            getting "stuck."
>>>
>>>            Here is the configuration I used, a tweak of yours:
>>>
>>>            //////////////////////////////////////////////////////////////
>>>
>>>            is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>>>        00  08 00>,
>>>            LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
>>>        -> Discard;
>>>
>>>            is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>>>        00  08 00>,
>>>            LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
>>>        -> Discard;
>>>
>>>
>>>            StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>>            Script(TYPE ACTIVE,
>>>                    init x 0,
>>>                    set x $(add $x 1),
>>>                    goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>>>                    print "$x: is1.scheduled false, uq1.scheduled false,
>>>            q1.length $(q1.length)",
>>>                    label foo,
>>>                    goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>>>                    print "$x: is2.scheduled false, uq2.scheduled false,
>>>            q2.length $(q2.length)",
>>>                    label bar,
>>>                    wait 0.5s,
>>>                    loop);
>>>            //////////////////////////////////////////////////////////////
>>>
>>>            This should print a message whenever neither source nor sink
>>> is
>>>            scheduled.
>>>
>>>            I also tried this with "FullNoteQueue" replaced with
>>>        "ThreadSafeQueue."
>>>
>>>            If a queue was truly getting 'stuck,' we would expect to see
>>>            something like
>>>
>>>            745: is1.scheduled false, uq1.scheduled false
>>>            746: is1.scheduled false, uq1.scheduled false
>>>            747: is1.scheduled false, uq1.scheduled false
>>>            748: is1.scheduled false, uq1.scheduled false
>>>            749: is1.scheduled false, uq1.scheduled false....
>>>
>>>            with no gaps, because something would go unscheduled *and
>>>        stay that
>>>            way*.
>>>
>>>            But that's not what I see.  I see this:
>>>
>>>            1: is1.scheduled false, uq1.scheduled false, q1.length 17
>>>            1: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>            20: is1.scheduled false, uq1.scheduled false, q1.length 14
>>>            22: is1.scheduled false, uq1.scheduled false, q1.length 0
>>>            28: is1.scheduled false, uq1.scheduled false, q1.length 14
>>>            41: is1.scheduled false, uq1.scheduled false, q1.length 0
>>>            47: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>            55: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>            56: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>            76: is2.scheduled false, uq2.scheduled false, q2.length 0
>>>
>>>            *** Now this, on the other hand, I would expect.  In a
>>>        multithreaded
>>>            environment, notification will cause some delay between the
>>>        upstream
>>>        & downstream sides of a queue.  This is because the upstream side
>>>            must wake up the downstream side, or vice versa.  The
>>>        cross-thread
>>>            communication required for this can happen a bit slowly,
>>>        leaving a
>>>            moment (maybe more than a moment) where neither task appears
>>>        to be
>>>            scheduled.  (In fact, at least one of the tasks will be
>>>        "pre-scheduled.")
>>>
>>>            So my questions for you are this.  Did you really mean
>>> "stuck"?
>>>              Have you verified that the tasks NEVER re-schedule
>>>        themselves?  Do
>>>            you agree with my analysis?  Or, are you saying that the
>>>        "pauses"
>>>            are too long, even though the tasks do reschedule themselves?
>>>
>>>            Eddie
>>>
>>>
>>>
>>>            On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>>>
>>>                Hi Eddie, list,
>>>
>>>                I've come across a multithreading bug in queue signal
>>>        handling
>>>                which I can
>>>                reproduce fairly consistently with userlevel
>>> multithreading
>>>                Click. The
>>>                symptoms are that either the upstream or downstream task
>>>        from
>>>                the queue are
>>>                unscheduled even though the notifier signal is active,
>>> this
>>>                occurs when the
>>>                queue either becomes full or empty. To illustrate here
>>>        is the
>>>                config I use
>>>                and some debug handler outputs:
>>>
>>>                //******************** Config ************************
>>>                is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>>>        00 00 00
>>>                  08 00>,
>>>                LIMIT 1000, STOP false) ->  ThreadSafeQueue ->
>>>          uq1::Unqueue ->
>>>                  Discard;
>>>                is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>>>        00 00 00
>>>                  08 00>,
>>>                LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->
>>>          uq2::Unqueue ->
>>>                  Discard;
>>>
>>>                StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>>>
>>>
>>>                //******************** Debug Handler Output when
>>>        upstream push
>>>                task is stuck
>>>                ************************
>>>                read q.length
>>>                200 Read handler 'q.length' OK
>>>                DATA 1
>>>                0
>>>
>>>                read q.fullnote_state
>>>                200 Read handler 'q.fullnote_state' OK
>>>                DATA 131
>>>                empty notifier off
>>>                task 0x25b8830 [uq2 :: Unqueue] unscheduled
>>>                full notifier on
>>>                task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>>>
>>>                //******************** Debug Handler Output when
>>>        downstream pull
>>>                task is
>>>                stuck ************************
>>>                read q.length
>>>                200 Read handler 'q.length' OK
>>>                DATA 4
>>>                1000
>>>
>>>                read q.fullnote_state
>>>                200 Read handler 'q.fullnote_state' OK
>>>                DATA 131
>>>                empty notifier on
>>>                task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>>>                full notifier off
>>>                task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>>>
>>>
>>>
>>>  //*****************************************************************************************************************
>>>
>>>                Clearly the notifier states are correct, but somehow the
>>>                relevant task is
>>>                not rescheduled. The above config uses ThreadSafeQueue but
>>> I
>>>                verified that
>>>                the same issue occurs when using FullNoteQueue.
>>>                The obvious places to look are
>>>
>>>
>>>  ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>>>                but so far I haven't spotted anything wrong with the
>>>        relevant
>>>                code, clearly
>>>                I'm overlooking something.
>>>
>>>                Have you or anyone else on the list got any suggestions?
>>>
>>>                If it helps, I'm running click source from a couple of
>>> weeks
>>>                back, default
>>>                64bit Fedora Core 13 kernel with preemption enabled
>>>                (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>>>                I start userlevel click with the following command: 'click
>>>                --threads=4
>>>                conf/threadtest.click -p 777'
>>>
>>>                Cheers
>>>
>>>                Beyers
>>>                _______________________________________________
>>>                click mailing list
>>>        click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>>>        <mailto:click at amsterdam.lcs.mit.edu
>>>        <mailto:click at amsterdam.lcs.mit.edu>>
>>>
>>>        https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>>
>>>
>>>
>>>
>