[Click] Race condition with FullNoteQueue signaling in multithreading

Eddie Kohler kohler at cs.ucla.edu
Fri Oct 15 00:24:54 EDT 2010


Beyers,

I believe I've found the problem... or A problem.  What does this commit do 
for you?

http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071

Eddie


On 10/13/2010 05:01 PM, Beyers Cronje wrote:
> Hi Eddie,
> Thank you for your response and looking into this. I keep forgetting
> about your excellent Script element, during my tests I used a much more
> rudimentary method of verifying the issue :) I used a Counter element
> before and after the queue and manually read the counter's rate and
> count handlers. After a while I noticed that both counters rate dropped
> to 0 and and both counters count stopped increasing, *permanently*. From
> what I can see the tasks NEVER re-schedule themselves.
> The only way I found to reschedule them was to manually issue a reset on
> the queue, which seems to re-activate the notifier signals and in turn
> schedule the tasks.
> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1, this seems
> to trigger the condition quicker. Also, strangely if I set the first
> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
> posted) it also seems to trigger the condition much quicker, in fact I
> couldn't reproduce the issue with both InfiniteSource elements set to -1
> during a 10min test run where with LIMIT 1000 it always happens within
> 10min of testing time.
> I'm out of the office now, but will test with both your script and
> Cliff's script tomorrow and post back my results.
> Beyers
>
> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler <kohler at cs.ucla.edu
> <mailto:kohler at cs.ucla.edu>> wrote:
>
>     Hi Beyers!
>
>     I am not sure that I agree with your analysis about the queue
>     getting "stuck."
>
>     Here is the configuration I used, a tweak of yours:
>
>     //////////////////////////////////////////////////////////////
>
>     is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
>     LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue -> Discard;
>
>     is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
>     LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue -> Discard;
>
>
>     StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>     Script(TYPE ACTIVE,
>             init x 0,
>             set x $(add $x 1),
>             goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>             print "$x: is1.scheduled false, uq1.scheduled false,
>     q1.length $(q1.length)",
>             label foo,
>             goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>             print "$x: is2.scheduled false, uq2.scheduled false,
>     q2.length $(q2.length)",
>             label bar,
>             wait 0.5s,
>             loop);
>     //////////////////////////////////////////////////////////////
>
>     This should print a message whenever neither source nor sink is
>     scheduled.
>
>     I also tried this with "FullNoteQueue" replaced with "ThreadSafeQueue."
>
>     If a queue was truly getting 'stuck,' we would expect to see
>     something like
>
>     745: is1.scheduled false, uq1.scheduled false
>     746: is1.scheduled false, uq1.scheduled false
>     747: is1.scheduled false, uq1.scheduled false
>     748: is1.scheduled false, uq1.scheduled false
>     749: is1.scheduled false, uq1.scheduled false....
>
>     with no gaps, because something would go unscheduled *and stay that
>     way*.
>
>     But that's not what I see.  I see this:
>
>     1: is1.scheduled false, uq1.scheduled false, q1.length 17
>     1: is2.scheduled false, uq2.scheduled false, q2.length 0
>     20: is1.scheduled false, uq1.scheduled false, q1.length 14
>     22: is1.scheduled false, uq1.scheduled false, q1.length 0
>     28: is1.scheduled false, uq1.scheduled false, q1.length 14
>     41: is1.scheduled false, uq1.scheduled false, q1.length 0
>     47: is2.scheduled false, uq2.scheduled false, q2.length 0
>     55: is2.scheduled false, uq2.scheduled false, q2.length 0
>     56: is2.scheduled false, uq2.scheduled false, q2.length 0
>     76: is2.scheduled false, uq2.scheduled false, q2.length 0
>
>     *** Now this, on the other hand, I would expect.  In a multithreaded
>     environment, notification will cause some delay between the upstream
>     & downstream sides of a queue.  This is because the upstream side
>     must wake up the downstream side, or vice versa.  The cross-thread
>     communication required for this can happen a bit slowly, leaving a
>     moment (maybe more than a moment) where neither task appears to be
>     scheduled.  (In fact, at least one of the tasks will be
>     "pre-scheduled.")
>
>     So my questions for you are this.  Did you really mean "stuck"?
>       Have you verified that the tasks NEVER re-schedule themselves?  Do
>     you agree with my analysis?  Or, are you saying that the "pauses"
>     are too long, even though the tasks do reschedule themselves?
>
>     Eddie
>
>
>
>     On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>
>         Hi Eddie, list,
>
>         I've come across a multithreading bug in queue signal handling
>         which I can
>         reproduce fairly consistently with userlevel multithreading
>         Click. The
>         symptoms are that either the upstream or downstream task from
>         the queue are
>         unscheduled even though the notifier signal is active, this
>         occurs when the
>         queue either becomes full or empty. To illustrate here is the
>         config I use
>         and some debug handler outputs:
>
>         //******************** Config ************************
>         is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>           08 00>,
>         LIMIT 1000, STOP false) ->  ThreadSafeQueue ->  uq1::Unqueue ->
>           Discard;
>         is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>           08 00>,
>         LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->  uq2::Unqueue ->
>           Discard;
>
>         StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>
>
>         //******************** Debug Handler Output when upstream push
>         task is stuck
>         ************************
>         read q.length
>         200 Read handler 'q.length' OK
>         DATA 1
>         0
>
>         read q.fullnote_state
>         200 Read handler 'q.fullnote_state' OK
>         DATA 131
>         empty notifier off
>         task 0x25b8830 [uq2 :: Unqueue] unscheduled
>         full notifier on
>         task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>
>         //******************** Debug Handler Output when downstream pull
>         task is
>         stuck ************************
>         read q.length
>         200 Read handler 'q.length' OK
>         DATA 4
>         1000
>
>         read q.fullnote_state
>         200 Read handler 'q.fullnote_state' OK
>         DATA 131
>         empty notifier on
>         task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>         full notifier off
>         task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>
>         //*****************************************************************************************************************
>
>         Clearly the notifier states are correct, but somehow the
>         relevant task is
>         not rescheduled. The above config uses ThreadSafeQueue but I
>         verified that
>         the same issue occurs when using FullNoteQueue.
>         The obvious places to look are
>         ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>         but so far I haven't spotted anything wrong with the relevant
>         code, clearly
>         I'm overlooking something.
>
>         Have you or anyone else on the list got any suggestions?
>
>         If it helps, I'm running click source from a couple of weeks
>         back, default
>         64bit Fedora Core 13 kernel with preemption enabled
>         (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>         I start userlevel click with the following command: 'click
>         --threads=4
>         conf/threadtest.click -p 777'
>
>         Cheers
>
>         Beyers
>         _______________________________________________
>         click mailing list
>         click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>         https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>


More information about the click mailing list