[Click] Race condition with FullNoteQueue signaling in multithreading

Eddie Kohler kohler at cs.ucla.edu
Fri Oct 15 10:40:47 EDT 2010


Great, glad it's fixed!
E


On 10/15/2010 06:35 AM, Beyers Cronje wrote:
> Hi Eddie,
> I'm happy to report that I've been unable to reproduce the bug using the
> latest commit. So as far as I can tell this fixed the issue I was having.
> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
> Kind regards
> Beyers Cronje
>
> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
> <mailto:kohler at cs.ucla.edu>> wrote:
>
>     Beyers,
>
>     I believe I've found the problem... or A problem.  What does this
>     commit do for you?
>
>     http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>
>     Eddie
>
>
>
>     On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>
>         Hi Eddie,
>
>         Thank you for your response and looking into this. I keep forgetting
>         about your excellent Script element, during my tests I used a
>         much more
>         rudimentary method of verifying the issue :) I used a Counter
>         element
>         before and after the queue and manually read the counter's rate and
>         count handlers. After a while I noticed that both counters rate
>         dropped
>         to 0 and and both counters count stopped increasing,
>         *permanently*. From
>         what I can see the tasks NEVER re-schedule themselves.
>         The only way I found to reschedule them was to manually issue a
>         reset on
>         the queue, which seems to re-activate the notifier signals and
>         in turn
>         schedule the tasks.
>         One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
>         this seems
>         to trigger the condition quicker. Also, strangely if I set the first
>         InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
>         posted) it also seems to trigger the condition much quicker, in
>         fact I
>         couldn't reproduce the issue with both InfiniteSource elements
>         set to -1
>         during a 10min test run where with LIMIT 1000 it always happens
>         within
>         10min of testing time.
>         I'm out of the office now, but will test with both your script and
>         Cliff's script tomorrow and post back my results.
>         Beyers
>
>         On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
>         <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
>         <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>
>             Hi Beyers!
>
>             I am not sure that I agree with your analysis about the queue
>             getting "stuck."
>
>             Here is the configuration I used, a tweak of yours:
>
>             //////////////////////////////////////////////////////////////
>
>             is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>         00  08 00>,
>             LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
>         -> Discard;
>
>             is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00
>         00  08 00>,
>             LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
>         -> Discard;
>
>
>             StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>             Script(TYPE ACTIVE,
>                     init x 0,
>                     set x $(add $x 1),
>                     goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>                     print "$x: is1.scheduled false, uq1.scheduled false,
>             q1.length $(q1.length)",
>                     label foo,
>                     goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>                     print "$x: is2.scheduled false, uq2.scheduled false,
>             q2.length $(q2.length)",
>                     label bar,
>                     wait 0.5s,
>                     loop);
>             //////////////////////////////////////////////////////////////
>
>             This should print a message whenever neither source nor sink is
>             scheduled.
>
>             I also tried this with "FullNoteQueue" replaced with
>         "ThreadSafeQueue."
>
>             If a queue was truly getting 'stuck,' we would expect to see
>             something like
>
>             745: is1.scheduled false, uq1.scheduled false
>             746: is1.scheduled false, uq1.scheduled false
>             747: is1.scheduled false, uq1.scheduled false
>             748: is1.scheduled false, uq1.scheduled false
>             749: is1.scheduled false, uq1.scheduled false....
>
>             with no gaps, because something would go unscheduled *and
>         stay that
>             way*.
>
>             But that's not what I see.  I see this:
>
>             1: is1.scheduled false, uq1.scheduled false, q1.length 17
>             1: is2.scheduled false, uq2.scheduled false, q2.length 0
>             20: is1.scheduled false, uq1.scheduled false, q1.length 14
>             22: is1.scheduled false, uq1.scheduled false, q1.length 0
>             28: is1.scheduled false, uq1.scheduled false, q1.length 14
>             41: is1.scheduled false, uq1.scheduled false, q1.length 0
>             47: is2.scheduled false, uq2.scheduled false, q2.length 0
>             55: is2.scheduled false, uq2.scheduled false, q2.length 0
>             56: is2.scheduled false, uq2.scheduled false, q2.length 0
>             76: is2.scheduled false, uq2.scheduled false, q2.length 0
>
>             *** Now this, on the other hand, I would expect.  In a
>         multithreaded
>             environment, notification will cause some delay between the
>         upstream
>         & downstream sides of a queue.  This is because the upstream side
>             must wake up the downstream side, or vice versa.  The
>         cross-thread
>             communication required for this can happen a bit slowly,
>         leaving a
>             moment (maybe more than a moment) where neither task appears
>         to be
>             scheduled.  (In fact, at least one of the tasks will be
>         "pre-scheduled.")
>
>             So my questions for you are this.  Did you really mean "stuck"?
>               Have you verified that the tasks NEVER re-schedule
>         themselves?  Do
>             you agree with my analysis?  Or, are you saying that the
>         "pauses"
>             are too long, even though the tasks do reschedule themselves?
>
>             Eddie
>
>
>
>             On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>
>                 Hi Eddie, list,
>
>                 I've come across a multithreading bug in queue signal
>         handling
>                 which I can
>                 reproduce fairly consistently with userlevel multithreading
>                 Click. The
>                 symptoms are that either the upstream or downstream task
>         from
>                 the queue are
>                 unscheduled even though the notifier signal is active, this
>                 occurs when the
>                 queue either becomes full or empty. To illustrate here
>         is the
>                 config I use
>                 and some debug handler outputs:
>
>                 //******************** Config ************************
>                 is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>         00 00 00
>                   08 00>,
>                 LIMIT 1000, STOP false) ->  ThreadSafeQueue ->
>           uq1::Unqueue ->
>                   Discard;
>                 is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00
>         00 00 00
>                   08 00>,
>                 LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->
>           uq2::Unqueue ->
>                   Discard;
>
>                 StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>
>
>                 //******************** Debug Handler Output when
>         upstream push
>                 task is stuck
>                 ************************
>                 read q.length
>                 200 Read handler 'q.length' OK
>                 DATA 1
>                 0
>
>                 read q.fullnote_state
>                 200 Read handler 'q.fullnote_state' OK
>                 DATA 131
>                 empty notifier off
>                 task 0x25b8830 [uq2 :: Unqueue] unscheduled
>                 full notifier on
>                 task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>
>                 //******************** Debug Handler Output when
>         downstream pull
>                 task is
>                 stuck ************************
>                 read q.length
>                 200 Read handler 'q.length' OK
>                 DATA 4
>                 1000
>
>                 read q.fullnote_state
>                 200 Read handler 'q.fullnote_state' OK
>                 DATA 131
>                 empty notifier on
>                 task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>                 full notifier off
>                 task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>
>
>           //*****************************************************************************************************************
>
>                 Clearly the notifier states are correct, but somehow the
>                 relevant task is
>                 not rescheduled. The above config uses ThreadSafeQueue but I
>                 verified that
>                 the same issue occurs when using FullNoteQueue.
>                 The obvious places to look are
>
>           ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>                 but so far I haven't spotted anything wrong with the
>         relevant
>                 code, clearly
>                 I'm overlooking something.
>
>                 Have you or anyone else on the list got any suggestions?
>
>                 If it helps, I'm running click source from a couple of weeks
>                 back, default
>                 64bit Fedora Core 13 kernel with preemption enabled
>                 (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>                 I start userlevel click with the following command: 'click
>                 --threads=4
>                 conf/threadtest.click -p 777'
>
>                 Cheers
>
>                 Beyers
>                 _______________________________________________
>                 click mailing list
>         click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>         <mailto:click at amsterdam.lcs.mit.edu
>         <mailto:click at amsterdam.lcs.mit.edu>>
>
>         https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>
>


More information about the click mailing list