[Click] Race condition with FullNoteQueue signaling in multithreading

Eddie Kohler kohler at cs.ucla.edu
Wed Oct 13 14:20:40 EDT 2010


This script is very useful.  Perhaps (unshockingly :| god damn it) there is 
some problem with waking up a sleeping thread at userlevel; you may recall a 
lot of retarded drama around that issue.  The Script may help by cauisng 
threads to sleep for limited amounts of time.  I will look.

Eddie


On 10/13/2010 09:01 AM, Cliff Frey wrote:
> I am getting the stuck behavior that Beyers describes.  However if I add
> the script, the problem for the most part goes away.
>
> Specifically, if I use this config:
>
> //////////////////////////////////////////////////////////////
> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
> LIMIT 1000, STOP false)
>   -> q1::ThreadSafeQueue
>   -> uq1::Unqueue
>   -> Discard;
>
> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
> LIMIT -1, STOP false)
>   -> q2::ThreadSafeQueue
>   -> uq2::Unqueue
>   -> Discard;
>
> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>
> Script(TYPE ACTIVE,
>         init x 0,
>         set x $(add $x 1),
>         read uq2.count,
>         write uq2.reset_counts,
>         wait 5s,
>         loop);
> //////////////////////////////////////////////////////////////
>
> Then when I run with --threads=4, I get highly variable (and low) output:
> uq2.count:
> 11973
> uq2.count:
> 332166
> uq2.count:
> 86556
> uq2.count:
> 2884924
> uq2.count:
> 419946
>
> compared to if I run with threads=1
> uq2.count:
> 12183593
> uq2.count:
> 12186308
> uq2.count:
> 12091170
>
> This makes me think that it is a problem with the thread scheduling
> logic, not with ThreadSafeQueue, because the scheduling of the Script
> task seems to kick things correctly.  I can look into this more, but
> I'll wait for some feedback from Eddie first...
>
> Cliff
>
> On Wed, Oct 13, 2010 at 8:06 AM, Eddie Kohler <kohler at cs.ucla.edu
> <mailto:kohler at cs.ucla.edu>> wrote:
>
>     Hi Beyers!
>
>     I am not sure that I agree with your analysis about the queue
>     getting "stuck."
>
>     Here is the configuration I used, a tweak of yours:
>
>     //////////////////////////////////////////////////////////////
>     is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
>     LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue -> Discard;
>     is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00  08 00>,
>     LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue -> Discard;
>
>     StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>     Script(TYPE ACTIVE,
>             init x 0,
>             set x $(add $x 1),
>             goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
>             print "$x: is1.scheduled false, uq1.scheduled false,
>     q1.length $(q1.length)",
>             label foo,
>             goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
>             print "$x: is2.scheduled false, uq2.scheduled false,
>     q2.length $(q2.length)",
>             label bar,
>             wait 0.5s,
>             loop);
>     //////////////////////////////////////////////////////////////
>
>     This should print a message whenever neither source nor sink is
>     scheduled.
>
>     I also tried this with "FullNoteQueue" replaced with "ThreadSafeQueue."
>
>     If a queue was truly getting 'stuck,' we would expect to see
>     something like
>
>     745: is1.scheduled false, uq1.scheduled false
>     746: is1.scheduled false, uq1.scheduled false
>     747: is1.scheduled false, uq1.scheduled false
>     748: is1.scheduled false, uq1.scheduled false
>     749: is1.scheduled false, uq1.scheduled false....
>
>     with no gaps, because something would go unscheduled *and stay that
>     way*.
>
>     But that's not what I see.  I see this:
>
>     1: is1.scheduled false, uq1.scheduled false, q1.length 17
>     1: is2.scheduled false, uq2.scheduled false, q2.length 0
>     20: is1.scheduled false, uq1.scheduled false, q1.length 14
>     22: is1.scheduled false, uq1.scheduled false, q1.length 0
>     28: is1.scheduled false, uq1.scheduled false, q1.length 14
>     41: is1.scheduled false, uq1.scheduled false, q1.length 0
>     47: is2.scheduled false, uq2.scheduled false, q2.length 0
>     55: is2.scheduled false, uq2.scheduled false, q2.length 0
>     56: is2.scheduled false, uq2.scheduled false, q2.length 0
>     76: is2.scheduled false, uq2.scheduled false, q2.length 0
>
>     *** Now this, on the other hand, I would expect.  In a multithreaded
>     environment, notification will cause some delay between the upstream &
>     downstream sides of a queue.  This is because the upstream side must
>     wake up
>     the downstream side, or vice versa.  The cross-thread communication
>     required
>     for this can happen a bit slowly, leaving a moment (maybe more than
>     a moment)
>     where neither task appears to be scheduled.  (In fact, at least one
>     of the
>     tasks will be "pre-scheduled.")
>
>     So my questions for you are this.  Did you really mean "stuck"?
>       Have you
>     verified that the tasks NEVER re-schedule themselves?  Do you agree
>     with my
>     analysis?  Or, are you saying that the "pauses" are too long, even
>     though the
>     tasks do reschedule themselves?
>
>     Eddie
>
>
>     On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>      > Hi Eddie, list,
>      >
>      > I've come across a multithreading bug in queue signal handling
>     which I can
>      > reproduce fairly consistently with userlevel multithreading
>     Click. The
>      > symptoms are that either the upstream or downstream task from the
>     queue are
>      > unscheduled even though the notifier signal is active, this
>     occurs when the
>      > queue either becomes full or empty. To illustrate here is the
>     config I use
>      > and some debug handler outputs:
>      >
>      > //******************** Config ************************
>      > is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>       08 00>,
>      > LIMIT 1000, STOP false) ->  ThreadSafeQueue ->  uq1::Unqueue ->
>       Discard;
>      > is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef  00 00 00 00 00 00
>       08 00>,
>      > LIMIT -1, STOP false) ->  q::ThreadSafeQueue ->  uq2::Unqueue ->
>       Discard;
>      >
>      > StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>      >
>      >
>      > //******************** Debug Handler Output when upstream push
>     task is stuck
>      > ************************
>      > read q.length
>      > 200 Read handler 'q.length' OK
>      > DATA 1
>      > 0
>      >
>      > read q.fullnote_state
>      > 200 Read handler 'q.fullnote_state' OK
>      > DATA 131
>      > empty notifier off
>      > task 0x25b8830 [uq2 :: Unqueue] unscheduled
>      > full notifier on
>      > task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>      >
>      > //******************** Debug Handler Output when downstream pull
>     task is
>      > stuck ************************
>      > read q.length
>      > 200 Read handler 'q.length' OK
>      > DATA 4
>      > 1000
>      >
>      > read q.fullnote_state
>      > 200 Read handler 'q.fullnote_state' OK
>      > DATA 131
>      > empty notifier on
>      > task 0x1c6f830 [uq2 :: Unqueue] unscheduled
>      > full notifier off
>      > task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>      >
>      >
>     //*****************************************************************************************************************
>      >
>      > Clearly the notifier states are correct, but somehow the relevant
>     task is
>      > not rescheduled. The above config uses ThreadSafeQueue but I
>     verified that
>      > the same issue occurs when using FullNoteQueue.
>      > The obvious places to look are
>      >
>     ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
>      > but so far I haven't spotted anything wrong with the relevant
>     code, clearly
>      > I'm overlooking something.
>      >
>      > Have you or anyone else on the list got any suggestions?
>      >
>      > If it helps, I'm running click source from a couple of weeks
>     back, default
>      > 64bit Fedora Core 13 kernel with preemption enabled
>      > (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
>      > I start userlevel click with the following command: 'click
>     --threads=4
>      > conf/threadtest.click -p 777'
>      >
>      > Cheers
>      >
>      > Beyers
>      > _______________________________________________
>      > click mailing list
>      > click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>      > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>     _______________________________________________
>     click mailing list
>     click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
>     https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>


More information about the click mailing list