[Click] Race condition with FullNoteQueue signaling in multithreading
Eddie Kohler
kohler at cs.ucla.edu
Fri Oct 15 10:40:47 EDT 2010
Great, glad it's fixed!
E
On 10/15/2010 06:35 AM, Beyers Cronje wrote:
> Hi Eddie,
> I'm happy to report that I've been unable to reproduce the bug using the
> latest commit. So as far as I can tell this fixed the issue I was having.
> Eddie, Cliff, you guys rock, thanks so much for your assistance on this.
> Kind regards
> Beyers Cronje
>
> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <kohler at cs.ucla.edu
> <mailto:kohler at cs.ucla.edu>> wrote:
>
> Beyers,
>
> I believe I've found the problem... or A problem. What does this
> commit do for you?
>
> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071
>
> Eddie
>
>
>
> On 10/13/2010 05:01 PM, Beyers Cronje wrote:
>
> Hi Eddie,
>
> Thank you for your response and looking into this. I keep forgetting
> about your excellent Script element, during my tests I used a
> much more
> rudimentary method of verifying the issue :) I used a Counter
> element
> before and after the queue and manually read the counter's rate and
> count handlers. After a while I noticed that both counters rate
> dropped
> to 0 and and both counters count stopped increasing,
> *permanently*. From
> what I can see the tasks NEVER re-schedule themselves.
> The only way I found to reschedule them was to manually issue a
> reset on
> the queue, which seems to re-activate the notifier signals and
> in turn
> schedule the tasks.
> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1,
> this seems
> to trigger the condition quicker. Also, strangely if I set the first
> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config
> posted) it also seems to trigger the condition much quicker, in
> fact I
> couldn't reproduce the issue with both InfiniteSource elements
> set to -1
> during a 10min test run where with LIMIT 1000 it always happens
> within
> 10min of testing time.
> I'm out of the office now, but will test with both your script and
> Cliff's script tomorrow and post back my results.
> Beyers
>
> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler
> <kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>
> <mailto:kohler at cs.ucla.edu <mailto:kohler at cs.ucla.edu>>> wrote:
>
> Hi Beyers!
>
> I am not sure that I agree with your analysis about the queue
> getting "stuck."
>
> Here is the configuration I used, a tweak of yours:
>
> //////////////////////////////////////////////////////////////
>
> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
> 00 08 00>,
> LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue
> -> Discard;
>
> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00
> 00 08 00>,
> LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue
> -> Discard;
>
>
> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
> Script(TYPE ACTIVE,
> init x 0,
> set x $(add $x 1),
> goto foo $(or $(is1.scheduled) $(uq1.scheduled)),
> print "$x: is1.scheduled false, uq1.scheduled false,
> q1.length $(q1.length)",
> label foo,
> goto bar $(or $(is2.scheduled) $(uq2.scheduled)),
> print "$x: is2.scheduled false, uq2.scheduled false,
> q2.length $(q2.length)",
> label bar,
> wait 0.5s,
> loop);
> //////////////////////////////////////////////////////////////
>
> This should print a message whenever neither source nor sink is
> scheduled.
>
> I also tried this with "FullNoteQueue" replaced with
> "ThreadSafeQueue."
>
> If a queue was truly getting 'stuck,' we would expect to see
> something like
>
> 745: is1.scheduled false, uq1.scheduled false
> 746: is1.scheduled false, uq1.scheduled false
> 747: is1.scheduled false, uq1.scheduled false
> 748: is1.scheduled false, uq1.scheduled false
> 749: is1.scheduled false, uq1.scheduled false....
>
> with no gaps, because something would go unscheduled *and
> stay that
> way*.
>
> But that's not what I see. I see this:
>
> 1: is1.scheduled false, uq1.scheduled false, q1.length 17
> 1: is2.scheduled false, uq2.scheduled false, q2.length 0
> 20: is1.scheduled false, uq1.scheduled false, q1.length 14
> 22: is1.scheduled false, uq1.scheduled false, q1.length 0
> 28: is1.scheduled false, uq1.scheduled false, q1.length 14
> 41: is1.scheduled false, uq1.scheduled false, q1.length 0
> 47: is2.scheduled false, uq2.scheduled false, q2.length 0
> 55: is2.scheduled false, uq2.scheduled false, q2.length 0
> 56: is2.scheduled false, uq2.scheduled false, q2.length 0
> 76: is2.scheduled false, uq2.scheduled false, q2.length 0
>
> *** Now this, on the other hand, I would expect. In a
> multithreaded
> environment, notification will cause some delay between the
> upstream
> & downstream sides of a queue. This is because the upstream side
> must wake up the downstream side, or vice versa. The
> cross-thread
> communication required for this can happen a bit slowly,
> leaving a
> moment (maybe more than a moment) where neither task appears
> to be
> scheduled. (In fact, at least one of the tasks will be
> "pre-scheduled.")
>
> So my questions for you are this. Did you really mean "stuck"?
> Have you verified that the tasks NEVER re-schedule
> themselves? Do
> you agree with my analysis? Or, are you saying that the
> "pauses"
> are too long, even though the tasks do reschedule themselves?
>
> Eddie
>
>
>
> On 10/12/2010 05:23 PM, Beyers Cronje wrote:
>
> Hi Eddie, list,
>
> I've come across a multithreading bug in queue signal
> handling
> which I can
> reproduce fairly consistently with userlevel multithreading
> Click. The
> symptoms are that either the upstream or downstream task
> from
> the queue are
> unscheduled even though the notifier signal is active, this
> occurs when the
> queue either becomes full or empty. To illustrate here
> is the
> config I use
> and some debug handler outputs:
>
> //******************** Config ************************
> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
> 00 00 00
> 08 00>,
> LIMIT 1000, STOP false) -> ThreadSafeQueue ->
> uq1::Unqueue ->
> Discard;
> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00
> 00 00 00
> 08 00>,
> LIMIT -1, STOP false) -> q::ThreadSafeQueue ->
> uq2::Unqueue ->
> Discard;
>
> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3);
>
>
> //******************** Debug Handler Output when
> upstream push
> task is stuck
> ************************
> read q.length
> 200 Read handler 'q.length' OK
> DATA 1
> 0
>
> read q.fullnote_state
> 200 Read handler 'q.fullnote_state' OK
> DATA 131
> empty notifier off
> task 0x25b8830 [uq2 :: Unqueue] unscheduled
> full notifier on
> task 0x25b8350 [is2 :: InfiniteSource] unscheduled
>
> //******************** Debug Handler Output when
> downstream pull
> task is
> stuck ************************
> read q.length
> 200 Read handler 'q.length' OK
> DATA 4
> 1000
>
> read q.fullnote_state
> 200 Read handler 'q.fullnote_state' OK
> DATA 131
> empty notifier on
> task 0x1c6f830 [uq2 :: Unqueue] unscheduled
> full notifier off
> task 0x1c6f350 [is2 :: InfiniteSource] unscheduled
>
>
> //*****************************************************************************************************************
>
> Clearly the notifier states are correct, but somehow the
> relevant task is
> not rescheduled. The above config uses ThreadSafeQueue but I
> verified that
> the same issue occurs when using FullNoteQueue.
> The obvious places to look are
>
> ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure
> but so far I haven't spotted anything wrong with the
> relevant
> code, clearly
> I'm overlooking something.
>
> Have you or anyone else on the list got any suggestions?
>
> If it helps, I'm running click source from a couple of weeks
> back, default
> 64bit Fedora Core 13 kernel with preemption enabled
> (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU.
> I start userlevel click with the following command: 'click
> --threads=4
> conf/threadtest.click -p 777'
>
> Cheers
>
> Beyers
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu <mailto:click at amsterdam.lcs.mit.edu>
> <mailto:click at amsterdam.lcs.mit.edu
> <mailto:click at amsterdam.lcs.mit.edu>>
>
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>
>
More information about the click
mailing list