[Click] Core performance checkins

Wed Feb 9 12:49:01 EST 2011

I believe that I can reproduce this, give or take.
if I use this click config:

InfiniteSource
-> q1 :: ThreadSafeQueue(CAPACITY 10)
-> rs :: RatedUnqueue(RATE 1000)
-> q :: ThreadSafeQueue(CAPACITY 10000)
-> uq :: Unqueue
-> c :: Counter
-> d :: Discard;

Script(wait 3, print $(c.count) $(uq.scheduled) $(q.highwater_length), write
c.reset_counts, loop);
StaticThreadSched(rs 0, uq 1);

and then run multiple instances of click, I get very weird results.  I have
an 8 core machine, but even running a total of 4 instances of click with
this config causes it to get stuck for up to 3 seconds at a time (which
corresponds to the "wait 3" in it.  If I change that wait, then it gets
stuck for longer)

If I run with --threads=1, I never see the behavior, even with 20+ click
processes running.

I believe that I also see the same behavior with normal Queue instead of
ThreadSafeQueue.

I won't have much more time to spend on this for a while though... the last
few bugs along these lines were quite hairy and required adding memory
barrier instructions (look at commits 8f521a2d and 01f456ac for an idea).
 This bug could just be because the new code is faster, so now we are more
likely to see an existing race condition, or it could be because a new race
condition was added.  If the same test procedure that worked for me works
for you, perhaps you can track it down to one commit that caused this
regression?

Cliff

bug:~/co/click/build2$ ./userlevel/click -p 1234 --threads=2 ./foo.click
3020 false 19
2999 false 19
3000 false 19
3000 false 19
3000 false 19
3001 false 19
3000 false 19
-> now I run 3 more processes in another terminal
651 true 2326
5326 false 2328
3001 false 2328
3000 false 2328
868 true 2328
5127 false 2328
3001 false 2328
3000 false 2328
312 true 2688
2885 true 2803
5803 false 2805
214 true 2805
5766 false 2805
3001 false 2805
3001 false 2805

On Wed, Feb 9, 2011 at 5:22 AM, Beyers Cronje <bcronje at gmail.com> wrote:

> Hi Eddie,
>
> Some info that I'm sure will help debugging. I configured click
> with  --enable-schedule-debugging=extra and also enabled
> NOTIFIERQUEUE_DEBUG
>
> It seems Unqueue gets stuck in the pending list. See the output below:
>
> read q.length
> 1000
>
> read q.notifier_state
> notifier on
> task 0x19387b0 [uq :: Unqueue] scheduled
>
> read uq.scheduled
> true /* but pending */
>
> read uq.notifier
> empty.0/1:1*
>
>
> Unqueue stays in this scheduled but pending state for an undetermined
> period
> of time, up to minutes some times.
>
> Any idea where I can start on fixing this bug?
>
> Beyers
>
>
> On Wed, Feb 9, 2011 at 2:52 AM, Beyers Cronje <bcronje at gmail.com> wrote:
>
> > Update - Strange, after typing my previous email I checked again, and all
> > of a sudden Unqueue was pulling packets again. Not sure if me breaking
> into
> > Click with gdb kick-started, it again :) or if it's a intermittent issue.
> >
> >
> > On Wed, Feb 9, 2011 at 2:41 AM, Beyers Cronje <bcronje at gmail.com> wrote:
> >
> >> Hi Eddie,
> >>
> >> Since running this merge I've been experiencing issues with usermode
> >> multithreading. I'm using commit
> 9419098acbdc20837e37f3033c40661809431f8d
> >> I do believe the issues are related to the changes of this merged, as I
> >> was running the same config on pre-coreperformance merge code without
> any
> >> issues.
> >>
> >> *Simplified Config used:*
> >>
> >> fd::FromDAG(/dev/dag0) -> cl1::Classifier(12/0800) -> MarkIPHeader(14)
> ->
> >> ipc1::IPClassifier(udp port 1646);
> >> sfp::SFP();
> >> q::ThreadSafeQueue;
> >> log::Logger(sfp, 1800);
> >>
> >> ipc1 -> RadAccounting -> q;
> >> q -> uq::Unqueue -> aupcc::Counter -> sfp;
> >>
> >> StaticThreadSched(fd 0, uq 1, log 2);
> >>
> >> *Problem Description:*
> >> The 3 threads in the config: FromDAG and Unqueue do what their names
> >> indicate, and Logger has a timer that schedules its task once every
> 30min.
> >> Everything runs fine initially and then somewhere along the line Unqueue
> >> stops pulling packets from the queue which leads to a constant queue
> >> overflow. When this happens I can see FromDAG is still working as
> q.drops
> >> increases constantly, and Logger also fires away every 30min. Note push
> rate
> >> from FromDAG is quite high, but what RadAccounting pushes into queue is
> very
> >> low ~ 100pps, which means queue is most of the time empty and Unqueue is
> not
> >> scheduled.
> >>
> >> Unfortunately I didn't configure debug scheduling, which would've
> helped.
> >> I did notice Unqueue.scheduled always returned true, even though it
> never
> >> actually ran. Not sure if the following will help, gdb shows the
> following:
> >>
> >> (gdb) info threads
> >>   3 Thread 0x7fa422559700 (LWP 27205)  0x000000392cedb0b3 in poll ()
> from
> >> /lib64/libc.so.6
> >>   2 Thread 0x7fa421d58700 (LWP 27206)  0x000000392cedcee3 in select ()
> >> from /lib64/libc.so.6
> >> * 1 Thread 0x7fa432cfb740 (LWP 27197)  FromDAG::process_packet
> >> (this=0x1afc570, erf_record=0x7fa423258bf0, rlen=<value optimized out>)
> at
> >> ../elements/local/fromdag.cc:193
> >> (gdb) thread 2
> >> [Switching to thread 2 (Thread 0x7fa421d58700 (LWP 27206))]#0
> >>  0x000000392cedb0b3 in poll () from /lib64/libc.so.6
> >> (gdb) bt
> >> #0  0x000000392cedb0b3 in poll () from /lib64/libc.so.6
> >> #1  0x000000000058bc36 in Master::run_selects_poll (this=0x1afacc0,
> >> thread=0x1afb8b0, more_tasks=false) at ../lib/master.cc:862
> >> #2  0x000000000058c4ec in Master::run_selects (this=0x1afacc0,
> >> thread=0x1afb8b0) at ../lib/master.cc:1050
> >> #3  0x000000000057ddf1 in run_os (this=0x1afb8b0) at
> >> ../lib/routerthread.cc:447
> >> #4  RouterThread::driver (this=0x1afb8b0) at ../lib/routerthread.cc:568
> >> #5  0x0000000000556dc9 in thread_driver (user_data=<value optimized
> out>)
> >> at click.cc:414
> >> #6  0x000000392d206d5b in start_thread () from /lib64/libpthread.so.0
> >> #7  0x000000392cee4aad in clone () from /lib64/libc.so.6
> >> (gdb) thread 3
> >> [Switching to thread 3 (Thread 0x7fa422559700 (LWP 27205))]#0
> >>  0x000000392cedcee3 in select () from /lib64/libc.so.6
> >> (gdb) bt
> >> #0  0x000000392cedcee3 in select () from /lib64/libc.so.6
> >> #1  0x000000000058c4bf in Master::run_selects (this=0x1afacc0,
> >> thread=0x1afb7e0) at ../lib/master.cc:1015
> >> #2  0x000000000057ddf1 in run_os (this=0x1afb7e0) at
> >> ../lib/routerthread.cc:447
> >> #3  RouterThread::driver (this=0x1afb7e0) at ../lib/routerthread.cc:568
> >> #4  0x0000000000556dc9 in thread_driver (user_data=<value optimized
> out>)
> >> at click.cc:414
> >> #5  0x000000392d206d5b in start_thread () from /lib64/libpthread.so.0
> >> #6  0x000000392cee4aad in clone () from /lib64/libc.so.6
> >> (gdb) thread 1
> >> [Switching to thread 1 (Thread 0x7fa432cfb740 (LWP 27197))]#0
> >>  0x000000392d20ebfd in nanosleep () from /lib64/libpthread.so.0
> >> (gdb) bt
> >> #0  FromDAG::process_packet (this=0x1afc570, erf_record=0x7fa423258bf0,
> >> rlen=<value optimized out>) at ../elements/local/fromdag.cc:193
> >> #1  0x00000000004d6402 in FromDAG::run_task (this=0x1afc570) at
> >> ../elements/local/fromdag.cc:150
> >> #2  0x000000000057dbe6 in fire (this=0x1afb710) at
> >> ../include/click/task.hh:612
> >> #3  run_tasks (this=0x1afb710) at ../lib/routerthread.cc:410
> >> #4  RouterThread::driver (this=0x1afb710) at ../lib/routerthread.cc:600
> >> #5  0x0000000000558513 in main (argc=<value optimized out>, argv=<value
> >> optimized out>) at click.cc:639
> >>
> >> One thing to note, for various reasons I'm doing something very
> >> unclick-like with FromDAG where I allow it to block up to 10ms. For my
> >> specific requirements this is not a problem, but just in case it might
> >> affect the way the new task handling operates it's worth noting.
> >>
> >> Beyers
> >>
> >> On Sun, Feb 6, 2011 at 1:23 AM, Eddie Kohler <kohler at cs.ucla.edu>
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> This is just a note to say that I've merged the "coreperformance"
> branch
> >>> with master.  There are several changes that may speed up particularly
> >>> simple configurations, and that (more importantly) may make it easier
> to
> >>> experiment with different multithreading setups.  (For instance I
> >>> believe switching a task from one thread to another is faster now.)
>  Let
> >>> me know if you experience any problems
> >>>
> >>> Eddie
> >>> _______________________________________________
> >>> click mailing list
> >>> click at amsterdam.lcs.mit.edu
> >>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> >>>
> >>
> >>
> >
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>