[Click] NotifierQueue on multi-threaded Click problem.

Fri Sep 7 16:40:21 EDT 2007

Rant:

For a good time, take a look at the list of authors on the SMP Click paper. 
Oh gee, I'm not there.  And Mazu does not use SMP Click.  All I do is for some 
godforsaken unknown reason maintain the detritus of that paper and improve it 
gradually over time.  And it WAS detritus; making it halfway sensible took 
forever, and happened after the paper was published.  (Tags 
click_smp_merge_[1-10] are visible in the git repository.)

My role is to prevent sensible, i.e., non-SMP, configurations from starting to 
suck because people want to vaguely hack non-sensible, i.e., SMP, 
configurations with expensive locks and atomic operations and god knows what 
until they work for them on one test.  It's going to improve, but it's going 
to improve in the right way.

Feel free to contact Robert or Benjie and ask them why they did not do this 
right in the first place.  Feel free also to send me patches to make 
SMP/user-multithread Click work better -- but expect discussion.  Feel free to 
start a public git repository called click-fewer-synchronization-bugs.  And 
feel free to run Click S I N G L E - T H R E A D E D.

Cliff Frey wrote:
> I think that bighashmap_arena and string need atomic stuff added as 
> well... but maybe this is already known.
> 
> I guess I've always looked at it that the 20% performance hit was just 
> the price of being able to do other things in parallel...  That's why I 
> would have expected simplequeue to use some atomic operations on 
> head/tail and many parallel pushes/pullers.  It seems as though 
> otherwise, many many people will continue to have issues here.
> 
> Cliff
> 
> On 9/7/07, *Eddie Kohler* <kohler at cs.ucla.edu 
> <mailto:kohler at cs.ucla.edu>> wrote:
> 
>     Beyers, Cliff,
> 
>     Cliff's diagnosis is at most partially right.  On x86 architectures
>     a write
>     memory barrier is not necessary, if the relevant variables are
>     "volatile".
>     Furthermore, I implemented a test element that exercises a Queue
>     with pushing
>     and pulling, and even *with* a memory barrier, I got an assertion
>     failure.
> 
>     The assertion failure in my test happened because Packet reference
>     counts are
>     not atomic at user level.  Packet::clone() incremented a reference
>     count, and
>     Packet::kill() decremented that count, but non-atomically.  This
>     seemed to
>     lead to major problems.  Fixing this problem, I could run my test
>     for several
>     hundred million iterations.
> 
>     Here is part of the commit message that fixes this problem, if
>     --enable-multithread is supplied to "configure".  (An earlier commit
>     added
>     "volatile"s.)
> 
>     =============
>     (1) Packet use counts must be updated atomically.  However, making
>     packet
>     use counts atomic_uint32_t reduces performance on a clone-heavy test by
>     roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
>     packets; after it took ~8.85s).
> 
>     (2) Router _running and _state must be volatile.
> 
>     Implement these fixes.  However, because of the performance
>     degradation,
>     packet use counts are atomic at userlevel only if
>     --enable-user-multithread
>     was supplied to configure.  (--enable-user-multithread also provides the
>     "umultithread" requirement, which lets you compile elements
>     conditionally
>     based on user-level multithreading support.)
> 
>     Lots of aspects of user-level multithreading still need work.  For
>     example
>     one thread can't reschedule tasks on another, since the relevant atomic
>     operations are unimplemented.  But this is a start.
>     =============
> 
>     Eddie
> 
> 
> 
>     Beyers Cronje wrote:
>      > Hi Cliff,
>      >
>      > On 9/7/07, *Cliff Frey* <cliff at meraki.net
>     <mailto:cliff at meraki.net> <mailto: cliff at meraki.net
>     <mailto:cliff at meraki.net>>> wrote:
>      >
>      >     Do _head/_tail need to be declared volatile?  I'm not clear
>     on the
>      >     rules for c++ reordering operations...  It like adding some
>     sort of
>      >     barrier between these lines might be a good idea. (and the
>     similar
>      >     set for deq, but actually the assert() statement probably
>     does that
>      >     as a side effect...)
>      >
>      >         _q[_tail] = p;
>      >         _tail = next;
>      >
>      >
>      >
>      > You were right, a memory barrier in enq() fixed the problem!
>      >
>      > So the enq() code:
>      >     _q[_tail] = p;
>      >     _tail = next;
>      > was actually executed as
>      >     _tail = next;
>      >     _q[_tail] = p;
>      > While the second thread used _tail through deq() in between the
>     above
>      > two instructions, before _q[_tail] = p.
>      >
>      > So the final enq() piece now looks like:
>      >     _q[_tail] = p;
>      >     __asm__ __volatile__ ("": : :"memory");  // Where is wmb()
>     defined
>      > in userland??
>      >     _tail = next;
>      >
>      > The above works on my i386 architecture, a more portable fix will
>     be a
>      > proper wmb() call.
>      >
>      >     also, I think that I'm confused about the difference between
>     enq/deq
>      >     and push/pull, but it might be worthwhile to add assert(p);
>     to the
>      >     push method.
>      >
>      >
>      > I'm referencing enq/deq directly as my element is not in the normal
>      > Queue push/pull path. To relate them, pull calls  deq() while
>     push does
>      > exactly what enq does. Not sure why push just doesnt just call enq ?
>      >
>      >     And a side note that might be outside the scope of this
>     discussion.
>      >
>      >     If we wanted to get rid of the one thread enq/deq constraint,
>      >     couldn't we do it without locks if we used cmpxchg and some other
>      >     factor to dictate that things were actually done their operation.
>      >
>      >
>      > You can. There are a few methods to do this. From limited online
>      > research the following algorithm seems to scale well:
>      >
>     http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
>     <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>
>      > <
>     http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
>     <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>>
>      >
>      > For interest sake, using mutexes I got ~270k pps throughput.
>     Using the
>      > now working non-locking queue I'm getting ~340k pps.
>      >
>      > Beyers
> 
>