[Click] NotifierQueue on multi-threaded Click problem.

Eddie Kohler kohler at cs.ucla.edu
Sat Sep 8 11:45:34 EDT 2007


Thanks!

E


Beyers Cronje wrote:
> Hi Eddie,
> 
> Thank you so much for this! I just tested it and it's working 100% so 
> far. I did pick up a small bug in one of your recent updates, patch 
> provided below.
> 
> Kind regards
> 
> Beyers
> 
> --- /home/bcronje/string.cc     2007-09-08 01:21: 40.000000000 +0200
> +++ lib/string.cc       2007-09-08 01:19:39.000000000 +0200
> @@ -87,7 +87,7 @@
>  String::Memo::Memo(int dirty, int capacity)
>    : _capacity(capacity), _real_data((char *) CLICK_LALLOC(capacity))
>  {
> -    assert(_capacity >= _dirty);
> +    assert(_capacity >= dirty);
>      _refcount = 1;
>      _dirty = dirty;
>  }
> 
> 
> 
> 
> 
> On 9/7/07, * Eddie Kohler* <kohler at cs.ucla.edu 
> <mailto:kohler at cs.ucla.edu>> wrote:
> 
>     Beyers, Cliff,
> 
>     Cliff's diagnosis is at most partially right.  On x86 architectures
>     a write
>     memory barrier is not necessary, if the relevant variables are
>     "volatile".
>     Furthermore, I implemented a test element that exercises a Queue
>     with pushing
>     and pulling, and even *with* a memory barrier, I got an assertion
>     failure.
> 
>     The assertion failure in my test happened because Packet reference
>     counts are
>     not atomic at user level.  Packet::clone() incremented a reference
>     count, and
>     Packet::kill() decremented that count, but non-atomically.  This
>     seemed to
>     lead to major problems.  Fixing this problem, I could run my test
>     for several
>     hundred million iterations.
> 
>     Here is part of the commit message that fixes this problem, if
>     --enable-multithread is supplied to "configure".  (An earlier commit
>     added
>     "volatile"s.)
> 
>     =============
>     (1) Packet use counts must be updated atomically.  However, making
>     packet
>     use counts atomic_uint32_t reduces performance on a clone-heavy test by
>     roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
>     packets; after it took ~8.85s).
> 
>     (2) Router _running and _state must be volatile.
> 
>     Implement these fixes.  However, because of the performance
>     degradation,
>     packet use counts are atomic at userlevel only if
>     --enable-user-multithread
>     was supplied to configure.  (--enable-user-multithread also provides the
>     "umultithread" requirement, which lets you compile elements
>     conditionally
>     based on user-level multithreading support.)
> 
>     Lots of aspects of user-level multithreading still need work.  For
>     example
>     one thread can't reschedule tasks on another, since the relevant atomic
>     operations are unimplemented.  But this is a start.
>     =============
> 
>     Eddie
> 
> 
> 
>     Beyers Cronje wrote:
>      > Hi Cliff,
>      >
>      > On 9/7/07, *Cliff Frey* <cliff at meraki.net
>     <mailto:cliff at meraki.net> <mailto: cliff at meraki.net
>     <mailto:cliff at meraki.net>>> wrote:
>      >
>      >     Do _head/_tail need to be declared volatile?  I'm not clear
>     on the
>      >     rules for c++ reordering operations...  It like adding some
>     sort of
>      >     barrier between these lines might be a good idea. (and the
>     similar
>      >     set for deq, but actually the assert() statement probably
>     does that
>      >     as a side effect...)
>      >
>      >         _q[_tail] = p;
>      >         _tail = next;
>      >
>      >
>      >
>      > You were right, a memory barrier in enq() fixed the problem!
>      >
>      > So the enq() code:
>      >     _q[_tail] = p;
>      >     _tail = next;
>      > was actually executed as
>      >     _tail = next;
>      >     _q[_tail] = p;
>      > While the second thread used _tail through deq() in between the
>     above
>      > two instructions, before _q[_tail] = p.
>      >
>      > So the final enq() piece now looks like:
>      >     _q[_tail] = p;
>      >     __asm__ __volatile__ ("": : :"memory");  // Where is wmb()
>     defined
>      > in userland??
>      >     _tail = next;
>      >
>      > The above works on my i386 architecture, a more portable fix will
>     be a
>      > proper wmb() call.
>      >
>      >     also, I think that I'm confused about the difference between
>     enq/deq
>      >     and push/pull, but it might be worthwhile to add assert(p);
>     to the
>      >     push method.
>      >
>      >
>      > I'm referencing enq/deq directly as my element is not in the normal
>      > Queue push/pull path. To relate them, pull calls  deq() while
>     push does
>      > exactly what enq does. Not sure why push just doesnt just call enq ?
>      >
>      >     And a side note that might be outside the scope of this
>     discussion.
>      >
>      >     If we wanted to get rid of the one thread enq/deq constraint,
>      >     couldn't we do it without locks if we used cmpxchg and some other
>      >     factor to dictate that things were actually done their operation.
>      >
>      >
>      > You can. There are a few methods to do this. From limited online
>      > research the following algorithm seems to scale well:
>      >
>     http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
>     <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>
>      > <
>     http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
>     <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>>
>      >
>      > For interest sake, using mutexes I got ~270k pps throughput.
>     Using the
>      > now working non-locking queue I'm getting ~340k pps.
>      >
>      > Beyers
> 
> 



More information about the click mailing list