[Click] NotifierQueue on multi-threaded Click problem.

Fri Sep 7 16:12:44 EDT 2007

I think that bighashmap_arena and string need atomic stuff added as well...
but maybe this is already known.

I guess I've always looked at it that the 20% performance hit was just the
price of being able to do other things in parallel...  That's why I would
have expected simplequeue to use some atomic operations on head/tail and
many parallel pushes/pullers.  It seems as though otherwise, many many
people will continue to have issues here.

Cliff

On 9/7/07, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>
> Beyers, Cliff,
>
> Cliff's diagnosis is at most partially right.  On x86 architectures a
> write
> memory barrier is not necessary, if the relevant variables are "volatile".
> Furthermore, I implemented a test element that exercises a Queue with
> pushing
> and pulling, and even *with* a memory barrier, I got an assertion failure.
>
> The assertion failure in my test happened because Packet reference counts
> are
> not atomic at user level.  Packet::clone() incremented a reference count,
> and
> Packet::kill() decremented that count, but non-atomically.  This seemed to
> lead to major problems.  Fixing this problem, I could run my test for
> several
> hundred million iterations.
>
> Here is part of the commit message that fixes this problem, if
> --enable-multithread is supplied to "configure".  (An earlier commit added
> "volatile"s.)
>
> =============
> (1) Packet use counts must be updated atomically.  However, making packet
> use counts atomic_uint32_t reduces performance on a clone-heavy test by
> roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
> packets; after it took ~8.85s).
>
> (2) Router _running and _state must be volatile.
>
> Implement these fixes.  However, because of the performance degradation,
> packet use counts are atomic at userlevel only if
> --enable-user-multithread
> was supplied to configure.  (--enable-user-multithread also provides the
> "umultithread" requirement, which lets you compile elements conditionally
> based on user-level multithreading support.)
>
> Lots of aspects of user-level multithreading still need work.  For example
> one thread can't reschedule tasks on another, since the relevant atomic
> operations are unimplemented.  But this is a start.
> =============
>
> Eddie
>
>
>
> Beyers Cronje wrote:
> > Hi Cliff,
> >
> > On 9/7/07, *Cliff Frey* <cliff at meraki.net <mailto:cliff at meraki.net>>
> wrote:
> >
> >     Do _head/_tail need to be declared volatile?  I'm not clear on the
> >     rules for c++ reordering operations...  It like adding some sort of
> >     barrier between these lines might be a good idea. (and the similar
> >     set for deq, but actually the assert() statement probably does that
> >     as a side effect...)
> >
> >         _q[_tail] = p;
> >         _tail = next;
> >
> >
> >
> > You were right, a memory barrier in enq() fixed the problem!
> >
> > So the enq() code:
> >     _q[_tail] = p;
> >     _tail = next;
> > was actually executed as
> >     _tail = next;
> >     _q[_tail] = p;
> > While the second thread used _tail through deq() in between the above
> > two instructions, before _q[_tail] = p.
> >
> > So the final enq() piece now looks like:
> >     _q[_tail] = p;
> >     __asm__ __volatile__ ("": : :"memory");  // Where is wmb() defined
> > in userland??
> >     _tail = next;
> >
> > The above works on my i386 architecture, a more portable fix will be a
> > proper wmb() call.
> >
> >     also, I think that I'm confused about the difference between enq/deq
> >     and push/pull, but it might be worthwhile to add assert(p); to the
> >     push method.
> >
> >
> > I'm referencing enq/deq directly as my element is not in the normal
> > Queue push/pull path. To relate them, pull calls  deq() while push does
> > exactly what enq does. Not sure why push just doesnt just call enq ?
> >
> >     And a side note that might be outside the scope of this discussion.
> >
> >     If we wanted to get rid of the one thread enq/deq constraint,
> >     couldn't we do it without locks if we used cmpxchg and some other
> >     factor to dictate that things were actually done their operation.
> >
> >
> > You can. There are a few methods to do this. From limited online
> > research the following algorithm seems to scale well:
> >
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> > <
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> >
> >
> > For interest sake, using mutexes I got ~270k pps throughput. Using the
> > now working non-locking queue I'm getting ~340k pps.
> >
> > Beyers
>