[Click] NotifierQueue on multi-threaded Click problem.

Fri Sep 7 14:50:50 EDT 2007

Beyers, Cliff,

Cliff's diagnosis is at most partially right.  On x86 architectures a write 
memory barrier is not necessary, if the relevant variables are "volatile". 
Furthermore, I implemented a test element that exercises a Queue with pushing 
and pulling, and even *with* a memory barrier, I got an assertion failure.

The assertion failure in my test happened because Packet reference counts are 
not atomic at user level.  Packet::clone() incremented a reference count, and 
Packet::kill() decremented that count, but non-atomically.  This seemed to 
lead to major problems.  Fixing this problem, I could run my test for several 
hundred million iterations.

Here is part of the commit message that fixes this problem, if 
--enable-multithread is supplied to "configure".  (An earlier commit added 
"volatile"s.)

=============
(1) Packet use counts must be updated atomically.  However, making packet
use counts atomic_uint32_t reduces performance on a clone-heavy test by
roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
packets; after it took ~8.85s).

(2) Router _running and _state must be volatile.

Implement these fixes.  However, because of the performance degradation,
packet use counts are atomic at userlevel only if --enable-user-multithread
was supplied to configure.  (--enable-user-multithread also provides the
"umultithread" requirement, which lets you compile elements conditionally
based on user-level multithreading support.)

Lots of aspects of user-level multithreading still need work.  For example
one thread can't reschedule tasks on another, since the relevant atomic
operations are unimplemented.  But this is a start.
=============

Eddie

Beyers Cronje wrote:
> Hi Cliff,
> 
> On 9/7/07, *Cliff Frey* <cliff at meraki.net <mailto:cliff at meraki.net>> wrote:
> 
>     Do _head/_tail need to be declared volatile?  I'm not clear on the
>     rules for c++ reordering operations...  It like adding some sort of
>     barrier between these lines might be a good idea. (and the similar
>     set for deq, but actually the assert() statement probably does that
>     as a side effect...)
> 
>         _q[_tail] = p;
>         _tail = next;
> 
> 
> 
> You were right, a memory barrier in enq() fixed the problem!
> 
> So the enq() code:
>     _q[_tail] = p;
>     _tail = next;
> was actually executed as
>     _tail = next;
>     _q[_tail] = p;
> While the second thread used _tail through deq() in between the above 
> two instructions, before _q[_tail] = p.
> 
> So the final enq() piece now looks like:
>     _q[_tail] = p;
>     __asm__ __volatile__ ("": : :"memory");  // Where is wmb() defined 
> in userland??
>     _tail = next;
> 
> The above works on my i386 architecture, a more portable fix will be a 
> proper wmb() call.
> 
>     also, I think that I'm confused about the difference between enq/deq
>     and push/pull, but it might be worthwhile to add assert(p); to the
>     push method. 
> 
> 
> I'm referencing enq/deq directly as my element is not in the normal 
> Queue push/pull path. To relate them, pull calls  deq() while push does 
> exactly what enq does. Not sure why push just doesnt just call enq ?
> 
>     And a side note that might be outside the scope of this discussion.
> 
>     If we wanted to get rid of the one thread enq/deq constraint,
>     couldn't we do it without locks if we used cmpxchg and some other
>     factor to dictate that things were actually done their operation.
> 
> 
> You can. There are a few methods to do this. From limited online 
> research the following algorithm seems to scale well:
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380 
> <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>
> 
> For interest sake, using mutexes I got ~270k pps throughput. Using the 
> now working non-locking queue I'm getting ~340k pps.
> 
> Beyers