[Click] NotifierQueue on multi-threaded Click problem.
Cliff Frey
cliff at meraki.net
Fri Sep 7 16:12:44 EDT 2007
I think that bighashmap_arena and string need atomic stuff added as well...
but maybe this is already known.
I guess I've always looked at it that the 20% performance hit was just the
price of being able to do other things in parallel... That's why I would
have expected simplequeue to use some atomic operations on head/tail and
many parallel pushes/pullers. It seems as though otherwise, many many
people will continue to have issues here.
Cliff
On 9/7/07, Eddie Kohler <kohler at cs.ucla.edu> wrote:
>
> Beyers, Cliff,
>
> Cliff's diagnosis is at most partially right. On x86 architectures a
> write
> memory barrier is not necessary, if the relevant variables are "volatile".
> Furthermore, I implemented a test element that exercises a Queue with
> pushing
> and pulling, and even *with* a memory barrier, I got an assertion failure.
>
> The assertion failure in my test happened because Packet reference counts
> are
> not atomic at user level. Packet::clone() incremented a reference count,
> and
> Packet::kill() decremented that count, but non-atomically. This seemed to
> lead to major problems. Fixing this problem, I could run my test for
> several
> hundred million iterations.
>
> Here is part of the commit message that fixes this problem, if
> --enable-multithread is supplied to "configure". (An earlier commit added
> "volatile"s.)
>
> =============
> (1) Packet use counts must be updated atomically. However, making packet
> use counts atomic_uint32_t reduces performance on a clone-heavy test by
> roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
> packets; after it took ~8.85s).
>
> (2) Router _running and _state must be volatile.
>
> Implement these fixes. However, because of the performance degradation,
> packet use counts are atomic at userlevel only if
> --enable-user-multithread
> was supplied to configure. (--enable-user-multithread also provides the
> "umultithread" requirement, which lets you compile elements conditionally
> based on user-level multithreading support.)
>
> Lots of aspects of user-level multithreading still need work. For example
> one thread can't reschedule tasks on another, since the relevant atomic
> operations are unimplemented. But this is a start.
> =============
>
> Eddie
>
>
>
> Beyers Cronje wrote:
> > Hi Cliff,
> >
> > On 9/7/07, *Cliff Frey* <cliff at meraki.net <mailto:cliff at meraki.net>>
> wrote:
> >
> > Do _head/_tail need to be declared volatile? I'm not clear on the
> > rules for c++ reordering operations... It like adding some sort of
> > barrier between these lines might be a good idea. (and the similar
> > set for deq, but actually the assert() statement probably does that
> > as a side effect...)
> >
> > _q[_tail] = p;
> > _tail = next;
> >
> >
> >
> > You were right, a memory barrier in enq() fixed the problem!
> >
> > So the enq() code:
> > _q[_tail] = p;
> > _tail = next;
> > was actually executed as
> > _tail = next;
> > _q[_tail] = p;
> > While the second thread used _tail through deq() in between the above
> > two instructions, before _q[_tail] = p.
> >
> > So the final enq() piece now looks like:
> > _q[_tail] = p;
> > __asm__ __volatile__ ("": : :"memory"); // Where is wmb() defined
> > in userland??
> > _tail = next;
> >
> > The above works on my i386 architecture, a more portable fix will be a
> > proper wmb() call.
> >
> > also, I think that I'm confused about the difference between enq/deq
> > and push/pull, but it might be worthwhile to add assert(p); to the
> > push method.
> >
> >
> > I'm referencing enq/deq directly as my element is not in the normal
> > Queue push/pull path. To relate them, pull calls deq() while push does
> > exactly what enq does. Not sure why push just doesnt just call enq ?
> >
> > And a side note that might be outside the scope of this discussion.
> >
> > If we wanted to get rid of the one thread enq/deq constraint,
> > couldn't we do it without locks if we used cmpxchg and some other
> > factor to dictate that things were actually done their operation.
> >
> >
> > You can. There are a few methods to do this. From limited online
> > research the following algorithm seems to scale well:
> >
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> > <
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> >
> >
> > For interest sake, using mutexes I got ~270k pps throughput. Using the
> > now working non-locking queue I'm getting ~340k pps.
> >
> > Beyers
>
More information about the click
mailing list