[Click] NotifierQueue on multi-threaded Click problem.
Eddie Kohler
kohler at cs.ucla.edu
Sat Sep 8 11:45:34 EDT 2007
Thanks!
E
Beyers Cronje wrote:
> Hi Eddie,
>
> Thank you so much for this! I just tested it and it's working 100% so
> far. I did pick up a small bug in one of your recent updates, patch
> provided below.
>
> Kind regards
>
> Beyers
>
> --- /home/bcronje/string.cc 2007-09-08 01:21: 40.000000000 +0200
> +++ lib/string.cc 2007-09-08 01:19:39.000000000 +0200
> @@ -87,7 +87,7 @@
> String::Memo::Memo(int dirty, int capacity)
> : _capacity(capacity), _real_data((char *) CLICK_LALLOC(capacity))
> {
> - assert(_capacity >= _dirty);
> + assert(_capacity >= dirty);
> _refcount = 1;
> _dirty = dirty;
> }
>
>
>
>
>
> On 9/7/07, * Eddie Kohler* <kohler at cs.ucla.edu
> <mailto:kohler at cs.ucla.edu>> wrote:
>
> Beyers, Cliff,
>
> Cliff's diagnosis is at most partially right. On x86 architectures
> a write
> memory barrier is not necessary, if the relevant variables are
> "volatile".
> Furthermore, I implemented a test element that exercises a Queue
> with pushing
> and pulling, and even *with* a memory barrier, I got an assertion
> failure.
>
> The assertion failure in my test happened because Packet reference
> counts are
> not atomic at user level. Packet::clone() incremented a reference
> count, and
> Packet::kill() decremented that count, but non-atomically. This
> seemed to
> lead to major problems. Fixing this problem, I could run my test
> for several
> hundred million iterations.
>
> Here is part of the commit message that fixes this problem, if
> --enable-multithread is supplied to "configure". (An earlier commit
> added
> "volatile"s.)
>
> =============
> (1) Packet use counts must be updated atomically. However, making
> packet
> use counts atomic_uint32_t reduces performance on a clone-heavy test by
> roughly 20% (before it took ~7.38s on my laptop to clone and kill 40M
> packets; after it took ~8.85s).
>
> (2) Router _running and _state must be volatile.
>
> Implement these fixes. However, because of the performance
> degradation,
> packet use counts are atomic at userlevel only if
> --enable-user-multithread
> was supplied to configure. (--enable-user-multithread also provides the
> "umultithread" requirement, which lets you compile elements
> conditionally
> based on user-level multithreading support.)
>
> Lots of aspects of user-level multithreading still need work. For
> example
> one thread can't reschedule tasks on another, since the relevant atomic
> operations are unimplemented. But this is a start.
> =============
>
> Eddie
>
>
>
> Beyers Cronje wrote:
> > Hi Cliff,
> >
> > On 9/7/07, *Cliff Frey* <cliff at meraki.net
> <mailto:cliff at meraki.net> <mailto: cliff at meraki.net
> <mailto:cliff at meraki.net>>> wrote:
> >
> > Do _head/_tail need to be declared volatile? I'm not clear
> on the
> > rules for c++ reordering operations... It like adding some
> sort of
> > barrier between these lines might be a good idea. (and the
> similar
> > set for deq, but actually the assert() statement probably
> does that
> > as a side effect...)
> >
> > _q[_tail] = p;
> > _tail = next;
> >
> >
> >
> > You were right, a memory barrier in enq() fixed the problem!
> >
> > So the enq() code:
> > _q[_tail] = p;
> > _tail = next;
> > was actually executed as
> > _tail = next;
> > _q[_tail] = p;
> > While the second thread used _tail through deq() in between the
> above
> > two instructions, before _q[_tail] = p.
> >
> > So the final enq() piece now looks like:
> > _q[_tail] = p;
> > __asm__ __volatile__ ("": : :"memory"); // Where is wmb()
> defined
> > in userland??
> > _tail = next;
> >
> > The above works on my i386 architecture, a more portable fix will
> be a
> > proper wmb() call.
> >
> > also, I think that I'm confused about the difference between
> enq/deq
> > and push/pull, but it might be worthwhile to add assert(p);
> to the
> > push method.
> >
> >
> > I'm referencing enq/deq directly as my element is not in the normal
> > Queue push/pull path. To relate them, pull calls deq() while
> push does
> > exactly what enq does. Not sure why push just doesnt just call enq ?
> >
> > And a side note that might be outside the scope of this
> discussion.
> >
> > If we wanted to get rid of the one thread enq/deq constraint,
> > couldn't we do it without locks if we used cmpxchg and some other
> > factor to dictate that things were actually done their operation.
> >
> >
> > You can. There are a few methods to do this. From limited online
> > research the following algorithm seems to scale well:
> >
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>
> > <
> http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380
> <http://delivery.acm.org/10.1145/380000/378611/p134-tsigas.pdf?key1=378611&key2=4422219811&coll=portal&dl=ACM&CFID=28858073&CFTOKEN=82193380>>
> >
> > For interest sake, using mutexes I got ~270k pps throughput.
> Using the
> > now working non-locking queue I'm getting ~340k pps.
> >
> > Beyers
>
>
More information about the click
mailing list