[Click] click userland patches for large speed improvements

Sat Jul 2 05:48:30 EDT 2011

On Fri, Jul 01, 2011 at 11:13:45AM -0700, Eddie Kohler wrote:
> Hi Roman,
> 
> I believe Luigi generally tests on a FreeBSD version, and the FreeBSD malloc 
> appears particularly slow compared to current Linux versions.  But Luigi's 
> patches are AWESOME and will be integrated soon, with adjustments.  Any help 
> appreciated!!  THANKS LUIGI!!!
> 
> Luigi, have you done github?

correct, i am developing/testing on FreeBSD and click 1.8.0.

Will try github next week.

The 0.5Mpps figure is dominated by timestamps, which, with the default
configuration on freebsd, are incredibly expensive (800ns to run gettimeofday;
clock_gettime is faster, around 400ns, but click does not use it anyways).

I don't know how cheap is access to timestamps in linux , but certainly it
would make sense to have a run-time option to use, say, TSC or jiffies
as the internal representation (accessible without a syscall)
and convert to sec/usec/nsec only when the value is actually used.

Of course to use TSC you need constant_tsc, and for jiffies you need
an OS that exports the value in a readonly page to all userland threads.

cheers
luigi

> Eddie
> 
> 
> On 07/01/2011 11:01 AM, Roman Chertov wrote:
> > For curiosity sake, I just ran the script shown below on my machine and got
> > 3.34 Mpps.  I am using the latest Click from git (as of two days ago), and
> > running the following CPUs: Intel(R) Xeon(R) CPU  W3520  @ 2.67GH
> >
> >
> > It seems that 0.5Mpps is pretty low for an i7-870 CPU, but it does appear that
> > the patches improved the performance significantly.
> >
> >
> > InfiniteSource ->  ctr::AverageCounter ->  Queue ->  Discard;
> >
> >
> > Script(
> >      wait 60,
> >      print ctr.count,
> >      print ctr.byte_count,
> > );
> >
> > Roman
> >
> > On Fri, 1 Jul 2011 19:47:13 +0200 Luigi Rizzo<rizzo at iet.unipi.it>  wrote
> >
> >> If someone is interest in performance of userland click, i'd suggest
> >> the following two patches and looking at netmap (i already discussed
> >> what follows with Eddie, and i am hoping someone more fluent than
> >> me in C++ can polish the code and add a support for thread-local lists).
> >>
> >> To get an idea of what you can get on a single core i7-870 CPU with
> >> the stock version and with these patches:
> >>
> >> 					1.8.0		With patches
> >>      InfiniteSource ->  Discard		515Kpps		18.56Mpps
> >>      InfiniteSource ->  Queue ->  Discard	500Kpps		13.41Mpps
> >>
> >> 					pcap		netmap
> >>      FromDevice->Queue->ToDevice		420Kpps		3.97 Mpps
> >>
> >>
> >> Click userland performance was never a priority given the high cost
> >> (until now) of packet I/O. But once packet i/o has become quite fast,
> >> it turns out that there are to other big offenders:
> >> - the C++ memory allocator is quite expensive, and replacing it with
> >>    thread-local freelists (Packet objects and data buffers can be made
> >>    all with the same size) gives huge savings -- 100ns per packet or more
> >>    even on a fast machine;
> >>
> >> - everytime an element wants a timestamp, it calls a syscall (gettimeofday()
> >>    or similar) which consumes another 400-800ns per call. There are many
> >>    elements (e.g. InfiniteSource, Counter, etc.) which timestamp packets.
> >>
> >> Attached there are a couple of patches which address these problems:
> >>
> >> - patch-pcap	makes FromDevice and ToDevice use libpcap properly,
> >> 		supporting I/O in bursts to amortize the syscall overhead.
> >> 		This has been tested on FreeBSD.
> >>
> >> - patch-more
> >>     + introduces a NOTS option for InfiniteSource to remove timestamps.
> >>       This gives a 10x performance improvement in simple apps using
> >> InfiniteSource
> >>
> >>     + replaces the allocator for Packet and data buffers with local freelists;
> >>       not thread safe, but this is easy to introduce. This gives another
> >> 1.5-2x
> >>       speed improvement after the 10x gained removing timestamps;
> >>
> >>     + enables BURST operation in Discard, giving another 2x speed improvement
> >>
> >> Using netmap instead of pcap is another big win, as you can see the
> >> forwarding
> >> performance of a simple FromDevice->Queue->ToDevice chain goes up by 10x
> >> You can find netmap at http://info.iet.unipi.it/~luigi/netmap/
> >>
> >> cheers
> >> luigi
> >
> >
> > _______________________________________________
> > click mailing list
> > click at amsterdam.lcs.mit.edu
> > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click