[Click] click userland patches for large speed improvements

Fri Jul 1 14:13:45 EDT 2011

Hi Roman,

I believe Luigi generally tests on a FreeBSD version, and the FreeBSD malloc 
appears particularly slow compared to current Linux versions.  But Luigi's 
patches are AWESOME and will be integrated soon, with adjustments.  Any help 
appreciated!!  THANKS LUIGI!!!

Luigi, have you done github?

Eddie

On 07/01/2011 11:01 AM, Roman Chertov wrote:
> For curiosity sake, I just ran the script shown below on my machine and got
> 3.34 Mpps.  I am using the latest Click from git (as of two days ago), and
> running the following CPUs: Intel(R) Xeon(R) CPU  W3520  @ 2.67GH
>
>
> It seems that 0.5Mpps is pretty low for an i7-870 CPU, but it does appear that
> the patches improved the performance significantly.
>
>
> InfiniteSource ->  ctr::AverageCounter ->  Queue ->  Discard;
>
>
> Script(
>      wait 60,
>      print ctr.count,
>      print ctr.byte_count,
> );
>
> Roman
>
> On Fri, 1 Jul 2011 19:47:13 +0200 Luigi Rizzo<rizzo at iet.unipi.it>  wrote
>
>> If someone is interest in performance of userland click, i'd suggest
>> the following two patches and looking at netmap (i already discussed
>> what follows with Eddie, and i am hoping someone more fluent than
>> me in C++ can polish the code and add a support for thread-local lists).
>>
>> To get an idea of what you can get on a single core i7-870 CPU with
>> the stock version and with these patches:
>>
>> 					1.8.0		With patches
>>      InfiniteSource ->  Discard		515Kpps		18.56Mpps
>>      InfiniteSource ->  Queue ->  Discard	500Kpps		13.41Mpps
>>
>> 					pcap		netmap
>>      FromDevice->Queue->ToDevice		420Kpps		3.97 Mpps
>>
>>
>> Click userland performance was never a priority given the high cost
>> (until now) of packet I/O. But once packet i/o has become quite fast,
>> it turns out that there are to other big offenders:
>> - the C++ memory allocator is quite expensive, and replacing it with
>>    thread-local freelists (Packet objects and data buffers can be made
>>    all with the same size) gives huge savings -- 100ns per packet or more
>>    even on a fast machine;
>>
>> - everytime an element wants a timestamp, it calls a syscall (gettimeofday()
>>    or similar) which consumes another 400-800ns per call. There are many
>>    elements (e.g. InfiniteSource, Counter, etc.) which timestamp packets.
>>
>> Attached there are a couple of patches which address these problems:
>>
>> - patch-pcap	makes FromDevice and ToDevice use libpcap properly,
>> 		supporting I/O in bursts to amortize the syscall overhead.
>> 		This has been tested on FreeBSD.
>>
>> - patch-more
>>     + introduces a NOTS option for InfiniteSource to remove timestamps.
>>       This gives a 10x performance improvement in simple apps using
>> InfiniteSource
>>
>>     + replaces the allocator for Packet and data buffers with local freelists;
>>       not thread safe, but this is easy to introduce. This gives another
>> 1.5-2x
>>       speed improvement after the 10x gained removing timestamps;
>>
>>     + enables BURST operation in Discard, giving another 2x speed improvement
>>
>> Using netmap instead of pcap is another big win, as you can see the
>> forwarding
>> performance of a simple FromDevice->Queue->ToDevice chain goes up by 10x
>> You can find netmap at http://info.iet.unipi.it/~luigi/netmap/
>>
>> cheers
>> luigi
>
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click