[Click] Click-git: Kernel crash w/ Queue element overflow?

Nuutti Varis nvaris at cc.hut.fi
Thu Feb 11 10:44:48 EST 2010


Hey,

As predicted, the commit does not do anything in this case, 2.6.31.12 warns and 2.6.24.7 crashes with identical traces as before. The test cases for ToDevice do not cause any crashes, best I could get were warnings on 2.6.31.12 (attached) and nothing out of the ordinary on 2.6.24.7 (manual patch). This happened in the case of "InfiniteSource -> [Queue ->] ToDevice" (single device).

However, there is something interesting that I found out. Apparently, the newer e1000e driver in 2.6.31.12, as well as the e1000e-0.4.1.7 driver both use NAPI by default. To rule that out, I recompiled the 0.4.1.7 driver to a manually patched 2.6.24.7 kernel. And the result is .. no kernel crash in the original setup (endhost - switch - switch - endhost). In hindsight, should have tried this first :)

On Feb 10, 2010, at 10:36 PM, Eddie Kohler wrote:

> Hi Nuutti,
> 
> There is a small chance this commit may fix your issue:
> 
> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=01c8f4e084036338e83a6bff7a8e74dc49caa014
> 
> If it does not, I think we need more input from you to narrow it down...
> 
> Thanks so much,
> Eddie
> 
> 
> Eddie Kohler wrote:
>> Nuutti,
>> Thanks very much for these dumps and this config.  Pretty informative.
>> Here are some debugging suggestions.
>> (0) This distinctly looks like memory corruption, possibly within ToDevice.  I will look at Queue itself, as well, but this seems like an unlikely source of problems, since your Click is not installed with --enable-multithread.
>> (1) Perhaps the problem is with EtherSwitch, whose internal hash table may be causing problems in SMP settings.  Can you try again, replacing the EtherSwitch element with a Hub element?  This will do the same job, but without a table.  My expectation is this will also fail.
>> (2) To narrow down the problem, we can try very simple ToDevice and Queue configs.  This would involve:
>> - ia32
>> - either patch or fixincludes
>> - SMP kernel
>> - The following configs:
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> ToDevice(eth0);
>> -*- OR
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> Queue
>> -> ToDevice(eth0);
>> -*- OR
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> ToDevice(eth0);
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> ToDevice(eth1);
>> -*- OR
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> Queue
>> -> ToDevice(eth0);
>> InfiniteSource(DATA \<plausible-data-for-an-ethernet-packet>)
>> -> Queue
>> -> ToDevice(eth1);
>> ------
>> These configs test ToDevice with and without Queues, and with and without accessing two devices.
>> We'll look in parallel, but I'm interested in what you see.
>> Eddie
>> Nuutti Varis wrote:
>>> Hey, 
>>> While trying to run throughput measurements with Click in a kernel, running a simple EtherSwitch configuration (attached as etherswitch.click) in a topology of:
>>> 
>>> EndHostA::ethI0 <==> ethI0::EtherSwitch1::ethI1 <==> ethI1::EtherSwitch2::ethI0 <==> ethI0::EndHostB
>>> 192.168.2.1 ---------------------------------------------------------------------------> 192.168.2.2
>>> FastUDPSrc w/ 64B packet, 300kpp/s
>>> 
>>> I stumbled upon a kernel crash, seemingly when the Queue elements started dropping packets due to overflow. I tried this with two different kernel versions (2.6.31.12 and 2.6.24.7) and with either 2.6.24.7 manual patch, or with --enable-fixincludes. Interestingly, the kernel crash does not happen when I disable SMP from the kernel. Additionally, normal linux bridging does not crash the kernel on overflows. Partial/full crash dumps as attachments from various days of testing.
>>> 
>>> Configuration stuff of the EtherSwitch{1,2}:
>>> - Dumps arch indicated in the filename, either amd64 or ia32
>>> - MTU of ethI1 is 1540 (tried with 1500 as well, no difference)
>>> - Click is configured with --enable-linuxmodule --enable-userlevel --enable-etherswitch [--enable-fixincludes]
>>> - Kernel does not have any pre-empting enabled.
>>> - Both e1000e poll-patched and vanilla cause the problem
>>> - e1000e versions 0.4.1.7 and 1.0.2-k2 (comes with 2.6.31.12) cause the problem
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Nuutti Varis (nvaris at cc.hut.fi)
>>> PhD Student, Aalto University School of Science and Technology
>>> Department of Communications and Networking
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> click mailing list
>>> click at amsterdam.lcs.mit.edu
>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>> _______________________________________________
>> click mailing list
>> click at amsterdam.lcs.mit.edu
>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click

-------------- next part --------------
A non-text attachment was scrubbed...
Name: kernel_warn.100210.linux-2.6.31.12.ia32.enable_fixincludes.dump
Type: application/octet-stream
Size: 1246 bytes
Desc: not available
Url : http://amsterdam.lcs.mit.edu/pipermail/click/attachments/20100211/6dbabc88/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kernel_warn.100210.linux-2.6.31.12.ia32.enable_fixincludes.2.dump
Type: application/octet-stream
Size: 1982 bytes
Desc: not available
Url : http://amsterdam.lcs.mit.edu/pipermail/click/attachments/20100211/6dbabc88/attachment-0003.obj 
-------------- next part --------------

--
Nuutti Varis (nvaris at cc.hut.fi)
PhD Student, Aalto University School of Science and Technology
Department of Communications and Networking





More information about the click mailing list