[Click] Causing Kernel Panics Via Click

dmoore7@nd.edu dmoore7 at nd.edu
Thu Sep 13 13:30:33 EDT 2007


Hello Everyone,
I have encounted a slightly unreliable method of bringing down one of my click
machines.  I shall present an example below of how it happens.

> Log into to freshly booted machine
> Succesfully load locally stored click file
> Ping another machine via the click router
> Use w3c's webbot to successfully download a webpage from another host
> Do so again, again with proper results
> Execute click-uninstall
> Reload the same kernel config I had loaded earlier

> Meanwhile, in another ssh window to the webserver I have tcpdump running, and
have been watching the traffic coming in and out.

> I execute the same webbot command as previously, with the following output
(copy-pasted from the uninstall onward)

[root at filius ~]# click-uninstall [root at filius ~]# click-install
./Filius-1-50.click
[root at filius ~]# webbot -I spoof_eth2_0 -q -n -saveimg
http://192.168.10.4/index3.html
Message from syslogd at filius at Thu Sep 13 12:59:52 2007 ...
filius kernel: BUG: spinlock wrong CPU on CPU#2, click-uninstall/4890

Message from syslogd at filius at Thu Sep 13 12:59:52 2007 ...
filius kernel:  lock: f8c79c44, .magic: dead4ead, .owner: click-uninstall/4890,
.owner_cpu: 3

<after about 5 minutes>
Read from remote host filius.cse.nd.edu: Connection timed out
Connection to filius.cse.nd.edu closed.

> During this I have been watching tcpdump on the http server, and it is
proceeding until suddenly it stops receiving ack's.
> Thus it appears the click module began malfunctioning mid-use, and not simply
upon its loading.

Background information:
 - There are 3 machines with click loaded on them, the config files I used can
be found here:
Filius (the one that crashed): http://cse.nd.edu/~dmoore7/Filius-1-50.click
Sybill (the forwarder): http://cse.nd.edu/~dmoore7/Sybill-1-50.click
Hagrid (the http host): http://cse.nd.edu/~dmoore7/Hagrid-1-50.click
 - For a general idea of what these router configs do, see this diagram:
http://cse.nd.edu/~dmoore7/myrouter.jpg
 - The webbot I used is a slightly hacked version of w3c's webbot (modified to
allow forcing of a particular device).  It exhibits no failures without click.
 - Filius is running a version of click downloaded about 3 weeks ago from the
cvs.  The others are running 1.5.0 downloaded from the click website during the
middle of this past summer.  I can update filius if that may be helpful.
 - The systems in question have 4 processors, I believe technically 2 core duo's
each.
 - Kernel is a vanilla 2.6.16.13 from kernel.org w/ click patch applied
 - This appears to be related to an earlier problem I reported here, which at
first appeared to be resolved: 
https://pdos.csail.mit.edu/pipermail/click/2007-August/006206.html

Any input is appreciated, I will try to get crash dumps and such when I regain
access to the machine, as it is not local I must wait for someone else to
reboot it for me.
Thanks,
 - David




More information about the click mailing list