[Click] Unable to handle kernel paging request at virtualaddress
Eddie Kohler
kohler at cs.ucla.edu
Mon Feb 5 18:57:59 EST 2007
Hi Thomas,
Documenting the Spinlock objects may have revealed the problem. Click
spinlocks, which are hand built -- I don't know why, ask Benjie Chen or Robert
Morris -- are only useful when Click is compiled with --enable-multithread.
In particular, a Click Spinlock WILL NOT prevent interactions between a Click
thread and a handler.
I think we should just fix this, by making Spinlock shorthand for a regular
old linux spinlock_t of some kind. This will probably slow down Click, but it
seems worth it.
All: Speak now or forever hold your peace.
Eddie
Paine, Thomas Asa wrote:
> Ok, so back to this dang problem. I was able to cut down on these crashes, but not able to eliminate them. I did finally notice something very strange. The "virtual address" it is referencing in this dump is actually valid map data!! 0x40e21c89 is hex value of a network byte ordered IP address which was in the map at the time of the crash (0x40E21C89 = 1088560265 = 64.226.28.137 network byte order). I'm assuming it came from register edx from the crash below.
> I've provided the disassembly for _ZN7HashMapIjjE6insertERKjS2_. These crashes are induced by pounding on Spinlocks(ReadWriteLocks) acquires and releases from different threads. Is it possible Spinlock's assembly is causing GCC to clobber some registers? I don't know how else map data ended up as a memory reference. Should I be adding something to the CXXFLAGS or doing something special with gcc? Very frustrated at this point.
>
> gcc is version 4.1.0
> CXXFLAGS=-g
> Kernel is 2.6.16.13
> Click is CVS
>
> ************** the crash *******************
>
> Unable to handle kernel paging request at virtual address 40e21c89
> printing eip:
> f88b2387
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> Modules linked in: tms click proclikefs e1000
> CPU: 0
> EIP: 0060:[<f88b2387>] Tainted: P VLI
> EFLAGS: 00010202 (2.6.16.13-tms #6)
> EIP is at _ZN7HashMapIjjE6insertERKjS2_+0x57/0xe2 [tms]
> eax: 0000c331 ebx: 00007fff ecx: c19f0e80 edx: 40e21c89
> esi: f61f2480 edi: 00001b9d ebp: f7f65cb0 esp: f7f65c4c
> ds: 007b es: 007b ss: 0068
> Process bash (pid: 1046, threadinfo=f7f64000 task=c19fba90)
> Stack: <0>f7f65c54 00000000 00000002 00000040 cdc20016 1600c2cd f88a2615 f61f2480
> f7f65cb0 f7f65cd0 00000001 00000000 cdc20016 f7f64000 f7f64000 f61f3880
> 1600ffff f6819760 00000001 f7d4bd80 00000000 00000000 00000000 0000ffff
> Call Trace:
> [<f88a2615>] _ZN16BlacklistElement14addToList_cidrER6Stringj+0x235/0x2ae [tms]
> [<f88a1768>] _ZN16BlacklistElement13write_handlerERK6StringP7ElementPvP12ErrorHandler+0x10a/0x252 [tms]
> [<f8c5d50a>] _ZNK7Handler10call_writeERK6StringP7ElementbP12ErrorHandler+0x144/0x1ca [click]
> [<f8c67497>] _ZN11HandlerCall6assignEP7ElementRK6StringS4_iP12ErrorHandler+0x27/0x1e8 [click]
> [<f8c68177>] _ZN11HandlerCall10call_writeEP7ElementRK6StringS4_P12ErrorHandler+0x93/0xcc [click]
> [<f88afc74>] _ZN13TMSController9list_cidrE6StringS0_+0x1b8/0x2a8 [tms]
> [<f88b0e90>] _ZN13TMSController13write_handlerERK6StringP7ElementPvP12ErrorHandler+0x1f4/0x458 [tms]
> [<c0157cc8>] cache_alloc_refill+0x80/0x511
> [<f8c3c4a2>] _ZN6String4MemoC1Eii+0x24/0x2c [click]
> [<f8c5d50a>] _ZNK7Handler10call_writeERK6StringP7ElementbP12ErrorHandler+0x144/0x1ca [click]
> [<f8c438e4>] click_lalloc+0x3c/0x3e [click]
> [<f8c3ce8d>] _ZN6String6appendEPKci+0x2f/0x76 [click]
> [<f8cc26e7>] handler_flush+0x577/0x57e [click]
> [<c0159929>] filp_close+0x28/0x6d
> [<c016cd46>] sys_dup2+0xc5/0xee
> [<c0102c9d>] syscall_call+0x7/0xb
> Code: d2 74 17 3b 0a 75 f5 8b 4c 24 24 8b 01 89 42 04 31 c0 83 c4 08 5b 5e 5f 5d c3 8b 46 0c 3b 46 10 73 41 8b 4e 14 8b 11 85 d2 74 5a <8b> 02 89 01 89 d3 8b 45 00 89 03 8b 4c 24 24 8b 01 89 43 04 8d
>
>
>
> (gdb) disassemble _ZN7HashMapIjjE6insertERKjS2_
> Dump of assembler code for function HashMap:
> 0x00000000 <hatred_of_rebecca+0>: sub $0x10,%esp
> 0x00000003 <hatred_of_rebecca+3>: movl $0x0,0xc(%esp)
> 0x0000000b <hatred_of_rebecca+11>: lea 0xc(%esp),%eax
> 0x0000000f <hatred_of_rebecca+15>: mov %eax,0x8(%esp)
> 0x00000013 <hatred_of_rebecca+19>: movl $0x0,0x4(%esp)
> 0x0000001b <hatred_of_rebecca+27>: mov 0x18(%esp),%eax
> 0x0000001f <hatred_of_rebecca+31>: mov %eax,(%esp)
> 0x00000022 <hatred_of_rebecca+34>: call 0x23 <hatred_of_rebecca+35>
> 0x00000027 <hatred_of_rebecca+39>: test %al,%al
> 0x00000029 <hatred_of_rebecca+41>: je 0x3a <__module_depends+14>
> 0x0000002b <hatred_of_rebecca+43>: mov 0xc(%esp),%edx
> 0x0000002f <__module_depends+3>: test %edx,%edx
> End of assembler dump.
> (gdb)
>
>
>
> 40E21C89 -> 1088560265 -> 64.226.28.137
>
>
> Thanks,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Thomas Paine (paineta at uwec.edu)
> University of Wisconsin - Eau Claire
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> -----Original Message-----
> From: click-bounces at pdos.csail.mit.edu [mailto:click-bounces at pdos.csail.mit.edu] On Behalf Of Paine, Thomas Asa
> Sent: Thursday, October 26, 2006 12:13 PM
> To: Beyers Cronje
> Cc: click at pdos.csail.mit.edu
> Subject: Re: [Click] Unable to handle kernel paging request at virtualaddress
>
> Beyers,
> No, my maps are small. I had set up some very tight testing conditions and the map sizes were very small and appear to have no bearing on the crashing. It only appears to tank when there is SMP involved. It runs fine without SMP or with SMP with 0 packet load. You remember my post back in September...
> https://amsterdam.lcs.mit.edu/pipermail/click/2006-September/005176.html
> , well I don't think I'm out of the woods with that issue. I don't think SMP, P4 Hyperthreading, and Spinlock (click/include/click/sync.hh) are working right together. My system can run like a rock star, however when I call the write_handlers from /proc (which then causes 1000's of handler calls between elements) do I see the issues. My hypothesis is that my sched_setaffinity() hack job handled the race condition with the click's main thread (and only one because of my prior post), however the system call into click is still making the locks vulnerable, thus the locks not working.
> I complied click and my package with the -g compiler option (via CXXFLAGS), produced the oops, located the exact line in source code (via gdb/info line/disassemble) and was able to verify the assembly and registers are correct at the time of crashes. It would also explain why it seems to always pick on a couple of my elements, they are the ones handling the packets and have the most lock exposure.
> I must admit, the whole oops/gdb experience is worth its weight in gold, however, I'd much rather be improving the my software rather than wrestling with it. These are painful times...
> Can anyone confirm or deny any issues with Spinlock and P4 Hyperthreading w/ SMP?
>
> Thanks,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Thomas Paine (paineta at uwec.edu)
> University of Wisconsin - Eau Claire
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> ________________________________
>
> From: Beyers Cronje [mailto:bcronje at gmail.com]
> Sent: Wednesday, October 25, 2006 5:50 PM
> To: Paine, Thomas Asa
> Cc: click at pdos.csail.mit.edu
> Subject: Re: [Click] Unable to handle kernel paging request at virtual address
>
>
> Hi Thomas,
>
> How big is your hashmap at the point of the crash? Hashmap only supports
> a maximum of 32k entries due to kmalloc limitation. If this is your
> issue, you can work around this by creating multiple hashmaps and hash
> inserts across them.
>
> Maybe hashmap should also be a CLICK_LALLOC candidate.
>
> Cheers
>
> Beyers
>
>
> On 10/25/06, Paine, Thomas Asa < PAINETA at uwec.edu
> <mailto:PAINETA at uwec.edu> > wrote:
>
> Ok, I started a long email but soon realized it would be
> confusing as
> hell, so let me try the quick and dirty post first :)
>
> I'm getting oops, with EIP always pointing to
> HashMap.insert.
> However, the call trace can differ (somewhat) and the top of the
> call
> trace is not always in a method using HashMap.insert (.find
> though).
> Eliminating calls seen in the trace seems to have no affect on
> preventing an oops. Are there any open issues/changes with
> HashMap? Is
> this a memory allocation error? Possible incorrect kernel
> setting?
> Anyone have troubleshooting recommendations for this error?
> I've
> attached my current kernel config for 2.6.16.13. I'm not 100%
> sure this
> has anything to do specifically with hashmap, or if it is just
> hashmap
> manifesting the error, for whatever the real reason might be.
>
> Looking for some guidance, thanks.
>
>
> Unable to handle kernel paging request at virtual address
> f9291c89
> printing eip:
> e08eea67
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> Modules linked in: tms click proclikefs e1000
> CPU: 1
> EIP: 0060:[<e08eea67>] Tainted: P VLI
> EFLAGS: 00010282 (2.6.16.13-tms #3)
> EIP is at _ZN7HashMapIjjE6insertERKjS2_+0x57/0xe2 [tms]
> eax: 00003fff ebx: 00007fff ecx: dff152c0 edx: f9291c89
> esi: dfa23520 edi: 000066de ebp: df703c88 esp: df703c48
> ds: 007b es: 007b ss: 0068
> Process bash (pid: 1030, threadinfo=df702000 task=c1b76a50)
> Stack: <0>e08e135f dfa23520 dfa93a00 dfb67180 dfa939c0 df703c88
> e08e1706
> dfa23520
> df703c88 df703cac dfb67180 00000001 00000004 00000000
> 00000000
> df703cb0
> 652a1c89 dfa939c0 00000000 df703cb0 dfa939c0 dfe69760
> e08e181e
> dfa939c0
> Call Trace:
> [<e08e135f>] _ZN16BlacklistElement14is_whitelistedEm+0x5f/0xc6
> [tms]
> [<e08e1706>]
> _ZN16BlacklistElement9addToListER6Stringj+0x172/0x1c8
> [tms]
> [<e08e181e>]
>
> _ZN16BlacklistElement13write_handlerERK6StringP7ElementPvP12ErrorHandler
>
> +0xc2/0x144 [tms]
> [<e0c5d50a>]
>
> _ZNK7Handler10call_writeERK6StringP7ElementbP12ErrorHandler+0x144/0x1ca
> [click]
> [<e0c67497>]
>
> _ZN11HandlerCall6assignEP7ElementRK6StringS4_iP12ErrorHandler+0x27/0x1e8
>
> [click]
> [<e0c68177>]
>
> _ZN11HandlerCall10call_writeEP7ElementRK6StringS4_P12ErrorHandler+0x93/0
> xcc [click]
> [<e0c68177>]
>
> _ZN11HandlerCall10call_writeEP7ElementRK6StringS4_P12ErrorHandler+0x93/0
>
> xcc [click]
> [<e08e9c48>] _ZN13TMSController4listEm6String+0x6a/0x1de [tms]
> [<e0c438e4>] click_lalloc+0x3c/0x3e [click]
> [<e0c3c4a2>] _ZN6String4MemoC1Eii+0x24/0x2c [click]
> [<e08ec238>]
> _ZN13TMSController14whitelist_cidrE6String+0xa8/0xde [tms]
> [<e08ecf40>]
>
> _ZN13TMSController13write_handlerERK6StringP7ElementPvP12ErrorHandler+0x
> 21a/0x2e2 [tms]
> [<e0c438e4>] click_lalloc+0x3c/0x3e [click]
> [<e0c3c4a2>] _ZN6String4MemoC1Eii+0x24/0x2c [click]
> [<e0c3c57a>] _ZN6String14append_garbageEi+0xa4/0x150 [click]
> [<e0c5d50a>]
>
> _ZNK7Handler10call_writeERK6StringP7ElementbP12ErrorHandler+0x144/0x1ca
> [click]
> [<e0c438e4>] click_lalloc+0x3c/0x3e [click]
> [<e0c3ce8d>] _ZN6String6appendEPKci+0x2f/0x76 [click]
> [<e0cc26e7>] handler_flush+0x577/0x57e [click]
> [<c0159c49>] filp_close+0x28/0x6d
> [<c016d1c6>] sys_dup2+0xc5/0xee
> [<c0102cb9>] syscall_call+0x7/0xb
> Code: d2 74 17 3b 0a 75 f5 8b 4c 24 24 8b 01 89 42 04 31 c0 83
> c4 08 5b
> 5e 5f 5d c3 8b 46 0c 3b 46 10 73 41 8b 4e 14 8b 11 85 d2 74 5a
> <8b> 02
> 89 01 89 d3 8b 45 00 89 03 8b 4c 24 24 8b 01 89 43 04 8d
>
>
>
> Thanks,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Thomas Paine (paineta at uwec.edu)
> University of Wisconsin - Eau Claire
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>
>
>
>
>
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
More information about the click
mailing list