[Click] e1000 driver timeout with 2.6.x

Thu Jul 20 12:46:58 EDT 2006

I can reproduce this, with the following setup

Comp9 <-> Comp11 <-> Comp12

comp9 : netperf tcp source
comp11 : click router with max's driver (config attached)
comp12 : netserver

error

e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
  TDH                  <3b>
  TDT                  <3b>
  next_to_use          <3b>
  next_to_clean        <3a>
buffer_info[next_to_clean]
  dma                  <3db93040>
  time_stamp           <0>
  next_to_watch        <0>
  jiffies              <1001b9088>
  next_to_watch.status <0>
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
PGD 3e851067 PUD 3dc29067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in: click e1000 proclikefs
Pid: 8416, comm: kclick Not tainted 2.6.16.13 #4
RIP: 0010:[<ffffffff880b2cd8>]
<ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
RSP: 0018:ffff81003c2e9e68  EFLAGS: 00010297
RAX: 000000000000003f RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff81003d83a0c0 RDI: ffff81003d83a0c0
RBP: ffff81003df0b580 R08: ffffffff8836b418 R09: ffffffff8836b418
R10: 0000000000000000 R11: 0000000000000610 R12: ffff81003e864800
R13: 0000000000000000 R14: 0000000000000000 R15: ffff81003c2e9e78
FS:  000000000073cae0(0000) GS:ffffffff8057a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003e200000 CR4: 00000000000006e0
Process kclick (pid: 8416, threadinfo ffff81003c2e8000, task ffff81003f868890)
Stack: ffff81003d83a0c0 00000001000000fc ffff81003e864000 0000000000000246
       0000000000000246 0000000000000031 ffff81003df0b580 ffff81003f868890
       000000000000aba2 ffff81003df0bca8
Call Trace: <ffffffff88042b41>{:click:_ZN12RouterThread6driverEv+625}
       <ffffffff88029b3c>{:click:_Znam+28}
<ffffffff88107564>{:click:_Z11click_schedPv+148}
       <ffffffff8010ae82>{child_rip+8}
<ffffffff881138f0>{:click:_ZNK16BaseErrorHandler7nerrorsEv+0}
       <ffffffff881074d0>{:click:_Z11click_schedPv+0}
<ffffffff8010ae7a>{child_rip+0}

Code: 4d 8b 36 48 8b 83 b8 00 00 00 83 83 88 00 00 00 0e 48 c7 03
RIP <ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216} RSP
<ffff81003c2e9e68>
CR2: 0000000000000000

kernel

Linux computer11 2.6.16.13 #4 Thu Jun 8 17:06:31 BST 2006 x86_64 AMD
Opteron(tm) Processor 250 AuthenticAMD GNU/Linux

I'll keep looking into this.

Adam

On 7/19/06, todd lewis <tgl2 at yahoo.com> wrote:
> Recompiling the kernel, modules, driver and click for UP instead of SMP did allow it to work, but
> instead of my normal 700mbps (which I can sustain even with netfilter queueing to userspace), I
> instead got 190kbps and these errors:
>
> ****************************
> [17195356.700000] e1000: eth2: e1000_watchdog_1: NIC Link is Up 1000 Mbps Full Duplex
> [17195379.828000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> [17195379.828000]   TDH                  <92>
> [17195379.828000]   TDT                  <92>
> [17195379.828000]   next_to_use          <92>
> [17195379.828000]   next_to_clean        <91>
> [17195379.828000] buffer_info[next_to_clean]
> [17195379.828000]   dma                  <7c3c7840>
> [17195379.828000]   time_stamp           <0>
> [17195379.828000]   next_to_watch        <0>
> [17195379.828000]   jiffies              <3b2b1d>
> [17195379.828000]   next_to_watch.status <0>
> [17195430.836000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> [17195430.836000]   TDH                  <14>
> [17195430.836000]   TDT                  <14>
> [17195430.836000]   next_to_use          <14>
> [17195430.836000]   next_to_clean        <d6>
> [17195430.836000] buffer_info[next_to_clean]
> [17195430.836000]   dma                  <7c230c40>
> [17195430.836000]   time_stamp           <0>
> [17195430.836000]   next_to_watch        <0>
> [17195430.836000]   jiffies              <3b5ced>
> [17195430.836000]   next_to_watch.status <0>
> (..., lots of these)
> ****************************
>
> I have a dual-port pcie e1000 card.  I plan to try that, and then to try one port from each card.
> If anyone has any other experiments they'd like run with my setup, then please let me know.
>
> --- Adam Greenhalgh <a.greenhalgh at cs.ucl.ac.uk> wrote:
>
> > Max,
> >
> > Are you running an SMP kernel with the polling boxes and which intel
> > card are you using ? I've seen numerous SMP related reports on the
> > e1000 / netdev lists and just noticed that todd is using an SMP
> > kernel.
> >
> > Adam
> >
> > On 7/6/06, Massimiliano Poletto <maxp at mazunetworks.com> wrote:
> > > Hi Srivas and Beyers, I've spent some time looking at drivers again recently.
> > >
> > > What works best for me at present is a patched version of the 6.1.16.2
> > > Intel driver (not the 6.3.9-k4 driver that comes with linux
> > > 2.6.16.13).  I attach the driver sources and patch to this email.  I'm
> > > using a 2.6.16.22 kernel, but I don't see why .13 should work any less
> > > well with the driver.
> > >
> > > Performance is good, and it is stable across hundreds of
> > > installs/uninstalls and many hours of testing at full line rate
> > > offered load.  I sometimes see messages similar to yours (below is an
> > > example), but they only seem to happen during stress tests when click
> > > is repeatedly installed/uninstalled at very short intervals:
> > > e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > >   TDH                  <b0>
> > >   TDT                  <b0>
> > >   next_to_use          <b0>
> > >   next_to_clean        <9e>
> > > buffer_info[next_to_clean]
> > >   dma                  <2c66c040>
> > >   time_stamp           <0>
> > >   next_to_watch        <0>
> > >   jiffies              <82ddae8>
> > >   next_to_watch.status <0>
> > >
> > > I'm trying to get a sourceforge 7.x driver to work, but for now this
> > > seems at least workable.
> > >
> > > Please let me know if you have problems with this driver, or if you
> > > make other progress yourselves.
> > >
> > > Regards,
> > > max
> > >
> > >
> > > On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
> > > > Hello Beyers,
> > > >
> > > > Thanks a lot for your speedy response. To answer your question regarding
> > > > the e1000 driver, I've downloaded and tested my configuration with the
> > > > latest stable release (7.1.9) from sourceforge, and the timeout
> > > > stubbornly continues to occur with the TSO option disabled.
> > > >
> > > > For your reference, the possibly relevant snippet of my click
> > > > configuration is attached. It uses a click element (onuagent) that I've
> > > > written to emulate the protocol being tested, which receives and
> > > > forwards packets between 3 interfaces via a customized priority
> > > > schedulers.
> > > >
> > > > ...
> > > > FromDevice($rp0, PROMISC true) -> [0]onuagent;
> > > > onuagent[0] -> priosched0 -> ToDevice($rp0);
> > > > FromDevice($rp1, PROMISC true) -> [1]onuagent;
> > > > onuagent[1] -> priosched1 -> ToDevice($rp1);
> > > > FromDevice($lp, PROMISC true) -> [2]onuagent;
> > > > onuagent[2] -> priosched2 -> ToDevice($lp);
> > > > ...
> > > >
> > > > I'm currently attempting to find a combination of a kernel (2.4.x or
> > > > 2.6.x) and a stable e1000 driver version with which I can reliably use
> > > > FromDevice/PollDevice. Any details of a setup that has worked for you in
> > > > this regard would be helpful.
> > > >
> > > > Thanks in advance,
> > > > Srivas.
> > > >
> > > > On Jul 03, 2006 05:18 PM, Beyers Cronje wrote:
> > > >
> > > >
> > > > >
> > > > > Hi Srivas,
> > > > >
> > > > > This is a problem myself, Adamand a few others have been struggling
> > > > > with. Strange FromDevice gives you the TX hang, as on my system it
> > > > > only happens when using PollDevice in certain configurations. If
> > > > > possible can you post the Click config you are using to duplicate the
> > > > > hang?
> > > > >
> > > > > Adam pointed me to the E1000 dev mailing list on SourceForge and the
> > > > > TX Hang issue seems to pop up on standard linux (non-click) systems as
> > > > > well. One possible workaround seems to be to disable tcp segmentation
> > > > > offloading (TSO), you can do this via 'ethtool -K eth0 tso off', but
> > > > > seems to work only sometimes ...
> > > > >
> > > > > What e1000 driver version are you using? Since you only using
> > > > > FromDevice have you tried the latest e1000 driver?
> > > > >
> > > > > Anyone else also having this problem?
> > > > >
> > > > > Beyers
> > > > >
> > > > >
> > > > > On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
> > > > > > Hello all,
> > > > > >
> > > > > > I'm a relatively new click user trying to build and test a link
> > > > > > layer
> > > > > > protocol using Click. My test runs used the click kernel module
> > > > > > built
> > > > > > from the latest CVS sources. On a patched 2.6.16.13 kernel with an
> > > > > > original Intel PRO/1000 MT dual port GbE NIC for a click
> > > > > > configuration
> > > > > > using FromDevice, the driver abruptly times out during Tx and resets
> > > > > > with messages like those below:
> > > > > >
> > > > > > e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > > > > > Tx Queue <0>
> > > > > > TDH <97>
> > > > > > TDT <9a>
> > > > > > next_to_use <9a>
> > > > > > next_to_clean <95>
> > > > > > buffer_info[next_to_clean]
> > > > > > time_stamp
> > > > > > next_to_watch <97>
> > > > > > jiffies
> > > > > > next_to_watch.status <0>
> > > > > > ....
> > > > > > ....
> > > > > > Eventually I see in the log file:
> > > > > >
> > > > > > NETDEV WATCHDOG: eth1: transmit timed out
> > > > > > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
> > > > > >
> > > > > > Interestingly, this timeout-and-reset problem does not occur when
> > > > > > running my click configuration at the userlevel, but reproduces
> > > > > > quite
> > > > > > easily with the kernel module, even when the NIC is working at low
> > > > > > packet Rx rates. All configuration parameters to the e1000 modules
> > > > > > are
> > > > > > at their defaults, and my attempts with parameters suggested in a
> > > > > > previous post
> > > > > > (
> > > > > > https://amsterdam.lcs.mit.edu/pipermail/click/2006-March/004690.html)
> > > > > > for similar problems didn't help.
> > > > > >
> > > > > >
> > > > > > Any pointers to solving this problem are appreciated,
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Srivas.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Meet us at
> > > > > >
> > > > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > > > _______________________________________________
> > > > > > click mailing list
> > > > > > click at amsterdam.lcs.mit.edu
> > > > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > > > >
> > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Meet us at
> > > >
> > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > _______________________________________________
> > > > click mailing list
> > > > click at amsterdam.lcs.mit.edu
> > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > >
> > >
> > >
> > > _______________________________________________
> > > click mailing list
> > > click at amsterdam.lcs.mit.edu
> > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > >
> > >
> > >
> > >
> > _______________________________________________
> > click mailing list
> > click at amsterdam.lcs.mit.edu
> > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: routercomputer11-aligned-polldevice.click
Type: application/octet-stream
Size: 5438 bytes
Desc: not available
Url : https://amsterdam.lcs.mit.edu/pipermail/click/attachments/20060720/654aa336/routercomputer11-aligned-polldevice.obj