[Click] e1000 driver timeout with 2.6.x

Fri Jul 21 05:04:30 EDT 2006

I fixed lib/packet.cc and It seems OK.
At least for me, I can't see TX timeout from e1000 driver any more.
Anyone can test with this patch?
Please let me know result.

I'm running click 1.5 (CVS) on linux 2.6.22 SMP with Intel 82571EB (rev 06)
with e1000-7.1.9 driver from e1000.sf.net

Thanks in advance.

-----Original Message-----
From: click-bounces at pdos.csail.mit.edu
[mailto:click-bounces at pdos.csail.mit.edu] On Behalf Of Adam Greenhalgh
Sent: Friday, July 21, 2006 1:47 AM
To: todd lewis
Cc: Srivas Chennu; click at pdos.csail.mit.edu
Subject: Re: [Click] e1000 driver timeout with 2.6.x

I can reproduce this, with the following setup

Comp9 <-> Comp11 <-> Comp12

comp9 : netperf tcp source
comp11 : click router with max's driver (config attached)
comp12 : netserver

error

e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
  TDH                  <3b>
  TDT                  <3b>
  next_to_use          <3b>
  next_to_clean        <3a>
buffer_info[next_to_clean]
  dma                  <3db93040>
  time_stamp           <0>
  next_to_watch        <0>
  jiffies              <1001b9088>
  next_to_watch.status <0>
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
PGD 3e851067 PUD 3dc29067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in: click e1000 proclikefs
Pid: 8416, comm: kclick Not tainted 2.6.16.13 #4
RIP: 0010:[<ffffffff880b2cd8>]
<ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
RSP: 0018:ffff81003c2e9e68  EFLAGS: 00010297
RAX: 000000000000003f RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff81003d83a0c0 RDI: ffff81003d83a0c0
RBP: ffff81003df0b580 R08: ffffffff8836b418 R09: ffffffff8836b418
R10: 0000000000000000 R11: 0000000000000610 R12: ffff81003e864800
R13: 0000000000000000 R14: 0000000000000000 R15: ffff81003c2e9e78
FS:  000000000073cae0(0000) GS:ffffffff8057a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003e200000 CR4: 00000000000006e0
Process kclick (pid: 8416, threadinfo ffff81003c2e8000, task
ffff81003f868890)
Stack: ffff81003d83a0c0 00000001000000fc ffff81003e864000 0000000000000246
       0000000000000246 0000000000000031 ffff81003df0b580 ffff81003f868890
       000000000000aba2 ffff81003df0bca8
Call Trace: <ffffffff88042b41>{:click:_ZN12RouterThread6driverEv+625}
       <ffffffff88029b3c>{:click:_Znam+28}
<ffffffff88107564>{:click:_Z11click_schedPv+148}
       <ffffffff8010ae82>{child_rip+8}
<ffffffff881138f0>{:click:_ZNK16BaseErrorHandler7nerrorsEv+0}
       <ffffffff881074d0>{:click:_Z11click_schedPv+0}
<ffffffff8010ae7a>{child_rip+0}

Code: 4d 8b 36 48 8b 83 b8 00 00 00 83 83 88 00 00 00 0e 48 c7 03
RIP <ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216} RSP
<ffff81003c2e9e68>
CR2: 0000000000000000

kernel

Linux computer11 2.6.16.13 #4 Thu Jun 8 17:06:31 BST 2006 x86_64 AMD
Opteron(tm) Processor 250 AuthenticAMD GNU/Linux

I'll keep looking into this.

Adam

On 7/19/06, todd lewis <tgl2 at yahoo.com> wrote:
> Recompiling the kernel, modules, driver and click for UP instead of SMP
did allow it to work, but
> instead of my normal 700mbps (which I can sustain even with netfilter
queueing to userspace), I
> instead got 190kbps and these errors:
>
> ****************************
> [17195356.700000] e1000: eth2: e1000_watchdog_1: NIC Link is Up 1000 Mbps
Full Duplex
> [17195379.828000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> [17195379.828000]   TDH                  <92>
> [17195379.828000]   TDT                  <92>
> [17195379.828000]   next_to_use          <92>
> [17195379.828000]   next_to_clean        <91>
> [17195379.828000] buffer_info[next_to_clean]
> [17195379.828000]   dma                  <7c3c7840>
> [17195379.828000]   time_stamp           <0>
> [17195379.828000]   next_to_watch        <0>
> [17195379.828000]   jiffies              <3b2b1d>
> [17195379.828000]   next_to_watch.status <0>
> [17195430.836000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> [17195430.836000]   TDH                  <14>
> [17195430.836000]   TDT                  <14>
> [17195430.836000]   next_to_use          <14>
> [17195430.836000]   next_to_clean        <d6>
> [17195430.836000] buffer_info[next_to_clean]
> [17195430.836000]   dma                  <7c230c40>
> [17195430.836000]   time_stamp           <0>
> [17195430.836000]   next_to_watch        <0>
> [17195430.836000]   jiffies              <3b5ced>
> [17195430.836000]   next_to_watch.status <0>
> (..., lots of these)
> ****************************
>
> I have a dual-port pcie e1000 card.  I plan to try that, and then to try
one port from each card.
> If anyone has any other experiments they'd like run with my setup, then
please let me know.
>
> --- Adam Greenhalgh <a.greenhalgh at cs.ucl.ac.uk> wrote:
>
> > Max,
> >
> > Are you running an SMP kernel with the polling boxes and which intel
> > card are you using ? I've seen numerous SMP related reports on the
> > e1000 / netdev lists and just noticed that todd is using an SMP
> > kernel.
> >
> > Adam
> >
> > On 7/6/06, Massimiliano Poletto <maxp at mazunetworks.com> wrote:
> > > Hi Srivas and Beyers, I've spent some time looking at drivers again
recently.
> > >
> > > What works best for me at present is a patched version of the 6.1.16.2
> > > Intel driver (not the 6.3.9-k4 driver that comes with linux
> > > 2.6.16.13).  I attach the driver sources and patch to this email.  I'm
> > > using a 2.6.16.22 kernel, but I don't see why .13 should work any less
> > > well with the driver.
> > >
> > > Performance is good, and it is stable across hundreds of
> > > installs/uninstalls and many hours of testing at full line rate
> > > offered load.  I sometimes see messages similar to yours (below is an
> > > example), but they only seem to happen during stress tests when click
> > > is repeatedly installed/uninstalled at very short intervals:
> > > e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > >   TDH                  <b0>
> > >   TDT                  <b0>
> > >   next_to_use          <b0>
> > >   next_to_clean        <9e>
> > > buffer_info[next_to_clean]
> > >   dma                  <2c66c040>
> > >   time_stamp           <0>
> > >   next_to_watch        <0>
> > >   jiffies              <82ddae8>
> > >   next_to_watch.status <0>
> > >
> > > I'm trying to get a sourceforge 7.x driver to work, but for now this
> > > seems at least workable.
> > >
> > > Please let me know if you have problems with this driver, or if you
> > > make other progress yourselves.
> > >
> > > Regards,
> > > max
> > >
> > >
> > > On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
> > > > Hello Beyers,
> > > >
> > > > Thanks a lot for your speedy response. To answer your question
regarding
> > > > the e1000 driver, I've downloaded and tested my configuration with
the
> > > > latest stable release (7.1.9) from sourceforge, and the timeout
> > > > stubbornly continues to occur with the TSO option disabled.
> > > >
> > > > For your reference, the possibly relevant snippet of my click
> > > > configuration is attached. It uses a click element (onuagent) that
I've
> > > > written to emulate the protocol being tested, which receives and
> > > > forwards packets between 3 interfaces via a customized priority
> > > > schedulers.
> > > >
> > > > ...
> > > > FromDevice($rp0, PROMISC true) -> [0]onuagent;
> > > > onuagent[0] -> priosched0 -> ToDevice($rp0);
> > > > FromDevice($rp1, PROMISC true) -> [1]onuagent;
> > > > onuagent[1] -> priosched1 -> ToDevice($rp1);
> > > > FromDevice($lp, PROMISC true) -> [2]onuagent;
> > > > onuagent[2] -> priosched2 -> ToDevice($lp);
> > > > ...
> > > >
> > > > I'm currently attempting to find a combination of a kernel (2.4.x or
> > > > 2.6.x) and a stable e1000 driver version with which I can reliably
use
> > > > FromDevice/PollDevice. Any details of a setup that has worked for
you in
> > > > this regard would be helpful.
> > > >
> > > > Thanks in advance,
> > > > Srivas.
> > > >
> > > > On Jul 03, 2006 05:18 PM, Beyers Cronje wrote:
> > > >
> > > >
> > > > >
> > > > > Hi Srivas,
> > > > >
> > > > > This is a problem myself, Adamand a few others have been
struggling
> > > > > with. Strange FromDevice gives you the TX hang, as on my system it
> > > > > only happens when using PollDevice in certain configurations. If
> > > > > possible can you post the Click config you are using to duplicate
the
> > > > > hang?
> > > > >
> > > > > Adam pointed me to the E1000 dev mailing list on SourceForge and
the
> > > > > TX Hang issue seems to pop up on standard linux (non-click)
systems as
> > > > > well. One possible workaround seems to be to disable tcp
segmentation
> > > > > offloading (TSO), you can do this via 'ethtool -K eth0 tso off',
but
> > > > > seems to work only sometimes ...
> > > > >
> > > > > What e1000 driver version are you using? Since you only using
> > > > > FromDevice have you tried the latest e1000 driver?
> > > > >
> > > > > Anyone else also having this problem?
> > > > >
> > > > > Beyers
> > > > >
> > > > >
> > > > > On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
> > > > > > Hello all,
> > > > > >
> > > > > > I'm a relatively new click user trying to build and test a link
> > > > > > layer
> > > > > > protocol using Click. My test runs used the click kernel module
> > > > > > built
> > > > > > from the latest CVS sources. On a patched 2.6.16.13 kernel with
an
> > > > > > original Intel PRO/1000 MT dual port GbE NIC for a click
> > > > > > configuration
> > > > > > using FromDevice, the driver abruptly times out during Tx and
resets
> > > > > > with messages like those below:
> > > > > >
> > > > > > e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > > > > > Tx Queue <0>
> > > > > > TDH <97>
> > > > > > TDT <9a>
> > > > > > next_to_use <9a>
> > > > > > next_to_clean <95>
> > > > > > buffer_info[next_to_clean]
> > > > > > time_stamp
> > > > > > next_to_watch <97>
> > > > > > jiffies
> > > > > > next_to_watch.status <0>
> > > > > > ....
> > > > > > ....
> > > > > > Eventually I see in the log file:
> > > > > >
> > > > > > NETDEV WATCHDOG: eth1: transmit timed out
> > > > > > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full
Duplex
> > > > > >
> > > > > > Interestingly, this timeout-and-reset problem does not occur
when
> > > > > > running my click configuration at the userlevel, but reproduces
> > > > > > quite
> > > > > > easily with the kernel module, even when the NIC is working at
low
> > > > > > packet Rx rates. All configuration parameters to the e1000
modules
> > > > > > are
> > > > > > at their defaults, and my attempts with parameters suggested in
a
> > > > > > previous post
> > > > > > (
> > > > > >
https://amsterdam.lcs.mit.edu/pipermail/click/2006-March/004690.html)
> > > > > > for similar problems didn't help.
> > > > > >
> > > > > >
> > > > > > Any pointers to solving this problem are appreciated,
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Srivas.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Meet us at
> > > > > >
> > > > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > > > _______________________________________________
> > > > > > click mailing list
> > > > > > click at amsterdam.lcs.mit.edu
> > > > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > > > >
> > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Meet us at
> > > >
> > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > _______________________________________________
> > > > click mailing list
> > > > click at amsterdam.lcs.mit.edu
> > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > >
> > >
> > >
> > > _______________________________________________
> > > click mailing list
> > > click at amsterdam.lcs.mit.edu
> > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > >
> > >
> > >
> > >
> > _______________________________________________
> > click mailing list
> > click at amsterdam.lcs.mit.edu
> > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: packet_clone.patch
Type: application/octet-stream
Size: 534 bytes
Desc: not available
Url : https://amsterdam.lcs.mit.edu/pipermail/click/attachments/20060721/ccf018c6/packet_clone-0001.obj