[Click] e1000 driver timeout with 2.6.x

Adam Greenhalgh a.greenhalgh at cs.ucl.ac.uk
Tue Jul 25 06:02:43 EDT 2006


I am seeing time outs and a segfault in run_task in polldevice.cc . I
am investigating the segfault at the moment.

Adam

On 7/25/06, Jason Park <jason at geninetworks.com> wrote:
> Hi Beyers.
>
>
>
> I agree with you, it cannot be absolute solution.
>
> But performance was not bad as compared to skb_clone() on my test.
>
>
>
> And definitely reason I fixed skb_clone() to skb_copy() is memory leak.
>
>
>
> Please refer to this thread.
>
> https://amsterdam.lcs.mit.edu/pipermail/click/2006-June/005021.html
>
>
>
> With skb_copy() memory leak was settled too.
>
> I think destroying cloned packet of click have problem.
>
>
>
> By the way, patch work on TX timeout to you?
>
>   _____
>
> From: Beyers Cronje [mailto:bcronje at gmail.com]
> Sent: Tuesday, July 25, 2006 8:24 AM
> To: Jason Park
> Cc: click at pdos.csail.mit.edu; Volter Yen
> Subject: Re: [Click] e1000 driver timeout with 2.6.x
>
>
>
> Hi Jason,
>
> Doesn't skb_copy() defeit the purpose of Packet::clone() ?
>
> Beyers
>
> On 7/21/06, Jason Park < <mailto:jason at geninetworks.com>
> jason at geninetworks.com> wrote:
>
> I fixed lib/packet.cc and It seems OK.
> At least for me, I can't see TX timeout from e1000 driver any more.
> Anyone can test with this patch?
> Please let me know result.
>
> I'm running click 1.5 (CVS) on linux 2.6.22 SMP with Intel 82571EB (rev 06)
> with e1000-7.1.9 driver from e1000.sf.net  <http://e1000.sf.net>
>
> Thanks in advance.
>
> -----Original Message-----
> From: click-bounces at pdos.csail.mit.edu
> [mailto: click-bounces at pdos.csail.mit.edu
> <mailto:click-bounces at pdos.csail.mit.edu> ] On Behalf Of Adam Greenhalgh
> Sent: Friday, July 21, 2006 1:47 AM
> To: todd lewis
> Cc: Srivas Chennu; click at pdos.csail.mit.edu
> Subject: Re: [Click] e1000 driver timeout with 2.6.x
>
> I can reproduce this, with the following setup
>
> Comp9 <-> Comp11 <-> Comp12
>
> comp9 : netperf tcp source
> comp11 : click router with max's driver (config attached)
> comp12 : netserver
>
> error
>
> e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
>   TDH                  <3b>
>   TDT                  <3b>
>   next_to_use          <3b>
>   next_to_clean        <3a>
> buffer_info[next_to_clean]
>   dma                  <3db93040>
>   time_stamp           <0>
>   next_to_watch        <0>
>   jiffies              <1001b9088>
>   next_to_watch.status <0>
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
> PGD 3e851067 PUD 3dc29067 PMD 0
> Oops: 0000 [1]
> CPU 0
> Modules linked in: click e1000 proclikefs
> Pid: 8416, comm: kclick Not tainted 2.6.16.13 #4
> RIP: 0010:[<ffffffff880b2cd8>]
> <ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216}
> RSP: 0018:ffff81003c2e9e68  EFLAGS: 00010297
> RAX: 000000000000003f RBX: 0000000000000000 RCX: 0000000000000001
> RDX: 0000000000000001 RSI: ffff81003d83a0c0 RDI: ffff81003d83a0c0
> RBP: ffff81003df0b580 R08: ffffffff8836b418 R09: ffffffff8836b418
> R10: 0000000000000000 R11: 0000000000000610 R12: ffff81003e864800
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff81003c2e9e78
> FS:  000000000073cae0(0000) GS:ffffffff8057a000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000003e200000 CR4: 00000000000006e0
> Process kclick (pid: 8416, threadinfo ffff81003c2e8000, task
> ffff81003f868890)
> Stack: ffff81003d83a0c0 00000001000000fc ffff81003e864000 0000000000000246
>        0000000000000246 0000000000000031 ffff81003df0b580 ffff81003f868890
>        000000000000aba2 ffff81003df0bca8
> Call Trace: <ffffffff88042b41>{:click:_ZN12RouterThread6driverEv+625}
>        <ffffffff88029b3c>{:click:_Znam+28}
> <ffffffff88107564>{:click:_Z11click_schedPv+148}
>        <ffffffff8010ae82>{child_rip+8}
> <ffffffff881138f0>{:click:_ZNK16BaseErrorHandler7nerrorsEv+0}
>        <ffffffff881074d0>{:click:_Z11click_schedPv+0}
> <ffffffff8010ae7a>{child_rip+0}
>
> Code: 4d 8b 36 48 8b 83 b8 00 00 00 83 83 88 00 00 00 0e 48 c7 03
> RIP <ffffffff880b2cd8>{:click:_ZN10PollDevice8run_taskEv+216} RSP
> <ffff81003c2e9e68>
> CR2: 0000000000000000
>
> kernel
>
> Linux computer11 2.6.16.13 #4 Thu Jun 8 17:06:31 BST 2006 x86_64 AMD
> Opteron(tm) Processor 250 AuthenticAMD GNU/Linux
>
> I'll keep looking into this.
>
> Adam
>
> On 7/19/06, todd lewis < tgl2 at yahoo.com> wrote:
> > Recompiling the kernel, modules, driver and click for UP instead of SMP
> did allow it to work, but
> > instead of my normal 700mbps (which I can sustain even with netfilter
> queueing to userspace), I
> > instead got 190kbps and these errors:
> >
> > ****************************
> > [17195356.700000] e1000: eth2: e1000_watchdog_1: NIC Link is Up 1000 Mbps
> Full Duplex
> > [17195379.828000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [17195379.828000]   TDH                  <92>
> > [17195379.828000]   TDT                  <92>
> > [17195379.828000 ]   next_to_use          <92>
> > [17195379.828000]   next_to_clean        <91>
> > [17195379.828000] buffer_info[next_to_clean]
> > [17195379.828000]   dma                  <7c3c7840>
> > [ 17195379.828000]   time_stamp           <0>
> > [17195379.828000]   next_to_watch        <0>
> > [17195379.828000]   jiffies              <3b2b1d>
> > [17195379.828000]   next_to_watch.status <0>
> > [17195430.836000] e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [17195430.836000]   TDH                  <14>
> > [17195430.836000]   TDT                  <14>
> > [17195430.836000 ]   next_to_use          <14>
> > [17195430.836000]   next_to_clean        <d6>
> > [17195430.836000] buffer_info[next_to_clean]
> > [17195430.836000]   dma                  <7c230c40>
> > [ 17195430.836000]   time_stamp           <0>
> > [17195430.836000]   next_to_watch        <0>
> > [17195430.836000]   jiffies              <3b5ced>
> > [17195430.836000]   next_to_watch.status <0>
> > (..., lots of these)
> > ****************************
> >
> > I have a dual-port pcie e1000 card.  I plan to try that, and then to try
> one port from each card.
> > If anyone has any other experiments they'd like run with my setup, then
> please let me know.
> >
> > --- Adam Greenhalgh <a.greenhalgh at cs.ucl.ac.uk> wrote:
> >
> > > Max,
> > >
> > > Are you running an SMP kernel with the polling boxes and which intel
> > > card are you using ? I've seen numerous SMP related reports on the
> > > e1000 / netdev lists and just noticed that todd is using an SMP
> > > kernel.
> > >
> > > Adam
> > >
> > > On 7/6/06, Massimiliano Poletto <maxp at mazunetworks.com> wrote:
> > > > Hi Srivas and Beyers, I've spent some time looking at drivers again
> recently.
> > > >
> > > > What works best for me at present is a patched version of the 6.1.16.2
> > > > Intel driver (not the 6.3.9-k4 driver that comes with linux
> > > > 2.6.16.13).  I attach the driver sources and patch to this email.  I'm
> > > > using a 2.6.16.22 kernel, but I don't see why .13 should work any less
>
> > > > well with the driver.
> > > >
> > > > Performance is good, and it is stable across hundreds of
> > > > installs/uninstalls and many hours of testing at full line rate
> > > > offered load.  I sometimes see messages similar to yours (below is an
> > > > example), but they only seem to happen during stress tests when click
> > > > is repeatedly installed/uninstalled at very short intervals:
> > > > e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > > >   TDH                  <b0>
> > > >   TDT                  <b0>
> > > >   next_to_use          <b0>
> > > >   next_to_clean        <9e>
> > > > buffer_info[next_to_clean]
> > > >   dma                  <2c66c040>
> > > >   time_stamp           <0>
> > > >   next_to_watch        <0>
> > > >   jiffies              <82ddae8>
> > > >   next_to_watch.status <0>
> > > >
> > > > I'm trying to get a sourceforge 7.x driver to work, but for now this
> > > > seems at least workable.
> > > >
> > > > Please let me know if you have problems with this driver, or if you
> > > > make other progress yourselves.
> > > >
> > > > Regards,
> > > > max
> > > >
> > > >
> > > > On 7/3/06, Srivas Chennu < <mailto:chennu at hhi.fhg.de>
> chennu at hhi.fhg.de> wrote:
> > > > > Hello Beyers,
> > > > >
> > > > > Thanks a lot for your speedy response. To answer your question
> regarding
> > > > > the e1000 driver, I've downloaded and tested my configuration with
> the
> > > > > latest stable release (7.1.9) from sourceforge, and the timeout
> > > > > stubbornly continues to occur with the TSO option disabled.
> > > > >
> > > > > For your reference, the possibly relevant snippet of my click
> > > > > configuration is attached. It uses a click element (onuagent) that
> I've
> > > > > written to emulate the protocol being tested, which receives and
> > > > > forwards packets between 3 interfaces via a customized priority
> > > > > schedulers.
> > > > >
> > > > > ...
> > > > > FromDevice($rp0, PROMISC true) -> [0]onuagent;
> > > > > onuagent[0] -> priosched0 -> ToDevice($rp0);
> > > > > FromDevice($rp1, PROMISC true) -> [1]onuagent;
> > > > > onuagent[1] -> priosched1 -> ToDevice($rp1);
> > > > > FromDevice($lp, PROMISC true) -> [2]onuagent;
> > > > > onuagent[2] -> priosched2 -> ToDevice($lp);
> > > > > ...
> > > > >
> > > > > I'm currently attempting to find a combination of a kernel (2.4.x or
> > > > > 2.6.x) and a stable e1000 driver version with which I can reliably
> use
> > > > > FromDevice/PollDevice. Any details of a setup that has worked for
> you in
> > > > > this regard would be helpful.
> > > > >
> > > > > Thanks in advance,
> > > > > Srivas.
> > > > >
> > > > > On Jul 03, 2006 05:18 PM, Beyers Cronje wrote:
> > > > >
> > > > >
> > > > > >
> > > > > > Hi Srivas,
> > > > > >
> > > > > > This is a problem myself, Adamand a few others have been
> struggling
> > > > > > with. Strange FromDevice gives you the TX hang, as on my system it
>
> > > > > > only happens when using PollDevice in certain configurations. If
> > > > > > possible can you post the Click config you are using to duplicate
> the
> > > > > > hang?
> > > > > >
> > > > > > Adam pointed me to the E1000 dev mailing list on SourceForge and
> the
> > > > > > TX Hang issue seems to pop up on standard linux (non-click)
> systems as
> > > > > > well. One possible workaround seems to be to disable tcp
> segmentation
> > > > > > offloading (TSO), you can do this via 'ethtool -K eth0 tso off',
> but
> > > > > > seems to work only sometimes ...
> > > > > >
> > > > > > What e1000 driver version are you using? Since you only using
> > > > > > FromDevice have you tried the latest e1000 driver?
> > > > > >
> > > > > > Anyone else also having this problem?
> > > > > >
> > > > > > Beyers
> > > > > >
> > > > > >
> > > > > > On 7/3/06, Srivas Chennu < chennu at hhi.fhg.de> wrote:
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I'm a relatively new click user trying to build and test a link
> > > > > > > layer
> > > > > > > protocol using Click. My test runs used the click kernel module
> > > > > > > built
> > > > > > > from the latest CVS sources. On a patched 2.6.16.13 kernel with
> an
> > > > > > > original Intel PRO/1000 MT dual port GbE NIC for a click
> > > > > > > configuration
> > > > > > > using FromDevice, the driver abruptly times out during Tx and
> resets
> > > > > > > with messages like those below:
> > > > > > >
> > > > > > > e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > > > > > > Tx Queue <0>
> > > > > > > TDH <97>
> > > > > > > TDT <9a>
> > > > > > > next_to_use <9a>
> > > > > > > next_to_clean <95>
> > > > > > > buffer_info[next_to_clean]
> > > > > > > time_stamp
> > > > > > > next_to_watch <97>
> > > > > > > jiffies
> > > > > > > next_to_watch.status <0>
> > > > > > > ....
> > > > > > > ....
> > > > > > > Eventually I see in the log file:
> > > > > > >
> > > > > > > NETDEV WATCHDOG: eth1: transmit timed out
> > > > > > > e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full
> Duplex
> > > > > > >
> > > > > > > Interestingly, this timeout-and-reset problem does not occur
> when
> > > > > > > running my click configuration at the userlevel, but reproduces
> > > > > > > quite
> > > > > > > easily with the kernel module, even when the NIC is working at
> low
> > > > > > > packet Rx rates. All configuration parameters to the e1000
> modules
> > > > > > > are
> > > > > > > at their defaults, and my attempts with parameters suggested in
> a
> > > > > > > previous post
> > > > > > > (
> > > > > > >
> https://amsterdam.lcs.mit.edu/pipermail/click/2006-March/004690.html)
> > > > > > > for similar problems didn't help.
> > > > > > >
> > > > > > >
> > > > > > > Any pointers to solving this problem are appreciated,
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Srivas.
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Meet us at
> > > > > > >
> > > > > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > > > > _______________________________________________
> > > > > > > click mailing list
> > > > > > > click at amsterdam.lcs.mit.edu
> > > > > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > > > > >
> > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Meet us at
> > > > >
> > > > > OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
> > > > > IFA: 1.-6.9.06, Berlin, Germany
> > > > > _______________________________________________
> > > > > click mailing list
> > > > > click at amsterdam.lcs.mit.edu
> > > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > click mailing list
> > > > click at amsterdam.lcs.mit.edu
> > > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > > >
> > > >
> > > >
> > > >
> > > _______________________________________________
> > > click mailing list
> > > click at amsterdam.lcs.mit.edu
> > > https://amsterdam.lcs.mit.edu/mailman/listinfo/click
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>
>
>
>
>
>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>


More information about the click mailing list