[Click] e1000 driver timeout with 2.6.x

Mon Sep 18 21:08:18 EDT 2006

Hi Max, Adam,

I'm in the process of importing Max's 6.x e1000 driver into the Click 
repository.  I agree with Adam, we need got-- here.  But I am not a big fan of 
this interface between Click and the polling driver anyway: why must "got" and 
the actual list of skbs remain in sync?  Why not just use the list?

Eddie

Adam Greenhalgh wrote:
> Max,
> 
> I think I might have found a bug in driver, line 4639 in e1000_main.c,
> I think needs got--; could you please validate this for me.
> 
>     if(!(rx_desc->status & E1000_RXD_STAT_EOP) ||
>        (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK)) {
>       rx_desc->status = 0;
>       dev_kfree_skb(*skb);
>       *skb = NULL;
> +      got--;
>       continue;
>     }
> 
> I am also seeing the same tx-time outs when the load is high, but I am
> not swapping click in and out. I have seen many messages on the e1000
> list about this, so I  think there is a problem with the under lying
> driver, it might be good to apply the click patched to the latest 7.x
> driver and see how that is shaping up. I'm happy to test and code if
> you want me to.
> 
> Adam
> 
> On 7/6/06, Massimiliano Poletto <maxp at mazunetworks.com> wrote:
>> Hi Srivas and Beyers, I've spent some time looking at drivers again recently.
>>
>> What works best for me at present is a patched version of the 6.1.16.2
>> Intel driver (not the 6.3.9-k4 driver that comes with linux
>> 2.6.16.13).  I attach the driver sources and patch to this email.  I'm
>> using a 2.6.16.22 kernel, but I don't see why .13 should work any less
>> well with the driver.
>>
>> Performance is good, and it is stable across hundreds of
>> installs/uninstalls and many hours of testing at full line rate
>> offered load.  I sometimes see messages similar to yours (below is an
>> example), but they only seem to happen during stress tests when click
>> is repeatedly installed/uninstalled at very short intervals:
>> e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
>>   TDH                  <b0>
>>   TDT                  <b0>
>>   next_to_use          <b0>
>>   next_to_clean        <9e>
>> buffer_info[next_to_clean]
>>   dma                  <2c66c040>
>>   time_stamp           <0>
>>   next_to_watch        <0>
>>   jiffies              <82ddae8>
>>   next_to_watch.status <0>
>>
>> I'm trying to get a sourceforge 7.x driver to work, but for now this
>> seems at least workable.
>>
>> Please let me know if you have problems with this driver, or if you
>> make other progress yourselves.
>>
>> Regards,
>> max
>>
>>
>> On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
>>> Hello Beyers,
>>>
>>> Thanks a lot for your speedy response. To answer your question regarding
>>> the e1000 driver, I've downloaded and tested my configuration with the
>>> latest stable release (7.1.9) from sourceforge, and the timeout
>>> stubbornly continues to occur with the TSO option disabled.
>>>
>>> For your reference, the possibly relevant snippet of my click
>>> configuration is attached. It uses a click element (onuagent) that I've
>>> written to emulate the protocol being tested, which receives and
>>> forwards packets between 3 interfaces via a customized priority
>>> schedulers.
>>>
>>> ...
>>> FromDevice($rp0, PROMISC true) -> [0]onuagent;
>>> onuagent[0] -> priosched0 -> ToDevice($rp0);
>>> FromDevice($rp1, PROMISC true) -> [1]onuagent;
>>> onuagent[1] -> priosched1 -> ToDevice($rp1);
>>> FromDevice($lp, PROMISC true) -> [2]onuagent;
>>> onuagent[2] -> priosched2 -> ToDevice($lp);
>>> ...
>>>
>>> I'm currently attempting to find a combination of a kernel (2.4.x or
>>> 2.6.x) and a stable e1000 driver version with which I can reliably use
>>> FromDevice/PollDevice. Any details of a setup that has worked for you in
>>> this regard would be helpful.
>>>
>>> Thanks in advance,
>>> Srivas.
>>>
>>> On Jul 03, 2006 05:18 PM, Beyers Cronje wrote:
>>>
>>>
>>>> Hi Srivas,
>>>>
>>>> This is a problem myself, Adamand a few others have been struggling
>>>> with. Strange FromDevice gives you the TX hang, as on my system it
>>>> only happens when using PollDevice in certain configurations. If
>>>> possible can you post the Click config you are using to duplicate the
>>>> hang?
>>>>
>>>> Adam pointed me to the E1000 dev mailing list on SourceForge and the
>>>> TX Hang issue seems to pop up on standard linux (non-click) systems as
>>>> well. One possible workaround seems to be to disable tcp segmentation
>>>> offloading (TSO), you can do this via 'ethtool -K eth0 tso off', but
>>>> seems to work only sometimes ...
>>>>
>>>> What e1000 driver version are you using? Since you only using
>>>> FromDevice have you tried the latest e1000 driver?
>>>>
>>>> Anyone else also having this problem?
>>>>
>>>> Beyers
>>>>
>>>>
>>>> On 7/3/06, Srivas Chennu <chennu at hhi.fhg.de> wrote:
>>>>> Hello all,
>>>>>
>>>>> I'm a relatively new click user trying to build and test a link
>>>>> layer
>>>>> protocol using Click. My test runs used the click kernel module
>>>>> built
>>>>> from the latest CVS sources. On a patched 2.6.16.13 kernel with an
>>>>> original Intel PRO/1000 MT dual port GbE NIC for a click
>>>>> configuration
>>>>> using FromDevice, the driver abruptly times out during Tx and resets
>>>>> with messages like those below:
>>>>>
>>>>> e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
>>>>> Tx Queue <0>
>>>>> TDH <97>
>>>>> TDT <9a>
>>>>> next_to_use <9a>
>>>>> next_to_clean <95>
>>>>> buffer_info[next_to_clean]
>>>>> time_stamp
>>>>> next_to_watch <97>
>>>>> jiffies
>>>>> next_to_watch.status <0>
>>>>> ....
>>>>> ....
>>>>> Eventually I see in the log file:
>>>>>
>>>>> NETDEV WATCHDOG: eth1: transmit timed out
>>>>> e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
>>>>>
>>>>> Interestingly, this timeout-and-reset problem does not occur when
>>>>> running my click configuration at the userlevel, but reproduces
>>>>> quite
>>>>> easily with the kernel module, even when the NIC is working at low
>>>>> packet Rx rates. All configuration parameters to the e1000 modules
>>>>> are
>>>>> at their defaults, and my attempts with parameters suggested in a
>>>>> previous post
>>>>> (
>>>>> https://amsterdam.lcs.mit.edu/pipermail/click/2006-March/004690.html)
>>>>> for similar problems didn't help.
>>>>>
>>>>>
>>>>> Any pointers to solving this problem are appreciated,
>>>>>
>>>>> Thanks in advance,
>>>>> Srivas.
>>>>>
>>>>>
>>>>> --
>>>>> Meet us at
>>>>>
>>>>> OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
>>>>> IFA: 1.-6.9.06, Berlin, Germany
>>>>> _______________________________________________
>>>>> click mailing list
>>>>> click at amsterdam.lcs.mit.edu
>>>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>>>>
>>>
>>> --
>>> Meet us at
>>>
>>> OC&I 2006 and NOC 2006: 10.-13.7.06 at HHI Berlin, Germany
>>> IFA: 1.-6.9.06, Berlin, Germany
>>> _______________________________________________
>>> click mailing list
>>> click at amsterdam.lcs.mit.edu
>>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>>
>>
>> _______________________________________________
>> click mailing list
>> click at amsterdam.lcs.mit.edu
>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click
>>
>>
>>
>>
> _______________________________________________
> click mailing list
> click at amsterdam.lcs.mit.edu
> https://amsterdam.lcs.mit.edu/mailman/listinfo/click