[Click] e1000 polling tx_ring lockup patch

springbo at cs.wisc.edu springbo at cs.wisc.edu
Thu Jan 10 18:43:34 EST 2008


Hi,

I think I've discovered a bug in e1000_poll_on() in click's Intel
e1000-7.x driver. the bug is only experienced with polling enabled in the
linuxmodule. The bug should be present for uniprocessor machines, but is
more likely in multiprocessor machines.

Symptom:
About out of every 20 installs of the linuxmodule I would not see any
packets transmitted and would see a message similar to the following in
the log:
e1000: eth3: e1000_clean_tx_irq: Detected Tx Unit Hang
   Tx Queue             <0>
   TDH                  <0>
   TDT                  <2a>
   next_to_use          <2a>
   next_to_clean        <0>
 buffer_info[next_to_clean]
   time_stamp           <104941e6f>
   next_to_watch        <0>
   jiffies              <104960aec>
   next_to_watch.status <0>


Cause:
The problem is that transmitting on the adapter is never enabled if the
link state updates quickly.


Explanation:
The bug is seldom seen because of a sequence of recovery steps in the
watchdog which enabled transmitting.

In the case of auto recovery the link state (E1000_READ_REG(&adapter->hw,
STATUS) & E1000_STATUS_LU) remains down through the first run of the
watchdog (e1000_watchdog_1()). The watchdog detects an inconsistent state
where the link is down and netif_carrier is ok. The watchdog takes the
correct steps to resolve the inconsistency, including enabling
transmissions.

In the case where the tx_ring locks the link state comes back up before
the watchdog runs. In this case the link is up and netif_carrier is ok =>
nothing is inconsistent => no recovery code is run => transmission is not
enabled. Which results in no packets being transmitted and the tx ring
filling. Apply replication.patch to replicate the problem.


Solution:
The solution is to enable transmissions when polling is turned on. This
behaves correctly in both of the cases mentioned above: If the link status
is not updated to 'up' before the watchdog is run, then the register which
enables tx is set twice. If the link stats is 'down' on the first watchdog
run, then the unnecessary recovery code is not run.


The polling.patch implements the outlined solution and also adds some
debugging output to help identify lockups.



~Kevin Springborn
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polling.patch
Type: application/octet-stream
Size: 1462 bytes
Desc: not available
Url : https://pdos.csail.mit.edu/pipermail/click/attachments/20080110/83eb0078/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: replication.patch
Type: application/octet-stream
Size: 434 bytes
Desc: not available
Url : https://pdos.csail.mit.edu/pipermail/click/attachments/20080110/83eb0078/attachment-0001.obj 


More information about the click mailing list