[OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and the Greek economy

Chris Siebenmann cks at cs.toronto.edu
Sat Feb 21 03:25:59 UTC 2015


> After installation and configuration, I observed all kinds of bad behavior
> in the network traffic between the hosts and the server. All of this bad
> behavior is traced to the ixgbe driver on the storage server. Without going
> into the full troubleshooting process, here are my takeaways:
[...]

 For what it's worth, we managed to achieve much better line rates on
copper 10G ixgbe hardware of various descriptions between OmniOS
and CentOS 7 (I don't think we ever tested OmniOS to OmniOS). I don't
believe OmniOS could do TCP at full line rate but I think we managed 700+
Mbytes/sec on both transmit and receive and we got basically disk-limited
speeds with iSCSI (across multiple disks on multi-disk mirrored pools,
OmniOS iSCSI initiator, Linux iSCSI targets).

 I don't believe we did any specific kernel tuning (and in fact some of
our attempts to fiddle ixgbe driver parameters blew up in our face).
We did tune iSCSI connection parameters to increase various buffer
sizes so that ZFS could do even large single operations in single iSCSI
transactions. (More details available if people are interested.)

> 10: At the wire level, the speed problems are clearly due to pauses in
> response time by omnios. At 9000 byte frame sizes, I see a good number
> of duplicate ACKs and fast retransmits during read operations (when
> omnios is transmitting). But below about a 4100-byte MTU on omnios
> (which seems to correlate to 4096-byte iSCSI block transfers), the
> transmission errors fade away and we only see the transmission pause
> problem.

 This is what really attracted my attention. In our OmniOS setup, our
specific Intel hardware had ixgbe driver issues that could cause
activity stalls during once-a-second link heartbeat checks. This
obviously had an effect at the TCP and iSCSI layers. My initial message
to illumos-developer sparked a potentially interesting discussion:

http://www.listbox.com/member/archive/182179/2014/10/sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-4B1D-11E4-A39C-D534381BA44D/

If you think this is a possibility in your setup, I've put the DTrace
script I used to hunt for this up on the web:

	http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_delay.d

This isn't the only potential source of driver stalls by any means, it's
just the one I found. You may also want to look at lockstat in general,
as information it reported is what led us to look specifically at the
ixgbe code here.

(If you suspect kernel/driver issues, lockstat combined with kernel
source is a really excellent resource.)

	- cks


More information about the OmniOS-discuss mailing list