[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy
Mark
mark0x01 at gmail.com
Mon Mar 2 08:12:12 UTC 2015
LACP does work - I have used on HP Procurve, but settings are fussy &
usually different than Etherchannel uses.
(http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048)
Did you try changing the virtual switch settings?
On 2/03/2015 6:11 p.m., Garrett D'Amore wrote:
> I’m not sure I’ve followed properly. You have *two* interfaces. You
> are not trying to provision these in an aggr are you? As far as I’m
> aware, VMware does not support 802.3ad link aggregations. (Its possible
> that you can make it work with ESXi if you give the entire NIC to the
> guest — but I’m skeptical.) The problem is that if you try to use link
> aggregation, some packets (up to half!) will be lost. TCP and other
> protocols fare poorly in this situation.
>
> Its possible I’ve totally misunderstood what you’re trying to do, in
> which case I apologize.
>
> The idle thing is a red-herring — the cpu is waiting for work to do,
> probably because packets haven’t arrived (or where dropped by the
> hypervisor!) I wouldn’t read too much into that except that your
> network stack is in trouble. I’d look a bit more closely at the kstats
> for tcp — I suspect you’ll see retransmits or out of order values that
> are unusually high — if so this may help validate my theory above.
>
> - Garrett
>
>> On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>> <developer at lists.illumos.org <mailto:developer at lists.illumos.org>> wrote:
>>
>> Hello all,
>>
>>
>> Well, I no longer blame the ixgbe driver for the problems I'm seeing.
>>
>>
>> I tried Joerg's updated driver, which didn't improve the issue. So I
>> went back to the drawing board and rebuilt the server from scratch.
>>
>> What I noted is that if I have only a single 1-gig physical interface
>> active on the ESXi host, everything works as expected. As soon as I
>> enable two interfaces, I start seeing the performance problems I've
>> described.
>>
>> Response pauses from the server that I see in TCPdumps are still
>> leading me to believe the problem is delay on the server side, so I
>> ran a series of kernel dtraces and produced some flamegraphs.
>>
>>
>> This was taken during a read operation with two active 10G interfaces
>> on the server, with a single target being shared by two tpgs- one tpg
>> for each 10G physical port. The host device has two 1G ports enabled,
>> with VLANs separating the active ports into 10G/1G pairs. ESXi is set
>> to multipath using both VLANS with a round-robin IO interval of 1.
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/view?usp=sharing
>>
>>
>> This was taken during a write operation:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/view?usp=sharing
>>
>>
>> I then rebooted the server and disabled C-State, ACPI T-State, and
>> general EIST (Turbo boost) functionality in the CPU.
>>
>> I when I attempted to boot my guest VM, the iSCSI transfer gradually
>> ground to a halt during the boot loading process, and the guest OS
>> never did complete its boot process.
>>
>> Here is a flamegraph taken while iSCSI is slowly dying:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/view?usp=sharing
>>
>>
>> I edited out cpu_idle_adaptive from the dtrace output and regenerated
>> the slowdown graph:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/view?usp=sharing
>>
>>
>> I then edited cpu_idle_adaptive out of the speedy write operation and
>> regenerated that graph:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/view?usp=sharing
>>
>>
>> I have zero experience with interpreting flamegraphs, but the most
>> significant difference I see between the slow read example and the
>> fast write example is in unix`thread_start --> unix`idle. There's a
>> good chunk of "unix`i86_mwait" in the read example that is not present
>> in the write example at all.
>>
>> Disabling the l2arc cache device didn't make a difference, and I had
>> to reenable EIST support on the CPU to get my VMs to boot.
>>
>> I am seeing a variety of bug reports going back to 2010 regarding
>> excessive mwait operations, with the suggested solutions usually being
>> to set "cpupm enable poll-mode" in power.conf. That change also had no
>> effect on speed.
>>
>> -Warren V
>>
>>
>>
>>
>> -----Original Message-----
>>
>> From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>>
>> Sent: Monday, February 23, 2015 8:30 AM
>>
>> To: W Verb
>>
>> Cc: omnios-discuss at lists.omniti.com
>> <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>> <mailto:cks at cs.toronto.edu>
>>
>> Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and the
>> Greek economy
>>
>>
>> > Chris, thanks for your specific details. I'd appreciate it if you
>>
>> > could tell me which copper NIC you tried, as well as to pass on the
>>
>> > iSCSI tuning parameters.
>>
>>
>> Our copper NIC experience is with onboard X540-AT2 ports on SuperMicro
>> hardware (which have the guaranteed 10-20 msec lock hold) and
>> dual-port 82599EB TN cards (which have some sort of driver/hardware
>> failure under load that eventually leads to 2-second lock holds). I
>> can't recommend either with the current driver; we had to revert to 1G
>> networking in order to get stable servers.
>>
>>
>> The iSCSI parameter modifications we do, across both initiators and
>> targets, are:
>>
>>
>> initialr2tno
>>
>> firstburstlength128k
>>
>> maxrecvdataseglen128k[only on Linux backends]
>>
>> maxxmitdataseglen128k[only on Linux backends]
>>
>>
>> The OmniOS initiator doesn't need tuning for more than the first two
>> parameters; on the Linux backends we tune up all four. My extended
>> thoughts on these tuning parameters and why we touch them can be found
>>
>> here:
>>
>>
>> http://utcc.utoronto.ca/~cks/space/blog/tech/UnderstandingiSCSIProtocol <http://utcc.utoronto.ca/%7Ecks/space/blog/tech/UnderstandingiSCSIProtocol>
>>
>> http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>> <http://utcc.utoronto.ca/%7Ecks/space/blog/tech/LikelyISCSITuning>
>>
>>
>> The short version is that these parameters probably only make a small
>> difference but their overall goal is to do 128KB ZFS reads and writes
>> in single iSCSI operations (although they will be fragmented at the TCP
>>
>> layer) and to do iSCSI writes without a back-and-forth delay between
>> initiator and target (that's 'initialr2t no').
>>
>>
>> I think basically everyone should use InitialR2T set to no and in fact
>> that it should be the software default. These days only unusually
>> limited iSCSI targets should need it to be otherwise and they can
>> change their setting for it (initiator and target must both agree to
>> it being 'yes', so either can veto it).
>>
>>
>> - cks
>>
>>
>>
>> On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>> <mailto:jg at osn.de>> wrote:
>>
>> Hi,
>>
>> I think your problem is caused by your link properties or your
>> switch settings. In general the standard ixgbe seems to perform
>> well.
>>
>> I had trouble after changing the default flow control settings to "bi"
>> and this was my motivation to update the ixgbe driver a long time ago.
>> After I have updated our systems to ixgbe 2.5.8 I never had any
>> problems ....
>>
>> Make sure your switch has support for jumbo frames and you use
>> the same mtu on all ports, otherwise the smallest will be used.
>>
>> What switch do you use? I can tell you nice horror stories about
>> different vendors....
>>
>> - Joerg
>>
>> On 23.02.2015 10:31, W Verb wrote:
>>
>> Thank you Joerg,
>>
>> I've downloaded the package and will try it tomorrow.
>>
>> The only thing I can add at this point is that upon review of my
>> testing, I may have performed my "pkg -u" between the initial
>> quad-gig
>> performance test and installing the 10G NIC. So this may be a new
>> problem introduced in the latest updates.
>>
>> Those of you who are running 10G and have not upgraded to the
>> latest
>> kernel, etc, might want to do some additional testing before
>> running the
>> update.
>>
>> -Warren V
>>
>> On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann <jg at osn.de
>> <mailto:jg at osn.de>
>> <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>>
>> Hi,
>>
>> I remember there was a problem with the flow control
>> settings in the
>> ixgbe
>> driver, so I updated it a long time ago for our internal
>> servers to
>> 2.5.8.
>> Last weekend I integrated the latest changes from the
>> FreeBSD driver
>> to bring
>> the illumos ixgbe to 2.5.25 but I had no time to test it,
>> so it's
>> completely
>> untested!
>>
>>
>> If you would like to give the latest driver a try you can
>> fetch the
>> kernel modules from
>> https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>> <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>>
>> Clone your boot environment, place the modules in the new
>> environment
>> and update the boot-archive of the new BE.
>>
>> - Joerg
>>
>>
>>
>>
>>
>> On 23.02.2015 02:54, W Verb wrote:
>>
>> By the way, to those of you who have working setups:
>> please send me
>> your pool/volume settings, interface linkprops, and
>> any kernel
>> tuning
>> parameters you may have set.
>>
>> Thanks,
>> Warren V
>>
>> On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>> <chip at innovates.com <mailto:chip at innovates.com>
>> <mailto:chip at innovates.com <mailto:chip at innovates.com>>> wrote:
>>
>> I can't say I totally agree with your performance
>> assessment. I run Intel
>> X520 in all my OmniOS boxes.
>>
>> Here is a capture of nfssvrtop I made while
>> running many
>> storage vMotions
>> between two OmniOS boxes hosting NFS datastores.
>> This is a
>> 10 host VMware
>> cluster. Both OmniOS boxes are dual 10G connected
>> with
>> copper twin-ax to
>> the in rack Nexus 5010.
>>
>> VMware does 100% sync writes, I use ZeusRAM SSDs
>> for log
>> devices.
>>
>> -Chip
>>
>> 2014 Apr 24 08:05:51, load: 12.64, read: 17330243 KB,
>> swrite: 15985 KB,
>> awrite: 1875455 KB
>>
>> Ver Client NFSOPS Reads SWrites
>> AWrites
>> Commits Rd_bw
>> SWr_bw AWr_bw Rd_t SWr_t AWr_t Com_t Align%
>>
>> 4 10.28.17.105 0 0 0
>> 0
>> 0 0
>> 0 0 0 0 0 0 0
>>
>> 4 10.28.17.215 0 0 0
>> 0
>> 0 0
>> 0 0 0 0 0 0 0
>>
>> 4 10.28.17.213 0 0 0
>> 0
>> 0 0
>> 0 0 0 0 0 0 0
>>
>> 4 10.28.16.151 0 0 0
>> 0
>> 0 0
>> 0 0 0 0 0 0 0
>>
>> 4 all 1 0 0
>> 0
>> 0 0
>> 0 0 0 0 0 0 0
>>
>> 3 10.28.16.175 3 0 3
>> 0
>> 0 1
>> 11 0 4806 48 0 0 85
>>
>> 3 10.28.16.183 6 0 6
>> 0
>> 0 3
>> 162 0 549 124 0 0 73
>>
>> 3 10.28.16.180 11 0 10
>> 0
>> 0 3
>> 27 0 776 89 0 0 67
>>
>> 3 10.28.16.176 28 2 26
>> 0
>> 0 10
>> 405 0 2572 198 0 0 100
>>
>> 3 10.28.16.178 4606 4602 4
>> 0
>> 0 294534
>> 3 0 723 49 0 0 99
>>
>> 3 10.28.16.179 4905 4879 26
>> 0
>> 0 312208
>> 311 0 735 271 0 0 99
>>
>> 3 10.28.16.181 5515 5502 13
>> 0
>> 0 352107
>> 77 0 89 87 0 0 99
>>
>> 3 10.28.16.184 12095 12059 10
>> 0
>> 0 763014
>> 39 0 249 147 0 0 99
>>
>> 3 10.28.58.1 15401 6040 116
>> 6354
>> 53 191605
>> 474 202346 192 96 144 83 99
>>
>> 3 all 42574 33086 <tel:42574%2033086>
>> <tel:42574%20%20%2033086> 217
>> 6354 53 1913488
>> 1582 202300 348 138 153 105 99
>>
>>
>>
>>
>>
>> On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>> <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>> <mailto:wverb73 at gmail.com
>> <mailto:wverb73 at gmail.com>>> wrote:
>>
>>
>> Hello All,
>>
>> Thank you for your replies.
>> I tried a few things, and found the following:
>>
>> 1: Disabling hyperthreading support in the
>> BIOS drops
>> performance overall
>> by a factor of 4.
>> 2: Disabling VT support also seems to have
>> some effect,
>> although it
>> appears to be minor. But this has the amusing side
>> effect of fixing the
>> hangs I've been experiencing with fast reboot.
>> Probably
>> by disabling kvm.
>> 3: The performance tests are a bit tricky to
>> quantify
>> because of caching
>> effects. In fact, I'm not entirely sure what is
>> happening here. It's just
>> best to describe what I'm seeing:
>>
>> The commands I'm using to test are
>> dd if=/dev/zero of=./test.dd bs=2M count=5000
>> dd of=/dev/null if=./test.dd bs=2M count=5000
>> The host vm is running Centos 6.6, and has the
>> latest
>> vmtools installed.
>> There is a host cache on an SSD local to the
>> host that
>> is also in place.
>> Disabling the host cache didn't immediately
>> have an
>> effect as far as I could
>> see.
>>
>> The host MTU set to 3000 on all iSCSI
>> interfaces for all
>> tests.
>>
>> Test 1: Right after reboot, with an ixgbe MTU
>> of 9000,
>> the write test
>> yields an average speed over three tests of
>> 137MB/s. The
>> read test yields an
>> average over three tests of 5MB/s.
>>
>> Test 2: After setting "ifconfig ixgbe0 mtu
>> 3000", the
>> write tests yield
>> 140MB/s, and the read tests yield 53MB/s. It's
>> important
>> to note here that
>> if I cut the read test short at only 2-3GB, I get
>> results upwards of
>> 350MB/s, which I assume is local cache-related
>> distortion.
>>
>> Test 3: MTU of 1500. Read tests are up to 156
>> MB/s.
>> Write tests yield
>> about 142MB/s.
>> Test 4: MTU of 1000: Read test at 182MB/s.
>> Test 5: MTU of 900: Read test at 130 MB/s.
>> Test 6: MTU of 1000: Read test at 160MB/s.
>> Write tests
>> are now
>> consistently at about 300MB/s.
>> Test 7: MTU of 1200: Read test at 124MB/s.
>> Test 8: MTU of 1000: Read test at 161MB/s.
>> Write at 261MB/s.
>>
>> A few final notes:
>> L1ARC grabs about 10GB of RAM during the tests, so
>> there's definitely some
>> read caching going on.
>> The write operations are easier to observe
>> with iostat,
>> and I'm seeing io
>> rates that closely correlate with the network
>> write speeds.
>>
>>
>> Chris, thanks for your specific details. I'd
>> appreciate
>> it if you could
>> tell me which copper NIC you tried, as well as
>> to pass
>> on the iSCSI tuning
>> parameters.
>>
>> I've ordered an Intel EXPX9502AFXSR, which
>> uses the
>> 82598 chip instead of
>> the 82599 in the X520. If I get similar
>> results with my
>> fiber transcievers,
>> I'll see if I can get a hold of copper ones.
>>
>> But I should mention that I did indeed look at
>> PHY/MAC
>> error rates, and
>> they are nil.
>>
>> -Warren V
>>
>> On Fri, Feb 20, 2015 at 7:25 PM, Chris Siebenmann
>> <cks at cs.toronto.edu
>> <mailto:cks at cs.toronto.edu> <mailto:cks at cs.toronto.edu
>> <mailto:cks at cs.toronto.edu>>>
>>
>> wrote:
>>
>>
>> After installation and configuration,
>> I observed
>> all kinds of bad
>> behavior
>> in the network traffic between the
>> hosts and the
>> server. All of this
>> bad
>> behavior is traced to the ixgbe driver
>> on the
>> storage server. Without
>> going
>> into the full troubleshooting process,
>> here are
>> my takeaways:
>>
>> [...]
>>
>> For what it's worth, we managed to
>> achieve much
>> better line rates on
>> copper 10G ixgbe hardware of various
>> descriptions
>> between OmniOS
>> and CentOS 7 (I don't think we ever tested
>> OmniOS to
>> OmniOS). I don't
>> believe OmniOS could do TCP at full line
>> rate but I
>> think we managed 700+
>> Mbytes/sec on both transmit and receive
>> and we got
>> basically disk-limited
>> speeds with iSCSI (across multiple disks on
>> multi-disk mirrored pools,
>> OmniOS iSCSI initiator, Linux iSCSI targets).
>>
>> I don't believe we did any specific
>> kernel tuning
>> (and in fact some of
>> our attempts to fiddle ixgbe driver
>> parameters blew
>> up in our face).
>> We did tune iSCSI connection parameters to
>> increase
>> various buffer
>> sizes so that ZFS could do even large single
>> operations in single iSCSI
>> transactions. (More details available if
>> people are
>> interested.)
>>
>> 10: At the wire level, the speed
>> problems are
>> clearly due to pauses in
>> response time by omnios. At 9000 byte
>> frame
>> sizes, I see a good number
>> of duplicate ACKs and fast retransmits
>> during
>> read operations (when
>> omnios is transmitting). But below about a
>> 4100-byte MTU on omnios
>> (which seems to correlate to 4096-byte
>> iSCSI
>> block transfers), the
>> transmission errors fade away and we
>> only see
>> the transmission pause
>> problem.
>>
>>
>> This is what really attracted my
>> attention. In
>> our OmniOS setup, our
>> specific Intel hardware had ixgbe driver
>> issues that
>> could cause
>> activity stalls during once-a-second link
>> heartbeat
>> checks. This
>> obviously had an effect at the TCP and
>> iSCSI layers.
>> My initial message
>> to illumos-developer sparked a potentially
>> interesting discussion:
>>
>>
>> http://www.listbox.com/member/____archive/182179/2014/10/sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__4B1D-__11E4-A39C-D534381BA44D/
>> <http://www.listbox.com/member/__archive/182179/2014/10/sort/__time_rev/page/16/entry/6:405/__20141003125035:6357079A-4B1D-__11E4-A39C-D534381BA44D/>
>>
>> <http://www.listbox.com/__member/archive/182179/2014/10/__sort/time_rev/page/16/entry/6:__405/20141003125035:6357079A-__4B1D-11E4-A39C-D534381BA44D/
>> <http://www.listbox.com/member/archive/182179/2014/10/sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-4B1D-11E4-A39C-D534381BA44D/>>
>>
>> If you think this is a possibility in your
>> setup,
>> I've put the DTrace
>> script I used to hunt for this up on the web:
>>
>> http://www.cs.toronto.edu/~____cks/src/omnios-ixgbe/ixgbe_____delay.d
>> <http://www.cs.toronto.edu/%7E__cks/src/omnios-ixgbe/ixgbe___delay.d>
>>
>> <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___delay.d
>> <http://www.cs.toronto.edu/%7Ecks/src/omnios-ixgbe/ixgbe_delay.d>>
>>
>> This isn't the only potential source of driver
>> stalls by any means, it's
>> just the one I found. You may also want to
>> look at
>> lockstat in general,
>> as information it reported is what led us
>> to look
>> specifically at the
>> ixgbe code here.
>>
>> (If you suspect kernel/driver issues, lockstat
>> combined with kernel
>> source is a really excellent resource.)
>>
>> - cks
>>
>>
>>
>>
>>
>> ___________________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti
>> <mailto:OmniOS-discuss at lists.omniti>.____com
>> <mailto:OmniOS-discuss at lists.__omniti.com
>> <mailto:OmniOS-discuss at lists.omniti.com>>
>> http://lists.omniti.com/____mailman/listinfo/omnios-____discuss <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss>
>>
>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>
>>
>> ___________________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti
>> <mailto:OmniOS-discuss at lists.omniti>.____com
>> <mailto:OmniOS-discuss at lists.__omniti.com
>> <mailto:OmniOS-discuss at lists.omniti.com>>
>> http://lists.omniti.com/____mailman/listinfo/omnios-____discuss <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss>
>>
>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>
>>
>> --
>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408
>> Nuernberg
>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0>
>> <tel:%2B49%20911%2039905-0> - Fax: +49 911
>> 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>> <http://www.osn.de/>
>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>
>>
>>
>> --
>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg
>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0> - Fax: +49 911
>> 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>> <http://www.osn.de/>
>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>
>>
>> *illumos-developer* | Archives
>> <https://www.listbox.com/member/archive/182179/=now>
>> <https://www.listbox.com/member/archive/rss/182179/21239177-3604570e>
>> | Modify
>> <https://www.listbox.com/member/?member_id=21239177&id_secret=21239177-2d0c9337>
>> Your Subscription [Powered by Listbox] <http://www.listbox.com/>
>>
>
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
More information about the OmniOS-discuss
mailing list