[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy
W Verb
wverb73 at gmail.com
Mon Mar 2 19:07:45 UTC 2015
Hello all,
I am not using layer 2 flow control. The switch carries line-rate 10G
traffic without error.
I think I have found the issue via lockstat. The first lockstat is taken
during a multipath read:
lockstat -kWP sleep 30
Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)
Count indv cuml rcnt nsec Hottest Lock Caller
-------------------------------------------------------------------------------
9306 44% 44% 0.00 1557 htable_mutex+0x370 htable_release
6307 23% 68% 0.00 1207 htable_mutex+0x108 htable_lookup
596 7% 75% 0.00 4100 0xffffff0931705188 cv_wait
349 5% 80% 0.00 4437 0xffffff0931705188 taskq_thread
704 2% 82% 0.00 995 0xffffff0935de3c50 dbuf_create
The hash table being read here I would guess is the tcp connection hash
table.
When lockstat is run during a multipath write operation, I get:
Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)
Count indv cuml rcnt nsec Hottest Lock Caller
-------------------------------------------------------------------------------
210752 28% 28% 0.00 4781 0xffffff0931705188 taskq_thread
174471 22% 50% 0.00 4476 0xffffff0931705188 cv_wait
127183 10% 61% 0.00 2871 0xffffff096f29b510 zio_notify_parent
176066 10% 70% 0.00 1922 0xffffff0931705188 taskq_dispatch_ent
105134 9% 80% 0.00 3110 0xffffff096ffdbf10 zio_remove_child
67512 4% 83% 0.00 1938 0xffffff096f3db4b0 zio_add_child
45736 3% 86% 0.00 2239 0xffffff0935de3c50 dbuf_destroy
27781 3% 89% 0.00 3416 0xffffff0935de3c50 dbuf_create
38536 2% 91% 0.00 2122 0xffffff0935de3b70 dnode_rele
27841 2% 93% 0.00 2423 0xffffff0935de3b70 dnode_diduse_space
19020 2% 95% 0.00 3046 0xffffff09d9e305e0 dbuf_rele
14627 1% 96% 0.00 3632 dbuf_hash_table+0x4f8 dbuf_find
Writes are not performing htable lookups, while reads are.
-Warren V
On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <jg at osn.de> wrote:
> Hi,
>
> I would try *one* TPG which includes both interface addresses
> and I would double check for packet drops on the Catalyst.
>
> The 3560 supports only receive flow control which means, that
> a sending 10Gbit port can easily overload a 1Gbit port.
> Do you have flow control enabled?
>
> - Joerg
>
>
> On 02.03.2015 09:22, W Verb via illumos-developer wrote:
>
>> Hello Garrett,
>>
>> No, no 802.3ad going on in this config.
>>
>> Here is a basic schematic:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/
>> view?usp=sharing
>>
>> Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/
>> view?usp=sharing
>>
>> Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The
>> switch is set to allow 9148-byte frames, and I'm not seeing any
>> errors/buffer overruns on the switch.
>>
>> Here is a screenshot of a packet capture from a read operation on the
>> guest OS (from it's local drive, which is actually a VMDK file on the
>> storage server). In this example, only a single 1G ESXi kernel interface
>> (vmk1) is bound to the software iSCSI initiator.
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/
>> view?usp=sharing
>>
>> Note that there's a nice, well-behaved window sizing process taking
>> place. The ESXi decreases the scaled window by 11 or 12 for each ACK,
>> then bumps it back up to 512.
>>
>> Here is a similar screenshot of a single-interface write operation:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/
>> view?usp=sharing
>>
>> There are no pauses or gaps in the transmission rate in the
>> single-interface transfers.
>>
>>
>> In the next screenshots, I have enabled an additional 1G interface on
>> the ESXi host, and bound it to the iSCSI initiator. The new interface is
>> bound to a separate physical port, uses a different VLAN on the switch,
>> and talks to a different 10G port on the storage server.
>>
>> First, let's look at a write operation on the guest OS, which happily
>> pumps data at near-line-rate to the storage server.
>>
>> Here is a sequence number trace diagram. Note how the transfer has a
>> nice, smooth increment rate over the entire transfer.
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/
>> view?usp=sharing
>>
>> Here are screenshots from packet captures on both 1G interfaces:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/
>> view?usp=sharing
>> https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/
>> view?usp=sharing
>>
>> Note how we again see nice, smooth window adjustment, and no gaps in
>> transmission.
>>
>>
>> But now, let's look at the problematic two-interface Read operation.
>> First, the sequence graph:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/
>> view?usp=sharing
>>
>> As you can see, there are gaps and jumps in the transmission throughout
>> the transfer.
>> It is very illustrative to look at captures of the gaps, which are
>> occurring on both interfaces:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/
>> view?usp=sharing
>> https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/
>> view?usp=sharing
>>
>> As you can see, there are ~.4 second pauses in transmission from the
>> storage server, which kills the transfer rate.
>> It's clear that the ESXi box ACKs the prior iSCSI operation to
>> completion, then makes a new LUN request, which the storage server
>> immediately replies to. The ESXi ACKs the response packet from the
>> storage server, then waits...and waits....and waits... until eventually
>> the storage server starts transmitting again.
>>
>> Because the pause happens while the ESXi client is waiting for a packet
>> from the storage server, that tells me that the gaps are not an artifact
>> of traffic being switched between both active interfaces, but are
>> actually indicative of short hangs occurring on the server.
>>
>> Having a pause or two in transmission is no big deal, but in my case, it
>> is happening constantly, and dropping my overall read transfer rate down
>> to 20-60MB/s, which is slower than the single interface transfer rate
>> (~90-100MB/s).
>>
>> Decreasing the MTU makes the pauses shorter, increasing them makes the
>> pauses longer.
>>
>> Another interesting thing is that if I set the multipath io interval to
>> 3 operations instead of 1, I get better throughput. In other words, the
>> less frequently I swap IP addresses on my iSCSI requests from the ESXi
>> unit, the fewer pauses I see.
>>
>> Basically, COMSTAR seems to choke each time an iSCSI request from a new
>> IP arrives.
>>
>> Because the single interface transfer is near line rate, that tells me
>> that the storage system (mpt_sas, zfs, etc) is working fine. It's only
>> when multiple paths are attempted that iSCSI falls on its face during
>> reads.
>>
>> All of these captures were taken without a cache device being attached
>> to the storage zpool, so this isn't looking like some kind of ZFS ARC
>> problem. As mentioned previously, local transfers to/from the zpool are
>> showing ~300-500 MB/s rates over long transfers (10G+).
>>
>> -Warren V
>>
>> On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <garrett at damore.org
>> <mailto:garrett at damore.org>> wrote:
>>
>> I’m not sure I’ve followed properly. You have *two* interfaces.
>> You are not trying to provision these in an aggr are you? As far as
>> I’m aware, VMware does not support 802.3ad link aggregations. (Its
>> possible that you can make it work with ESXi if you give the entire
>> NIC to the guest — but I’m skeptical.) The problem is that if you
>> try to use link aggregation, some packets (up to half!) will be
>> lost. TCP and other protocols fare poorly in this situation.
>>
>> Its possible I’ve totally misunderstood what you’re trying to do, in
>> which case I apologize.
>>
>> The idle thing is a red-herring — the cpu is waiting for work to do,
>> probably because packets haven’t arrived (or where dropped by the
>> hypervisor!) I wouldn’t read too much into that except that your
>> network stack is in trouble. I’d look a bit more closely at the
>> kstats for tcp — I suspect you’ll see retransmits or out of order
>> values that are unusually high — if so this may help validate my
>> theory above.
>>
>> - Garrett
>>
>> On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>>> <developer at lists.illumos.org <mailto:developer at lists.illumos.org>>
>>>
>>> wrote:
>>>
>>> Hello all,
>>>
>>>
>>> Well, I no longer blame the ixgbe driver for the problems I'm seeing.
>>>
>>>
>>> I tried Joerg's updated driver, which didn't improve the issue. So
>>> I went back to the drawing board and rebuilt the server from scratch.
>>>
>>> What I noted is that if I have only a single 1-gig physical
>>> interface active on the ESXi host, everything works as expected.
>>> As soon as I enable two interfaces, I start seeing the performance
>>> problems I've described.
>>>
>>> Response pauses from the server that I see in TCPdumps are still
>>> leading me to believe the problem is delay on the server side, so
>>> I ran a series of kernel dtraces and produced some flamegraphs.
>>>
>>>
>>> This was taken during a read operation with two active 10G
>>> interfaces on the server, with a single target being shared by two
>>> tpgs- one tpg for each 10G physical port. The host device has two
>>> 1G ports enabled, with VLANs separating the active ports into
>>> 10G/1G pairs. ESXi is set to multipath using both VLANS with a
>>> round-robin IO interval of 1.
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/
>>> view?usp=sharing
>>>
>>>
>>> This was taken during a write operation:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/
>>> view?usp=sharing
>>>
>>>
>>> I then rebooted the server and disabled C-State, ACPI T-State, and
>>> general EIST (Turbo boost) functionality in the CPU.
>>>
>>> I when I attempted to boot my guest VM, the iSCSI transfer
>>> gradually ground to a halt during the boot loading process, and
>>> the guest OS never did complete its boot process.
>>>
>>> Here is a flamegraph taken while iSCSI is slowly dying:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/
>>> view?usp=sharing
>>>
>>>
>>> I edited out cpu_idle_adaptive from the dtrace output and
>>> regenerated the slowdown graph:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/
>>> view?usp=sharing
>>>
>>>
>>> I then edited cpu_idle_adaptive out of the speedy write operation
>>> and regenerated that graph:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/
>>> view?usp=sharing
>>>
>>>
>>> I have zero experience with interpreting flamegraphs, but the most
>>> significant difference I see between the slow read example and the
>>> fast write example is in unix`thread_start --> unix`idle. There's
>>> a good chunk of "unix`i86_mwait" in the read example that is not
>>> present in the write example at all.
>>>
>>> Disabling the l2arc cache device didn't make a difference, and I
>>> had to reenable EIST support on the CPU to get my VMs to boot.
>>>
>>> I am seeing a variety of bug reports going back to 2010 regarding
>>> excessive mwait operations, with the suggested solutions usually
>>> being to set "cpupm enable poll-mode" in power.conf. That change
>>> also had no effect on speed.
>>>
>>> -Warren V
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>>
>>> From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>>>
>>> Sent: Monday, February 23, 2015 8:30 AM
>>>
>>> To: W Verb
>>>
>>> Cc: omnios-discuss at lists.omniti.com
>>> <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>>> <mailto:cks at cs.toronto.edu>
>>>
>>> Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and
>>> the Greek economy
>>>
>>>
>>> > Chris, thanks for your specific details. I'd appreciate it if you
>>>
>>> > could tell me which copper NIC you tried, as well as to pass on the
>>>
>>> > iSCSI tuning parameters.
>>>
>>>
>>> Our copper NIC experience is with onboard X540-AT2 ports on
>>> SuperMicro hardware (which have the guaranteed 10-20 msec lock
>>> hold) and dual-port 82599EB TN cards (which have some sort of
>>> driver/hardware failure under load that eventually leads to
>>> 2-second lock holds). I can't recommend either with the current
>>> driver; we had to revert to 1G networking in order to get stable
>>> servers.
>>>
>>>
>>> The iSCSI parameter modifications we do, across both initiators
>>> and targets, are:
>>>
>>>
>>> initialr2tno
>>>
>>> firstburstlength128k
>>>
>>> maxrecvdataseglen128k[only on Linux backends]
>>>
>>> maxxmitdataseglen128k[only on Linux backends]
>>>
>>>
>>> The OmniOS initiator doesn't need tuning for more than the first
>>> two parameters; on the Linux backends we tune up all four. My
>>> extended thoughts on these tuning parameters and why we touch them
>>> can be found
>>>
>>> here:
>>>
>>>
>>> http://utcc.utoronto.ca/~cks/space/blog/tech/
>>> UnderstandingiSCSIProtocol
>>>
>>> http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>>>
>>>
>>> The short version is that these parameters probably only make a
>>> small difference but their overall goal is to do 128KB ZFS reads
>>> and writes in single iSCSI operations (although they will be
>>> fragmented at the TCP
>>>
>>> layer) and to do iSCSI writes without a back-and-forth delay
>>> between initiator and target (that's 'initialr2t no').
>>>
>>>
>>> I think basically everyone should use InitialR2T set to no and in
>>> fact that it should be the software default. These days only
>>> unusually limited iSCSI targets should need it to be otherwise and
>>> they can change their setting for it (initiator and target must
>>> both agree to it being 'yes', so either can veto it).
>>>
>>>
>>> - cks
>>>
>>>
>>>
>>> On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>>> <mailto:jg at osn.de>> wrote:
>>>
>>> Hi,
>>>
>>> I think your problem is caused by your link properties or your
>>> switch settings. In general the standard ixgbe seems to perform
>>> well.
>>>
>>> I had trouble after changing the default flow control settings
>>> to "bi"
>>> and this was my motivation to update the ixgbe driver a long
>>> time ago.
>>> After I have updated our systems to ixgbe 2.5.8 I never had any
>>> problems ....
>>>
>>> Make sure your switch has support for jumbo frames and you use
>>> the same mtu on all ports, otherwise the smallest will be used.
>>>
>>> What switch do you use? I can tell you nice horror stories about
>>> different vendors....
>>>
>>> - Joerg
>>>
>>> On 23.02.2015 10:31, W Verb wrote:
>>>
>>> Thank you Joerg,
>>>
>>> I've downloaded the package and will try it tomorrow.
>>>
>>> The only thing I can add at this point is that upon review
>>> of my
>>> testing, I may have performed my "pkg -u" between the
>>> initial quad-gig
>>> performance test and installing the 10G NIC. So this may
>>> be a new
>>> problem introduced in the latest updates.
>>>
>>> Those of you who are running 10G and have not upgraded to
>>> the latest
>>> kernel, etc, might want to do some additional testing
>>> before running the
>>> update.
>>>
>>> -Warren V
>>>
>>> On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann
>>> <jg at osn.de <mailto:jg at osn.de>
>>> <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>>>
>>> Hi,
>>>
>>> I remember there was a problem with the flow control
>>> settings in the
>>> ixgbe
>>> driver, so I updated it a long time ago for our
>>> internal servers to
>>> 2.5.8.
>>> Last weekend I integrated the latest changes from the
>>> FreeBSD driver
>>> to bring
>>> the illumos ixgbe to 2.5.25 but I had no time to test
>>> it, so it's
>>> completely
>>> untested!
>>>
>>>
>>> If you would like to give the latest driver a try you
>>> can fetch the
>>> kernel modules from
>>> https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>>> <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>>>
>>> Clone your boot environment, place the modules in the
>>> new environment
>>> and update the boot-archive of the new BE.
>>>
>>> - Joerg
>>>
>>>
>>>
>>>
>>>
>>> On 23.02.2015 02:54, W Verb wrote:
>>>
>>> By the way, to those of you who have working
>>> setups: please send me
>>> your pool/volume settings, interface linkprops,
>>> and any kernel
>>> tuning
>>> parameters you may have set.
>>>
>>> Thanks,
>>> Warren V
>>>
>>> On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>>> <chip at innovates.com <mailto:chip at innovates.com>
>>> <mailto:chip at innovates.com <mailto:chip at innovates.com>>>
>>>
>>> wrote:
>>>
>>> I can't say I totally agree with your performance
>>> assessment. I run Intel
>>> X520 in all my OmniOS boxes.
>>>
>>> Here is a capture of nfssvrtop I made while
>>> running many
>>> storage vMotions
>>> between two OmniOS boxes hosting NFS
>>> datastores. This is a
>>> 10 host VMware
>>> cluster. Both OmniOS boxes are dual 10G
>>> connected with
>>> copper twin-ax to
>>> the in rack Nexus 5010.
>>>
>>> VMware does 100% sync writes, I use ZeusRAM
>>> SSDs for log
>>> devices.
>>>
>>> -Chip
>>>
>>> 2014 Apr 24 08:05:51, load: 12.64, read:
>>> 17330243 KB,
>>> swrite: 15985 KB,
>>> awrite: 1875455 KB
>>>
>>> Ver Client NFSOPS Reads
>>> SWrites AWrites
>>> Commits Rd_bw
>>> SWr_bw AWr_bw Rd_t SWr_t AWr_t
>>> Com_t Align%
>>>
>>> 4 10.28.17.105 0 0
>>> 0 0
>>> 0 0
>>> 0 0 0 0 0 0 0
>>>
>>> 4 10.28.17.215 0 0
>>> 0 0
>>> 0 0
>>> 0 0 0 0 0 0 0
>>>
>>> 4 10.28.17.213 0 0
>>> 0 0
>>> 0 0
>>> 0 0 0 0 0 0 0
>>>
>>> 4 10.28.16.151 0 0
>>> 0 0
>>> 0 0
>>> 0 0 0 0 0 0 0
>>>
>>> 4 all 1 0
>>> 0 0
>>> 0 0
>>> 0 0 0 0 0 0 0
>>>
>>> 3 10.28.16.175 3 0
>>> 3 0
>>> 0 1
>>> 11 0 4806 48 0 0
>>> 85
>>>
>>> 3 10.28.16.183 6 0
>>> 6 0
>>> 0 3
>>> 162 0 549 124 0 0
>>> 73
>>>
>>> 3 10.28.16.180 11 0
>>> 10 0
>>> 0 3
>>> 27 0 776 89 0 0
>>> 67
>>>
>>> 3 10.28.16.176 28 2
>>> 26 0
>>> 0 10
>>> 405 0 2572 198 0 0
>>> 100
>>>
>>> 3 10.28.16.178 4606 4602
>>> 4 0
>>> 0 294534
>>> 3 0 723 49 0 0 99
>>>
>>> 3 10.28.16.179 4905 4879
>>> 26 0
>>> 0 312208
>>> 311 0 735 271 0 0
>>> 99
>>>
>>> 3 10.28.16.181 5515 5502
>>> 13 0
>>> 0 352107
>>> 77 0 89 87 0 0
>>> 99
>>>
>>> 3 10.28.16.184 12095 12059
>>> 10 0
>>> 0 763014
>>> 39 0 249 147 0 0
>>> 99
>>>
>>> 3 10.28.58.1 15401 6040
>>> 116 6354
>>> 53 191605
>>> 474 202346 192 96 144 83
>>> 99
>>>
>>> 3 all 42574 33086 <tel:42574%2033086>
>>> <tel:42574%20%20%2033086> 217
>>> 6354 53 1913488
>>> 1582 202300 348 138 153 105
>>> 99
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>>> <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>>> <mailto:wverb73 at gmail.com
>>>
>>> <mailto:wverb73 at gmail.com>>> wrote:
>>>
>>>
>>> Hello All,
>>>
>>> Thank you for your replies.
>>> I tried a few things, and found the
>>> following:
>>>
>>> 1: Disabling hyperthreading support in the
>>> BIOS drops
>>> performance overall
>>> by a factor of 4.
>>> 2: Disabling VT support also seems to have
>>> some effect,
>>> although it
>>> appears to be minor. But this has the
>>> amusing side
>>> effect of fixing the
>>> hangs I've been experiencing with fast
>>> reboot. Probably
>>> by disabling kvm.
>>> 3: The performance tests are a bit tricky
>>> to quantify
>>> because of caching
>>> effects. In fact, I'm not entirely sure
>>> what is
>>> happening here. It's just
>>> best to describe what I'm seeing:
>>>
>>> The commands I'm using to test are
>>> dd if=/dev/zero of=./test.dd bs=2M count=5000
>>> dd of=/dev/null if=./test.dd bs=2M count=5000
>>> The host vm is running Centos 6.6, and has
>>> the latest
>>> vmtools installed.
>>> There is a host cache on an SSD local to
>>> the host that
>>> is also in place.
>>> Disabling the host cache didn't
>>> immediately have an
>>> effect as far as I could
>>> see.
>>>
>>> The host MTU set to 3000 on all iSCSI
>>> interfaces for all
>>> tests.
>>>
>>> Test 1: Right after reboot, with an ixgbe
>>> MTU of 9000,
>>> the write test
>>> yields an average speed over three tests
>>> of 137MB/s. The
>>> read test yields an
>>> average over three tests of 5MB/s.
>>>
>>> Test 2: After setting "ifconfig ixgbe0 mtu
>>> 3000", the
>>> write tests yield
>>> 140MB/s, and the read tests yield 53MB/s.
>>> It's important
>>> to note here that
>>> if I cut the read test short at only
>>> 2-3GB, I get
>>> results upwards of
>>> 350MB/s, which I assume is local
>>> cache-related distortion.
>>>
>>> Test 3: MTU of 1500. Read tests are up to
>>> 156 MB/s.
>>> Write tests yield
>>> about 142MB/s.
>>> Test 4: MTU of 1000: Read test at 182MB/s.
>>> Test 5: MTU of 900: Read test at 130 MB/s.
>>> Test 6: MTU of 1000: Read test at 160MB/s.
>>> Write tests
>>> are now
>>> consistently at about 300MB/s.
>>> Test 7: MTU of 1200: Read test at 124MB/s.
>>> Test 8: MTU of 1000: Read test at 161MB/s.
>>> Write at 261MB/s.
>>>
>>> A few final notes:
>>> L1ARC grabs about 10GB of RAM during the
>>> tests, so
>>> there's definitely some
>>> read caching going on.
>>> The write operations are easier to observe
>>> with iostat,
>>> and I'm seeing io
>>> rates that closely correlate with the
>>> network write speeds.
>>>
>>>
>>> Chris, thanks for your specific details.
>>> I'd appreciate
>>> it if you could
>>> tell me which copper NIC you tried, as
>>> well as to pass
>>> on the iSCSI tuning
>>> parameters.
>>>
>>> I've ordered an Intel EXPX9502AFXSR, which
>>> uses the
>>> 82598 chip instead of
>>> the 82599 in the X520. If I get similar
>>> results with my
>>> fiber transcievers,
>>> I'll see if I can get a hold of copper ones.
>>>
>>> But I should mention that I did indeed
>>> look at PHY/MAC
>>> error rates, and
>>> they are nil.
>>>
>>> -Warren V
>>>
>>> On Fri, Feb 20, 2015 at 7:25 PM, Chris
>>> Siebenmann
>>> <cks at cs.toronto.edu
>>> <mailto:cks at cs.toronto.edu> <mailto:cks at cs.toronto.edu
>>>
>>> <mailto:cks at cs.toronto.edu>>>
>>>
>>> wrote:
>>>
>>>
>>> After installation and
>>> configuration, I observed
>>> all kinds of bad
>>> behavior
>>> in the network traffic between the
>>> hosts and the
>>> server. All of this
>>> bad
>>> behavior is traced to the ixgbe
>>> driver on the
>>> storage server. Without
>>> going
>>> into the full troubleshooting
>>> process, here are
>>> my takeaways:
>>>
>>> [...]
>>>
>>> For what it's worth, we managed to
>>> achieve much
>>> better line rates on
>>> copper 10G ixgbe hardware of various
>>> descriptions
>>> between OmniOS
>>> and CentOS 7 (I don't think we ever
>>> tested OmniOS to
>>> OmniOS). I don't
>>> believe OmniOS could do TCP at full
>>> line rate but I
>>> think we managed 700+
>>> Mbytes/sec on both transmit and
>>> receive and we got
>>> basically disk-limited
>>> speeds with iSCSI (across multiple
>>> disks on
>>> multi-disk mirrored pools,
>>> OmniOS iSCSI initiator, Linux iSCSI
>>> targets).
>>>
>>> I don't believe we did any specific
>>> kernel tuning
>>> (and in fact some of
>>> our attempts to fiddle ixgbe driver
>>> parameters blew
>>> up in our face).
>>> We did tune iSCSI connection
>>> parameters to increase
>>> various buffer
>>> sizes so that ZFS could do even large
>>> single
>>> operations in single iSCSI
>>> transactions. (More details available
>>> if people are
>>> interested.)
>>>
>>> 10: At the wire level, the speed
>>> problems are
>>> clearly due to pauses in
>>> response time by omnios. At 9000
>>> byte frame
>>> sizes, I see a good number
>>> of duplicate ACKs and fast
>>> retransmits during
>>> read operations (when
>>> omnios is transmitting). But below
>>> about a
>>> 4100-byte MTU on omnios
>>> (which seems to correlate to
>>> 4096-byte iSCSI
>>> block transfers), the
>>> transmission errors fade away and
>>> we only see
>>> the transmission pause
>>> problem.
>>>
>>>
>>> This is what really attracted my
>>> attention. In
>>> our OmniOS setup, our
>>> specific Intel hardware had ixgbe
>>> driver issues that
>>> could cause
>>> activity stalls during once-a-second
>>> link heartbeat
>>> checks. This
>>> obviously had an effect at the TCP and
>>> iSCSI layers.
>>> My initial message
>>> to illumos-developer sparked a
>>> potentially
>>> interesting discussion:
>>>
>>>
>>> http://www.listbox.com/member/____archive/182179/2014/10/
>>> sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__
>>> 4B1D-__11E4-A39C-D534381BA44D/
>>> <http://www.listbox.com/member/__archive/182179/2014/
>>> 10/sort/__time_rev/page/16/entry/6:405/__20141003125035:
>>> 6357079A-4B1D-__11E4-A39C-D534381BA44D/>
>>>
>>> <http://www.listbox.com/__member/archive/182179/2014/10/
>>> __sort/time_rev/page/16/entry/6:__405/20141003125035:
>>> 6357079A-__4B1D-11E4-A39C-D534381BA44D/
>>> <http://www.listbox.com/member/archive/182179/2014/10/
>>> sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-
>>> 4B1D-11E4-A39C-D534381BA44D/>>
>>>
>>> If you think this is a possibility in
>>> your setup,
>>> I've put the DTrace
>>> script I used to hunt for this up on
>>> the web:
>>>
>>> http://www.cs.toronto.edu/~____cks/src/omnios-ixgbe/ixgbe___
>>> __delay.d
>>> <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>> delay.d>
>>>
>>> <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>> delay.d
>>> <http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_
>>> delay.d>>
>>>
>>> This isn't the only potential source
>>> of driver
>>> stalls by any means, it's
>>> just the one I found. You may also
>>> want to look at
>>> lockstat in general,
>>> as information it reported is what led
>>> us to look
>>> specifically at the
>>> ixgbe code here.
>>>
>>> (If you suspect kernel/driver issues,
>>> lockstat
>>> combined with kernel
>>> source is a really excellent resource.)
>>>
>>> - cks
>>>
>>>
>>>
>>>
>>>
>>> ___________________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti
>>> <mailto:OmniOS-discuss at lists.omniti>.____com
>>> <mailto:OmniOS-discuss at lists.__omniti.com
>>> <mailto:OmniOS-discuss at lists.omniti.com>>
>>> http://lists.omniti.com/____mailman/listinfo/omnios-____
>>> discuss
>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> >
>>>
>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>
>>>
>>> ___________________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti
>>> <mailto:OmniOS-discuss at lists.omniti>.____com
>>> <mailto:OmniOS-discuss at lists.__omniti.com
>>> <mailto:OmniOS-discuss at lists.omniti.com>>
>>> http://lists.omniti.com/____mailman/listinfo/omnios-____
>>> discuss
>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> >
>>>
>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>
>>>
>>> --
>>> OSN Online Service Nuernberg GmbH, Bucher Str. 78,
>>> 90408 Nuernberg
>>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0>
>>> <tel:%2B49%20911%2039905-0> - Fax: +49 911
>>> 39905-55 <tel:%2B49%20911%2039905-55> -
>>> http://www.osn.de <http://www.osn.de/>
>>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg
>>> Goltermann
>>>
>>>
>>>
>>> --
>>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408
>>> Nuernberg
>>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0> - Fax: +49
>>> 911 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>>> <http://www.osn.de/>
>>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>>
>>>
>>> *illumos-developer* | Archives
>>> <https://www.listbox.com/member/archive/182179/=now>
>>> <https://www.listbox.com/member/archive/rss/182179/21239177-3604570e
>>> >
>>> | Modify <https://www.listbox.com/member/?&> Your Subscription
>>> [Powered by Listbox] <http://www.listbox.com/>
>>>
>>>
>>
>> *illumos-developer* | Archives
>> <https://www.listbox.com/member/archive/182179/=now>
>> <https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c> |
>> Modify
>> <https://www.listbox.com/member/?member_id=21175123&id_
>> secret=21175123-d92578cc>
>> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>>
>>
> --
> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg
> Tel: +49 911 39905-0 - Fax: +49 911 39905-55 - http://www.osn.de
> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150302/62dbb20a/attachment-0001.html>
More information about the OmniOS-discuss
mailing list