[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy
W Verb
wverb73 at gmail.com
Mon Mar 2 20:19:44 UTC 2015
Hello,
vmstat seems pretty boring. Certainly nothing going to swap.
root at sanbox:/root# vmstat
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr po ro s0 s2 in sy cs us sy
id
0 0 0 34631632 30728068 175 215 0 0 0 0 963 275 4 6 140 3301 796 6681 0 1
99
Here is the "taskq_dispatch_ent" output from "lockstat -s 5 -kWP sleep 30"
during the "fast" write operation.
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest Lock Caller
50934 3% 79% 0.00 3437 0xffffff093145ba40 taskq_dispatch_ent
nsec ------ Time Distribution ------ count Stack
128 | 7 spa_taskq_dispatch_ent
256 |@@ 4333 zio_taskq_dispatch
512 |@@ 3863 zio_issue_async
1024 |@@@@@ 9717 zio_execute
2048 |@@@@@@@@@ 15904
4096 |@@@@ 7595
8192 |@@ 4498
16384 |@ 2662
32768 |@ 1886
65536 | 434
131072 | 34
262144 | 1
-------------------------------------------------------------------------------
However, the truly "broken" function is a read operation:
Top lock 1st try:
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest Lock Caller
474 15% 15% 0.00 7031 0xffffff093145b6f8 cv_wait
nsec ------ Time Distribution ------ count Stack
256 |@ 29 taskq_thread_wait
512 |@@@@@@ 100 taskq_thread
1024 |@@@@ 72 thread_start
2048 |@@@@ 69
4096 |@@@ 51
8192 |@@ 47
16384 |@@ 44
32768 |@@ 32
65536 |@ 25
131072 | 5
-------------------------------------------------------------------------------
Top lock 2nd try:
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest Lock Caller
174 39% 39% 0.00 103909 0xffffff0943f116a0 dmu_zfetch_find
nsec ------ Time Distribution ------ count Stack
2048 | 2 dmu_zfetch
4096 | 3 dbuf_read
8192 | 4
dmu_buf_hold_array_by_dnode
16384 | 3 dmu_buf_hold_array
32768 |@ 7
65536 |@@ 14
131072 |@@@@@@@@@@@@@@@@@@@@ 116
262144 |@@@ 19
524288 | 4
1048576 | 2
-------------------------------------------------------------------------------
Top lock 3rd try:
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Hottest Lock Caller
283 55% 55% 0.00 94602 0xffffff0943ff5a68 dmu_zfetch_find
nsec ------ Time Distribution ------ count Stack
512 | 1 dmu_zfetch
1024 | 1 dbuf_read
2048 | 0
dmu_buf_hold_array_by_dnode
4096 | 5 dmu_buf_hold_array
8192 | 2
16384 | 7
32768 | 4
65536 |@@@ 33
131072 |@@@@@@@@@@@@@@@@@@@@ 198
262144 |@@ 27
524288 | 2
1048576 | 3
-------------------------------------------------------------------------------
As for the MTU question- setting the MTU to 9000 makes read operations
grind almost to a halt at 5MB/s transfer rate.
-Warren V
On Mon, Mar 2, 2015 at 11:30 AM, Garrett D'Amore <garrett at damore.org> wrote:
> Here’s a theory. You are using small (relatively) MTUs (3000 is less than
> the smallest ZFS block size.) So, when you go multipathing this way, might
> a single upper layer transaction (ZFS block transfer request, or for that
> matter COMSTAR block request) get routed over different paths. This sounds
> like a potentially pathological condition to me.
>
> What happens if you increase the MTU to 9000? Have you tried it? I’m
> sort of thinking that this will permit each transaction to be issued in a
> single IP frame, which may alleviate certain tragic code paths. (That
> said, I’m not sure how aware COMSTAR is of the IP MTU. If it is ignorant,
> then it shouldn’t matter *that* much, since TCP should do the right thing
> here and a single TCP stream should stick to a single underlying NIC. But
> if COMSTAR is aware of the MTU, it may do some really screwball things as
> it tries to break requests up into single frames.)
>
> Your read spin really looks like only about 22 msec of wait out of a total
> run of 30 sec. (That’s not *great*, but neither does it sound tragic.)
> Your write is interesting because that looks like it is going a wildly
> different path. You should be aware that the locks you see are *not*
> necessarily related in call order, but rather are ordered by instance
> count. The write code path hitting the task_thread as hard as it does is
> really, really weird. Something is pounding on a taskq lock super hard.
> The number of taskq_dispatch_ent calls is interesting here. I’m starting
> to wonder if it’s something as stupid as a spin where if the taskq is
> “full” (max size reached), a caller just is spinning trying to dispatch
> jobs to the taskq.
>
> The taskq_dispatch_ent code is super simple, and it should be almost
> impossible to have contention on that lock — barring a thread spinning hard
> on taskq_dispatch (or taskq_dispatch_ent as I think is happening here).
> Looking at the various call sites, there are places in both COMSTAR
> (iscsit) and in ZFS where this could be coming from. To know which, we
> really need to have the back trace associated.
>
> lockstat can give this — try giving “-s 5” to give a short backtrace from
> this, that will probably give us a little more info about the guilty
> caller. :-)
>
> - Garrett
>
> On Mar 2, 2015, at 11:07 AM, W Verb via illumos-developer <
> developer at lists.illumos.org> wrote:
>
> Hello all,
> I am not using layer 2 flow control. The switch carries line-rate 10G
> traffic without error.
>
> I think I have found the issue via lockstat. The first lockstat is taken
> during a multipath read:
>
>
> lockstat -kWP sleep 30
>
> Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)
>
> Count indv cuml rcnt nsec Hottest Lock Caller
>
> -------------------------------------------------------------------------------
> 9306 44% 44% 0.00 1557 htable_mutex+0x370 htable_release
> 6307 23% 68% 0.00 1207 htable_mutex+0x108 htable_lookup
> 596 7% 75% 0.00 4100 0xffffff0931705188 cv_wait
> 349 5% 80% 0.00 4437 0xffffff0931705188 taskq_thread
> 704 2% 82% 0.00 995 0xffffff0935de3c50 dbuf_create
>
> The hash table being read here I would guess is the tcp connection hash
> table.
>
> When lockstat is run during a multipath write operation, I get:
>
> Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)
>
> Count indv cuml rcnt nsec Hottest Lock Caller
>
> -------------------------------------------------------------------------------
> 210752 28% 28% 0.00 4781 0xffffff0931705188 taskq_thread
> 174471 22% 50% 0.00 4476 0xffffff0931705188 cv_wait
> 127183 10% 61% 0.00 2871 0xffffff096f29b510 zio_notify_parent
> 176066 10% 70% 0.00 1922 0xffffff0931705188 taskq_dispatch_ent
> 105134 9% 80% 0.00 3110 0xffffff096ffdbf10 zio_remove_child
> 67512 4% 83% 0.00 1938 0xffffff096f3db4b0 zio_add_child
> 45736 3% 86% 0.00 2239 0xffffff0935de3c50 dbuf_destroy
> 27781 3% 89% 0.00 3416 0xffffff0935de3c50 dbuf_create
> 38536 2% 91% 0.00 2122 0xffffff0935de3b70 dnode_rele
> 27841 2% 93% 0.00 2423 0xffffff0935de3b70 dnode_diduse_space
> 19020 2% 95% 0.00 3046 0xffffff09d9e305e0 dbuf_rele
> 14627 1% 96% 0.00 3632 dbuf_hash_table+0x4f8 dbuf_find
>
>
>
> Writes are not performing htable lookups, while reads are.
>
> -Warren V
>
>
>
>
>
>
> On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <jg at osn.de> wrote:
>
>> Hi,
>>
>> I would try *one* TPG which includes both interface addresses
>> and I would double check for packet drops on the Catalyst.
>>
>> The 3560 supports only receive flow control which means, that
>> a sending 10Gbit port can easily overload a 1Gbit port.
>> Do you have flow control enabled?
>>
>> - Joerg
>>
>>
>> On 02.03.2015 09:22, W Verb via illumos-developer wrote:
>>
>>> Hello Garrett,
>>>
>>> No, no 802.3ad going on in this config.
>>>
>>> Here is a basic schematic:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/
>>> view?usp=sharing
>>>
>>> Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/
>>> view?usp=sharing
>>>
>>> Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The
>>> switch is set to allow 9148-byte frames, and I'm not seeing any
>>> errors/buffer overruns on the switch.
>>>
>>> Here is a screenshot of a packet capture from a read operation on the
>>> guest OS (from it's local drive, which is actually a VMDK file on the
>>> storage server). In this example, only a single 1G ESXi kernel interface
>>> (vmk1) is bound to the software iSCSI initiator.
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/
>>> view?usp=sharing
>>>
>>> Note that there's a nice, well-behaved window sizing process taking
>>> place. The ESXi decreases the scaled window by 11 or 12 for each ACK,
>>> then bumps it back up to 512.
>>>
>>> Here is a similar screenshot of a single-interface write operation:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/
>>> view?usp=sharing
>>>
>>> There are no pauses or gaps in the transmission rate in the
>>> single-interface transfers.
>>>
>>>
>>> In the next screenshots, I have enabled an additional 1G interface on
>>> the ESXi host, and bound it to the iSCSI initiator. The new interface is
>>> bound to a separate physical port, uses a different VLAN on the switch,
>>> and talks to a different 10G port on the storage server.
>>>
>>> First, let's look at a write operation on the guest OS, which happily
>>> pumps data at near-line-rate to the storage server.
>>>
>>> Here is a sequence number trace diagram. Note how the transfer has a
>>> nice, smooth increment rate over the entire transfer.
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/
>>> view?usp=sharing
>>>
>>> Here are screenshots from packet captures on both 1G interfaces:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/
>>> view?usp=sharing
>>> https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/
>>> view?usp=sharing
>>>
>>> Note how we again see nice, smooth window adjustment, and no gaps in
>>> transmission.
>>>
>>>
>>> But now, let's look at the problematic two-interface Read operation.
>>> First, the sequence graph:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/
>>> view?usp=sharing
>>>
>>> As you can see, there are gaps and jumps in the transmission throughout
>>> the transfer.
>>> It is very illustrative to look at captures of the gaps, which are
>>> occurring on both interfaces:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/
>>> view?usp=sharing
>>> https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/
>>> view?usp=sharing
>>>
>>> As you can see, there are ~.4 second pauses in transmission from the
>>> storage server, which kills the transfer rate.
>>> It's clear that the ESXi box ACKs the prior iSCSI operation to
>>> completion, then makes a new LUN request, which the storage server
>>> immediately replies to. The ESXi ACKs the response packet from the
>>> storage server, then waits...and waits....and waits... until eventually
>>> the storage server starts transmitting again.
>>>
>>> Because the pause happens while the ESXi client is waiting for a packet
>>> from the storage server, that tells me that the gaps are not an artifact
>>> of traffic being switched between both active interfaces, but are
>>> actually indicative of short hangs occurring on the server.
>>>
>>> Having a pause or two in transmission is no big deal, but in my case, it
>>> is happening constantly, and dropping my overall read transfer rate down
>>> to 20-60MB/s, which is slower than the single interface transfer rate
>>> (~90-100MB/s).
>>>
>>> Decreasing the MTU makes the pauses shorter, increasing them makes the
>>> pauses longer.
>>>
>>> Another interesting thing is that if I set the multipath io interval to
>>> 3 operations instead of 1, I get better throughput. In other words, the
>>> less frequently I swap IP addresses on my iSCSI requests from the ESXi
>>> unit, the fewer pauses I see.
>>>
>>> Basically, COMSTAR seems to choke each time an iSCSI request from a new
>>> IP arrives.
>>>
>>> Because the single interface transfer is near line rate, that tells me
>>> that the storage system (mpt_sas, zfs, etc) is working fine. It's only
>>> when multiple paths are attempted that iSCSI falls on its face during
>>> reads.
>>>
>>> All of these captures were taken without a cache device being attached
>>> to the storage zpool, so this isn't looking like some kind of ZFS ARC
>>> problem. As mentioned previously, local transfers to/from the zpool are
>>> showing ~300-500 MB/s rates over long transfers (10G+).
>>>
>>> -Warren V
>>>
>>> On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <garrett at damore.org
>>> <mailto:garrett at damore.org>> wrote:
>>>
>>> I’m not sure I’ve followed properly. You have *two* interfaces.
>>> You are not trying to provision these in an aggr are you? As far as
>>> I’m aware, VMware does not support 802.3ad link aggregations. (Its
>>> possible that you can make it work with ESXi if you give the entire
>>> NIC to the guest — but I’m skeptical.) The problem is that if you
>>> try to use link aggregation, some packets (up to half!) will be
>>> lost. TCP and other protocols fare poorly in this situation.
>>>
>>> Its possible I’ve totally misunderstood what you’re trying to do, in
>>> which case I apologize.
>>>
>>> The idle thing is a red-herring — the cpu is waiting for work to do,
>>> probably because packets haven’t arrived (or where dropped by the
>>> hypervisor!) I wouldn’t read too much into that except that your
>>> network stack is in trouble. I’d look a bit more closely at the
>>> kstats for tcp — I suspect you’ll see retransmits or out of order
>>> values that are unusually high — if so this may help validate my
>>> theory above.
>>>
>>> - Garrett
>>>
>>> On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>>>> <developer at lists.illumos.org <mailto:developer at lists.illumos.org>>
>>>>
>>>> wrote:
>>>>
>>>> Hello all,
>>>>
>>>>
>>>> Well, I no longer blame the ixgbe driver for the problems I'm
>>>> seeing.
>>>>
>>>>
>>>> I tried Joerg's updated driver, which didn't improve the issue. So
>>>> I went back to the drawing board and rebuilt the server from
>>>> scratch.
>>>>
>>>> What I noted is that if I have only a single 1-gig physical
>>>> interface active on the ESXi host, everything works as expected.
>>>> As soon as I enable two interfaces, I start seeing the performance
>>>> problems I've described.
>>>>
>>>> Response pauses from the server that I see in TCPdumps are still
>>>> leading me to believe the problem is delay on the server side, so
>>>> I ran a series of kernel dtraces and produced some flamegraphs.
>>>>
>>>>
>>>> This was taken during a read operation with two active 10G
>>>> interfaces on the server, with a single target being shared by two
>>>> tpgs- one tpg for each 10G physical port. The host device has two
>>>> 1G ports enabled, with VLANs separating the active ports into
>>>> 10G/1G pairs. ESXi is set to multipath using both VLANS with a
>>>> round-robin IO interval of 1.
>>>>
>>>> https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/
>>>> view?usp=sharing
>>>>
>>>>
>>>> This was taken during a write operation:
>>>>
>>>> https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/
>>>> view?usp=sharing
>>>>
>>>>
>>>> I then rebooted the server and disabled C-State, ACPI T-State, and
>>>> general EIST (Turbo boost) functionality in the CPU.
>>>>
>>>> I when I attempted to boot my guest VM, the iSCSI transfer
>>>> gradually ground to a halt during the boot loading process, and
>>>> the guest OS never did complete its boot process.
>>>>
>>>> Here is a flamegraph taken while iSCSI is slowly dying:
>>>>
>>>> https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/
>>>> view?usp=sharing
>>>>
>>>>
>>>> I edited out cpu_idle_adaptive from the dtrace output and
>>>> regenerated the slowdown graph:
>>>>
>>>> https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/
>>>> view?usp=sharing
>>>>
>>>>
>>>> I then edited cpu_idle_adaptive out of the speedy write operation
>>>> and regenerated that graph:
>>>>
>>>> https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/
>>>> view?usp=sharing
>>>>
>>>>
>>>> I have zero experience with interpreting flamegraphs, but the most
>>>> significant difference I see between the slow read example and the
>>>> fast write example is in unix`thread_start --> unix`idle. There's
>>>> a good chunk of "unix`i86_mwait" in the read example that is not
>>>> present in the write example at all.
>>>>
>>>> Disabling the l2arc cache device didn't make a difference, and I
>>>> had to reenable EIST support on the CPU to get my VMs to boot.
>>>>
>>>> I am seeing a variety of bug reports going back to 2010 regarding
>>>> excessive mwait operations, with the suggested solutions usually
>>>> being to set "cpupm enable poll-mode" in power.conf. That change
>>>> also had no effect on speed.
>>>>
>>>> -Warren V
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>>
>>>> From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>>>>
>>>> Sent: Monday, February 23, 2015 8:30 AM
>>>>
>>>> To: W Verb
>>>>
>>>> Cc: omnios-discuss at lists.omniti.com
>>>> <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>>>> <mailto:cks at cs.toronto.edu>
>>>>
>>>> Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and
>>>> the Greek economy
>>>>
>>>>
>>>> > Chris, thanks for your specific details. I'd appreciate it if you
>>>>
>>>> > could tell me which copper NIC you tried, as well as to pass on
>>>> the
>>>>
>>>> > iSCSI tuning parameters.
>>>>
>>>>
>>>> Our copper NIC experience is with onboard X540-AT2 ports on
>>>> SuperMicro hardware (which have the guaranteed 10-20 msec lock
>>>> hold) and dual-port 82599EB TN cards (which have some sort of
>>>> driver/hardware failure under load that eventually leads to
>>>> 2-second lock holds). I can't recommend either with the current
>>>> driver; we had to revert to 1G networking in order to get stable
>>>> servers.
>>>>
>>>>
>>>> The iSCSI parameter modifications we do, across both initiators
>>>> and targets, are:
>>>>
>>>>
>>>> initialr2tno
>>>>
>>>> firstburstlength128k
>>>>
>>>> maxrecvdataseglen128k[only on Linux backends]
>>>>
>>>> maxxmitdataseglen128k[only on Linux backends]
>>>>
>>>>
>>>> The OmniOS initiator doesn't need tuning for more than the first
>>>> two parameters; on the Linux backends we tune up all four. My
>>>> extended thoughts on these tuning parameters and why we touch them
>>>> can be found
>>>>
>>>> here:
>>>>
>>>>
>>>> http://utcc.utoronto.ca/~cks/space/blog/tech/
>>>> UnderstandingiSCSIProtocol
>>>>
>>>> http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>>>>
>>>>
>>>> The short version is that these parameters probably only make a
>>>> small difference but their overall goal is to do 128KB ZFS reads
>>>> and writes in single iSCSI operations (although they will be
>>>> fragmented at the TCP
>>>>
>>>> layer) and to do iSCSI writes without a back-and-forth delay
>>>> between initiator and target (that's 'initialr2t no').
>>>>
>>>>
>>>> I think basically everyone should use InitialR2T set to no and in
>>>> fact that it should be the software default. These days only
>>>> unusually limited iSCSI targets should need it to be otherwise and
>>>> they can change their setting for it (initiator and target must
>>>> both agree to it being 'yes', so either can veto it).
>>>>
>>>>
>>>> - cks
>>>>
>>>>
>>>>
>>>> On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>>>> <mailto:jg at osn.de>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I think your problem is caused by your link properties or your
>>>> switch settings. In general the standard ixgbe seems to perform
>>>> well.
>>>>
>>>> I had trouble after changing the default flow control settings
>>>> to "bi"
>>>> and this was my motivation to update the ixgbe driver a long
>>>> time ago.
>>>> After I have updated our systems to ixgbe 2.5.8 I never had any
>>>> problems ....
>>>>
>>>> Make sure your switch has support for jumbo frames and you use
>>>> the same mtu on all ports, otherwise the smallest will be used.
>>>>
>>>> What switch do you use? I can tell you nice horror stories about
>>>> different vendors....
>>>>
>>>> - Joerg
>>>>
>>>> On 23.02.2015 10:31, W Verb wrote:
>>>>
>>>> Thank you Joerg,
>>>>
>>>> I've downloaded the package and will try it tomorrow.
>>>>
>>>> The only thing I can add at this point is that upon review
>>>> of my
>>>> testing, I may have performed my "pkg -u" between the
>>>> initial quad-gig
>>>> performance test and installing the 10G NIC. So this may
>>>> be a new
>>>> problem introduced in the latest updates.
>>>>
>>>> Those of you who are running 10G and have not upgraded to
>>>> the latest
>>>> kernel, etc, might want to do some additional testing
>>>> before running the
>>>> update.
>>>>
>>>> -Warren V
>>>>
>>>> On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann
>>>> <jg at osn.de <mailto:jg at osn.de>
>>>> <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I remember there was a problem with the flow control
>>>> settings in the
>>>> ixgbe
>>>> driver, so I updated it a long time ago for our
>>>> internal servers to
>>>> 2.5.8.
>>>> Last weekend I integrated the latest changes from the
>>>> FreeBSD driver
>>>> to bring
>>>> the illumos ixgbe to 2.5.25 but I had no time to test
>>>> it, so it's
>>>> completely
>>>> untested!
>>>>
>>>>
>>>> If you would like to give the latest driver a try you
>>>> can fetch the
>>>> kernel modules from
>>>> https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>>>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>>>> <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>>>> <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>>>>
>>>> Clone your boot environment, place the modules in the
>>>> new environment
>>>> and update the boot-archive of the new BE.
>>>>
>>>> - Joerg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 23.02.2015 02:54, W Verb wrote:
>>>>
>>>> By the way, to those of you who have working
>>>> setups: please send me
>>>> your pool/volume settings, interface linkprops,
>>>> and any kernel
>>>> tuning
>>>> parameters you may have set.
>>>>
>>>> Thanks,
>>>> Warren V
>>>>
>>>> On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>>>> <chip at innovates.com <mailto:chip at innovates.com>
>>>> <mailto:chip at innovates.com <mailto:chip at innovates.com>>>
>>>>
>>>> wrote:
>>>>
>>>> I can't say I totally agree with your
>>>> performance
>>>> assessment. I run Intel
>>>> X520 in all my OmniOS boxes.
>>>>
>>>> Here is a capture of nfssvrtop I made while
>>>> running many
>>>> storage vMotions
>>>> between two OmniOS boxes hosting NFS
>>>> datastores. This is a
>>>> 10 host VMware
>>>> cluster. Both OmniOS boxes are dual 10G
>>>> connected with
>>>> copper twin-ax to
>>>> the in rack Nexus 5010.
>>>>
>>>> VMware does 100% sync writes, I use ZeusRAM
>>>> SSDs for log
>>>> devices.
>>>>
>>>> -Chip
>>>>
>>>> 2014 Apr 24 08:05:51, load: 12.64, read:
>>>> 17330243 KB,
>>>> swrite: 15985 KB,
>>>> awrite: 1875455 KB
>>>>
>>>> Ver Client NFSOPS Reads
>>>> SWrites AWrites
>>>> Commits Rd_bw
>>>> SWr_bw AWr_bw Rd_t SWr_t AWr_t
>>>> Com_t Align%
>>>>
>>>> 4 10.28.17.105 0 0
>>>> 0 0
>>>> 0 0
>>>> 0 0 0 0 0 0
>>>> 0
>>>>
>>>> 4 10.28.17.215 0 0
>>>> 0 0
>>>> 0 0
>>>> 0 0 0 0 0 0
>>>> 0
>>>>
>>>> 4 10.28.17.213 0 0
>>>> 0 0
>>>> 0 0
>>>> 0 0 0 0 0 0
>>>> 0
>>>>
>>>> 4 10.28.16.151 0 0
>>>> 0 0
>>>> 0 0
>>>> 0 0 0 0 0 0
>>>> 0
>>>>
>>>> 4 all 1 0
>>>> 0 0
>>>> 0 0
>>>> 0 0 0 0 0 0
>>>> 0
>>>>
>>>> 3 10.28.16.175 3 0
>>>> 3 0
>>>> 0 1
>>>> 11 0 4806 48 0 0
>>>> 85
>>>>
>>>> 3 10.28.16.183 6 0
>>>> 6 0
>>>> 0 3
>>>> 162 0 549 124 0 0
>>>> 73
>>>>
>>>> 3 10.28.16.180 11 0
>>>> 10 0
>>>> 0 3
>>>> 27 0 776 89 0 0
>>>> 67
>>>>
>>>> 3 10.28.16.176 28 2
>>>> 26 0
>>>> 0 10
>>>> 405 0 2572 198 0 0
>>>> 100
>>>>
>>>> 3 10.28.16.178 4606 4602
>>>> 4 0
>>>> 0 294534
>>>> 3 0 723 49 0 0
>>>> 99
>>>>
>>>> 3 10.28.16.179 4905 4879
>>>> 26 0
>>>> 0 312208
>>>> 311 0 735 271 0 0
>>>> 99
>>>>
>>>> 3 10.28.16.181 5515 5502
>>>> 13 0
>>>> 0 352107
>>>> 77 0 89 87 0 0
>>>> 99
>>>>
>>>> 3 10.28.16.184 12095 12059
>>>> 10 0
>>>> 0 763014
>>>> 39 0 249 147 0 0
>>>> 99
>>>>
>>>> 3 10.28.58.1 15401 6040
>>>> 116 6354
>>>> 53 191605
>>>> 474 202346 192 96 144 83
>>>> 99
>>>>
>>>> 3 all 42574 33086 <tel:42574%2033086>
>>>> <tel:42574%20%20%2033086> 217
>>>> 6354 53 1913488
>>>> 1582 202300 348 138 153 105
>>>> 99
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>>>> <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>>>> <mailto:wverb73 at gmail.com
>>>>
>>>> <mailto:wverb73 at gmail.com>>> wrote:
>>>>
>>>>
>>>> Hello All,
>>>>
>>>> Thank you for your replies.
>>>> I tried a few things, and found the
>>>> following:
>>>>
>>>> 1: Disabling hyperthreading support in the
>>>> BIOS drops
>>>> performance overall
>>>> by a factor of 4.
>>>> 2: Disabling VT support also seems to have
>>>> some effect,
>>>> although it
>>>> appears to be minor. But this has the
>>>> amusing side
>>>> effect of fixing the
>>>> hangs I've been experiencing with fast
>>>> reboot. Probably
>>>> by disabling kvm.
>>>> 3: The performance tests are a bit tricky
>>>> to quantify
>>>> because of caching
>>>> effects. In fact, I'm not entirely sure
>>>> what is
>>>> happening here. It's just
>>>> best to describe what I'm seeing:
>>>>
>>>> The commands I'm using to test are
>>>> dd if=/dev/zero of=./test.dd bs=2M
>>>> count=5000
>>>> dd of=/dev/null if=./test.dd bs=2M
>>>> count=5000
>>>> The host vm is running Centos 6.6, and has
>>>> the latest
>>>> vmtools installed.
>>>> There is a host cache on an SSD local to
>>>> the host that
>>>> is also in place.
>>>> Disabling the host cache didn't
>>>> immediately have an
>>>> effect as far as I could
>>>> see.
>>>>
>>>> The host MTU set to 3000 on all iSCSI
>>>> interfaces for all
>>>> tests.
>>>>
>>>> Test 1: Right after reboot, with an ixgbe
>>>> MTU of 9000,
>>>> the write test
>>>> yields an average speed over three tests
>>>> of 137MB/s. The
>>>> read test yields an
>>>> average over three tests of 5MB/s.
>>>>
>>>> Test 2: After setting "ifconfig ixgbe0 mtu
>>>> 3000", the
>>>> write tests yield
>>>> 140MB/s, and the read tests yield 53MB/s.
>>>> It's important
>>>> to note here that
>>>> if I cut the read test short at only
>>>> 2-3GB, I get
>>>> results upwards of
>>>> 350MB/s, which I assume is local
>>>> cache-related distortion.
>>>>
>>>> Test 3: MTU of 1500. Read tests are up to
>>>> 156 MB/s.
>>>> Write tests yield
>>>> about 142MB/s.
>>>> Test 4: MTU of 1000: Read test at 182MB/s.
>>>> Test 5: MTU of 900: Read test at 130 MB/s.
>>>> Test 6: MTU of 1000: Read test at 160MB/s.
>>>> Write tests
>>>> are now
>>>> consistently at about 300MB/s.
>>>> Test 7: MTU of 1200: Read test at 124MB/s.
>>>> Test 8: MTU of 1000: Read test at 161MB/s.
>>>> Write at 261MB/s.
>>>>
>>>> A few final notes:
>>>> L1ARC grabs about 10GB of RAM during the
>>>> tests, so
>>>> there's definitely some
>>>> read caching going on.
>>>> The write operations are easier to observe
>>>> with iostat,
>>>> and I'm seeing io
>>>> rates that closely correlate with the
>>>> network write speeds.
>>>>
>>>>
>>>> Chris, thanks for your specific details.
>>>> I'd appreciate
>>>> it if you could
>>>> tell me which copper NIC you tried, as
>>>> well as to pass
>>>> on the iSCSI tuning
>>>> parameters.
>>>>
>>>> I've ordered an Intel EXPX9502AFXSR, which
>>>> uses the
>>>> 82598 chip instead of
>>>> the 82599 in the X520. If I get similar
>>>> results with my
>>>> fiber transcievers,
>>>> I'll see if I can get a hold of copper ones.
>>>>
>>>> But I should mention that I did indeed
>>>> look at PHY/MAC
>>>> error rates, and
>>>> they are nil.
>>>>
>>>> -Warren V
>>>>
>>>> On Fri, Feb 20, 2015 at 7:25 PM, Chris
>>>> Siebenmann
>>>> <cks at cs.toronto.edu
>>>> <mailto:cks at cs.toronto.edu> <mailto:cks at cs.toronto.edu
>>>>
>>>> <mailto:cks at cs.toronto.edu>>>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> After installation and
>>>> configuration, I observed
>>>> all kinds of bad
>>>> behavior
>>>> in the network traffic between the
>>>> hosts and the
>>>> server. All of this
>>>> bad
>>>> behavior is traced to the ixgbe
>>>> driver on the
>>>> storage server. Without
>>>> going
>>>> into the full troubleshooting
>>>> process, here are
>>>> my takeaways:
>>>>
>>>> [...]
>>>>
>>>> For what it's worth, we managed to
>>>> achieve much
>>>> better line rates on
>>>> copper 10G ixgbe hardware of various
>>>> descriptions
>>>> between OmniOS
>>>> and CentOS 7 (I don't think we ever
>>>> tested OmniOS to
>>>> OmniOS). I don't
>>>> believe OmniOS could do TCP at full
>>>> line rate but I
>>>> think we managed 700+
>>>> Mbytes/sec on both transmit and
>>>> receive and we got
>>>> basically disk-limited
>>>> speeds with iSCSI (across multiple
>>>> disks on
>>>> multi-disk mirrored pools,
>>>> OmniOS iSCSI initiator, Linux iSCSI
>>>> targets).
>>>>
>>>> I don't believe we did any specific
>>>> kernel tuning
>>>> (and in fact some of
>>>> our attempts to fiddle ixgbe driver
>>>> parameters blew
>>>> up in our face).
>>>> We did tune iSCSI connection
>>>> parameters to increase
>>>> various buffer
>>>> sizes so that ZFS could do even large
>>>> single
>>>> operations in single iSCSI
>>>> transactions. (More details available
>>>> if people are
>>>> interested.)
>>>>
>>>> 10: At the wire level, the speed
>>>> problems are
>>>> clearly due to pauses in
>>>> response time by omnios. At 9000
>>>> byte frame
>>>> sizes, I see a good number
>>>> of duplicate ACKs and fast
>>>> retransmits during
>>>> read operations (when
>>>> omnios is transmitting). But below
>>>> about a
>>>> 4100-byte MTU on omnios
>>>> (which seems to correlate to
>>>> 4096-byte iSCSI
>>>> block transfers), the
>>>> transmission errors fade away and
>>>> we only see
>>>> the transmission pause
>>>> problem.
>>>>
>>>>
>>>> This is what really attracted my
>>>> attention. In
>>>> our OmniOS setup, our
>>>> specific Intel hardware had ixgbe
>>>> driver issues that
>>>> could cause
>>>> activity stalls during once-a-second
>>>> link heartbeat
>>>> checks. This
>>>> obviously had an effect at the TCP and
>>>> iSCSI layers.
>>>> My initial message
>>>> to illumos-developer sparked a
>>>> potentially
>>>> interesting discussion:
>>>>
>>>>
>>>> http://www.listbox.com/member/____archive/182179/2014/10/
>>>> sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__
>>>> 4B1D-__11E4-A39C-D534381BA44D/
>>>> <http://www.listbox.com/member/__archive/182179/2014/
>>>> 10/sort/__time_rev/page/16/entry/6:405/__20141003125035:
>>>> 6357079A-4B1D-__11E4-A39C-D534381BA44D/>
>>>>
>>>> <http://www.listbox.com/__member/archive/182179/2014/10/
>>>> __sort/time_rev/page/16/entry/6:__405/20141003125035:
>>>> 6357079A-__4B1D-11E4-A39C-D534381BA44D/
>>>> <http://www.listbox.com/member/archive/182179/2014/10/
>>>> sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-
>>>> 4B1D-11E4-A39C-D534381BA44D/>>
>>>>
>>>> If you think this is a possibility in
>>>> your setup,
>>>> I've put the DTrace
>>>> script I used to hunt for this up on
>>>> the web:
>>>>
>>>> http://www.cs.toronto.edu/~___
>>>> _cks/src/omnios-ixgbe/ixgbe_____delay.d
>>>> <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>>> delay.d>
>>>>
>>>> <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>>> delay.d
>>>> <http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_
>>>> delay.d>>
>>>>
>>>> This isn't the only potential source
>>>> of driver
>>>> stalls by any means, it's
>>>> just the one I found. You may also
>>>> want to look at
>>>> lockstat in general,
>>>> as information it reported is what led
>>>> us to look
>>>> specifically at the
>>>> ixgbe code here.
>>>>
>>>> (If you suspect kernel/driver issues,
>>>> lockstat
>>>> combined with kernel
>>>> source is a really excellent resource.)
>>>>
>>>> - cks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ___________________________________________________
>>>> OmniOS-discuss mailing list
>>>> OmniOS-discuss at lists.omniti
>>>> <mailto:OmniOS-discuss at lists.omniti>.____com
>>>> <mailto:OmniOS-discuss at lists.__omniti.com
>>>> <mailto:OmniOS-discuss at lists.omniti.com>>
>>>> http://lists.omniti.com/____mailman/listinfo/omnios-____
>>>> discuss
>>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss>
>>>>
>>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss
>>>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>>
>>>>
>>>> ___________________________________________________
>>>> OmniOS-discuss mailing list
>>>> OmniOS-discuss at lists.omniti
>>>> <mailto:OmniOS-discuss at lists.omniti>.____com
>>>> <mailto:OmniOS-discuss at lists.__omniti.com
>>>> <mailto:OmniOS-discuss at lists.omniti.com>>
>>>> http://lists.omniti.com/____mailman/listinfo/omnios-____
>>>> discuss
>>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss>
>>>>
>>>> <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss
>>>> <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>>
>>>>
>>>> --
>>>> OSN Online Service Nuernberg GmbH, Bucher Str. 78,
>>>> 90408 Nuernberg
>>>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0>
>>>> <tel:%2B49%20911%2039905-0> - Fax: +49 911
>>>> 39905-55 <tel:%2B49%20911%2039905-55> -
>>>> http://www.osn.de <http://www.osn.de/>
>>>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg
>>>> Goltermann
>>>>
>>>>
>>>>
>>>> --
>>>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408
>>>> Nuernberg
>>>> Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0> - Fax: +49
>>>> 911 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>>>> <http://www.osn.de/>
>>>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>>>
>>>>
>>>> *illumos-developer* | Archives
>>>> <https://www.listbox.com/member/archive/182179/=now>
>>>> <https://www.listbox.com/member/archive/rss/182179/
>>>> 21239177-3604570e>
>>>> | Modify <https://www.listbox.com/member/?&> Your Subscription
>>>> [Powered by Listbox] <http://www.listbox.com/>
>>>>
>>>>
>>>
>>> *illumos-developer* | Archives
>>> <https://www.listbox.com/member/archive/182179/=now>
>>> <https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c> |
>>> Modify
>>> <https://www.listbox.com/member/?&id_secret=21175123-d92578cc
>>> <https://www.listbox.com/member/?&>>
>>> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>>>
>>>
>> --
>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg
>> Tel: +49 911 39905-0 - Fax: +49 911 39905-55 - http://www.osn.de
>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>
>
> *illumos-developer* | Archives
> <https://www.listbox.com/member/archive/182179/=now>
> <https://www.listbox.com/member/archive/rss/182179/21239177-3604570e> |
> Modify
> <https://www.listbox.com/member/?member_id=21239177&id_secret=21239177-2d0c9337>
> Your Subscription <http://www.listbox.com/>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150302/b652d816/attachment-0001.html>
More information about the OmniOS-discuss
mailing list