[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy

Sun Mar 8 22:58:01 UTC 2015

Hello,

I was able to perform my last round of testing last night. The tests were
done with a single host, while enabling one or two 1G ports.

                1 port (Read)       2 ports  (Read)
Baseline:       130MB/s Read        30MB/s
Disable LRO:    90MB/s              27MB/s
Disable LRO/LSO 88MB/s              27MB/s

LRO/LSO enabled, TCP window size varies
Default iscsid (256k max)

64k Window     96MB/s               28MB/s
32k Window     72MB/s               22MB/s
16k Window     61MB/s               17MB/s

I then set everything back to the default and captured exactly what happens
when I start a single port transfer then enable a second port in the
middle. It's pretty illustrative. The server chokes a bit, then strangely
sends TDS protocol packets saying "exception occurred". I didn't know TDS
had anything to do with iSCSI.

Captures from both interfaces here:
https://drive.google.com/open?id=0BwyUMjibonYQMG8zZnNWbk40Ymc&authuser=0

So it seems that window size isn't the limiting factor here.

I am in the middle of implementing infiniband now. I can highly recommend
the Silverstorm (QLogic) 9024CU 20G 4096MTU (with latest firmware) switch
for the lab. The fans run very quietly at normal temps, and they are very
inexpensive ($250 on eBay). It supports ethernet out-of-band management, as
well as a subnet manager web app hosted from the switch itself. Creating
the serial console cable was mildly irritating.

The latest firmware can be retrieved from the QLogic site via a Google
search, you won't find a link on their support frontpage.

I'll report back once I have iSER / SRP results.

-Warren V

On Wed, Mar 4, 2015 at 9:14 AM, Mallory, Rob <rmallory at qualcomm.com> wrote:

>  Hi Warren,
>
> [ …no objections here if you want to take this thread off-line to a
> smaller group… I  wanted to post this to the larger groups for benefit  of
> others,
>
> And maybe if you find success in the end you can post back to the larger
> groups with a summary ]
>
>
>
>    Your recent success case going to 10GbE end to end seem to back up my
> theory of overload.
>
>     I noticed yesterday a couple things:  in the packet capture of the
> sender, it is apparent that large send offload LSO is being used (notice
> the size up to 64k packets).   Among other things, this makes it a bit
> harder to tune and understand what is happening from packet captures on the
> hosts. It also lets the NIC “tightly pack” the outgoing packet stream
> without much gap between the MTU sized packets. You need to get a 3rd-party
> snoop on the wire to see what the wire sees. Same thing on the ESXi
> server.  I suspect it is using LRO or RSC, also ganging up the packets, and
> making it difficult to diagnose from a tcpdump on a VM.
>
>
>
> I still stand by my original inclination.  And the data you have shown
> also seems to back this. (smaller MTU = less drops/pauses and then the
> latest: equal size pipes on send and receive make it work fine.)   I was
> recommending that you either/both decrease the rwin on the client side
> (limiting it to an absurdly small, but not for this case, 17KB) or on the
> server.
>
>
>
> It makes more sense to control the window-size on the server, (to tune
> just for these small-bandwidth clients) and use a host-route with –sendpipe
> (or is it –ssthresh) because then the server will limit (and hopefully set
> some timers on the LSO part of the ixgbe) the amount of in-flight so-as the
> last-switch hop does not drop those important initial packets on
> tcp-slowstart.
>
>
>
> So my original recommendation still stands:  limit the rwin on the client
> to 17K, if you want to continue to use a 1GbE interface on it.
>
> (your BDP @ 150us * 1GbE is about 24k, I’d pick a max receive window size
> smaller than this to be more conservative in the case of two interfaces)
>
> (and yes,  only 2 x 9K jumbo frames fit in that BDP, so those 3k jumbos
> had less loss for a good reason, 1500MTU is probably best in your case)
>
> And two other things to help identify/understand the situation (in the
> mode you had it before with the quad-1GbE in the client)
>
> You can turn off LSO (in ixbge.conf)   and also on the ESX, and the client
> VM side you can turn off LRO or RSC.
>
>
>
> Note:  there was (maybe still is) a long-standing bug in the S10 and S11
> TCP stack which I know of only second-hand from a reliable source,
>
> It has very similar conditions that you describe here, including high
> bandwidth servers, low bandwidth clients, multiple hops.
>
> The workaround is to disable RSC on the linux clients.
>
>
>
> good to hear that you can configure the system end-to-end 10GbE.  That’s
> the obvious best case if you stick eith ethernet, and you don’t have to go
> to extremes like above.   Note that the lossless fabric of Infiniband will
> completely hide these effects of TCP loss.  Also, you can use much more
> efficient transport such as SRP (RDMA) which I think would fit really well
> if you can afford the additional complexity.   (I’ve done this, and it’s
> really not that hard on small scale).
>
>
>
> Cheers,   Rob
>
>
>
>
>
> *From:* W Verb [mailto:wverb73 at gmail.com]
> *Sent:* Tuesday, March 03, 2015 9:22 PM
> *To:* Mallory, Rob; illumos-dev
> *Cc:* Garrett D'Amore; Joerg Goltermann; omnios-discuss at lists.omniti.com
>
> *Subject:* Re: [developer] Re: [OmniOS-discuss] The ixgbe driver, Lindsay
> Lohan, and the Greek economy
>
>
>
> Hello all,
>
> This is probably the last message in this thread.
>
> I pulled the quad-gig NIC out of one of my hosts, and installed an X520. I
> then set a single 10G port on the server to be on the same VLAN as the
> host, and defined a vswitch, vmknic, etc on the host.
>
> I set the MTU to be 9000 on both sides, then ran my tests.
>
> Read:  130 MB/s.
>
> Write:  156 MB/s.
>
> Additionally, at higher MTUs, the NIC would periodically lock up until I
> performed an "ipadm disable-if -t ixgbe0" and re-enabled it. I tried your
> updated driver, Jeorg, but unfortunately it failed quite often.
>
>
>
> I then disabled stmf, enabled NFS (v3 only) on the server, and shared a
> dataset on the zpool with "share -f nfs /ppool/testy".
> I then mounted the server dataset on the host via NFS, and copied my test
> VM from the iSCSI zvol to the NFS dataset. I also removed the binding of
> the 10G port on the host from the sw iscsi interface.
>
> Running the same tests on the VM over NFSv3 yielded:
>
> Read: 650MB/s
>
> Write: 306MB/s
>
> This is getting within 10% of the throughput I consistently get on dd
> operations local on the server, so I'm pretty happy that I'm getting as
> good as I'm going to get until I add more drives. Additionally, I haven't
> experienced any NIC hangs.
>
>
>
> I tried varying the settings in ixgbe.conf, the MTU, and disabling LSO on
> the host and server, but nothing really made that much of a difference
> (except reducing the MTU made things about 20-30% slower).
>
> mpstat during both NFS and iSCSI transfers showed all processors as
> getting roughly the same number of interrupts, etc, although I did see a
> varying number of  spins on reader/writer locks during the iSCSI transfers.
> The NFS showed no srws at all.
>
> Here is a pretty representative example of a 1s mpstat during an iSCSI
> transfer:
>
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
> idl set
>   0    0   0    0  3246 2690 8739    6  772 5967    2     0    0  11   0
> 89   0
>   1    0   0    0  2366 2249 7910    8  988 5563    2   302    0   9   0
> 91   0
>   2    0   0    0  2455 2344 5584    5  687 5656    3    66    0   9   0
> 91   0
>   3    0   0   25   248   12 6210    1  885 5679    2     0    0   9   0
> 91   0
>   4    0   0    0   284    7 5450    2  861 5751    1     0    0   8   0
> 92   0
>   5    0   0    0   232    3 4513    0  547 5733    3     0    0   7   0
> 93   0
>   6    0   0    0   322    8 6084    1  836 6295    2     0    0   8   0
> 92   0
>   7    0   0    0  3114 2848 8229    4  648 4966    2     0    0  10   0
> 90   0
>
>
>
> So, it seems that it's COMSTAR/iSCSI that's broke as hell, not ixgbe. My
> apologies to anyone I may have offended with my pre-judgement.
>
> The consequences of this performance issue are significant:
>
> 1: Instead of being able to utilize the existing quad-port NICs I have in
> my hosts, I must use dual 10G cards for redundancy purposes.
>
> 2: I must build out a full 10G switching infrastructure.
>
> 3: The network traffic is inherently less secure, as it is essentially
> impossible to do real security with NFSv3 (that is supported by ESXi).
>
> In the short run, I have already ordered some relatively cheap 20G
> infiniband gear that will hopefully push up the cost/performance ratio.
> However, I have received all sorts of advice about how painful it can be to
> build and maintain infiniband, and if iSCSI over 10G ethernet is this
> painful, I'm not hopeful that infiniband will "just work".
>
> The last option, of course, is to bail out of the Solaris derivatives and
> move to ZoL or ZoBSD. The drawbacks of this are:
>
> 1: ZoL doesn't easily support booting off of mirrored USB flash drives,
> let alone running the root filesystem and swap on them. FreeNAS, by way of
> comparison, puts a 2G swap partition on each zdev, which (strangely enough)
> causes it to often crash when a zdev experiences a failure under load.
>
> 2: Neither ZoL or FreeNAS have good, stable, kernel-based iSCSI
> implementations. FreeNAS is indeed testing istgt, but it proved unstable
> for my purposes in recent builds. Unfortunately, stmf hasn't proved itself
> any better.
>
>
>
> There are other minor differences, but these are the ones that brought me
> to OmniOS in the first place. We'll just have to wait and see how well the
> infiniband stuff works.
>
>   Hopefully this exercise will help prevent others from going down the
> same rabbit-hole that I did.
>
> -Warren V
>
>
>
>
>
>
>
>
>
> On Tue, Mar 3, 2015 at 3:45 PM, W Verb <wverb73 at gmail.com> wrote:
>
>       Hello Rob et al,
>
> Thank you for taking the time to look at this problem with me. I
> completely understand your inclination to look at the network as the most
> probable source of my issue, but I believe that this is a pretty clear-cut
> case of server-side issues.
>
> 1: I did run ping RTT tests during both read and write operations with
> multiple interfaces enabled, and the RTT stayed at ~.2ms regardless of
> whether traffic was actively being transmitted/received or not.
>
> 2: I am not seeing the TCP window size bouncing around, and I am certainly
> not seeing starvation and delay in my packet captures. It is true that I do
> see delayed ACKs and retransmissions when I bump the MTU to 9000 on both
> sides, but I stopped testing with high MTU as soon as I saw it happening
> because I have a good understanding of incast. All of my recent testing has
> been with MTUs between 1000 and 3000 bytes.
>
> 3: When testing with MTUs between 1000 and 3000 bytes, I do not see lost
> packets and retransmission in captures on either the server or client side.
> I only see staggered transmission delays on the part of the server.
>
> 4: The client is consistently advertising a large window size (20k+), so
> the TCP throttling mechanism does not appear to play into this.
>
> 5: As mentioned previously, layer 2 flow control is not enabled anywhere
> in the network, so there are no lower-level mechanisms at work.
>
> 6: Upon checking buffer and queue sizes (and doing the appropriate
> research into documentation on the C3560E's buffer sizes), I do not see
> large numbers of frames being dropped by the switch. It does happen at
> larger MTUs, but not very often (and not consistently) during transfers at
> 1000-3000 byte MTUs. I do not have QoS, policing, or rate-shaping enabled.
>
> 7: Network interface stats on both the server and the ESXi client show no
> errors of any kind. This is via netstat on the server, and esxcli / Vsphere
> client on the ESXi box.
>
> 8: When looking at captures taken simultaneously on the server and client
> side, the server-side transmission pauses are consistently seen and
> reproducible, even after multiple server rebuilds, ESXi rebuilds, vSphere
> reinstallations (down to wiping the SQL db), various COMSTAR configuration
> variations, multiple 10G NICs with different NIC chipsets, multiple
> switches (I tried both a 48-port and 24-port C3560E), multiple IOS
> revisions (12.2 and 15.0), OmniOS versions (r151012 and previous) multiple
> cables, transceivers, etc etc etc etc etc
>
>
> For your review, I have uploaded the actual packet captures to Google
> Drive:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQZS03dHJ2ZjJvTEE/view?usp=sharing
> 2 int write - ESXi vmk5
>
> https://drive.google.com/file/d/0BwyUMjibonYQTlNTQ2M5bjlxZ00/view?usp=sharing
> 2 int write - ESXi vmk1
>
> https://drive.google.com/file/d/0BwyUMjibonYQUEFsSVJCYXBVX3c/view?usp=sharing
> 2 int read -  server ixgbe0
>
> https://drive.google.com/file/d/0BwyUMjibonYQT3FBbElnNFpJTzQ/view?usp=sharing
> 2 int read - ESXi vmk5
>
> https://drive.google.com/file/d/0BwyUMjibonYQU1hXdFRLM2cxSTA/view?usp=sharing
> 2 int read - ESXi vmk1
>
> https://drive.google.com/file/d/0BwyUMjibonYQNEFZSHVZdFNweDA/view?usp=sharing
> 1 int write - ESXi vmk1
>
> https://drive.google.com/file/d/0BwyUMjibonYQM3FpTmloQm5iMGc/view?usp=sharing
> 1 int read - ESXi vmk1
>
> Regards,
>
> Warren V
>
>
>
> On Mon, Mar 2, 2015 at 1:11 PM, Mallory, Rob <rmallory at qualcomm.com>
> wrote:
>
>   Just an EWAG,   and forgive me for not following closely, I just saw
> this in my inbox, and looked at it and the screenshots for 2 minutes.
>
>
>
>   But this looks like the typical incast problem..  see
> http://www.pdl.cmu.edu/Incast/
>
> where your storage servers (there are effectively two with ISCSI/MPIO if
> round-robin is working) have networks which are 20:1 oversubscribed to your
> 1GbE host interfaces. (although one of the tcpdumps shows only one server
> so it may be choked out completely)
>
>
>
> What is your BDP?  I’m guessing .150ms * 1GbE.  For single-link that gets
> you to a MSS of 18700 or so.
>
>
>
> On your 1GbE connected clients, leave MTU at 9k, set the following in
> sysctl.conf,
>
> And reboot.
>
>
>
> net.ipv4.tcp_rmem = 4096 8938 17876
>
>
>
> If MPIO from the server is indeed round-robining properly, this will “make
> things fit” much better.
>
>
>
> Note that your tcp_wmem can and should stay high, since you are not
> oversubscribed going from clientàserver ;  you only need to tweak the tcp
> receive window size.
>
>
>
> I’ve not done it in quite some time, but IIRC, You can also set these from
> the server side with:
>
> Route add -sendpipe 8930   or –ssthresh
>
>
>
> And I think you can see the hash-table with computed BDP per client with
> ndd.
>
>
>
> I would try playing with those before delving deep into potential bugs in
> the TCP, nic driver, zfs, or vm.
>
> -Rob
>
>
>
> *From:* W Verb via illumos-developer [mailto:developer at lists.illumos.org]
> *Sent:* Monday, March 02, 2015 12:20 PM
> *To:* Garrett D'Amore
> *Cc:* Joerg Goltermann; illumos-dev; omnios-discuss at lists.omniti.com
> *Subject:* Re: [developer] Re: [OmniOS-discuss] The ixgbe driver, Lindsay
> Lohan, and the Greek economy
>
>
>
> Hello,
>
> vmstat seems pretty boring. Certainly nothing going to swap.
>
> root at sanbox:/root# vmstat
>  kthr      memory            page            disk          faults      cpu
>  r b w   swap  free  re  mf pi po fr de sr po ro s0 s2   in   sy   cs us
> sy id
>  0 0 0 34631632 30728068 175 215 0 0 0 0 963 275 4 6 140 3301 796 6681 0
> 1 99
>
> Here is the "taskq_dispatch_ent" output from "lockstat -s 5 -kWP sleep 30"
> during the "fast" write operation.
>
>
> -------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Hottest Lock           Caller
> 50934   3%  79% 0.00     3437 0xffffff093145ba40     taskq_dispatch_ent
>
>       nsec ------ Time Distribution ------ count     Stack
>        128 |                               7         spa_taskq_dispatch_ent
>        256 |@@                             4333      zio_taskq_dispatch
>        512 |@@                             3863      zio_issue_async
>       1024 |@@@@@                          9717      zio_execute
>       2048 |@@@@@@@@@                      15904
>       4096 |@@@@                           7595
>       8192 |@@                             4498
>      16384 |@                              2662
>      32768 |@                              1886
>      65536 |                               434
>     131072 |                               34
>     262144 |                               1
>
> -------------------------------------------------------------------------------
>
>   However, the truly "broken" function is a read operation:
>
> Top lock 1st try:
>
> -------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>   474  15%  15% 0.00     7031 0xffffff093145b6f8     cv_wait
>
>       nsec ------ Time Distribution ------ count     Stack
>        256 |@                              29        taskq_thread_wait
>        512 |@@@@@@                         100       taskq_thread
>       1024 |@@@@                           72        thread_start
>       2048 |@@@@                           69
>       4096 |@@@                            51
>       8192 |@@                             47
>      16384 |@@                             44
>      32768 |@@                             32
>      65536 |@                              25
>     131072 |                               5
>
> -------------------------------------------------------------------------------
>
> Top lock 2nd try:
>
>
> -------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>   174  39%  39% 0.00   103909 0xffffff0943f116a0     dmu_zfetch_find
>
>       nsec ------ Time Distribution ------ count     Stack
>       2048 |                               2         dmu_zfetch
>       4096 |                               3         dbuf_read
>       8192 |                               4
> dmu_buf_hold_array_by_dnode
>      16384 |                               3         dmu_buf_hold_array
>      32768 |@                              7
>      65536 |@@                             14
>     131072 |@@@@@@@@@@@@@@@@@@@@           116
>     262144 |@@@                            19
>     524288 |                               4
>    1048576 |                               2
>
> -------------------------------------------------------------------------------
>
> Top lock 3rd try:
>
>
> -------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>   283  55%  55% 0.00    94602 0xffffff0943ff5a68     dmu_zfetch_find
>
>       nsec ------ Time Distribution ------ count     Stack
>        512 |                               1         dmu_zfetch
>       1024 |                               1         dbuf_read
>       2048 |                               0
> dmu_buf_hold_array_by_dnode
>       4096 |                               5         dmu_buf_hold_array
>       8192 |                               2
>      16384 |                               7
>      32768 |                               4
>      65536 |@@@                            33
>     131072 |@@@@@@@@@@@@@@@@@@@@           198
>     262144 |@@                             27
>     524288 |                               2
>    1048576 |                               3
>
> -------------------------------------------------------------------------------
>
>
>
> As for the MTU question- setting the MTU to 9000 makes read operations
> grind almost to a halt at 5MB/s transfer rate.
>
> -Warren V
>
>
>
> On Mon, Mar 2, 2015 at 11:30 AM, Garrett D'Amore <garrett at damore.org>
> wrote:
>
>  Here’s a theory.  You are using small (relatively) MTUs (3000 is less
> than the smallest ZFS block size.)  So, when you go multipathing this way,
> might a single upper layer transaction (ZFS block transfer request, or for
> that matter COMSTAR block request) get routed over different paths.  This
> sounds like a potentially pathological condition to me.
>
>
>
> What happens if you increase the MTU to 9000?  Have you tried it?  I’m
> sort of thinking that this will permit each transaction to be issued in a
> single IP frame, which may alleviate certain tragic code paths.  (That
> said, I’m not sure how aware COMSTAR is of the IP MTU.  If it is ignorant,
> then it shouldn’t matter *that* much, since TCP should do the right thing
> here and a single TCP stream should stick to a single underlying NIC.  But
> if COMSTAR is aware of the MTU, it may do some really screwball things as
> it tries to break requests up into single frames.)
>
>
>
> Your read spin really looks like only about 22 msec of wait out of a total
> run of 30 sec.  (That’s not *great*, but neither does it sound tragic.)
>  Your write  is interesting because that looks like it is going a wildly
> different path.  You should be aware that the locks you see are *not*
> necessarily related in call order, but rather are ordered by instance
> count.  The write code path hitting the task_thread as hard as it does is
> really, really weird.  Something is pounding on a taskq lock super hard.
> The number of taskq_dispatch_ent calls is interesting here.  I’m starting
> to wonder if it’s something as stupid as a spin where if the taskq is
> “full” (max size reached), a caller just is spinning trying to dispatch
> jobs to the taskq.
>
>
>
> The taskq_dispatch_ent code is super simple, and it should be almost
> impossible to have contention on that lock — barring a thread spinning hard
> on taskq_dispatch (or taskq_dispatch_ent as I think is happening here).
> Looking at the various call sites, there are places in both COMSTAR
> (iscsit) and in ZFS where this could be coming from.  To know which, we
> really need to have the back trace associated.
>
>
>
> lockstat can give this — try giving “-s 5” to give a short backtrace from
> this, that will probably give us a little more info about the guilty
> caller. :-)
>
>
>
> - Garrett
>
>
>
>   On Mar 2, 2015, at 11:07 AM, W Verb via illumos-developer <
> developer at lists.illumos.org> wrote:
>
>
>
> Hello all,
>
> I am not using layer 2 flow control. The switch carries line-rate 10G
> traffic without error.
>
> I think I have found the issue via lockstat. The first lockstat is taken
> during a multipath read:
>
> lockstat -kWP sleep 30
>
>
> Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)
>
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>
> -------------------------------------------------------------------------------
>  9306  44%  44% 0.00     1557 htable_mutex+0x370     htable_release
>  6307  23%  68% 0.00     1207 htable_mutex+0x108     htable_lookup
>   596   7%  75% 0.00     4100 0xffffff0931705188     cv_wait
>   349   5%  80% 0.00     4437 0xffffff0931705188     taskq_thread
>   704   2%  82% 0.00      995 0xffffff0935de3c50     dbuf_create
>
> The hash table being read here I would guess is the tcp connection hash
> table.
>
>
>
> When lockstat is run during a multipath write operation, I get:
>
> Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)
>
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>
> -------------------------------------------------------------------------------
> 210752  28%  28% 0.00     4781 0xffffff0931705188     taskq_thread
> 174471  22%  50% 0.00     4476 0xffffff0931705188     cv_wait
> 127183  10%  61% 0.00     2871 0xffffff096f29b510     zio_notify_parent
> 176066  10%  70% 0.00     1922 0xffffff0931705188     taskq_dispatch_ent
> 105134   9%  80% 0.00     3110 0xffffff096ffdbf10     zio_remove_child
> 67512   4%  83% 0.00     1938 0xffffff096f3db4b0     zio_add_child
> 45736   3%  86% 0.00     2239 0xffffff0935de3c50     dbuf_destroy
> 27781   3%  89% 0.00     3416 0xffffff0935de3c50     dbuf_create
> 38536   2%  91% 0.00     2122 0xffffff0935de3b70     dnode_rele
> 27841   2%  93% 0.00     2423 0xffffff0935de3b70     dnode_diduse_space
> 19020   2%  95% 0.00     3046 0xffffff09d9e305e0     dbuf_rele
> 14627   1%  96% 0.00     3632 dbuf_hash_table+0x4f8  dbuf_find
>
>   Writes are not performing htable lookups, while reads are.
>
> -Warren V
>
>
>
>
>
>
> On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <jg at osn.de> wrote:
>
>  Hi,
>
> I would try *one* TPG which includes both interface addresses
> and I would double check for packet drops on the Catalyst.
>
> The 3560 supports only receive flow control which means, that
> a sending 10Gbit port can easily overload a 1Gbit port.
> Do you have flow control enabled?
>
>  - Joerg
>
>
>
> On 02.03.2015 09:22, W Verb via illumos-developer wrote:
>
>   Hello Garrett,
>
> No, no 802.3ad going on in this config.
>
> Here is a basic schematic:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/view?usp=sharing
>
> Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/view?usp=sharing
>
> Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The
> switch is set to allow 9148-byte frames, and I'm not seeing any
> errors/buffer overruns on the switch.
>
> Here is a screenshot of a packet capture from a read operation on the
> guest OS (from it's local drive, which is actually a VMDK file on the
> storage server). In this example, only a single 1G ESXi kernel interface
> (vmk1) is bound to the software iSCSI initiator.
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/view?usp=sharing
>
> Note that there's a nice, well-behaved window sizing process taking
> place. The ESXi decreases the scaled window by 11 or 12 for each ACK,
> then bumps it back up to 512.
>
> Here is a similar screenshot of a single-interface write operation:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/view?usp=sharing
>
> There are no pauses or gaps in the transmission rate in the
> single-interface transfers.
>
>
> In the next screenshots, I have enabled an additional 1G interface on
> the ESXi host, and bound it to the iSCSI initiator. The new interface is
> bound to a separate physical port, uses a different VLAN on the switch,
> and talks to a different 10G port on the storage server.
>
> First, let's look at a write operation on the guest OS, which happily
> pumps data at near-line-rate to the storage server.
>
> Here is a sequence number trace diagram. Note how the transfer has a
> nice, smooth increment rate over the entire transfer.
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/view?usp=sharing
>
> Here are screenshots from packet captures on both 1G interfaces:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/view?usp=sharing
>
> https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/view?usp=sharing
>
> Note how we again see nice, smooth window adjustment, and no gaps in
> transmission.
>
>
> But now, let's look at the problematic two-interface Read operation.
> First, the sequence graph:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/view?usp=sharing
>
> As you can see, there are gaps and jumps in the transmission throughout
> the transfer.
> It is very illustrative to look at captures of the gaps, which are
> occurring on both interfaces:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/view?usp=sharing
>
> https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/view?usp=sharing
>
> As you can see, there are ~.4 second pauses in transmission from the
> storage server, which kills the transfer rate.
> It's clear that the ESXi box ACKs the prior iSCSI operation to
> completion, then makes a new LUN request, which the storage server
> immediately replies to. The ESXi ACKs the response packet from the
> storage server, then waits...and waits....and waits... until eventually
> the storage server starts transmitting again.
>
> Because the pause happens while the ESXi client is waiting for a packet
> from the storage server, that tells me that the gaps are not an artifact
> of traffic being switched between both active interfaces, but are
> actually indicative of short hangs occurring on the server.
>
> Having a pause or two in transmission is no big deal, but in my case, it
> is happening constantly, and dropping my overall read transfer rate down
> to 20-60MB/s, which is slower than the single interface transfer rate
> (~90-100MB/s).
>
> Decreasing the MTU makes the pauses shorter, increasing them makes the
> pauses longer.
>
> Another interesting thing is that if I set the multipath io interval to
> 3 operations instead of 1, I get better throughput. In other words, the
> less frequently I swap IP addresses on my iSCSI requests from the ESXi
> unit, the fewer pauses I see.
>
> Basically, COMSTAR seems to choke each time an iSCSI request from a new
> IP arrives.
>
> Because the single interface transfer is near line rate, that tells me
> that the storage system (mpt_sas, zfs, etc) is working fine. It's only
> when multiple paths are attempted that iSCSI falls on its face during
> reads.
>
> All of these captures were taken without a cache device being attached
> to the storage zpool, so this isn't looking like some kind of ZFS ARC
> problem. As mentioned previously, local transfers to/from the zpool are
> showing ~300-500 MB/s rates over long transfers (10G+).
>
> -Warren V
>
> On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <garrett at damore.org
>
> <mailto:garrett at damore.org>> wrote:
>
>     I’m not sure I’ve followed properly.  You have *two* interfaces.
>     You are not trying to provision these in an aggr are you? As far as
>     I’m aware, VMware does not support 802.3ad link aggregations.  (Its
>     possible that you can make it work with ESXi if you give the entire
>     NIC to the guest — but I’m skeptical.)  The problem is that if you
>     try to use link aggregation, some packets (up to half!) will be
>     lost.  TCP and other protocols fare poorly in this situation.
>
>     Its possible I’ve totally misunderstood what you’re trying to do, in
>     which case I apologize.
>
>     The idle thing is a red-herring — the cpu is waiting for work to do,
>     probably because packets haven’t arrived (or where dropped by the
>     hypervisor!)  I wouldn’t read too much into that except that your
>     network stack is in trouble.  I’d look a bit more closely at the
>     kstats for tcp — I suspect you’ll see retransmits or out of order
>     values that are unusually high — if so this may help validate my
>     theory above.
>
>     - Garrett
>
>     On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>     <developer at lists.illumos.org <mailto:developer at lists.illumos.org>>
>
>
>     wrote:
>
>     Hello all,
>
>
>     Well, I no longer blame the ixgbe driver for the problems I'm seeing.
>
>
>     I tried Joerg's updated driver, which didn't improve the issue. So
>     I went back to the drawing board and rebuilt the server from scratch.
>
>     What I noted is that if I have only a single 1-gig physical
>     interface active on the ESXi host, everything works as expected.
>     As soon as I enable two interfaces, I start seeing the performance
>     problems I've described.
>
>     Response pauses from the server that I see in TCPdumps are still
>     leading me to believe the problem is delay on the server side, so
>     I ran a series of kernel dtraces and produced some flamegraphs.
>
>
>     This was taken during a read operation with two active 10G
>     interfaces on the server, with a single target being shared by two
>     tpgs- one tpg for each 10G physical port. The host device has two
>     1G ports enabled, with VLANs separating the active ports into
>     10G/1G pairs. ESXi is set to multipath using both VLANS with a
>     round-robin IO interval of 1.
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/view?usp=sharing
>
>
>     This was taken during a write operation:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/view?usp=sharing
>
>
>     I then rebooted the server and disabled C-State, ACPI T-State, and
>     general EIST (Turbo boost) functionality in the CPU.
>
>     I when I attempted to boot my guest VM, the iSCSI transfer
>     gradually ground to a halt during the boot loading process, and
>     the guest OS never did complete its boot process.
>
>     Here is a flamegraph taken while iSCSI is slowly dying:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/view?usp=sharing
>
>
>     I edited out cpu_idle_adaptive from the dtrace output and
>     regenerated the slowdown graph:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/view?usp=sharing
>
>
>     I then edited cpu_idle_adaptive out of the speedy write operation
>     and regenerated that graph:
>
>
> https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/view?usp=sharing
>
>
>     I have zero experience with interpreting flamegraphs, but the most
>     significant difference I see between the slow read example and the
>     fast write example is in unix`thread_start --> unix`idle. There's
>     a good chunk of "unix`i86_mwait" in the read example that is not
>     present in the write example at all.
>
>     Disabling the l2arc cache device didn't make a difference, and I
>     had to reenable EIST support on the CPU to get my VMs to boot.
>
>     I am seeing a variety of bug reports going back to 2010 regarding
>     excessive mwait operations, with the suggested solutions usually
>     being to set "cpupm enable poll-mode" in power.conf. That change
>     also had no effect on speed.
>
>     -Warren V
>
>
>
>
>     -----Original Message-----
>
>     From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>
>     Sent: Monday, February 23, 2015 8:30 AM
>
>     To: W Verb
>
>     Cc: omnios-discuss at lists.omniti.com
>
>     <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>     <mailto:cks at cs.toronto.edu>
>
>     Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and
>     the Greek economy
>
>
>     > Chris, thanks for your specific details. I'd appreciate it if you
>
>     > could tell me which copper NIC you tried, as well as to pass on the
>
>     > iSCSI tuning parameters.
>
>
>     Our copper NIC experience is with onboard X540-AT2 ports on
>     SuperMicro hardware (which have the guaranteed 10-20 msec lock
>     hold) and dual-port 82599EB TN cards (which have some sort of
>     driver/hardware failure under load that eventually leads to
>     2-second lock holds). I can't recommend either with the current
>     driver; we had to revert to 1G networking in order to get stable
>     servers.
>
>
>     The iSCSI parameter modifications we do, across both initiators
>     and targets, are:
>
>
>     initialr2tno
>
>     firstburstlength128k
>
>     maxrecvdataseglen128k[only on Linux backends]
>
>     maxxmitdataseglen128k[only on Linux backends]
>
>
>     The OmniOS initiator doesn't need tuning for more than the first
>     two parameters; on the Linux backends we tune up all four. My
>     extended thoughts on these tuning parameters and why we touch them
>     can be found
>
>     here:
>
>
>
> http://utcc.utoronto.ca/~cks/space/blog/tech/UnderstandingiSCSIProtocol
>
>     http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>
>
>     The short version is that these parameters probably only make a
>     small difference but their overall goal is to do 128KB ZFS reads
>     and writes in single iSCSI operations (although they will be
>     fragmented at the TCP
>
>     layer) and to do iSCSI writes without a back-and-forth delay
>     between initiator and target (that's 'initialr2t no').
>
>
>     I think basically everyone should use InitialR2T set to no and in
>     fact that it should be the software default. These days only
>     unusually limited iSCSI targets should need it to be otherwise and
>     they can change their setting for it (initiator and target must
>     both agree to it being 'yes', so either can veto it).
>
>
>     - cks
>
>
>
>     On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>
>     <mailto:jg at osn.de>> wrote:
>
>         Hi,
>
>         I think your problem is caused by your link properties or your
>         switch settings. In general the standard ixgbe seems to perform
>         well.
>
>         I had trouble after changing the default flow control settings
>         to "bi"
>         and this was my motivation to update the ixgbe driver a long
>         time ago.
>         After I have updated our systems to ixgbe 2.5.8 I never had any
>         problems ....
>
>         Make sure your switch has support for jumbo frames and you use
>         the same mtu on all ports, otherwise the smallest will be used.
>
>         What switch do you use? I can tell you nice horror stories about
>         different vendors....
>
>          - Joerg
>
>         On 23.02.2015 10:31, W Verb wrote:
>
>             Thank you Joerg,
>
>             I've downloaded the package and will try it tomorrow.
>
>             The only thing I can add at this point is that upon review
>             of my
>             testing, I may have performed my "pkg -u" between the
>             initial quad-gig
>             performance test and installing the 10G NIC. So this may
>             be a new
>             problem introduced in the latest updates.
>
>             Those of you who are running 10G and have not upgraded to
>             the latest
>             kernel, etc, might want to do some additional testing
>             before running the
>             update.
>
>             -Warren V
>
>             On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann
>             <jg at osn.de <mailto:jg at osn.de>
>
>             <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>
>                 Hi,
>
>                 I remember there was a problem with the flow control
>             settings in the
>                 ixgbe
>                 driver, so I updated it a long time ago for our
>             internal servers to
>                 2.5.8.
>                 Last weekend I integrated the latest changes from the
>             FreeBSD driver
>                 to bring
>                 the illumos ixgbe to 2.5.25 but I had no time to test
>             it, so it's
>                 completely
>                 untested!
>
>
>                 If you would like to give the latest driver a try you
>             can fetch the
>                 kernel modules from
>             https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>             <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>                 <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>             <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>
>                 Clone your boot environment, place the modules in the
>             new environment
>                 and update the boot-archive of the new BE.
>
>                   - Joerg
>
>
>
>
>
>                 On 23.02.2015 02:54, W Verb wrote:
>
>                     By the way, to those of you who have working
>             setups: please send me
>                     your pool/volume settings, interface linkprops,
>             and any kernel
>                     tuning
>                     parameters you may have set.
>
>                     Thanks,
>                     Warren V
>
>                     On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>                     <chip at innovates.com <mailto:chip at innovates.com>
>             <mailto:chip at innovates.com <mailto:chip at innovates.com>>>
>
>
>             wrote:
>
>                         I can't say I totally agree with your performance
>                         assessment.   I run Intel
>                         X520 in all my OmniOS boxes.
>
>                         Here is a capture of nfssvrtop I made while
>             running many
>                         storage vMotions
>                         between two OmniOS boxes hosting NFS
>             datastores.   This is a
>                         10 host VMware
>                         cluster.  Both OmniOS boxes are dual 10G
>             connected with
>                         copper twin-ax to
>                         the in rack Nexus 5010.
>
>                         VMware does 100% sync writes, I use ZeusRAM
>             SSDs for log
>                         devices.
>
>                         -Chip
>
>                         2014 Apr 24 08:05:51, load: 12.64, read:
>             17330243 KB,
>                         swrite: 15985    KB,
>                         awrite: 1875455  KB
>
>                         Ver     Client           NFSOPS   Reads
>             SWrites AWrites
>                         Commits   Rd_bw
>                         SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t
>              Com_t  Align%
>
>                         4       10.28.17.105          0       0
>              0       0
>                           0       0
>                         0       0       0       0       0       0       0
>
>                         4       10.28.17.215          0       0
>              0       0
>                           0       0
>                         0       0       0       0       0       0       0
>
>                         4       10.28.17.213          0       0
>              0       0
>                           0       0
>                         0       0       0       0       0       0       0
>
>                         4       10.28.16.151          0       0
>              0       0
>                           0       0
>                         0       0       0       0       0       0       0
>
>                         4       all                   1       0
>              0       0
>                           0       0
>                         0       0       0       0       0       0       0
>
>                         3       10.28.16.175          3       0
>              3       0
>                           0       1
>                         11       0    4806      48       0       0      85
>
>                         3       10.28.16.183          6       0
>              6       0
>                           0       3
>                         162       0     549     124       0       0
>               73
>
>                         3       10.28.16.180         11       0
>             10       0
>                           0       3
>                         27       0     776      89       0       0      67
>
>                         3       10.28.16.176         28       2
>             26       0
>                           0      10
>                         405       0    2572     198       0       0
>              100
>
>                         3       10.28.16.178       4606    4602
>              4       0
>                           0  294534
>                         3       0     723      49       0       0      99
>
>                         3       10.28.16.179       4905    4879
>             26       0
>                           0  312208
>                         311       0     735     271       0       0
>               99
>
>                         3       10.28.16.181       5515    5502
>             13       0
>                           0  352107
>                         77       0      89      87       0       0      99
>
>                         3       10.28.16.184      12095   12059
>             10       0
>                           0  763014
>                         39       0     249     147       0       0      99
>
>                         3       10.28.58.1        15401    6040
>              116    6354
>                         53  191605
>                         474  202346     192      96     144      83
>               99
>
>                         3       all 42574 33086 <tel:42574%2033086
> <42574%2033086>>
>             <tel:42574%20%20%2033086 <42574%20%20%2033086>>     217
>                         6354      53 1913488
>                         1582  202300     348     138     153     105
>                 99
>
>
>
>
>
>                         On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>             <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>                         <mailto:wverb73 at gmail.com
>
>
>             <mailto:wverb73 at gmail.com>>> wrote:
>
>
>                             Hello All,
>
>                             Thank you for your replies.
>                             I tried a few things, and found the following:
>
>                             1: Disabling hyperthreading support in the
>             BIOS drops
>                             performance overall
>                             by a factor of 4.
>                             2: Disabling VT support also seems to have
>             some effect,
>                             although it
>                             appears to be minor. But this has the
>             amusing side
>                             effect of fixing the
>                             hangs I've been experiencing with fast
>             reboot. Probably
>                             by disabling kvm.
>                             3: The performance tests are a bit tricky
>             to quantify
>                             because of caching
>                             effects. In fact, I'm not entirely sure
>             what is
>                             happening here. It's just
>                             best to describe what I'm seeing:
>
>                             The commands I'm using to test are
>                             dd if=/dev/zero of=./test.dd bs=2M count=5000
>                             dd of=/dev/null if=./test.dd bs=2M count=5000
>                             The host vm is running Centos 6.6, and has
>             the latest
>                             vmtools installed.
>                             There is a host cache on an SSD local to
>             the host that
>                             is also in place.
>                             Disabling the host cache didn't
>             immediately have an
>                             effect as far as I could
>                             see.
>
>                             The host MTU set to 3000 on all iSCSI
>             interfaces for all
>                             tests.
>
>                             Test 1: Right after reboot, with an ixgbe
>             MTU of 9000,
>                             the write test
>                             yields an average speed over three tests
>             of 137MB/s. The
>                             read test yields an
>                             average over three tests of 5MB/s.
>
>                             Test 2: After setting "ifconfig ixgbe0 mtu
>             3000", the
>                             write tests yield
>                             140MB/s, and the read tests yield 53MB/s.
>             It's important
>                             to note here that
>                             if I cut the read test short at only
>             2-3GB, I get
>                             results upwards of
>                             350MB/s, which I assume is local
>             cache-related distortion.
>
>                             Test 3: MTU of 1500. Read tests are up to
>             156 MB/s.
>                             Write tests yield
>                             about 142MB/s.
>                             Test 4: MTU of 1000: Read test at 182MB/s.
>                             Test 5: MTU of 900: Read test at 130 MB/s.
>                             Test 6: MTU of 1000: Read test at 160MB/s.
>             Write tests
>                             are now
>                             consistently at about 300MB/s.
>                             Test 7: MTU of 1200: Read test at 124MB/s.
>                             Test 8: MTU of 1000: Read test at 161MB/s.
>             Write at 261MB/s.
>
>                             A few final notes:
>                             L1ARC grabs about 10GB of RAM during the
>             tests, so
>                             there's definitely some
>                             read cachi
>
> ...
>
> [Message clipped]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150308/045e236d/attachment-0001.html>