[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy

Mon Mar 2 20:19:44 UTC 2015

Hello,

vmstat seems pretty boring. Certainly nothing going to swap.

root at sanbox:/root# vmstat
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr po ro s0 s2   in   sy   cs us sy
id
 0 0 0 34631632 30728068 175 215 0 0 0 0 963 275 4 6 140 3301 796 6681 0  1
99

Here is the "taskq_dispatch_ent" output from "lockstat -s 5 -kWP sleep 30"
during the "fast" write operation.
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller
50934   3%  79% 0.00     3437 0xffffff093145ba40     taskq_dispatch_ent

      nsec ------ Time Distribution ------ count     Stack
       128 |                               7         spa_taskq_dispatch_ent
       256 |@@                             4333      zio_taskq_dispatch
       512 |@@                             3863      zio_issue_async
      1024 |@@@@@                          9717      zio_execute
      2048 |@@@@@@@@@                      15904
      4096 |@@@@                           7595
      8192 |@@                             4498
     16384 |@                              2662
     32768 |@                              1886
     65536 |                               434
    131072 |                               34
    262144 |                               1
-------------------------------------------------------------------------------

However, the truly "broken" function is a read operation:

Top lock 1st try:
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller
  474  15%  15% 0.00     7031 0xffffff093145b6f8     cv_wait

      nsec ------ Time Distribution ------ count     Stack
       256 |@                              29        taskq_thread_wait
       512 |@@@@@@                         100       taskq_thread
      1024 |@@@@                           72        thread_start
      2048 |@@@@                           69
      4096 |@@@                            51
      8192 |@@                             47
     16384 |@@                             44
     32768 |@@                             32
     65536 |@                              25
    131072 |                               5
-------------------------------------------------------------------------------

Top lock 2nd try:
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller
  174  39%  39% 0.00   103909 0xffffff0943f116a0     dmu_zfetch_find

      nsec ------ Time Distribution ------ count     Stack
      2048 |                               2         dmu_zfetch
      4096 |                               3         dbuf_read
      8192 |                               4
dmu_buf_hold_array_by_dnode
     16384 |                               3         dmu_buf_hold_array
     32768 |@                              7
     65536 |@@                             14
    131072 |@@@@@@@@@@@@@@@@@@@@           116
    262144 |@@@                            19
    524288 |                               4
   1048576 |                               2
-------------------------------------------------------------------------------

Top lock 3rd try:

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller
  283  55%  55% 0.00    94602 0xffffff0943ff5a68     dmu_zfetch_find

      nsec ------ Time Distribution ------ count     Stack
       512 |                               1         dmu_zfetch
      1024 |                               1         dbuf_read
      2048 |                               0
dmu_buf_hold_array_by_dnode
      4096 |                               5         dmu_buf_hold_array
      8192 |                               2
     16384 |                               7
     32768 |                               4
     65536 |@@@                            33
    131072 |@@@@@@@@@@@@@@@@@@@@           198
    262144 |@@                             27
    524288 |                               2
   1048576 |                               3
-------------------------------------------------------------------------------

As for the MTU question- setting the MTU to 9000 makes read operations
grind almost to a halt at 5MB/s transfer rate.

-Warren V

On Mon, Mar 2, 2015 at 11:30 AM, Garrett D'Amore <garrett at damore.org> wrote:

> Here’s a theory.  You are using small (relatively) MTUs (3000 is less than
> the smallest ZFS block size.)  So, when you go multipathing this way, might
> a single upper layer transaction (ZFS block transfer request, or for that
> matter COMSTAR block request) get routed over different paths.  This sounds
> like a potentially pathological condition to me.
>
> What happens if you increase the MTU to 9000?  Have you tried it?  I’m
> sort of thinking that this will permit each transaction to be issued in a
> single IP frame, which may alleviate certain tragic code paths.  (That
> said, I’m not sure how aware COMSTAR is of the IP MTU.  If it is ignorant,
> then it shouldn’t matter *that* much, since TCP should do the right thing
> here and a single TCP stream should stick to a single underlying NIC.  But
> if COMSTAR is aware of the MTU, it may do some really screwball things as
> it tries to break requests up into single frames.)
>
> Your read spin really looks like only about 22 msec of wait out of a total
> run of 30 sec.  (That’s not *great*, but neither does it sound tragic.)
>  Your write  is interesting because that looks like it is going a wildly
> different path.  You should be aware that the locks you see are *not*
> necessarily related in call order, but rather are ordered by instance
> count.  The write code path hitting the task_thread as hard as it does is
> really, really weird.  Something is pounding on a taskq lock super hard.
> The number of taskq_dispatch_ent calls is interesting here.  I’m starting
> to wonder if it’s something as stupid as a spin where if the taskq is
> “full” (max size reached), a caller just is spinning trying to dispatch
> jobs to the taskq.
>
> The taskq_dispatch_ent code is super simple, and it should be almost
> impossible to have contention on that lock — barring a thread spinning hard
> on taskq_dispatch (or taskq_dispatch_ent as I think is happening here).
> Looking at the various call sites, there are places in both COMSTAR
> (iscsit) and in ZFS where this could be coming from.  To know which, we
> really need to have the back trace associated.
>
> lockstat can give this — try giving “-s 5” to give a short backtrace from
> this, that will probably give us a little more info about the guilty
> caller. :-)
>
> - Garrett
>
> On Mar 2, 2015, at 11:07 AM, W Verb via illumos-developer <
> developer at lists.illumos.org> wrote:
>
> Hello all,
> I am not using layer 2 flow control. The switch carries line-rate 10G
> traffic without error.
>
> I think I have found the issue via lockstat. The first lockstat is taken
> during a multipath read:
>
>
> lockstat -kWP sleep 30
>
> Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)
>
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>
> -------------------------------------------------------------------------------
>  9306  44%  44% 0.00     1557 htable_mutex+0x370     htable_release
>  6307  23%  68% 0.00     1207 htable_mutex+0x108     htable_lookup
>   596   7%  75% 0.00     4100 0xffffff0931705188     cv_wait
>   349   5%  80% 0.00     4437 0xffffff0931705188     taskq_thread
>   704   2%  82% 0.00      995 0xffffff0935de3c50     dbuf_create
>
> The hash table being read here I would guess is the tcp connection hash
> table.
>
> When lockstat is run during a multipath write operation, I get:
>
> Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)
>
> Count indv cuml rcnt     nsec Hottest Lock           Caller
>
> -------------------------------------------------------------------------------
> 210752  28%  28% 0.00     4781 0xffffff0931705188     taskq_thread
> 174471  22%  50% 0.00     4476 0xffffff0931705188     cv_wait
> 127183  10%  61% 0.00     2871 0xffffff096f29b510     zio_notify_parent
> 176066  10%  70% 0.00     1922 0xffffff0931705188     taskq_dispatch_ent
> 105134   9%  80% 0.00     3110 0xffffff096ffdbf10     zio_remove_child
> 67512   4%  83% 0.00     1938 0xffffff096f3db4b0     zio_add_child
> 45736   3%  86% 0.00     2239 0xffffff0935de3c50     dbuf_destroy
> 27781   3%  89% 0.00     3416 0xffffff0935de3c50     dbuf_create
> 38536   2%  91% 0.00     2122 0xffffff0935de3b70     dnode_rele
> 27841   2%  93% 0.00     2423 0xffffff0935de3b70     dnode_diduse_space
> 19020   2%  95% 0.00     3046 0xffffff09d9e305e0     dbuf_rele
> 14627   1%  96% 0.00     3632 dbuf_hash_table+0x4f8  dbuf_find
>
>
>
> Writes are not performing htable lookups, while reads are.
>
> -Warren V
>
>
>
>
>
>
> On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <jg at osn.de> wrote:
>
>> Hi,
>>
>> I would try *one* TPG which includes both interface addresses
>> and I would double check for packet drops on the Catalyst.
>>
>> The 3560 supports only receive flow control which means, that
>> a sending 10Gbit port can easily overload a 1Gbit port.
>> Do you have flow control enabled?
>>
>>  - Joerg
>>
>>
>> On 02.03.2015 09:22, W Verb via illumos-developer wrote:
>>
>>> Hello Garrett,
>>>
>>> No, no 802.3ad going on in this config.
>>>
>>> Here is a basic schematic:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/
>>> view?usp=sharing
>>>
>>> Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/
>>> view?usp=sharing
>>>
>>> Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The
>>> switch is set to allow 9148-byte frames, and I'm not seeing any
>>> errors/buffer overruns on the switch.
>>>
>>> Here is a screenshot of a packet capture from a read operation on the
>>> guest OS (from it's local drive, which is actually a VMDK file on the
>>> storage server). In this example, only a single 1G ESXi kernel interface
>>> (vmk1) is bound to the software iSCSI initiator.
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/
>>> view?usp=sharing
>>>
>>> Note that there's a nice, well-behaved window sizing process taking
>>> place. The ESXi decreases the scaled window by 11 or 12 for each ACK,
>>> then bumps it back up to 512.
>>>
>>> Here is a similar screenshot of a single-interface write operation:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/
>>> view?usp=sharing
>>>
>>> There are no pauses or gaps in the transmission rate in the
>>> single-interface transfers.
>>>
>>>
>>> In the next screenshots, I have enabled an additional 1G interface on
>>> the ESXi host, and bound it to the iSCSI initiator. The new interface is
>>> bound to a separate physical port, uses a different VLAN on the switch,
>>> and talks to a different 10G port on the storage server.
>>>
>>> First, let's look at a write operation on the guest OS, which happily
>>> pumps data at near-line-rate to the storage server.
>>>
>>> Here is a sequence number trace diagram. Note how the transfer has a
>>> nice, smooth increment rate over the entire transfer.
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/
>>> view?usp=sharing
>>>
>>> Here are screenshots from packet captures on both 1G interfaces:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/
>>> view?usp=sharing
>>> https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/
>>> view?usp=sharing
>>>
>>> Note how we again see nice, smooth window adjustment, and no gaps in
>>> transmission.
>>>
>>>
>>> But now, let's look at the problematic two-interface Read operation.
>>> First, the sequence graph:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/
>>> view?usp=sharing
>>>
>>> As you can see, there are gaps and jumps in the transmission throughout
>>> the transfer.
>>> It is very illustrative to look at captures of the gaps, which are
>>> occurring on both interfaces:
>>>
>>> https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/
>>> view?usp=sharing
>>> https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/
>>> view?usp=sharing
>>>
>>> As you can see, there are ~.4 second pauses in transmission from the
>>> storage server, which kills the transfer rate.
>>> It's clear that the ESXi box ACKs the prior iSCSI operation to
>>> completion, then makes a new LUN request, which the storage server
>>> immediately replies to. The ESXi ACKs the response packet from the
>>> storage server, then waits...and waits....and waits... until eventually
>>> the storage server starts transmitting again.
>>>
>>> Because the pause happens while the ESXi client is waiting for a packet
>>> from the storage server, that tells me that the gaps are not an artifact
>>> of traffic being switched between both active interfaces, but are
>>> actually indicative of short hangs occurring on the server.
>>>
>>> Having a pause or two in transmission is no big deal, but in my case, it
>>> is happening constantly, and dropping my overall read transfer rate down
>>> to 20-60MB/s, which is slower than the single interface transfer rate
>>> (~90-100MB/s).
>>>
>>> Decreasing the MTU makes the pauses shorter, increasing them makes the
>>> pauses longer.
>>>
>>> Another interesting thing is that if I set the multipath io interval to
>>> 3 operations instead of 1, I get better throughput. In other words, the
>>> less frequently I swap IP addresses on my iSCSI requests from the ESXi
>>> unit, the fewer pauses I see.
>>>
>>> Basically, COMSTAR seems to choke each time an iSCSI request from a new
>>> IP arrives.
>>>
>>> Because the single interface transfer is near line rate, that tells me
>>> that the storage system (mpt_sas, zfs, etc) is working fine. It's only
>>> when multiple paths are attempted that iSCSI falls on its face during
>>> reads.
>>>
>>> All of these captures were taken without a cache device being attached
>>> to the storage zpool, so this isn't looking like some kind of ZFS ARC
>>> problem. As mentioned previously, local transfers to/from the zpool are
>>> showing ~300-500 MB/s rates over long transfers (10G+).
>>>
>>> -Warren V
>>>
>>> On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <garrett at damore.org
>>> <mailto:garrett at damore.org>> wrote:
>>>
>>>     I’m not sure I’ve followed properly.  You have *two* interfaces.
>>>     You are not trying to provision these in an aggr are you? As far as
>>>     I’m aware, VMware does not support 802.3ad link aggregations.  (Its
>>>     possible that you can make it work with ESXi if you give the entire
>>>     NIC to the guest — but I’m skeptical.)  The problem is that if you
>>>     try to use link aggregation, some packets (up to half!) will be
>>>     lost.  TCP and other protocols fare poorly in this situation.
>>>
>>>     Its possible I’ve totally misunderstood what you’re trying to do, in
>>>     which case I apologize.
>>>
>>>     The idle thing is a red-herring — the cpu is waiting for work to do,
>>>     probably because packets haven’t arrived (or where dropped by the
>>>     hypervisor!)  I wouldn’t read too much into that except that your
>>>     network stack is in trouble.  I’d look a bit more closely at the
>>>     kstats for tcp — I suspect you’ll see retransmits or out of order
>>>     values that are unusually high — if so this may help validate my
>>>     theory above.
>>>
>>>     - Garrett
>>>
>>>     On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>>>>     <developer at lists.illumos.org <mailto:developer at lists.illumos.org>>
>>>>
>>>>     wrote:
>>>>
>>>>     Hello all,
>>>>
>>>>
>>>>     Well, I no longer blame the ixgbe driver for the problems I'm
>>>> seeing.
>>>>
>>>>
>>>>     I tried Joerg's updated driver, which didn't improve the issue. So
>>>>     I went back to the drawing board and rebuilt the server from
>>>> scratch.
>>>>
>>>>     What I noted is that if I have only a single 1-gig physical
>>>>     interface active on the ESXi host, everything works as expected.
>>>>     As soon as I enable two interfaces, I start seeing the performance
>>>>     problems I've described.
>>>>
>>>>     Response pauses from the server that I see in TCPdumps are still
>>>>     leading me to believe the problem is delay on the server side, so
>>>>     I ran a series of kernel dtraces and produced some flamegraphs.
>>>>
>>>>
>>>>     This was taken during a read operation with two active 10G
>>>>     interfaces on the server, with a single target being shared by two
>>>>     tpgs- one tpg for each 10G physical port. The host device has two
>>>>     1G ports enabled, with VLANs separating the active ports into
>>>>     10G/1G pairs. ESXi is set to multipath using both VLANS with a
>>>>     round-robin IO interval of 1.
>>>>
>>>>     https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/
>>>> view?usp=sharing
>>>>
>>>>
>>>>     This was taken during a write operation:
>>>>
>>>>     https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/
>>>> view?usp=sharing
>>>>
>>>>
>>>>     I then rebooted the server and disabled C-State, ACPI T-State, and
>>>>     general EIST (Turbo boost) functionality in the CPU.
>>>>
>>>>     I when I attempted to boot my guest VM, the iSCSI transfer
>>>>     gradually ground to a halt during the boot loading process, and
>>>>     the guest OS never did complete its boot process.
>>>>
>>>>     Here is a flamegraph taken while iSCSI is slowly dying:
>>>>
>>>>     https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/
>>>> view?usp=sharing
>>>>
>>>>
>>>>     I edited out cpu_idle_adaptive from the dtrace output and
>>>>     regenerated the slowdown graph:
>>>>
>>>>     https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/
>>>> view?usp=sharing
>>>>
>>>>
>>>>     I then edited cpu_idle_adaptive out of the speedy write operation
>>>>     and regenerated that graph:
>>>>
>>>>     https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/
>>>> view?usp=sharing
>>>>
>>>>
>>>>     I have zero experience with interpreting flamegraphs, but the most
>>>>     significant difference I see between the slow read example and the
>>>>     fast write example is in unix`thread_start --> unix`idle. There's
>>>>     a good chunk of "unix`i86_mwait" in the read example that is not
>>>>     present in the write example at all.
>>>>
>>>>     Disabling the l2arc cache device didn't make a difference, and I
>>>>     had to reenable EIST support on the CPU to get my VMs to boot.
>>>>
>>>>     I am seeing a variety of bug reports going back to 2010 regarding
>>>>     excessive mwait operations, with the suggested solutions usually
>>>>     being to set "cpupm enable poll-mode" in power.conf. That change
>>>>     also had no effect on speed.
>>>>
>>>>     -Warren V
>>>>
>>>>
>>>>
>>>>
>>>>     -----Original Message-----
>>>>
>>>>     From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>>>>
>>>>     Sent: Monday, February 23, 2015 8:30 AM
>>>>
>>>>     To: W Verb
>>>>
>>>>     Cc: omnios-discuss at lists.omniti.com
>>>>     <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>>>>     <mailto:cks at cs.toronto.edu>
>>>>
>>>>     Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and
>>>>     the Greek economy
>>>>
>>>>
>>>>     > Chris, thanks for your specific details. I'd appreciate it if you
>>>>
>>>>     > could tell me which copper NIC you tried, as well as to pass on
>>>> the
>>>>
>>>>     > iSCSI tuning parameters.
>>>>
>>>>
>>>>     Our copper NIC experience is with onboard X540-AT2 ports on
>>>>     SuperMicro hardware (which have the guaranteed 10-20 msec lock
>>>>     hold) and dual-port 82599EB TN cards (which have some sort of
>>>>     driver/hardware failure under load that eventually leads to
>>>>     2-second lock holds). I can't recommend either with the current
>>>>     driver; we had to revert to 1G networking in order to get stable
>>>>     servers.
>>>>
>>>>
>>>>     The iSCSI parameter modifications we do, across both initiators
>>>>     and targets, are:
>>>>
>>>>
>>>>     initialr2tno
>>>>
>>>>     firstburstlength128k
>>>>
>>>>     maxrecvdataseglen128k[only on Linux backends]
>>>>
>>>>     maxxmitdataseglen128k[only on Linux backends]
>>>>
>>>>
>>>>     The OmniOS initiator doesn't need tuning for more than the first
>>>>     two parameters; on the Linux backends we tune up all four. My
>>>>     extended thoughts on these tuning parameters and why we touch them
>>>>     can be found
>>>>
>>>>     here:
>>>>
>>>>
>>>>     http://utcc.utoronto.ca/~cks/space/blog/tech/
>>>> UnderstandingiSCSIProtocol
>>>>
>>>>     http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>>>>
>>>>
>>>>     The short version is that these parameters probably only make a
>>>>     small difference but their overall goal is to do 128KB ZFS reads
>>>>     and writes in single iSCSI operations (although they will be
>>>>     fragmented at the TCP
>>>>
>>>>     layer) and to do iSCSI writes without a back-and-forth delay
>>>>     between initiator and target (that's 'initialr2t no').
>>>>
>>>>
>>>>     I think basically everyone should use InitialR2T set to no and in
>>>>     fact that it should be the software default. These days only
>>>>     unusually limited iSCSI targets should need it to be otherwise and
>>>>     they can change their setting for it (initiator and target must
>>>>     both agree to it being 'yes', so either can veto it).
>>>>
>>>>
>>>>     - cks
>>>>
>>>>
>>>>
>>>>     On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>>>>     <mailto:jg at osn.de>> wrote:
>>>>
>>>>         Hi,
>>>>
>>>>         I think your problem is caused by your link properties or your
>>>>         switch settings. In general the standard ixgbe seems to perform
>>>>         well.
>>>>
>>>>         I had trouble after changing the default flow control settings
>>>>         to "bi"
>>>>         and this was my motivation to update the ixgbe driver a long
>>>>         time ago.
>>>>         After I have updated our systems to ixgbe 2.5.8 I never had any
>>>>         problems ....
>>>>
>>>>         Make sure your switch has support for jumbo frames and you use
>>>>         the same mtu on all ports, otherwise the smallest will be used.
>>>>
>>>>         What switch do you use? I can tell you nice horror stories about
>>>>         different vendors....
>>>>
>>>>          - Joerg
>>>>
>>>>         On 23.02.2015 10:31, W Verb wrote:
>>>>
>>>>             Thank you Joerg,
>>>>
>>>>             I've downloaded the package and will try it tomorrow.
>>>>
>>>>             The only thing I can add at this point is that upon review
>>>>             of my
>>>>             testing, I may have performed my "pkg -u" between the
>>>>             initial quad-gig
>>>>             performance test and installing the 10G NIC. So this may
>>>>             be a new
>>>>             problem introduced in the latest updates.
>>>>
>>>>             Those of you who are running 10G and have not upgraded to
>>>>             the latest
>>>>             kernel, etc, might want to do some additional testing
>>>>             before running the
>>>>             update.
>>>>
>>>>             -Warren V
>>>>
>>>>             On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann
>>>>             <jg at osn.de <mailto:jg at osn.de>
>>>>             <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>>>>
>>>>                 Hi,
>>>>
>>>>                 I remember there was a problem with the flow control
>>>>             settings in the
>>>>                 ixgbe
>>>>                 driver, so I updated it a long time ago for our
>>>>             internal servers to
>>>>                 2.5.8.
>>>>                 Last weekend I integrated the latest changes from the
>>>>             FreeBSD driver
>>>>                 to bring
>>>>                 the illumos ixgbe to 2.5.25 but I had no time to test
>>>>             it, so it's
>>>>                 completely
>>>>                 untested!
>>>>
>>>>
>>>>                 If you would like to give the latest driver a try you
>>>>             can fetch the
>>>>                 kernel modules from
>>>>             https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>>>>             <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>>>>                 <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>>>>             <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>>>>
>>>>                 Clone your boot environment, place the modules in the
>>>>             new environment
>>>>                 and update the boot-archive of the new BE.
>>>>
>>>>                   - Joerg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                 On 23.02.2015 02:54, W Verb wrote:
>>>>
>>>>                     By the way, to those of you who have working
>>>>             setups: please send me
>>>>                     your pool/volume settings, interface linkprops,
>>>>             and any kernel
>>>>                     tuning
>>>>                     parameters you may have set.
>>>>
>>>>                     Thanks,
>>>>                     Warren V
>>>>
>>>>                     On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>>>>                     <chip at innovates.com <mailto:chip at innovates.com>
>>>>             <mailto:chip at innovates.com <mailto:chip at innovates.com>>>
>>>>
>>>>             wrote:
>>>>
>>>>                         I can't say I totally agree with your
>>>> performance
>>>>                         assessment.   I run Intel
>>>>                         X520 in all my OmniOS boxes.
>>>>
>>>>                         Here is a capture of nfssvrtop I made while
>>>>             running many
>>>>                         storage vMotions
>>>>                         between two OmniOS boxes hosting NFS
>>>>             datastores.   This is a
>>>>                         10 host VMware
>>>>                         cluster.  Both OmniOS boxes are dual 10G
>>>>             connected with
>>>>                         copper twin-ax to
>>>>                         the in rack Nexus 5010.
>>>>
>>>>                         VMware does 100% sync writes, I use ZeusRAM
>>>>             SSDs for log
>>>>                         devices.
>>>>
>>>>                         -Chip
>>>>
>>>>                         2014 Apr 24 08:05:51, load: 12.64, read:
>>>>             17330243 KB,
>>>>                         swrite: 15985    KB,
>>>>                         awrite: 1875455  KB
>>>>
>>>>                         Ver     Client           NFSOPS   Reads
>>>>             SWrites AWrites
>>>>                         Commits   Rd_bw
>>>>                         SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t
>>>>              Com_t  Align%
>>>>
>>>>                         4       10.28.17.105          0       0
>>>>              0       0
>>>>                           0       0
>>>>                         0       0       0       0       0       0
>>>>  0
>>>>
>>>>                         4       10.28.17.215          0       0
>>>>              0       0
>>>>                           0       0
>>>>                         0       0       0       0       0       0
>>>>  0
>>>>
>>>>                         4       10.28.17.213          0       0
>>>>              0       0
>>>>                           0       0
>>>>                         0       0       0       0       0       0
>>>>  0
>>>>
>>>>                         4       10.28.16.151          0       0
>>>>              0       0
>>>>                           0       0
>>>>                         0       0       0       0       0       0
>>>>  0
>>>>
>>>>                         4       all                   1       0
>>>>              0       0
>>>>                           0       0
>>>>                         0       0       0       0       0       0
>>>>  0
>>>>
>>>>                         3       10.28.16.175          3       0
>>>>              3       0
>>>>                           0       1
>>>>                         11       0    4806      48       0       0
>>>> 85
>>>>
>>>>                         3       10.28.16.183          6       0
>>>>              6       0
>>>>                           0       3
>>>>                         162       0     549     124       0       0
>>>>               73
>>>>
>>>>                         3       10.28.16.180         11       0
>>>>             10       0
>>>>                           0       3
>>>>                         27       0     776      89       0       0
>>>> 67
>>>>
>>>>                         3       10.28.16.176         28       2
>>>>             26       0
>>>>                           0      10
>>>>                         405       0    2572     198       0       0
>>>>              100
>>>>
>>>>                         3       10.28.16.178       4606    4602
>>>>              4       0
>>>>                           0  294534
>>>>                         3       0     723      49       0       0
>>>> 99
>>>>
>>>>                         3       10.28.16.179       4905    4879
>>>>             26       0
>>>>                           0  312208
>>>>                         311       0     735     271       0       0
>>>>               99
>>>>
>>>>                         3       10.28.16.181       5515    5502
>>>>             13       0
>>>>                           0  352107
>>>>                         77       0      89      87       0       0
>>>> 99
>>>>
>>>>                         3       10.28.16.184      12095   12059
>>>>             10       0
>>>>                           0  763014
>>>>                         39       0     249     147       0       0
>>>> 99
>>>>
>>>>                         3       10.28.58.1        15401    6040
>>>>              116    6354
>>>>                         53  191605
>>>>                         474  202346     192      96     144      83
>>>>               99
>>>>
>>>>                         3       all 42574 33086 <tel:42574%2033086>
>>>>             <tel:42574%20%20%2033086>     217
>>>>                         6354      53 1913488
>>>>                         1582  202300     348     138     153     105
>>>>                 99
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                         On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>>>>             <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>>>>                         <mailto:wverb73 at gmail.com
>>>>
>>>>             <mailto:wverb73 at gmail.com>>> wrote:
>>>>
>>>>
>>>>                             Hello All,
>>>>
>>>>                             Thank you for your replies.
>>>>                             I tried a few things, and found the
>>>> following:
>>>>
>>>>                             1: Disabling hyperthreading support in the
>>>>             BIOS drops
>>>>                             performance overall
>>>>                             by a factor of 4.
>>>>                             2: Disabling VT support also seems to have
>>>>             some effect,
>>>>                             although it
>>>>                             appears to be minor. But this has the
>>>>             amusing side
>>>>                             effect of fixing the
>>>>                             hangs I've been experiencing with fast
>>>>             reboot. Probably
>>>>                             by disabling kvm.
>>>>                             3: The performance tests are a bit tricky
>>>>             to quantify
>>>>                             because of caching
>>>>                             effects. In fact, I'm not entirely sure
>>>>             what is
>>>>                             happening here. It's just
>>>>                             best to describe what I'm seeing:
>>>>
>>>>                             The commands I'm using to test are
>>>>                             dd if=/dev/zero of=./test.dd bs=2M
>>>> count=5000
>>>>                             dd of=/dev/null if=./test.dd bs=2M
>>>> count=5000
>>>>                             The host vm is running Centos 6.6, and has
>>>>             the latest
>>>>                             vmtools installed.
>>>>                             There is a host cache on an SSD local to
>>>>             the host that
>>>>                             is also in place.
>>>>                             Disabling the host cache didn't
>>>>             immediately have an
>>>>                             effect as far as I could
>>>>                             see.
>>>>
>>>>                             The host MTU set to 3000 on all iSCSI
>>>>             interfaces for all
>>>>                             tests.
>>>>
>>>>                             Test 1: Right after reboot, with an ixgbe
>>>>             MTU of 9000,
>>>>                             the write test
>>>>                             yields an average speed over three tests
>>>>             of 137MB/s. The
>>>>                             read test yields an
>>>>                             average over three tests of 5MB/s.
>>>>
>>>>                             Test 2: After setting "ifconfig ixgbe0 mtu
>>>>             3000", the
>>>>                             write tests yield
>>>>                             140MB/s, and the read tests yield 53MB/s.
>>>>             It's important
>>>>                             to note here that
>>>>                             if I cut the read test short at only
>>>>             2-3GB, I get
>>>>                             results upwards of
>>>>                             350MB/s, which I assume is local
>>>>             cache-related distortion.
>>>>
>>>>                             Test 3: MTU of 1500. Read tests are up to
>>>>             156 MB/s.
>>>>                             Write tests yield
>>>>                             about 142MB/s.
>>>>                             Test 4: MTU of 1000: Read test at 182MB/s.
>>>>                             Test 5: MTU of 900: Read test at 130 MB/s.
>>>>                             Test 6: MTU of 1000: Read test at 160MB/s.
>>>>             Write tests
>>>>                             are now
>>>>                             consistently at about 300MB/s.
>>>>                             Test 7: MTU of 1200: Read test at 124MB/s.
>>>>                             Test 8: MTU of 1000: Read test at 161MB/s.
>>>>             Write at 261MB/s.
>>>>
>>>>                             A few final notes:
>>>>                             L1ARC grabs about 10GB of RAM during the
>>>>             tests, so
>>>>                             there's definitely some
>>>>                             read caching going on.
>>>>                             The write operations are easier to observe
>>>>             with iostat,
>>>>                             and I'm seeing io
>>>>                             rates that closely correlate with the
>>>>             network write speeds.
>>>>
>>>>
>>>>                             Chris, thanks for your specific details.
>>>>             I'd appreciate
>>>>                             it if you could
>>>>                             tell me which copper NIC you tried, as
>>>>             well as to pass
>>>>                             on the iSCSI tuning
>>>>                             parameters.
>>>>
>>>>                             I've ordered an Intel EXPX9502AFXSR, which
>>>>             uses the
>>>>                             82598 chip instead of
>>>>                             the 82599 in the X520. If I get similar
>>>>             results with my
>>>>                             fiber transcievers,
>>>>                             I'll see if I can get a hold of copper ones.
>>>>
>>>>                             But I should mention that I did indeed
>>>>             look at PHY/MAC
>>>>                             error rates, and
>>>>                             they are nil.
>>>>
>>>>                             -Warren V
>>>>
>>>>                             On Fri, Feb 20, 2015 at 7:25 PM, Chris
>>>>             Siebenmann
>>>>                             <cks at cs.toronto.edu
>>>>             <mailto:cks at cs.toronto.edu> <mailto:cks at cs.toronto.edu
>>>>
>>>>             <mailto:cks at cs.toronto.edu>>>
>>>>
>>>>                             wrote:
>>>>
>>>>
>>>>                                     After installation and
>>>>             configuration, I observed
>>>>                                     all kinds of bad
>>>>                                     behavior
>>>>                                     in the network traffic between the
>>>>             hosts and the
>>>>                                     server. All of this
>>>>                                     bad
>>>>                                     behavior is traced to the ixgbe
>>>>             driver on the
>>>>                                     storage server. Without
>>>>                                     going
>>>>                                     into the full troubleshooting
>>>>             process, here are
>>>>                                     my takeaways:
>>>>
>>>>                                 [...]
>>>>
>>>>                                    For what it's worth, we managed to
>>>>             achieve much
>>>>                                 better line rates on
>>>>                                 copper 10G ixgbe hardware of various
>>>>             descriptions
>>>>                                 between OmniOS
>>>>                                 and CentOS 7 (I don't think we ever
>>>>             tested OmniOS to
>>>>                                 OmniOS). I don't
>>>>                                 believe OmniOS could do TCP at full
>>>>             line rate but I
>>>>                                 think we managed 700+
>>>>                                 Mbytes/sec on both transmit and
>>>>             receive and we got
>>>>                                 basically disk-limited
>>>>                                 speeds with iSCSI (across multiple
>>>>             disks on
>>>>                                 multi-disk mirrored pools,
>>>>                                 OmniOS iSCSI initiator, Linux iSCSI
>>>>             targets).
>>>>
>>>>                                    I don't believe we did any specific
>>>>             kernel tuning
>>>>                                 (and in fact some of
>>>>                                 our attempts to fiddle ixgbe driver
>>>>             parameters blew
>>>>                                 up in our face).
>>>>                                 We did tune iSCSI connection
>>>>             parameters to increase
>>>>                                 various buffer
>>>>                                 sizes so that ZFS could do even large
>>>>             single
>>>>                                 operations in single iSCSI
>>>>                                 transactions. (More details available
>>>>             if people are
>>>>                                 interested.)
>>>>
>>>>                                     10: At the wire level, the speed
>>>>             problems are
>>>>                                     clearly due to pauses in
>>>>                                     response time by omnios. At 9000
>>>>             byte frame
>>>>                                     sizes, I see a good number
>>>>                                     of duplicate ACKs and fast
>>>>             retransmits during
>>>>                                     read operations (when
>>>>                                     omnios is transmitting). But below
>>>>             about a
>>>>                                     4100-byte MTU on omnios
>>>>                                     (which seems to correlate to
>>>>             4096-byte iSCSI
>>>>                                     block transfers), the
>>>>                                     transmission errors fade away and
>>>>             we only see
>>>>                                     the transmission pause
>>>>                                     problem.
>>>>
>>>>
>>>>                                    This is what really attracted my
>>>>             attention. In
>>>>                                 our OmniOS setup, our
>>>>                                 specific Intel hardware had ixgbe
>>>>             driver issues that
>>>>                                 could cause
>>>>                                 activity stalls during once-a-second
>>>>             link heartbeat
>>>>                                 checks. This
>>>>                                 obviously had an effect at the TCP and
>>>>             iSCSI layers.
>>>>                                 My initial message
>>>>                                 to illumos-developer sparked a
>>>> potentially
>>>>                                 interesting discussion:
>>>>
>>>>
>>>>             http://www.listbox.com/member/____archive/182179/2014/10/
>>>> sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__
>>>> 4B1D-__11E4-A39C-D534381BA44D/
>>>>             <http://www.listbox.com/member/__archive/182179/2014/
>>>> 10/sort/__time_rev/page/16/entry/6:405/__20141003125035:
>>>> 6357079A-4B1D-__11E4-A39C-D534381BA44D/>
>>>>
>>>>             <http://www.listbox.com/__member/archive/182179/2014/10/
>>>> __sort/time_rev/page/16/entry/6:__405/20141003125035:
>>>> 6357079A-__4B1D-11E4-A39C-D534381BA44D/
>>>>             <http://www.listbox.com/member/archive/182179/2014/10/
>>>> sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-
>>>> 4B1D-11E4-A39C-D534381BA44D/>>
>>>>
>>>>                                 If you think this is a possibility in
>>>>             your setup,
>>>>                                 I've put the DTrace
>>>>                                 script I used to hunt for this up on
>>>>             the web:
>>>>
>>>>             http://www.cs.toronto.edu/~___
>>>> _cks/src/omnios-ixgbe/ixgbe_____delay.d
>>>>             <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>>> delay.d>
>>>>
>>>>             <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>>> delay.d
>>>>             <http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_
>>>> delay.d>>
>>>>
>>>>                                 This isn't the only potential source
>>>>             of driver
>>>>                                 stalls by any means, it's
>>>>                                 just the one I found. You may also
>>>>             want to look at
>>>>                                 lockstat in general,
>>>>                                 as information it reported is what led
>>>>             us to look
>>>>                                 specifically at the
>>>>                                 ixgbe code here.
>>>>
>>>>                                 (If you suspect kernel/driver issues,
>>>>             lockstat
>>>>                                 combined with kernel
>>>>                                 source is a really excellent resource.)
>>>>
>>>>                                           - cks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>             ___________________________________________________
>>>>                             OmniOS-discuss mailing list
>>>>             OmniOS-discuss at lists.omniti
>>>>             <mailto:OmniOS-discuss at lists.omniti>.____com
>>>>                             <mailto:OmniOS-discuss at lists.__omniti.com
>>>>             <mailto:OmniOS-discuss at lists.omniti.com>>
>>>>             http://lists.omniti.com/____mailman/listinfo/omnios-____
>>>> discuss
>>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss>
>>>>
>>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss
>>>>             <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>>
>>>>
>>>>                     ___________________________________________________
>>>>                     OmniOS-discuss mailing list
>>>>             OmniOS-discuss at lists.omniti
>>>>             <mailto:OmniOS-discuss at lists.omniti>.____com
>>>>                     <mailto:OmniOS-discuss at lists.__omniti.com
>>>>             <mailto:OmniOS-discuss at lists.omniti.com>>
>>>>             http://lists.omniti.com/____mailman/listinfo/omnios-____
>>>> discuss
>>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss>
>>>>
>>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__
>>>> discuss
>>>>             <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>>
>>>>
>>>>                 --
>>>>                 OSN Online Service Nuernberg GmbH, Bucher Str. 78,
>>>>             90408 Nuernberg
>>>>                 Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0>
>>>>             <tel:%2B49%20911%2039905-0> - Fax: +49 911
>>>>                 39905-55 <tel:%2B49%20911%2039905-55> -
>>>>             http://www.osn.de <http://www.osn.de/>
>>>>                 HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg
>>>>             Goltermann
>>>>
>>>>
>>>>
>>>>         --
>>>>         OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408
>>>> Nuernberg
>>>>         Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0> - Fax: +49
>>>>         911 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>>>>         <http://www.osn.de/>
>>>>         HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>>>
>>>>
>>>>     *illumos-developer* | Archives
>>>>     <https://www.listbox.com/member/archive/182179/=now>
>>>>     <https://www.listbox.com/member/archive/rss/182179/
>>>> 21239177-3604570e>
>>>>     | Modify <https://www.listbox.com/member/?&> Your Subscription
>>>>     [Powered by Listbox] <http://www.listbox.com/>
>>>>
>>>>
>>>
>>> *illumos-developer* | Archives
>>> <https://www.listbox.com/member/archive/182179/=now>
>>> <https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c> |
>>> Modify
>>> <https://www.listbox.com/member/?&id_secret=21175123-d92578cc
>>> <https://www.listbox.com/member/?&>>
>>> Your Subscription       [Powered by Listbox] <http://www.listbox.com>
>>>
>>>
>> --
>> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg
>> Tel: +49 911 39905-0 - Fax: +49 911 39905-55 - http://www.osn.de
>> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>
>
> *illumos-developer* | Archives
> <https://www.listbox.com/member/archive/182179/=now>
> <https://www.listbox.com/member/archive/rss/182179/21239177-3604570e> |
> Modify
> <https://www.listbox.com/member/?member_id=21239177&id_secret=21239177-2d0c9337>
> Your Subscription <http://www.listbox.com/>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150302/b652d816/attachment-0001.html>