[OmniOS-discuss] [developer] Re: The ixgbe driver, Lindsay Lohan, and the Greek economy

W Verb wverb73 at gmail.com
Mon Mar 2 19:07:45 UTC 2015


Hello all,
I am not using layer 2 flow control. The switch carries line-rate 10G
traffic without error.

I think I have found the issue via lockstat. The first lockstat is taken
during a multipath read:


lockstat -kWP sleep 30

Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)

Count indv cuml rcnt     nsec Hottest Lock           Caller
-------------------------------------------------------------------------------
 9306  44%  44% 0.00     1557 htable_mutex+0x370     htable_release
 6307  23%  68% 0.00     1207 htable_mutex+0x108     htable_lookup
  596   7%  75% 0.00     4100 0xffffff0931705188     cv_wait
  349   5%  80% 0.00     4437 0xffffff0931705188     taskq_thread
  704   2%  82% 0.00      995 0xffffff0935de3c50     dbuf_create

The hash table being read here I would guess is the tcp connection hash
table.

When lockstat is run during a multipath write operation, I get:

Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)

Count indv cuml rcnt     nsec Hottest Lock           Caller
-------------------------------------------------------------------------------
210752  28%  28% 0.00     4781 0xffffff0931705188     taskq_thread
174471  22%  50% 0.00     4476 0xffffff0931705188     cv_wait
127183  10%  61% 0.00     2871 0xffffff096f29b510     zio_notify_parent
176066  10%  70% 0.00     1922 0xffffff0931705188     taskq_dispatch_ent
105134   9%  80% 0.00     3110 0xffffff096ffdbf10     zio_remove_child
67512   4%  83% 0.00     1938 0xffffff096f3db4b0     zio_add_child
45736   3%  86% 0.00     2239 0xffffff0935de3c50     dbuf_destroy
27781   3%  89% 0.00     3416 0xffffff0935de3c50     dbuf_create
38536   2%  91% 0.00     2122 0xffffff0935de3b70     dnode_rele
27841   2%  93% 0.00     2423 0xffffff0935de3b70     dnode_diduse_space
19020   2%  95% 0.00     3046 0xffffff09d9e305e0     dbuf_rele
14627   1%  96% 0.00     3632 dbuf_hash_table+0x4f8  dbuf_find



Writes are not performing htable lookups, while reads are.

-Warren V






On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <jg at osn.de> wrote:

> Hi,
>
> I would try *one* TPG which includes both interface addresses
> and I would double check for packet drops on the Catalyst.
>
> The 3560 supports only receive flow control which means, that
> a sending 10Gbit port can easily overload a 1Gbit port.
> Do you have flow control enabled?
>
>  - Joerg
>
>
> On 02.03.2015 09:22, W Verb via illumos-developer wrote:
>
>> Hello Garrett,
>>
>> No, no 802.3ad going on in this config.
>>
>> Here is a basic schematic:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/
>> view?usp=sharing
>>
>> Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/
>> view?usp=sharing
>>
>> Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The
>> switch is set to allow 9148-byte frames, and I'm not seeing any
>> errors/buffer overruns on the switch.
>>
>> Here is a screenshot of a packet capture from a read operation on the
>> guest OS (from it's local drive, which is actually a VMDK file on the
>> storage server). In this example, only a single 1G ESXi kernel interface
>> (vmk1) is bound to the software iSCSI initiator.
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/
>> view?usp=sharing
>>
>> Note that there's a nice, well-behaved window sizing process taking
>> place. The ESXi decreases the scaled window by 11 or 12 for each ACK,
>> then bumps it back up to 512.
>>
>> Here is a similar screenshot of a single-interface write operation:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/
>> view?usp=sharing
>>
>> There are no pauses or gaps in the transmission rate in the
>> single-interface transfers.
>>
>>
>> In the next screenshots, I have enabled an additional 1G interface on
>> the ESXi host, and bound it to the iSCSI initiator. The new interface is
>> bound to a separate physical port, uses a different VLAN on the switch,
>> and talks to a different 10G port on the storage server.
>>
>> First, let's look at a write operation on the guest OS, which happily
>> pumps data at near-line-rate to the storage server.
>>
>> Here is a sequence number trace diagram. Note how the transfer has a
>> nice, smooth increment rate over the entire transfer.
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/
>> view?usp=sharing
>>
>> Here are screenshots from packet captures on both 1G interfaces:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/
>> view?usp=sharing
>> https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/
>> view?usp=sharing
>>
>> Note how we again see nice, smooth window adjustment, and no gaps in
>> transmission.
>>
>>
>> But now, let's look at the problematic two-interface Read operation.
>> First, the sequence graph:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/
>> view?usp=sharing
>>
>> As you can see, there are gaps and jumps in the transmission throughout
>> the transfer.
>> It is very illustrative to look at captures of the gaps, which are
>> occurring on both interfaces:
>>
>> https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/
>> view?usp=sharing
>> https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/
>> view?usp=sharing
>>
>> As you can see, there are ~.4 second pauses in transmission from the
>> storage server, which kills the transfer rate.
>> It's clear that the ESXi box ACKs the prior iSCSI operation to
>> completion, then makes a new LUN request, which the storage server
>> immediately replies to. The ESXi ACKs the response packet from the
>> storage server, then waits...and waits....and waits... until eventually
>> the storage server starts transmitting again.
>>
>> Because the pause happens while the ESXi client is waiting for a packet
>> from the storage server, that tells me that the gaps are not an artifact
>> of traffic being switched between both active interfaces, but are
>> actually indicative of short hangs occurring on the server.
>>
>> Having a pause or two in transmission is no big deal, but in my case, it
>> is happening constantly, and dropping my overall read transfer rate down
>> to 20-60MB/s, which is slower than the single interface transfer rate
>> (~90-100MB/s).
>>
>> Decreasing the MTU makes the pauses shorter, increasing them makes the
>> pauses longer.
>>
>> Another interesting thing is that if I set the multipath io interval to
>> 3 operations instead of 1, I get better throughput. In other words, the
>> less frequently I swap IP addresses on my iSCSI requests from the ESXi
>> unit, the fewer pauses I see.
>>
>> Basically, COMSTAR seems to choke each time an iSCSI request from a new
>> IP arrives.
>>
>> Because the single interface transfer is near line rate, that tells me
>> that the storage system (mpt_sas, zfs, etc) is working fine. It's only
>> when multiple paths are attempted that iSCSI falls on its face during
>> reads.
>>
>> All of these captures were taken without a cache device being attached
>> to the storage zpool, so this isn't looking like some kind of ZFS ARC
>> problem. As mentioned previously, local transfers to/from the zpool are
>> showing ~300-500 MB/s rates over long transfers (10G+).
>>
>> -Warren V
>>
>> On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <garrett at damore.org
>> <mailto:garrett at damore.org>> wrote:
>>
>>     I’m not sure I’ve followed properly.  You have *two* interfaces.
>>     You are not trying to provision these in an aggr are you? As far as
>>     I’m aware, VMware does not support 802.3ad link aggregations.  (Its
>>     possible that you can make it work with ESXi if you give the entire
>>     NIC to the guest — but I’m skeptical.)  The problem is that if you
>>     try to use link aggregation, some packets (up to half!) will be
>>     lost.  TCP and other protocols fare poorly in this situation.
>>
>>     Its possible I’ve totally misunderstood what you’re trying to do, in
>>     which case I apologize.
>>
>>     The idle thing is a red-herring — the cpu is waiting for work to do,
>>     probably because packets haven’t arrived (or where dropped by the
>>     hypervisor!)  I wouldn’t read too much into that except that your
>>     network stack is in trouble.  I’d look a bit more closely at the
>>     kstats for tcp — I suspect you’ll see retransmits or out of order
>>     values that are unusually high — if so this may help validate my
>>     theory above.
>>
>>     - Garrett
>>
>>      On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer
>>>     <developer at lists.illumos.org <mailto:developer at lists.illumos.org>>
>>>
>>>     wrote:
>>>
>>>     Hello all,
>>>
>>>
>>>     Well, I no longer blame the ixgbe driver for the problems I'm seeing.
>>>
>>>
>>>     I tried Joerg's updated driver, which didn't improve the issue. So
>>>     I went back to the drawing board and rebuilt the server from scratch.
>>>
>>>     What I noted is that if I have only a single 1-gig physical
>>>     interface active on the ESXi host, everything works as expected.
>>>     As soon as I enable two interfaces, I start seeing the performance
>>>     problems I've described.
>>>
>>>     Response pauses from the server that I see in TCPdumps are still
>>>     leading me to believe the problem is delay on the server side, so
>>>     I ran a series of kernel dtraces and produced some flamegraphs.
>>>
>>>
>>>     This was taken during a read operation with two active 10G
>>>     interfaces on the server, with a single target being shared by two
>>>     tpgs- one tpg for each 10G physical port. The host device has two
>>>     1G ports enabled, with VLANs separating the active ports into
>>>     10G/1G pairs. ESXi is set to multipath using both VLANS with a
>>>     round-robin IO interval of 1.
>>>
>>>     https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/
>>> view?usp=sharing
>>>
>>>
>>>     This was taken during a write operation:
>>>
>>>     https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/
>>> view?usp=sharing
>>>
>>>
>>>     I then rebooted the server and disabled C-State, ACPI T-State, and
>>>     general EIST (Turbo boost) functionality in the CPU.
>>>
>>>     I when I attempted to boot my guest VM, the iSCSI transfer
>>>     gradually ground to a halt during the boot loading process, and
>>>     the guest OS never did complete its boot process.
>>>
>>>     Here is a flamegraph taken while iSCSI is slowly dying:
>>>
>>>     https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/
>>> view?usp=sharing
>>>
>>>
>>>     I edited out cpu_idle_adaptive from the dtrace output and
>>>     regenerated the slowdown graph:
>>>
>>>     https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/
>>> view?usp=sharing
>>>
>>>
>>>     I then edited cpu_idle_adaptive out of the speedy write operation
>>>     and regenerated that graph:
>>>
>>>     https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/
>>> view?usp=sharing
>>>
>>>
>>>     I have zero experience with interpreting flamegraphs, but the most
>>>     significant difference I see between the slow read example and the
>>>     fast write example is in unix`thread_start --> unix`idle. There's
>>>     a good chunk of "unix`i86_mwait" in the read example that is not
>>>     present in the write example at all.
>>>
>>>     Disabling the l2arc cache device didn't make a difference, and I
>>>     had to reenable EIST support on the CPU to get my VMs to boot.
>>>
>>>     I am seeing a variety of bug reports going back to 2010 regarding
>>>     excessive mwait operations, with the suggested solutions usually
>>>     being to set "cpupm enable poll-mode" in power.conf. That change
>>>     also had no effect on speed.
>>>
>>>     -Warren V
>>>
>>>
>>>
>>>
>>>     -----Original Message-----
>>>
>>>     From: Chris Siebenmann [mailto:cks at cs.toronto.edu]
>>>
>>>     Sent: Monday, February 23, 2015 8:30 AM
>>>
>>>     To: W Verb
>>>
>>>     Cc: omnios-discuss at lists.omniti.com
>>>     <mailto:omnios-discuss at lists.omniti.com>; cks at cs.toronto.edu
>>>     <mailto:cks at cs.toronto.edu>
>>>
>>>     Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and
>>>     the Greek economy
>>>
>>>
>>>     > Chris, thanks for your specific details. I'd appreciate it if you
>>>
>>>     > could tell me which copper NIC you tried, as well as to pass on the
>>>
>>>     > iSCSI tuning parameters.
>>>
>>>
>>>     Our copper NIC experience is with onboard X540-AT2 ports on
>>>     SuperMicro hardware (which have the guaranteed 10-20 msec lock
>>>     hold) and dual-port 82599EB TN cards (which have some sort of
>>>     driver/hardware failure under load that eventually leads to
>>>     2-second lock holds). I can't recommend either with the current
>>>     driver; we had to revert to 1G networking in order to get stable
>>>     servers.
>>>
>>>
>>>     The iSCSI parameter modifications we do, across both initiators
>>>     and targets, are:
>>>
>>>
>>>     initialr2tno
>>>
>>>     firstburstlength128k
>>>
>>>     maxrecvdataseglen128k[only on Linux backends]
>>>
>>>     maxxmitdataseglen128k[only on Linux backends]
>>>
>>>
>>>     The OmniOS initiator doesn't need tuning for more than the first
>>>     two parameters; on the Linux backends we tune up all four. My
>>>     extended thoughts on these tuning parameters and why we touch them
>>>     can be found
>>>
>>>     here:
>>>
>>>
>>>     http://utcc.utoronto.ca/~cks/space/blog/tech/
>>> UnderstandingiSCSIProtocol
>>>
>>>     http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
>>>
>>>
>>>     The short version is that these parameters probably only make a
>>>     small difference but their overall goal is to do 128KB ZFS reads
>>>     and writes in single iSCSI operations (although they will be
>>>     fragmented at the TCP
>>>
>>>     layer) and to do iSCSI writes without a back-and-forth delay
>>>     between initiator and target (that's 'initialr2t no').
>>>
>>>
>>>     I think basically everyone should use InitialR2T set to no and in
>>>     fact that it should be the software default. These days only
>>>     unusually limited iSCSI targets should need it to be otherwise and
>>>     they can change their setting for it (initiator and target must
>>>     both agree to it being 'yes', so either can veto it).
>>>
>>>
>>>     - cks
>>>
>>>
>>>
>>>     On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <jg at osn.de
>>>     <mailto:jg at osn.de>> wrote:
>>>
>>>         Hi,
>>>
>>>         I think your problem is caused by your link properties or your
>>>         switch settings. In general the standard ixgbe seems to perform
>>>         well.
>>>
>>>         I had trouble after changing the default flow control settings
>>>         to "bi"
>>>         and this was my motivation to update the ixgbe driver a long
>>>         time ago.
>>>         After I have updated our systems to ixgbe 2.5.8 I never had any
>>>         problems ....
>>>
>>>         Make sure your switch has support for jumbo frames and you use
>>>         the same mtu on all ports, otherwise the smallest will be used.
>>>
>>>         What switch do you use? I can tell you nice horror stories about
>>>         different vendors....
>>>
>>>          - Joerg
>>>
>>>         On 23.02.2015 10:31, W Verb wrote:
>>>
>>>             Thank you Joerg,
>>>
>>>             I've downloaded the package and will try it tomorrow.
>>>
>>>             The only thing I can add at this point is that upon review
>>>             of my
>>>             testing, I may have performed my "pkg -u" between the
>>>             initial quad-gig
>>>             performance test and installing the 10G NIC. So this may
>>>             be a new
>>>             problem introduced in the latest updates.
>>>
>>>             Those of you who are running 10G and have not upgraded to
>>>             the latest
>>>             kernel, etc, might want to do some additional testing
>>>             before running the
>>>             update.
>>>
>>>             -Warren V
>>>
>>>             On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann
>>>             <jg at osn.de <mailto:jg at osn.de>
>>>             <mailto:jg at osn.de <mailto:jg at osn.de>>> wrote:
>>>
>>>                 Hi,
>>>
>>>                 I remember there was a problem with the flow control
>>>             settings in the
>>>                 ixgbe
>>>                 driver, so I updated it a long time ago for our
>>>             internal servers to
>>>                 2.5.8.
>>>                 Last weekend I integrated the latest changes from the
>>>             FreeBSD driver
>>>                 to bring
>>>                 the illumos ixgbe to 2.5.25 but I had no time to test
>>>             it, so it's
>>>                 completely
>>>                 untested!
>>>
>>>
>>>                 If you would like to give the latest driver a try you
>>>             can fetch the
>>>                 kernel modules from
>>>             https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9
>>>             <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9>
>>>                 <https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9
>>>             <https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9>>
>>>
>>>                 Clone your boot environment, place the modules in the
>>>             new environment
>>>                 and update the boot-archive of the new BE.
>>>
>>>                   - Joerg
>>>
>>>
>>>
>>>
>>>
>>>                 On 23.02.2015 02:54, W Verb wrote:
>>>
>>>                     By the way, to those of you who have working
>>>             setups: please send me
>>>                     your pool/volume settings, interface linkprops,
>>>             and any kernel
>>>                     tuning
>>>                     parameters you may have set.
>>>
>>>                     Thanks,
>>>                     Warren V
>>>
>>>                     On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip
>>>                     <chip at innovates.com <mailto:chip at innovates.com>
>>>             <mailto:chip at innovates.com <mailto:chip at innovates.com>>>
>>>
>>>             wrote:
>>>
>>>                         I can't say I totally agree with your performance
>>>                         assessment.   I run Intel
>>>                         X520 in all my OmniOS boxes.
>>>
>>>                         Here is a capture of nfssvrtop I made while
>>>             running many
>>>                         storage vMotions
>>>                         between two OmniOS boxes hosting NFS
>>>             datastores.   This is a
>>>                         10 host VMware
>>>                         cluster.  Both OmniOS boxes are dual 10G
>>>             connected with
>>>                         copper twin-ax to
>>>                         the in rack Nexus 5010.
>>>
>>>                         VMware does 100% sync writes, I use ZeusRAM
>>>             SSDs for log
>>>                         devices.
>>>
>>>                         -Chip
>>>
>>>                         2014 Apr 24 08:05:51, load: 12.64, read:
>>>             17330243 KB,
>>>                         swrite: 15985    KB,
>>>                         awrite: 1875455  KB
>>>
>>>                         Ver     Client           NFSOPS   Reads
>>>             SWrites AWrites
>>>                         Commits   Rd_bw
>>>                         SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t
>>>              Com_t  Align%
>>>
>>>                         4       10.28.17.105          0       0
>>>              0       0
>>>                           0       0
>>>                         0       0       0       0       0       0       0
>>>
>>>                         4       10.28.17.215          0       0
>>>              0       0
>>>                           0       0
>>>                         0       0       0       0       0       0       0
>>>
>>>                         4       10.28.17.213          0       0
>>>              0       0
>>>                           0       0
>>>                         0       0       0       0       0       0       0
>>>
>>>                         4       10.28.16.151          0       0
>>>              0       0
>>>                           0       0
>>>                         0       0       0       0       0       0       0
>>>
>>>                         4       all                   1       0
>>>              0       0
>>>                           0       0
>>>                         0       0       0       0       0       0       0
>>>
>>>                         3       10.28.16.175          3       0
>>>              3       0
>>>                           0       1
>>>                         11       0    4806      48       0       0
>>> 85
>>>
>>>                         3       10.28.16.183          6       0
>>>              6       0
>>>                           0       3
>>>                         162       0     549     124       0       0
>>>               73
>>>
>>>                         3       10.28.16.180         11       0
>>>             10       0
>>>                           0       3
>>>                         27       0     776      89       0       0
>>> 67
>>>
>>>                         3       10.28.16.176         28       2
>>>             26       0
>>>                           0      10
>>>                         405       0    2572     198       0       0
>>>              100
>>>
>>>                         3       10.28.16.178       4606    4602
>>>              4       0
>>>                           0  294534
>>>                         3       0     723      49       0       0      99
>>>
>>>                         3       10.28.16.179       4905    4879
>>>             26       0
>>>                           0  312208
>>>                         311       0     735     271       0       0
>>>               99
>>>
>>>                         3       10.28.16.181       5515    5502
>>>             13       0
>>>                           0  352107
>>>                         77       0      89      87       0       0
>>> 99
>>>
>>>                         3       10.28.16.184      12095   12059
>>>             10       0
>>>                           0  763014
>>>                         39       0     249     147       0       0
>>> 99
>>>
>>>                         3       10.28.58.1        15401    6040
>>>              116    6354
>>>                         53  191605
>>>                         474  202346     192      96     144      83
>>>               99
>>>
>>>                         3       all 42574 33086 <tel:42574%2033086>
>>>             <tel:42574%20%20%2033086>     217
>>>                         6354      53 1913488
>>>                         1582  202300     348     138     153     105
>>>                 99
>>>
>>>
>>>
>>>
>>>
>>>                         On Fri, Feb 20, 2015 at 11:46 PM, W Verb
>>>             <wverb73 at gmail.com <mailto:wverb73 at gmail.com>
>>>                         <mailto:wverb73 at gmail.com
>>>
>>>             <mailto:wverb73 at gmail.com>>> wrote:
>>>
>>>
>>>                             Hello All,
>>>
>>>                             Thank you for your replies.
>>>                             I tried a few things, and found the
>>> following:
>>>
>>>                             1: Disabling hyperthreading support in the
>>>             BIOS drops
>>>                             performance overall
>>>                             by a factor of 4.
>>>                             2: Disabling VT support also seems to have
>>>             some effect,
>>>                             although it
>>>                             appears to be minor. But this has the
>>>             amusing side
>>>                             effect of fixing the
>>>                             hangs I've been experiencing with fast
>>>             reboot. Probably
>>>                             by disabling kvm.
>>>                             3: The performance tests are a bit tricky
>>>             to quantify
>>>                             because of caching
>>>                             effects. In fact, I'm not entirely sure
>>>             what is
>>>                             happening here. It's just
>>>                             best to describe what I'm seeing:
>>>
>>>                             The commands I'm using to test are
>>>                             dd if=/dev/zero of=./test.dd bs=2M count=5000
>>>                             dd of=/dev/null if=./test.dd bs=2M count=5000
>>>                             The host vm is running Centos 6.6, and has
>>>             the latest
>>>                             vmtools installed.
>>>                             There is a host cache on an SSD local to
>>>             the host that
>>>                             is also in place.
>>>                             Disabling the host cache didn't
>>>             immediately have an
>>>                             effect as far as I could
>>>                             see.
>>>
>>>                             The host MTU set to 3000 on all iSCSI
>>>             interfaces for all
>>>                             tests.
>>>
>>>                             Test 1: Right after reboot, with an ixgbe
>>>             MTU of 9000,
>>>                             the write test
>>>                             yields an average speed over three tests
>>>             of 137MB/s. The
>>>                             read test yields an
>>>                             average over three tests of 5MB/s.
>>>
>>>                             Test 2: After setting "ifconfig ixgbe0 mtu
>>>             3000", the
>>>                             write tests yield
>>>                             140MB/s, and the read tests yield 53MB/s.
>>>             It's important
>>>                             to note here that
>>>                             if I cut the read test short at only
>>>             2-3GB, I get
>>>                             results upwards of
>>>                             350MB/s, which I assume is local
>>>             cache-related distortion.
>>>
>>>                             Test 3: MTU of 1500. Read tests are up to
>>>             156 MB/s.
>>>                             Write tests yield
>>>                             about 142MB/s.
>>>                             Test 4: MTU of 1000: Read test at 182MB/s.
>>>                             Test 5: MTU of 900: Read test at 130 MB/s.
>>>                             Test 6: MTU of 1000: Read test at 160MB/s.
>>>             Write tests
>>>                             are now
>>>                             consistently at about 300MB/s.
>>>                             Test 7: MTU of 1200: Read test at 124MB/s.
>>>                             Test 8: MTU of 1000: Read test at 161MB/s.
>>>             Write at 261MB/s.
>>>
>>>                             A few final notes:
>>>                             L1ARC grabs about 10GB of RAM during the
>>>             tests, so
>>>                             there's definitely some
>>>                             read caching going on.
>>>                             The write operations are easier to observe
>>>             with iostat,
>>>                             and I'm seeing io
>>>                             rates that closely correlate with the
>>>             network write speeds.
>>>
>>>
>>>                             Chris, thanks for your specific details.
>>>             I'd appreciate
>>>                             it if you could
>>>                             tell me which copper NIC you tried, as
>>>             well as to pass
>>>                             on the iSCSI tuning
>>>                             parameters.
>>>
>>>                             I've ordered an Intel EXPX9502AFXSR, which
>>>             uses the
>>>                             82598 chip instead of
>>>                             the 82599 in the X520. If I get similar
>>>             results with my
>>>                             fiber transcievers,
>>>                             I'll see if I can get a hold of copper ones.
>>>
>>>                             But I should mention that I did indeed
>>>             look at PHY/MAC
>>>                             error rates, and
>>>                             they are nil.
>>>
>>>                             -Warren V
>>>
>>>                             On Fri, Feb 20, 2015 at 7:25 PM, Chris
>>>             Siebenmann
>>>                             <cks at cs.toronto.edu
>>>             <mailto:cks at cs.toronto.edu> <mailto:cks at cs.toronto.edu
>>>
>>>             <mailto:cks at cs.toronto.edu>>>
>>>
>>>                             wrote:
>>>
>>>
>>>                                     After installation and
>>>             configuration, I observed
>>>                                     all kinds of bad
>>>                                     behavior
>>>                                     in the network traffic between the
>>>             hosts and the
>>>                                     server. All of this
>>>                                     bad
>>>                                     behavior is traced to the ixgbe
>>>             driver on the
>>>                                     storage server. Without
>>>                                     going
>>>                                     into the full troubleshooting
>>>             process, here are
>>>                                     my takeaways:
>>>
>>>                                 [...]
>>>
>>>                                    For what it's worth, we managed to
>>>             achieve much
>>>                                 better line rates on
>>>                                 copper 10G ixgbe hardware of various
>>>             descriptions
>>>                                 between OmniOS
>>>                                 and CentOS 7 (I don't think we ever
>>>             tested OmniOS to
>>>                                 OmniOS). I don't
>>>                                 believe OmniOS could do TCP at full
>>>             line rate but I
>>>                                 think we managed 700+
>>>                                 Mbytes/sec on both transmit and
>>>             receive and we got
>>>                                 basically disk-limited
>>>                                 speeds with iSCSI (across multiple
>>>             disks on
>>>                                 multi-disk mirrored pools,
>>>                                 OmniOS iSCSI initiator, Linux iSCSI
>>>             targets).
>>>
>>>                                    I don't believe we did any specific
>>>             kernel tuning
>>>                                 (and in fact some of
>>>                                 our attempts to fiddle ixgbe driver
>>>             parameters blew
>>>                                 up in our face).
>>>                                 We did tune iSCSI connection
>>>             parameters to increase
>>>                                 various buffer
>>>                                 sizes so that ZFS could do even large
>>>             single
>>>                                 operations in single iSCSI
>>>                                 transactions. (More details available
>>>             if people are
>>>                                 interested.)
>>>
>>>                                     10: At the wire level, the speed
>>>             problems are
>>>                                     clearly due to pauses in
>>>                                     response time by omnios. At 9000
>>>             byte frame
>>>                                     sizes, I see a good number
>>>                                     of duplicate ACKs and fast
>>>             retransmits during
>>>                                     read operations (when
>>>                                     omnios is transmitting). But below
>>>             about a
>>>                                     4100-byte MTU on omnios
>>>                                     (which seems to correlate to
>>>             4096-byte iSCSI
>>>                                     block transfers), the
>>>                                     transmission errors fade away and
>>>             we only see
>>>                                     the transmission pause
>>>                                     problem.
>>>
>>>
>>>                                    This is what really attracted my
>>>             attention. In
>>>                                 our OmniOS setup, our
>>>                                 specific Intel hardware had ixgbe
>>>             driver issues that
>>>                                 could cause
>>>                                 activity stalls during once-a-second
>>>             link heartbeat
>>>                                 checks. This
>>>                                 obviously had an effect at the TCP and
>>>             iSCSI layers.
>>>                                 My initial message
>>>                                 to illumos-developer sparked a
>>> potentially
>>>                                 interesting discussion:
>>>
>>>
>>>             http://www.listbox.com/member/____archive/182179/2014/10/
>>> sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__
>>> 4B1D-__11E4-A39C-D534381BA44D/
>>>             <http://www.listbox.com/member/__archive/182179/2014/
>>> 10/sort/__time_rev/page/16/entry/6:405/__20141003125035:
>>> 6357079A-4B1D-__11E4-A39C-D534381BA44D/>
>>>
>>>             <http://www.listbox.com/__member/archive/182179/2014/10/
>>> __sort/time_rev/page/16/entry/6:__405/20141003125035:
>>> 6357079A-__4B1D-11E4-A39C-D534381BA44D/
>>>             <http://www.listbox.com/member/archive/182179/2014/10/
>>> sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-
>>> 4B1D-11E4-A39C-D534381BA44D/>>
>>>
>>>                                 If you think this is a possibility in
>>>             your setup,
>>>                                 I've put the DTrace
>>>                                 script I used to hunt for this up on
>>>             the web:
>>>
>>>             http://www.cs.toronto.edu/~____cks/src/omnios-ixgbe/ixgbe___
>>> __delay.d
>>>             <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>> delay.d>
>>>
>>>             <http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___
>>> delay.d
>>>             <http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_
>>> delay.d>>
>>>
>>>                                 This isn't the only potential source
>>>             of driver
>>>                                 stalls by any means, it's
>>>                                 just the one I found. You may also
>>>             want to look at
>>>                                 lockstat in general,
>>>                                 as information it reported is what led
>>>             us to look
>>>                                 specifically at the
>>>                                 ixgbe code here.
>>>
>>>                                 (If you suspect kernel/driver issues,
>>>             lockstat
>>>                                 combined with kernel
>>>                                 source is a really excellent resource.)
>>>
>>>                                           - cks
>>>
>>>
>>>
>>>
>>>
>>>             ___________________________________________________
>>>                             OmniOS-discuss mailing list
>>>             OmniOS-discuss at lists.omniti
>>>             <mailto:OmniOS-discuss at lists.omniti>.____com
>>>                             <mailto:OmniOS-discuss at lists.__omniti.com
>>>             <mailto:OmniOS-discuss at lists.omniti.com>>
>>>             http://lists.omniti.com/____mailman/listinfo/omnios-____
>>> discuss
>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> >
>>>
>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>>             <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>
>>>
>>>                     ___________________________________________________
>>>                     OmniOS-discuss mailing list
>>>             OmniOS-discuss at lists.omniti
>>>             <mailto:OmniOS-discuss at lists.omniti>.____com
>>>                     <mailto:OmniOS-discuss at lists.__omniti.com
>>>             <mailto:OmniOS-discuss at lists.omniti.com>>
>>>             http://lists.omniti.com/____mailman/listinfo/omnios-____
>>> discuss
>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>> >
>>>
>>>             <http://lists.omniti.com/__mailman/listinfo/omnios-__discuss
>>>             <http://lists.omniti.com/mailman/listinfo/omnios-discuss>>
>>>
>>>
>>>                 --
>>>                 OSN Online Service Nuernberg GmbH, Bucher Str. 78,
>>>             90408 Nuernberg
>>>                 Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0>
>>>             <tel:%2B49%20911%2039905-0> - Fax: +49 911
>>>                 39905-55 <tel:%2B49%20911%2039905-55> -
>>>             http://www.osn.de <http://www.osn.de/>
>>>                 HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg
>>>             Goltermann
>>>
>>>
>>>
>>>         --
>>>         OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408
>>> Nuernberg
>>>         Tel: +49 911 39905-0 <tel:%2B49%20911%2039905-0> - Fax: +49
>>>         911 39905-55 <tel:%2B49%20911%2039905-55> - http://www.osn.de
>>>         <http://www.osn.de/>
>>>         HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>>>
>>>
>>>     *illumos-developer* | Archives
>>>     <https://www.listbox.com/member/archive/182179/=now>
>>>     <https://www.listbox.com/member/archive/rss/182179/21239177-3604570e
>>> >
>>>     | Modify <https://www.listbox.com/member/?&> Your Subscription
>>>     [Powered by Listbox] <http://www.listbox.com/>
>>>
>>>
>>
>> *illumos-developer* | Archives
>> <https://www.listbox.com/member/archive/182179/=now>
>> <https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c> |
>> Modify
>> <https://www.listbox.com/member/?member_id=21175123&id_
>> secret=21175123-d92578cc>
>> Your Subscription       [Powered by Listbox] <http://www.listbox.com>
>>
>>
> --
> OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg
> Tel: +49 911 39905-0 - Fax: +49 911 39905-55 - http://www.osn.de
> HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150302/62dbb20a/attachment-0001.html>


More information about the OmniOS-discuss mailing list