[OmniOS-discuss] zpool Write Bottlenecks

Michael Talbott mtalbott at lji.org
Fri Sep 30 05:19:38 UTC 2016


I'm very aware that dd is not the best tool for measuring disk performance in most cases. And I know the read throughput number is way off b/c of zfs caching mechanisms. However it does work in this case to illustrate the point in my case of a write throttle somewhere in the system. If anyone needs me to test with some other tool for illustrative purposes, I can do that too. It's just so odd that 1 card with a given set of disks attached in a pool provide roughly the same net throughput as 2 cards with 2 sets of disks in said pool.

But, for the nay-sayers of dd testing, I'll provide you with this.. Here's an example of using 2 x 11 disk raidz3 vdevs where all 22 drives live on one backplane attached to 1 card with an 8x phy sas connection. And then adding 2 more 11 disk raidz3 vdevs that are connected to that system on a separate card (also using an 8x phy sas link). No compression. Running bonnie++ to saturate the disks while I pull the numbers from iotop.

The following is an output of iotop (which uses dtrace for measuring disk io)
http://www.brendangregg.com/DTrace/iotop <http://www.brendangregg.com/DTrace/iotop>

Here's the 22 drive pool (all attached to 1 card):
------------------------------------------------------------------------------------------------------
2016 Sep 30 00:02:42,  load: 3.17,  disk_r:      0 KB,  disk_w: 3394084 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D            BYTES
    0   7630      0 zpool-datastore  sd124   194 7936 W        161812480
    0   7630      0 zpool-datastore  sd118   194 7552 W        161820672
    0   7630      0 zpool-datastore  sd127   194 8128 W        161845248
    0   7630      0 zpool-datastore  sd128   194 8192 W        161845248
    0   7630      0 zpool-datastore  sd122   194 7808 W        161849344
    0   7630      0 zpool-datastore  sd119   194 7616 W        161857536
    0   7630      0 zpool-datastore  sd121   194 7744 W        161857536
    0   7630      0 zpool-datastore  sd125   194 8000 W        161865728
    0   7630      0 zpool-datastore  sd123   194 7872 W        161869824
    0   7630      0 zpool-datastore  sd126   194 8064 W        161873920
    0   7630      0 zpool-datastore  sd120   194 7680 W        161906688
    0   7630      0 zpool-datastore  sd136   194 8704 W        165916672
    0   7630      0 zpool-datastore  sd137   194 8768 W        165916672
    0   7630      0 zpool-datastore  sd138   194 8832 W        165933056
    0   7630      0 zpool-datastore  sd135   194 8640 W        165937152
    0   7630      0 zpool-datastore  sd139   194 8896 W        165941248
    0   7630      0 zpool-datastore  sd134   194 8576 W        165945344
    0   7630      0 zpool-datastore  sd130   194 8320 W        165974016
    0   7630      0 zpool-datastore  sd129   194 8256 W        165978112
    0   7630      0 zpool-datastore  sd132   194 8448 W        165994496
    0   7630      0 zpool-datastore  sd133   194 8512 W        165994496
    0   7630      0 zpool-datastore  sd131   194 8384 W        166006784

------------------------------------------------------------------------------------------------------


And here's the pool extended with 2 more raidz3s with 2 cards
notice it's almost LITERALLY cut in HALF per drive!


------------------------------------------------------------------------------------------------------


2016 Sep 30 00:01:07,  load: 4.59,  disk_r:      8 KB,  disk_w: 3609852 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D            BYTES
    0   4550      0 zpool-datastore  sd133   194 8512 W         76558336
    0   4550      0 zpool-datastore  sd132   194 8448 W         76566528
    0   4550      0 zpool-datastore  sd135   194 8640 W         76570624
    0   4550      0 zpool-datastore  sd134   194 8576 W         76574720
    0   4550      0 zpool-datastore  sd136   194 8704 W         76582912
    0   4550      0 zpool-datastore  sd131   194 8384 W         76611584
    0   4550      0 zpool-datastore  sd130   194 8320 W         76644352
    0   4550      0 zpool-datastore  sd137   194 8768 W         76648448
    0   4550      0 zpool-datastore  sd129   194 8256 W         76660736
    0   4550      0 zpool-datastore  sd138   194 8832 W         76713984
    0   4550      0 zpool-datastore  sd139   194 8896 W         77369344
    0   4550      0 zpool-datastore  sd113   194 7232 W         77770752
    0   4550      0 zpool-datastore  sd115   194 7360 W         77832192
    0   4550      0 zpool-datastore  sd114   194 7296 W         77836288
    0   4550      0 zpool-datastore  sd111   194 7104 W         77840384
    0   4550      0 zpool-datastore  sd112   194 7168 W         77840384
    0   4550      0 zpool-datastore  sd108   194 6912 W         77844480
    0   4550      0 zpool-datastore  sd110   194 7040 W         77864960
    0   4550      0 zpool-datastore  sd116   194 7424 W         77873152
    0   4550      0 zpool-datastore  sd107   194 6848 W         77914112
    0   4550      0 zpool-datastore  sd106   194 6784 W         77918208
    0   4550      0 zpool-datastore  sd109   194 6976 W         77926400
    0   4550      0 zpool-datastore  sd128   194 8192 W         78938112
    0   4550      0 zpool-datastore  sd118   194 7552 W         78979072
    0   4550      0 zpool-datastore  sd125   194 8000 W         78991360
    0   4550      0 zpool-datastore  sd120   194 7680 W         78999552
    0   4550      0 zpool-datastore  sd127   194 8128 W         79007744
    0   4550      0 zpool-datastore  sd126   194 8064 W         79011840
    0   4550      0 zpool-datastore  sd119   194 7616 W         79020032
    0   4550      0 zpool-datastore  sd123   194 7872 W         79020032
    0   4550      0 zpool-datastore  sd122   194 7808 W         79048704
    0   4550      0 zpool-datastore  sd124   194 7936 W         79056896
    0   4550      0 zpool-datastore  sd121   194 7744 W         79065088
    0   4550      0 zpool-datastore  sd105   194 6720 W         82460672
    0   4550      0 zpool-datastore  sd102   194 6528 W         82468864
    0   4550      0 zpool-datastore  sd101   194 6464 W         82477056
    0   4550      0 zpool-datastore  sd104   194 6656 W         82477056
    0   4550      0 zpool-datastore  sd103   194 6592 W         82481152
    0   4550      0 zpool-datastore  sd141   194 9024 W         82489344
    0   4550      0 zpool-datastore  sd99    194 6336 W         82493440
    0   4550      0 zpool-datastore  sd98    194 6272 W         82501632
    0   4550      0 zpool-datastore  sd140   194 8960 W         82513920
    0   4550      0 zpool-datastore  sd97    194 6208 W         82538496
    0   4550      0 zpool-datastore  sd100   194 6400 W         82542592

------------------------------------------------------------------------------------------------------




Any thoughts of how to discover and/or overcome the true bottleneck is much appreciated.

Thanks

Michael



> On Sep 29, 2016, at 7:46 PM, Dale Ghent <daleg at omniti.com> wrote:
> 
> 
> Awesome that you're using LX Zones in a way with BeeGFS.
> 
> A note on your testing methodology, however:
> http://lethargy.org/~jesus/writes/disk-benchmarking-with-dd-dont/#.V-3RUqOZPOY
> 
>> On Sep 29, 2016, at 10:21 PM, Michael Talbott <mtalbott at lji.org> wrote:
>> 
>> Hi, I'm trying to find a way to achieve massive write speeds with some decent hardware which will be used for some parallel computing needs (bioinformatics). Eventually if all goes well and my testing succeeds.. I'll be duplicating this setup and run BeeGFS in a few LX zones (THANK YOU LX ZONES!!!) to run some truly massive parallel computing storage happiness, but, I'd like to tune this box as much as possible first.
>> 
>> For this setup I'm using an Intel S2600GZ board and 2 x E5-2640 (six cores ea) @ 2.5GHz and there's 144GB ECC ram installed. I have 3 x SAS2008 based LSI cards in that box. 1 of those is connected to 8 internal SSDs and the other 2 cards (4 ports) are connected to a 45 bay drive enclosure. And then there's 2 intel x 2 port 10ge cards for connectivity.
>> 
>> I've created so many different zpools in different configurations between straight up striped with no redundancy, radiz2, raidz3.. used multipath, non-multipath'd with 8x phy links instead of 4x multipath links, etc, etc in order to find the magic combination for maximum performance, but there's something somewhere capping raw throughput and I can't seem to find it.
>> 
>> Now the crazy part is I can for instance, create zpoolA with ~20 drives (via cardA and attached only to backplaneA), create zpoolB with another ~20 drives (via cardB and attached only to backplaneB), and each of them gets the same performance individually (~1GB/s write and 2.5GB/s read). So, my thought is if I destroy zpoolB and attach all those drives to zpoolA via additional vdevs.. It should double the performance or make some sort of significant improvement, but, nope, roughly the same speed.. Then I thought, ok, well maybe it's a slowest vdev sort of thing.. So then I created vdevs such each vdev used half it's drives from backplaneA and the other half from backplaneB.. That would force data distribution between the cards for each vdev and double the speed and get me to 2GB/s write.. but, nope. same deal. 1G/s write and 2.5G read :(
>> 
>> When I create the pool from scratch, for each vdev I add I see a linear increase in performance until I hit about 4-5 vdevs.. That's where the performance flatlines and no matter what I do beyond that it just wont go any faster :(
>> 
>> Also, if I create a pure SSD pool with cardC, the linear read/write performance of those are hitting the exact same numbers as the others :( Bottom line, no matter what pool configuration I use, no matter what recordsize is set in zfs, I'm always getting capped with roughly 1GB/s write and 2.5GB/s read.
>> 
>> I thought maybe there wasn't enough PCIe lanes to run all of those cards at 8x, but, that's not the case, this board can run 6 x 8 lane PCIe 3.0 cards at full speed.. I booted it up in linux to use lspci -vv to make sure of it (since I'm not sure how to view pcie speeds in OmniOS), and sure enough, everything is running with 8x width, so that's not it.
>> 
>> Oh, and just fyi, this is my super simple throughput testing script that I run with compression disabled on the tested pool.
>> 
>> START=$(date +%s)
>> /usr/gnu/bin/dd if=/dev/zero of=/datastore/testdd bs=1M count=10k
>> sync
>> echo $(($(date +%s)-START))
>> 
>> My goal is to find a way to achieve at least 2GB/s write and 4GB/s read which I think is theoretically possbile with this hardware..
>> 
>> Anyone have any ideas of what could be limiting this or how to remedy it? Could it be the mpt_sas driver itself somehow throttling access to all these devices? Or maybe do I need to do some sort of irq-cpu pinning type of magic?
>> 
>> 
>> Thanks,
>> 
>> Michael
>> 
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20160929/8b1a114a/attachment-0001.html>


More information about the OmniOS-discuss mailing list