[OmniOS-discuss] zpool Write Bottlenecks
Michael Talbott
mtalbott at lji.org
Fri Sep 30 05:19:38 UTC 2016
I'm very aware that dd is not the best tool for measuring disk performance in most cases. And I know the read throughput number is way off b/c of zfs caching mechanisms. However it does work in this case to illustrate the point in my case of a write throttle somewhere in the system. If anyone needs me to test with some other tool for illustrative purposes, I can do that too. It's just so odd that 1 card with a given set of disks attached in a pool provide roughly the same net throughput as 2 cards with 2 sets of disks in said pool.
But, for the nay-sayers of dd testing, I'll provide you with this.. Here's an example of using 2 x 11 disk raidz3 vdevs where all 22 drives live on one backplane attached to 1 card with an 8x phy sas connection. And then adding 2 more 11 disk raidz3 vdevs that are connected to that system on a separate card (also using an 8x phy sas link). No compression. Running bonnie++ to saturate the disks while I pull the numbers from iotop.
The following is an output of iotop (which uses dtrace for measuring disk io)
http://www.brendangregg.com/DTrace/iotop <http://www.brendangregg.com/DTrace/iotop>
Here's the 22 drive pool (all attached to 1 card):
------------------------------------------------------------------------------------------------------
2016 Sep 30 00:02:42, load: 3.17, disk_r: 0 KB, disk_w: 3394084 KB
UID PID PPID CMD DEVICE MAJ MIN D BYTES
0 7630 0 zpool-datastore sd124 194 7936 W 161812480
0 7630 0 zpool-datastore sd118 194 7552 W 161820672
0 7630 0 zpool-datastore sd127 194 8128 W 161845248
0 7630 0 zpool-datastore sd128 194 8192 W 161845248
0 7630 0 zpool-datastore sd122 194 7808 W 161849344
0 7630 0 zpool-datastore sd119 194 7616 W 161857536
0 7630 0 zpool-datastore sd121 194 7744 W 161857536
0 7630 0 zpool-datastore sd125 194 8000 W 161865728
0 7630 0 zpool-datastore sd123 194 7872 W 161869824
0 7630 0 zpool-datastore sd126 194 8064 W 161873920
0 7630 0 zpool-datastore sd120 194 7680 W 161906688
0 7630 0 zpool-datastore sd136 194 8704 W 165916672
0 7630 0 zpool-datastore sd137 194 8768 W 165916672
0 7630 0 zpool-datastore sd138 194 8832 W 165933056
0 7630 0 zpool-datastore sd135 194 8640 W 165937152
0 7630 0 zpool-datastore sd139 194 8896 W 165941248
0 7630 0 zpool-datastore sd134 194 8576 W 165945344
0 7630 0 zpool-datastore sd130 194 8320 W 165974016
0 7630 0 zpool-datastore sd129 194 8256 W 165978112
0 7630 0 zpool-datastore sd132 194 8448 W 165994496
0 7630 0 zpool-datastore sd133 194 8512 W 165994496
0 7630 0 zpool-datastore sd131 194 8384 W 166006784
------------------------------------------------------------------------------------------------------
And here's the pool extended with 2 more raidz3s with 2 cards
notice it's almost LITERALLY cut in HALF per drive!
------------------------------------------------------------------------------------------------------
2016 Sep 30 00:01:07, load: 4.59, disk_r: 8 KB, disk_w: 3609852 KB
UID PID PPID CMD DEVICE MAJ MIN D BYTES
0 4550 0 zpool-datastore sd133 194 8512 W 76558336
0 4550 0 zpool-datastore sd132 194 8448 W 76566528
0 4550 0 zpool-datastore sd135 194 8640 W 76570624
0 4550 0 zpool-datastore sd134 194 8576 W 76574720
0 4550 0 zpool-datastore sd136 194 8704 W 76582912
0 4550 0 zpool-datastore sd131 194 8384 W 76611584
0 4550 0 zpool-datastore sd130 194 8320 W 76644352
0 4550 0 zpool-datastore sd137 194 8768 W 76648448
0 4550 0 zpool-datastore sd129 194 8256 W 76660736
0 4550 0 zpool-datastore sd138 194 8832 W 76713984
0 4550 0 zpool-datastore sd139 194 8896 W 77369344
0 4550 0 zpool-datastore sd113 194 7232 W 77770752
0 4550 0 zpool-datastore sd115 194 7360 W 77832192
0 4550 0 zpool-datastore sd114 194 7296 W 77836288
0 4550 0 zpool-datastore sd111 194 7104 W 77840384
0 4550 0 zpool-datastore sd112 194 7168 W 77840384
0 4550 0 zpool-datastore sd108 194 6912 W 77844480
0 4550 0 zpool-datastore sd110 194 7040 W 77864960
0 4550 0 zpool-datastore sd116 194 7424 W 77873152
0 4550 0 zpool-datastore sd107 194 6848 W 77914112
0 4550 0 zpool-datastore sd106 194 6784 W 77918208
0 4550 0 zpool-datastore sd109 194 6976 W 77926400
0 4550 0 zpool-datastore sd128 194 8192 W 78938112
0 4550 0 zpool-datastore sd118 194 7552 W 78979072
0 4550 0 zpool-datastore sd125 194 8000 W 78991360
0 4550 0 zpool-datastore sd120 194 7680 W 78999552
0 4550 0 zpool-datastore sd127 194 8128 W 79007744
0 4550 0 zpool-datastore sd126 194 8064 W 79011840
0 4550 0 zpool-datastore sd119 194 7616 W 79020032
0 4550 0 zpool-datastore sd123 194 7872 W 79020032
0 4550 0 zpool-datastore sd122 194 7808 W 79048704
0 4550 0 zpool-datastore sd124 194 7936 W 79056896
0 4550 0 zpool-datastore sd121 194 7744 W 79065088
0 4550 0 zpool-datastore sd105 194 6720 W 82460672
0 4550 0 zpool-datastore sd102 194 6528 W 82468864
0 4550 0 zpool-datastore sd101 194 6464 W 82477056
0 4550 0 zpool-datastore sd104 194 6656 W 82477056
0 4550 0 zpool-datastore sd103 194 6592 W 82481152
0 4550 0 zpool-datastore sd141 194 9024 W 82489344
0 4550 0 zpool-datastore sd99 194 6336 W 82493440
0 4550 0 zpool-datastore sd98 194 6272 W 82501632
0 4550 0 zpool-datastore sd140 194 8960 W 82513920
0 4550 0 zpool-datastore sd97 194 6208 W 82538496
0 4550 0 zpool-datastore sd100 194 6400 W 82542592
------------------------------------------------------------------------------------------------------
Any thoughts of how to discover and/or overcome the true bottleneck is much appreciated.
Thanks
Michael
> On Sep 29, 2016, at 7:46 PM, Dale Ghent <daleg at omniti.com> wrote:
>
>
> Awesome that you're using LX Zones in a way with BeeGFS.
>
> A note on your testing methodology, however:
> http://lethargy.org/~jesus/writes/disk-benchmarking-with-dd-dont/#.V-3RUqOZPOY
>
>> On Sep 29, 2016, at 10:21 PM, Michael Talbott <mtalbott at lji.org> wrote:
>>
>> Hi, I'm trying to find a way to achieve massive write speeds with some decent hardware which will be used for some parallel computing needs (bioinformatics). Eventually if all goes well and my testing succeeds.. I'll be duplicating this setup and run BeeGFS in a few LX zones (THANK YOU LX ZONES!!!) to run some truly massive parallel computing storage happiness, but, I'd like to tune this box as much as possible first.
>>
>> For this setup I'm using an Intel S2600GZ board and 2 x E5-2640 (six cores ea) @ 2.5GHz and there's 144GB ECC ram installed. I have 3 x SAS2008 based LSI cards in that box. 1 of those is connected to 8 internal SSDs and the other 2 cards (4 ports) are connected to a 45 bay drive enclosure. And then there's 2 intel x 2 port 10ge cards for connectivity.
>>
>> I've created so many different zpools in different configurations between straight up striped with no redundancy, radiz2, raidz3.. used multipath, non-multipath'd with 8x phy links instead of 4x multipath links, etc, etc in order to find the magic combination for maximum performance, but there's something somewhere capping raw throughput and I can't seem to find it.
>>
>> Now the crazy part is I can for instance, create zpoolA with ~20 drives (via cardA and attached only to backplaneA), create zpoolB with another ~20 drives (via cardB and attached only to backplaneB), and each of them gets the same performance individually (~1GB/s write and 2.5GB/s read). So, my thought is if I destroy zpoolB and attach all those drives to zpoolA via additional vdevs.. It should double the performance or make some sort of significant improvement, but, nope, roughly the same speed.. Then I thought, ok, well maybe it's a slowest vdev sort of thing.. So then I created vdevs such each vdev used half it's drives from backplaneA and the other half from backplaneB.. That would force data distribution between the cards for each vdev and double the speed and get me to 2GB/s write.. but, nope. same deal. 1G/s write and 2.5G read :(
>>
>> When I create the pool from scratch, for each vdev I add I see a linear increase in performance until I hit about 4-5 vdevs.. That's where the performance flatlines and no matter what I do beyond that it just wont go any faster :(
>>
>> Also, if I create a pure SSD pool with cardC, the linear read/write performance of those are hitting the exact same numbers as the others :( Bottom line, no matter what pool configuration I use, no matter what recordsize is set in zfs, I'm always getting capped with roughly 1GB/s write and 2.5GB/s read.
>>
>> I thought maybe there wasn't enough PCIe lanes to run all of those cards at 8x, but, that's not the case, this board can run 6 x 8 lane PCIe 3.0 cards at full speed.. I booted it up in linux to use lspci -vv to make sure of it (since I'm not sure how to view pcie speeds in OmniOS), and sure enough, everything is running with 8x width, so that's not it.
>>
>> Oh, and just fyi, this is my super simple throughput testing script that I run with compression disabled on the tested pool.
>>
>> START=$(date +%s)
>> /usr/gnu/bin/dd if=/dev/zero of=/datastore/testdd bs=1M count=10k
>> sync
>> echo $(($(date +%s)-START))
>>
>> My goal is to find a way to achieve at least 2GB/s write and 4GB/s read which I think is theoretically possbile with this hardware..
>>
>> Anyone have any ideas of what could be limiting this or how to remedy it? Could it be the mpt_sas driver itself somehow throttling access to all these devices? Or maybe do I need to do some sort of irq-cpu pinning type of magic?
>>
>>
>> Thanks,
>>
>> Michael
>>
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20160929/8b1a114a/attachment-0001.html>
More information about the OmniOS-discuss
mailing list