[OmniOS-discuss] zpool Write Bottlenecks
Linda Kateley
lkateley at kateley.com
Fri Sep 30 05:25:41 UTC 2016
One of the things I do is turn caching off. on a dataset #zfs get
primarycache=none dataset. Give me better disk performance
lk
On 9/30/16 12:19 AM, Michael Talbott wrote:
> I'm very aware that dd is not the best tool for measuring disk
> performance in most cases. And I know the read throughput number is
> way off b/c of zfs caching mechanisms. However it does work in this
> case to illustrate the point in my case of a write throttle somewhere
> in the system. If anyone needs me to test with some other tool for
> illustrative purposes, I can do that too. It's just so odd that 1 card
> with a given set of disks attached in a pool provide roughly the same
> net throughput as 2 cards with 2 sets of disks in said pool.
>
> But, for the nay-sayers of dd testing, I'll provide you with this..
> Here's an example of using 2 x 11 disk raidz3 vdevs where all 22
> drives live on one backplane attached to 1 card with an 8x phy sas
> connection. And then adding 2 more 11 disk raidz3 vdevs that are
> connected to that system on a separate card (also using an 8x phy sas
> link). No compression. Running bonnie++ to saturate the disks while I
> pull the numbers from iotop.
>
> The following is an output of iotop (which uses dtrace for measuring
> disk io)
> http://www.brendangregg.com/DTrace/iotop
>
> Here's the 22 drive pool (all attached to 1 card):
> ------------------------------------------------------------------------------------------------------
> 2016 Sep 30 00:02:42, load: 3.17, disk_r: 0 KB, disk_w: 3394084 KB
>
> UID PID PPID CMD DEVICE MAJ MIN D BYTES
> 0 7630 0 zpool-datastore sd124 194 7936 W 161812480
> 0 7630 0 zpool-datastore sd118 194 7552 W 161820672
> 0 7630 0 zpool-datastore sd127 194 8128 W 161845248
> 0 7630 0 zpool-datastore sd128 194 8192 W 161845248
> 0 7630 0 zpool-datastore sd122 194 7808 W 161849344
> 0 7630 0 zpool-datastore sd119 194 7616 W 161857536
> 0 7630 0 zpool-datastore sd121 194 7744 W 161857536
> 0 7630 0 zpool-datastore sd125 194 8000 W 161865728
> 0 7630 0 zpool-datastore sd123 194 7872 W 161869824
> 0 7630 0 zpool-datastore sd126 194 8064 W 161873920
> 0 7630 0 zpool-datastore sd120 194 7680 W 161906688
> 0 7630 0 zpool-datastore sd136 194 8704 W 165916672
> 0 7630 0 zpool-datastore sd137 194 8768 W 165916672
> 0 7630 0 zpool-datastore sd138 194 8832 W 165933056
> 0 7630 0 zpool-datastore sd135 194 8640 W 165937152
> 0 7630 0 zpool-datastore sd139 194 8896 W 165941248
> 0 7630 0 zpool-datastore sd134 194 8576 W 165945344
> 0 7630 0 zpool-datastore sd130 194 8320 W 165974016
> 0 7630 0 zpool-datastore sd129 194 8256 W 165978112
> 0 7630 0 zpool-datastore sd132 194 8448 W 165994496
> 0 7630 0 zpool-datastore sd133 194 8512 W 165994496
> 0 7630 0 zpool-datastore sd131 194 8384 W 166006784
>
> ------------------------------------------------------------------------------------------------------
>
>
> And here's the pool extended with 2 more raidz3s with 2 cards
> notice it's almost LITERALLY cut in HALF per drive!
>
>
> ------------------------------------------------------------------------------------------------------
>
>
> 2016 Sep 30 00:01:07, load: 4.59, disk_r: 8 KB, disk_w: 3609852 KB
>
> UID PID PPID CMD DEVICE MAJ MIN D BYTES
> 0 4550 0 zpool-datastore sd133 194 8512 W 76558336
> 0 4550 0 zpool-datastore sd132 194 8448 W 76566528
> 0 4550 0 zpool-datastore sd135 194 8640 W 76570624
> 0 4550 0 zpool-datastore sd134 194 8576 W 76574720
> 0 4550 0 zpool-datastore sd136 194 8704 W 76582912
> 0 4550 0 zpool-datastore sd131 194 8384 W 76611584
> 0 4550 0 zpool-datastore sd130 194 8320 W 76644352
> 0 4550 0 zpool-datastore sd137 194 8768 W 76648448
> 0 4550 0 zpool-datastore sd129 194 8256 W 76660736
> 0 4550 0 zpool-datastore sd138 194 8832 W 76713984
> 0 4550 0 zpool-datastore sd139 194 8896 W 77369344
> 0 4550 0 zpool-datastore sd113 194 7232 W 77770752
> 0 4550 0 zpool-datastore sd115 194 7360 W 77832192
> 0 4550 0 zpool-datastore sd114 194 7296 W 77836288
> 0 4550 0 zpool-datastore sd111 194 7104 W 77840384
> 0 4550 0 zpool-datastore sd112 194 7168 W 77840384
> 0 4550 0 zpool-datastore sd108 194 6912 W 77844480
> 0 4550 0 zpool-datastore sd110 194 7040 W 77864960
> 0 4550 0 zpool-datastore sd116 194 7424 W 77873152
> 0 4550 0 zpool-datastore sd107 194 6848 W 77914112
> 0 4550 0 zpool-datastore sd106 194 6784 W 77918208
> 0 4550 0 zpool-datastore sd109 194 6976 W 77926400
> 0 4550 0 zpool-datastore sd128 194 8192 W 78938112
> 0 4550 0 zpool-datastore sd118 194 7552 W 78979072
> 0 4550 0 zpool-datastore sd125 194 8000 W 78991360
> 0 4550 0 zpool-datastore sd120 194 7680 W 78999552
> 0 4550 0 zpool-datastore sd127 194 8128 W 79007744
> 0 4550 0 zpool-datastore sd126 194 8064 W 79011840
> 0 4550 0 zpool-datastore sd119 194 7616 W 79020032
> 0 4550 0 zpool-datastore sd123 194 7872 W 79020032
> 0 4550 0 zpool-datastore sd122 194 7808 W 79048704
> 0 4550 0 zpool-datastore sd124 194 7936 W 79056896
> 0 4550 0 zpool-datastore sd121 194 7744 W 79065088
> 0 4550 0 zpool-datastore sd105 194 6720 W 82460672
> 0 4550 0 zpool-datastore sd102 194 6528 W 82468864
> 0 4550 0 zpool-datastore sd101 194 6464 W 82477056
> 0 4550 0 zpool-datastore sd104 194 6656 W 82477056
> 0 4550 0 zpool-datastore sd103 194 6592 W 82481152
> 0 4550 0 zpool-datastore sd141 194 9024 W 82489344
> 0 4550 0 zpool-datastore sd99 194 6336 W 82493440
> 0 4550 0 zpool-datastore sd98 194 6272 W 82501632
> 0 4550 0 zpool-datastore sd140 194 8960 W 82513920
> 0 4550 0 zpool-datastore sd97 194 6208 W 82538496
> 0 4550 0 zpool-datastore sd100 194 6400 W 82542592
>
> ------------------------------------------------------------------------------------------------------
>
>
>
>
> Any thoughts of how to discover and/or overcome the true bottleneck is
> much appreciated.
>
> Thanks
>
> Michael
>
>
>
>> On Sep 29, 2016, at 7:46 PM, Dale Ghent <daleg at omniti.com
>> <mailto:daleg at omniti.com>> wrote:
>>
>>
>> Awesome that you're using LX Zones in a way with BeeGFS.
>>
>> A note on your testing methodology, however:
>> http://lethargy.org/~jesus/writes/disk-benchmarking-with-dd-dont/#.V-3RUqOZPOY
>> <http://lethargy.org/%7Ejesus/writes/disk-benchmarking-with-dd-dont/#.V-3RUqOZPOY>
>>
>>> On Sep 29, 2016, at 10:21 PM, Michael Talbott <mtalbott at lji.org> wrote:
>>>
>>> Hi, I'm trying to find a way to achieve massive write speeds with
>>> some decent hardware which will be used for some parallel computing
>>> needs (bioinformatics). Eventually if all goes well and my testing
>>> succeeds.. I'll be duplicating this setup and run BeeGFS in a few LX
>>> zones (THANK YOU LX ZONES!!!) to run some truly massive parallel
>>> computing storage happiness, but, I'd like to tune this box as much
>>> as possible first.
>>>
>>> For this setup I'm using an Intel S2600GZ board and 2 x E5-2640 (six
>>> cores ea) @ 2.5GHz and there's 144GB ECC ram installed. I have 3 x
>>> SAS2008 based LSI cards in that box. 1 of those is connected to 8
>>> internal SSDs and the other 2 cards (4 ports) are connected to a 45
>>> bay drive enclosure. And then there's 2 intel x 2 port 10ge cards
>>> for connectivity.
>>>
>>> I've created so many different zpools in different configurations
>>> between straight up striped with no redundancy, radiz2, raidz3..
>>> used multipath, non-multipath'd with 8x phy links instead of 4x
>>> multipath links, etc, etc in order to find the magic combination for
>>> maximum performance, but there's something somewhere capping raw
>>> throughput and I can't seem to find it.
>>>
>>> Now the crazy part is I can for instance, create zpoolA with ~20
>>> drives (via cardA and attached only to backplaneA), create zpoolB
>>> with another ~20 drives (via cardB and attached only to backplaneB),
>>> and each of them gets the same performance individually (~1GB/s
>>> write and 2.5GB/s read). So, my thought is if I destroy zpoolB and
>>> attach all those drives to zpoolA via additional vdevs.. It should
>>> double the performance or make some sort of significant improvement,
>>> but, nope, roughly the same speed.. Then I thought, ok, well maybe
>>> it's a slowest vdev sort of thing.. So then I created vdevs such
>>> each vdev used half it's drives from backplaneA and the other half
>>> from backplaneB.. That would force data distribution between the
>>> cards for each vdev and double the speed and get me to 2GB/s write..
>>> but, nope. same deal. 1G/s write and 2.5G read :(
>>>
>>> When I create the pool from scratch, for each vdev I add I see a
>>> linear increase in performance until I hit about 4-5 vdevs.. That's
>>> where the performance flatlines and no matter what I do beyond that
>>> it just wont go any faster :(
>>>
>>> Also, if I create a pure SSD pool with cardC, the linear read/write
>>> performance of those are hitting the exact same numbers as the
>>> others :( Bottom line, no matter what pool configuration I use, no
>>> matter what recordsize is set in zfs, I'm always getting capped with
>>> roughly 1GB/s write and 2.5GB/s read.
>>>
>>> I thought maybe there wasn't enough PCIe lanes to run all of those
>>> cards at 8x, but, that's not the case, this board can run 6 x 8 lane
>>> PCIe 3.0 cards at full speed.. I booted it up in linux to use lspci
>>> -vv to make sure of it (since I'm not sure how to view pcie speeds
>>> in OmniOS), and sure enough, everything is running with 8x width, so
>>> that's not it.
>>>
>>> Oh, and just fyi, this is my super simple throughput testing script
>>> that I run with compression disabled on the tested pool.
>>>
>>> START=$(date +%s)
>>> /usr/gnu/bin/dd if=/dev/zero of=/datastore/testdd bs=1M count=10k
>>> sync
>>> echo $(($(date +%s)-START))
>>>
>>> My goal is to find a way to achieve at least 2GB/s write and 4GB/s
>>> read which I think is theoretically possbile with this hardware..
>>>
>>> Anyone have any ideas of what could be limiting this or how to
>>> remedy it? Could it be the mpt_sas driver itself somehow throttling
>>> access to all these devices? Or maybe do I need to do some sort of
>>> irq-cpu pinning type of magic?
>>>
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>> _______________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti.com
>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>
>
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20160930/de47fe2d/attachment-0001.html>
More information about the OmniOS-discuss
mailing list