[OmniOS-discuss] Building a new storage
Richard Elling
richard.elling at richardelling.com
Fri Apr 10 15:19:20 UTC 2015
> On Apr 10, 2015, at 3:17 AM, Matej Zerovnik <matej at zunaj.si> wrote:
>
> We are currently thinking of rebuilding our SAN, since we did some mistakes on the first build. But before we begin, we would like to plan accordingly, so I'm wondering how to measure some data(l2arc and zil usage, current iops,...) the right way.
>
> We currently have a single raidz2 pool build out of 50 SATA drives(Seagate Constellation, 2x Intel S3700 100GB as ZIL and 2x Intel S3700 100GB as L2ARC.
>
> For the new system, we plan to use a IBM 3550 M4 server with 256GB of memory and LSI SAS 9207-8e HBA. We will have around 70-80 SAS 4TB drives in JBOD cases and, if we need, some SSD's for ZIL and L2ARC.
[sidebar conversation: I've experienced bad results with WD Black 4TB SAS drives]
>
> Questions:
>
> 1.)
> How to measure average IOPS of the current system? 'zpool iostat poolname 1' gives me weird numbers saying current drives perform around 300 read ops and 100 write ops per second. Drives are 7200 SATA drives, so I know they can't perform that much IOPS.
Sure they can. The measurable peak will be closer to 20,000 IOPS for a 7,200 rpm drive at 512 bytes.
For HDDs, the biggest impact to response time is long seeks and the placement algorithms for ZFS
bias towards the outer cylinders. From outer to inner, bandwidth usually drops by 30% and random IOPS
becomes impacted by longer seeks. Also writes are cached in the drive, so you rarely see seek penalties
for writes. This leads to a false sense of performance. This is why we often use the rule of thumb that a
HDD can deliver 100 IOPS @ 4KB and 100 MB/sec -- good enough for back-of-the-envelope capacity
planning, but not as good as real, long-term measurement.
But the more interesting question is: where do you plan to measure the IOPS? The backend stats
as seen by iostat and zpool iostat are difficult to use because they do not account for caching and
the writes are coalesced. Write coalescing is particularly important for people who insist on counting
IOPS because, by default, 32 4KB random write IOPS will be coalesced into one 128KB write. Let's
take a closer look at your data...
> Output from iostat -vx (only some drives are pasted):
> Code:
> device r/s w/s kr/s kw/s wait actv svc_t %w %b
> data 36621,9 25740,2 19288,6 66191,0 197,6 25,9 3,6 40 77
> sd18 276,3 104,8 145,2 83,3 0,0 0,6 1,5 0 36
This version of iostat doesn't show average sizes :-( but you can calculate them from the data :-)
For pool data (data written from the pool to disks, not data written into the pool):
average write size = 66191,0 / 25740,2 = 2.5 KB
average read size = 19288,6 / 36621,9 = 526 bytes
For sd18:
average write size = 83,3 / 104,8 = 794 bytes
average read size = 145,2 / 276,3 = 525 bytes
From this we can suggest:
1. avoid 4KB sector sized disks for this configuration and workload
2. look further up the stack to determine why such small physical I/Os are being used
> sd19 283,3 106,7 152,1 83,3 0,0 0,6 1,5 0 24
> sd20 281,3 101,8 146,7 79,8 0,0 0,5 1,4 0 35
> sd21 286,3 117,7 146,7 84,3 0,0 0,3 0,7 0 21
> sd22 283,3 85,8 144,2 81,3 0,0 0,5 1,3 0 32
> sd23 275,3 116,7 139,7 82,8 0,0 0,3 0,8 0 21
> sd24 280,3 106,7 155,6 84,3 0,0 0,6 1,6 0 25
> sd25 288,3 106,7 148,6 86,3 0,0 0,4 1,0 0 24
> sd26 269,4 110,7 137,2 91,8 0,0 0,5 1,3 0 24
> sd27 272,4 87,8 141,7 78,3 0,0 0,7 1,8 0 34
> sd28 236,4 115,7 219,0 84,8 0,0 0,9 2,5 0 26
> sd29 235,4 108,7 228,5 83,8 0,0 0,9 2,7 0 33
> Output of 'zpool iostat -v data 1 | grep drive_id'
> Code:
> capacity operations bandwidth
> pool alloc free read write read write
> c8t5000C5004FD18DE9d0 - - 573 220 663K 607K
> c8t5000C5004FD18DE9d0 - - 563 0 318K 0
> c8t5000C5004FD18DE9d0 - - 586 314 361K 806K
> c8t5000C5004FD18DE9d0 - - 567 445 373K 1,02M
> c8t5000C5004FD18DE9d0 - - 464 25 299K 17,9K
> c8t5000C5004FD18DE9d0 - - 552 2 326K 3,68K
> c8t5000C5004FD18DE9d0 - - 421 41 249K 31,3K
> c8t5000C5004FD18DE9d0 - - 492 400 391K 944K
> c8t5000C5004FD18DE9d0 - - 313 148 242K 337K
> c8t5000C5004FD18DE9d0 - - 330 163 360K 390K
> c8t5000C5004FD18DE9d0 - - 655 23 577K 21,5K
> Is it just me, or are there too much IOPS for those drive to handle even in theory, let alone in practice? How to get the right measurement?
To measure IOPS written into the pool, look at fsstat for file systems. For iSCSI, this isn't
quite so easy to gather, so we use dtrace, see iscsisvrtop as an example.
>
> 2.)
> Current ARC utilization on our system:
> Code:
> ARC Efficency:
> Cache Access Total: 2134027465
> Cache Hit Ratio: 64% 1381755042 [Defined State for buffer]
> Cache Miss Ratio: 35% 752272423 [Undefined State for Buffer]
> REAL Hit Ratio: 56% 1199175179 [MRU/MFU Hits Only]
> Code:
> ./arcstat.pl -f read,hits,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size 1 2>/dev/null
> read hits miss hit% l2read l2hits l2miss l2hit% arcsz l2size
> 1 1 0 100 0 0 0 0 213G 235G
> 4.8K 3.0K 1.9K 61 1.9K 40 1.8K 2 213G 235G
> 4.3K 2.7K 1.6K 62 1.6K 35 1.5K 2 213G 235G
> 2.5K 853 1.6K 34 1.6K 45 1.6K 2 213G 235G
> 5.1K 3.0K 2.2K 57 2.2K 49 2.1K 2 213G 235G
> 6.5K 4.4K 2.1K 68 2.1K 30 2.0K 1 213G 235G
> 5.0K 2.5K 2.5K 49 2.5K 44 2.5K 1 213G 235G
> 11K 8.5K 2.8K 75 2.8K 13 2.8K 0 213G 235G
> 6.4K 4.8K 1.6K 74 1.6K 57 1.6K 3 213G 235G
> 2.3K 1.1K 1.2K 46 1.2K 88 1.1K 7 213G 235G
> 1.9K 532 1.3K 28 1.3K 83 1.2K 6 213G 235G
> As we can see, there are almost no L2ARC cache hits. What can be the reason for that? Is our L2ARC cache too small or are the data on our storage just too much random to be cached? I don't know what is on our iscsi shares, since they are for outside customers, but as far as I know, it's mostly backups and some live data.
Unfortunately, most versions of arcstat do not measure what you want to know. The measurement
you're looking for is the reason for eviction. These are measured as kstats:
# kstat -p ::arcstats:evict_l2\*
zfs:0:arcstats:evict_l2_cached 0
zfs:0:arcstats:evict_l2_eligible 2224128
zfs:0:arcstats:evict_l2_ineligible 4096
For this example system, you can see:
+ nothing is in the L2 cache (mostly, because there is no L2 :-)
+ 2224128 ARC evictions were eligible to be satisfied from an L2 cache
+ 4096 ARC evictions were not eligible
This example system can benefit from an L2 cache.
>
> 3.)
> As far as ZIL goes, do we need it?
From the data below, yes, it will help
> I think I read somewhere, that ZIL can only store 8k blocks and that you have to 'format' iscsi drives accordingly. Is that still the case?
This was never the case, where did you read it?
> Output from 'zilstat':
> Code:
> N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0 0
> 178352 178352 178352 262144 262144 262144 2 0 0 2
> 134823992 134823992 134823992 221380608 221380608 221380608 1689 0 0 1689
> 102893848 102893848 102893848 168427520 168427520 168427520 1285 0 0 1285
> 0 0 0 0 0 0 0 0 0 0
> 4472 4472 4472 131072 131072 131072 1 0 0 1
> 0 0 0 0 0 0 0 0 0 0
> 41904 41904 41904 262144 262144 262144 2 0 0 2
> 134963824 134963824 134963824 221511680 221511680 221511680 1690 0 0 1690
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0 0
> 32789896 32789896 32789896 53346304 53346304 53346304 407 0 0 407
> 25467912 25467912 25467912 41811968 41811968 41811968 319 0 0 319
> Given the stats, is ZIL even necessary? When I'm running zilstat, I see big ops every 5s. Why is that? I know system is suppose to flush data from memory to spindles every 5s, but that shouldn't be seen as ZIL flush, is that correct?
>
> 4.)
> How to put drives together, to get the best IOPS/capacity ratio out of them? We were thinking of 7 RAIDZ2 vdev's with 10 drives each. That way we would get around 224TB pool.
This depends on the workload. For more IOPS, use more vdevs.
>
> 5.)
> In case we decide to go with 4 JBOD cases, would it be better to build 2 pools, just so that in case 1 pool has a hickup, we won't loose all data?
This is a common configuration: two SAS pools + two servers configured such that either server can
serve the pool.
-- richard
>
> What else am I not considering?
>
> Thanks, Matej
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150410/f555e988/attachment-0001.html>
More information about the OmniOS-discuss
mailing list