[OmniOS-discuss] Building a new storage

Fri Apr 10 10:17:27 UTC 2015

We are currently thinking of rebuilding our SAN, since we did some 
mistakes on the first build. But before we begin, we would like to plan 
accordingly, so I'm wondering how to measure some data(l2arc and zil 
usage, current iops,...) the right way.

We currently have a single raidz2 pool build out of 50 SATA 
drives(Seagate Constellation, 2x Intel S3700 100GB as ZIL and 2x Intel 
S3700 100GB as L2ARC.

For the new system, we plan to use a IBM 3550 M4 server with 256GB of 
memory and LSI SAS 9207-8e HBA. We will have around 70-80 SAS 4TB drives 
in JBOD cases and, if we need, some SSD's for ZIL and L2ARC.

Questions:

*1.)*
How to measure average IOPS of the current system? 'zpool iostat 
poolname 1' gives me weird numbers saying current drives perform around 
300 read ops and 100 write ops per second. Drives are 7200 SATA drives, 
so I know they can't perform that much IOPS.
Output from iostat -vx (only some drives are pasted):
Code:

device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
data   36621,9 25740,2 19288,6 66191,0 197,6 25,9    3,6  40  77
sd18    276,3  104,8  145,2   83,3  0,0  0,6    1,5   0  36
sd19    283,3  106,7  152,1   83,3  0,0  0,6    1,5   0  24
sd20    281,3  101,8  146,7   79,8  0,0  0,5    1,4   0  35
sd21    286,3  117,7  146,7   84,3  0,0  0,3    0,7   0  21
sd22    283,3   85,8  144,2   81,3  0,0  0,5    1,3   0  32
sd23    275,3  116,7  139,7   82,8  0,0  0,3    0,8   0  21
sd24    280,3  106,7  155,6   84,3  0,0  0,6    1,6   0  25
sd25    288,3  106,7  148,6   86,3  0,0  0,4    1,0   0  24
sd26    269,4  110,7  137,2   91,8  0,0  0,5    1,3   0  24
sd27    272,4   87,8  141,7   78,3  0,0  0,7    1,8   0  34
sd28    236,4  115,7  219,0   84,8  0,0  0,9    2,5   0  26
sd29    235,4  108,7  228,5   83,8  0,0  0,9    2,7   0  33

Output of 'zpool iostat -v data 1 | grep drive_id'
Code:

                               capacity     operations    bandwidth
                 pool                       alloc   free   read  write   read  write
     c8t5000C5004FD18DE9d0      -      -    573    220   663K   607K
     c8t5000C5004FD18DE9d0      -      -    563      0   318K      0
     c8t5000C5004FD18DE9d0      -      -    586    314   361K   806K
     c8t5000C5004FD18DE9d0      -      -    567    445   373K  1,02M
     c8t5000C5004FD18DE9d0      -      -    464     25   299K  17,9K
     c8t5000C5004FD18DE9d0      -      -    552      2   326K  3,68K
     c8t5000C5004FD18DE9d0      -      -    421     41   249K  31,3K
     c8t5000C5004FD18DE9d0      -      -    492    400   391K   944K
     c8t5000C5004FD18DE9d0      -      -    313    148   242K   337K
     c8t5000C5004FD18DE9d0      -      -    330    163   360K   390K
     c8t5000C5004FD18DE9d0      -      -    655     23   577K  21,5K

Is it just me, or are there too much IOPS for those drive to handle even 
in theory, let alone in practice? How to get the right measurement?

*2.)*
Current ARC utilization on our system:
Code:

ARC Efficency:
          Cache Access Total:             2134027465
          Cache Hit Ratio:      64%       1381755042     [Defined State for buffer]
          Cache Miss Ratio:     35%       752272423      [Undefined State for Buffer]
          REAL Hit Ratio:       56%       1199175179     [MRU/MFU Hits Only]

Code:

./arcstat.pl -f read,hits,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size 1 2>/dev/null
read  hits  miss  hit%  l2read  l2hits  l2miss  l2hit%  arcsz  l2size
    1     1     0   100       0       0       0       0   213G    235G
4.8K  3.0K  1.9K    61    1.9K      40    1.8K       2   213G    235G
4.3K  2.7K  1.6K    62    1.6K      35    1.5K       2   213G    235G
2.5K   853  1.6K    34    1.6K      45    1.6K       2   213G    235G
5.1K  3.0K  2.2K    57    2.2K      49    2.1K       2   213G    235G
6.5K  4.4K  2.1K    68    2.1K      30    2.0K       1   213G    235G
5.0K  2.5K  2.5K    49    2.5K      44    2.5K       1   213G    235G
  11K  8.5K  2.8K    75    2.8K      13    2.8K       0   213G    235G
6.4K  4.8K  1.6K    74    1.6K      57    1.6K       3   213G    235G
2.3K  1.1K  1.2K    46    1.2K      88    1.1K       7   213G    235G
1.9K   532  1.3K    28    1.3K      83    1.2K       6   213G    235G

As we can see, there are almost no L2ARC cache hits. What can be the 
reason for that? Is our L2ARC cache too small or are the data on our 
storage just too much random to be cached? I don't know what is on our 
iscsi shares, since they are for outside customers, but as far as I 
know, it's mostly backups and some live data.

*3.)*
As far as ZIL goes, do we need it? I think I read somewhere, that ZIL 
can only store 8k blocks and that you have to 'format' iscsi drives 
accordingly. Is that still the case? Output from 'zilstat':
Code:

    N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
          0          0          0          0          0          0      0      0      0      0
          0          0          0          0          0          0      0      0      0      0
     178352     178352     178352     262144     262144     262144      2      0      0      2
  134823992  134823992  134823992  221380608  221380608  221380608   1689      0      0   1689
  102893848  102893848  102893848  168427520  168427520  168427520   1285      0      0   1285
          0          0          0          0          0          0      0      0      0      0
       4472       4472       4472     131072     131072     131072      1      0      0      1
          0          0          0          0          0          0      0      0      0      0
      41904      41904      41904     262144     262144     262144      2      0      0      2
  134963824  134963824  134963824  221511680  221511680  221511680   1690      0      0   1690
          0          0          0          0          0          0      0      0      0      0
          0          0          0          0          0          0      0      0      0      0
          0          0          0          0          0          0      0      0      0      0
          0          0          0          0          0          0      0      0      0      0
   32789896   32789896   32789896   53346304   53346304   53346304    407      0      0    407
   25467912   25467912   25467912   41811968   41811968   41811968    319      0      0    319

Given the stats, is ZIL even necessary? When I'm running zilstat, I see 
big ops every 5s. Why is that? I know system is suppose to flush data 
from memory to spindles every 5s, but that shouldn't be seen as ZIL 
flush, is that correct?

*4.)*
How to put drives together, to get the best IOPS/capacity ratio out of 
them? We were thinking of 7 RAIDZ2 vdev's with 10 drives each. That way 
we would get around 224TB pool.

*5.)*
In case we decide to go with 4 JBOD cases, would it be better to build 2 
pools, just so that in case 1 pool has a hickup, we won't loose all data?

What else am I not considering?

Thanks, Matej
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150410/790a187c/attachment-0001.html>