[OmniOS-discuss] Building a new storage

Matej Zerovnik matej at zunaj.si
Mon Apr 13 10:16:24 UTC 2015


Hello Richard. Thank you very much for your answer.

On 10. 04. 2015 17:19, Richard Elling wrote:
> [sidebar conversation: I've experienced bad results with WD Black 4TB 
> SAS drives]
What kind of bad results? What drives do you use now or what do you 
recommend? I was thinking of going with HGST anyway, since they got a 
good reputation on Backblaze reports.

> But the more interesting question is: where do you plan to measure the 
> IOPS? The backend stats
> as seen by iostat and zpool iostat are difficult to use because they 
> do not account for caching and
> the writes are coalesced. Write coalescing is particularly important 
> for people who insist on counting
> IOPS because, by default, 32 4KB random write IOPS will be coalesced 
> into one 128KB write. Let's
> take a closer look at your data...
Thank you for explaining this. A lot clearer now.
Where are write ops coalesced? In memory/ZIL? In drive cache?

>> Output from iostat -vx (only some drives are pasted):
>> Code:
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> data   36621,9 25740,2 19288,6 66191,0 197,6 25,9    3,6  40  77
>> sd18    276,3  104,8  145,2   83,3  0,0  0,6    1,5   0  36
>
> This version of iostat doesn't show average sizes :-( but you can 
> calculate them from the data :-)
Is there a version that does that? I see they can be calculated, but it 
might be nice to see results right there:)
Is it in official repo or do I have to get it somewhere else?

> From this we can suggest:
> 1. avoid 4KB sector sized disks for this configuration and workload
> 2. look further up the stack to determine why such small physical I/Os 
> are being used
I will take a look at the enterprise drives, but I think most of them 
are still 512b. I will be on alert when ordering the new drives.
As far as looking up the stack, where can I look for that? Is there 
something that can be done on the server itself or do I have to check 
that on iscsi clients?

>
>> Is it just me, or are there too much IOPS for those drive to handle 
>> even in theory, let alone in practice? How to get the right measurement?
>
> To measure IOPS written into the pool, look at fsstat for file 
> systems. For iSCSI, this isn't
> quite so easy to gather, so we use dtrace, see iscsisvrtop as an example.
Looking at iscsivrtop, I see around 260 r ops(10MB/s) and 140 w 
ops(14MB/s) total:
client                 ops   reads  writes    nops   rd_bw   wr_bw 
ard_sz  awr_sz    rd_t    wr_t   nop_t  align%
xxx.xxx.140.38 70        49       21         0         498 2201       1  
       10           6      6154       0      100
xxx.xxx.57.158 114      38       67         0       1210 7699       3   
      11         58  697704       0      100
xxx.xxx.7.3        233   188       46         0       9240 1241       
4         2      5235    11183       0      100
all                     446    284     148         0     11204 
12641       3         8      3453  326634        0         0

Why are numbers here smaller then the one reported by 'iostat' and 
'zpool iostat'?

> Unfortunately, most versions of arcstat do not measure what you want 
> to know. The measurement
> you're looking for is the reason for eviction. These are measured as 
> kstats:
> # kstat -p ::arcstats:evict_l2\*
> zfs:0:arcstats:evict_l2_cached  0
> zfs:0:arcstats:evict_l2_eligible        2224128
> zfs:0:arcstats:evict_l2_ineligible      4096
>
> For this example system, you can see:
> +  nothing is in the L2 cache (mostly, because there is no L2 :-)
> +  2224128 ARC evictions were eligible to be satisfied from an L2 cache
> + 4096 ARC evictions were not eligible
>
> This example system can benefit from an L2 cache.
I'm not sure if I understand this.
What does evict_l2_eligible and evict_l2_ineligible mean?

'ARC evictions were eligible to be satisfied from an L2 cache' <-- 
2224128 arc evictions would be moved to l2 and serviced from l2 arc, so 
l2 arc would help in 2224128 cases?
If that is the case, eligible value needs to be high for l2 to make sense?

This are the stats from our server:
kstat -p ::arcstats:evict_l2\*
zfs:0:arcstats:evict_l2_cached        10717615626752
zfs:0:arcstats:evict_l2_eligible        13868417704448
zfs:0:arcstats:evict_l2_ineligible       8230791512064
zfs:0:arcstats:evict_l2_skip                             174367

What do those stats tell me? Eligible counter is high, is it safe to 
assume, bigger l2 cache would help us?
Would it be better to add more vdev's instead of bigger l2 cache?

>> *3.)*
>> As far as ZIL goes, do we need it?
>
> From the data below, yes, it will help
>
>> Output from 'zilstat':
>> Code:
>>     N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
>>           0          0          0          0          0          0      0      0      0      0
>>           0          0          0          0          0          0      0      0      0      0
>>      178352     178352     178352     262144     262144     262144      2      0      0      2
>>   134823992  134823992  134823992  221380608  221380608  221380608   1689      0      0   1689
>>   102893848  102893848  102893848  168427520  168427520  168427520   1285      0      0   1285
>>           0          0          0          0          0          0      0      0      0      0
>>        4472       4472       4472     131072     131072     131072      1      0      0      1
>>           0          0          0          0          0          0      0      0      0      0
>>       41904      41904      41904     262144     262144     262144      2      0      0      2
>>   134963824  134963824  134963824  221511680  221511680  221511680   1690      0      0   1690
>>           0          0          0          0          0          0      0      0      0      0
>>           0          0          0          0          0          0      0      0      0      0
>>           0          0          0          0          0          0      0      0      0      0
>>           0          0          0          0          0          0      0      0      0      0
>>    32789896   32789896   32789896   53346304   53346304   53346304    407      0      0    407
>>    25467912   25467912   25467912   41811968   41811968   41811968    319      0      0    319
>> Given the stats, is ZIL even necessary? When I'm running zilstat, I 
>> see big ops every 5s. Why is that? I know system is suppose to flush 
>> data from memory to spindles every 5s, but that shouldn't be seen as 
>> ZIL flush, is that correct?
Would you be so kind and interpret the data a bit. What can I see and 
what can I gather from this data?
What are N-bytes and B-bytes?
What I see here are writes directly to ZIL and data are only written to 
ZIL, if I set sync option to always or if iscsi client mounts it's 
device with sync options, right?
Would you consider this zil as a big performance benefit for our system 
or would it be better to spent money on more vdev's?


>
>>
>> *5.)*
>> In case we decide to go with 4 JBOD cases, would it be better to 
>> build 2 pools, just so that in case 1 pool has a hickup, we won't 
>> loose all data?
>
> This is a common configuration: two SAS pools + two servers configured 
> such that either server can
> serve the pool.
How do you usually achieve that? Do you also use RSF-1 software or do 
you use other solutions?

Matej


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150413/3aa2112c/attachment-0001.html>


More information about the OmniOS-discuss mailing list