[OmniOS-discuss] Building a new storage
Matej Zerovnik
matej at zunaj.si
Mon Apr 13 10:16:24 UTC 2015
Hello Richard. Thank you very much for your answer.
On 10. 04. 2015 17:19, Richard Elling wrote:
> [sidebar conversation: I've experienced bad results with WD Black 4TB
> SAS drives]
What kind of bad results? What drives do you use now or what do you
recommend? I was thinking of going with HGST anyway, since they got a
good reputation on Backblaze reports.
> But the more interesting question is: where do you plan to measure the
> IOPS? The backend stats
> as seen by iostat and zpool iostat are difficult to use because they
> do not account for caching and
> the writes are coalesced. Write coalescing is particularly important
> for people who insist on counting
> IOPS because, by default, 32 4KB random write IOPS will be coalesced
> into one 128KB write. Let's
> take a closer look at your data...
Thank you for explaining this. A lot clearer now.
Where are write ops coalesced? In memory/ZIL? In drive cache?
>> Output from iostat -vx (only some drives are pasted):
>> Code:
>> device r/s w/s kr/s kw/s wait actv svc_t %w %b
>> data 36621,9 25740,2 19288,6 66191,0 197,6 25,9 3,6 40 77
>> sd18 276,3 104,8 145,2 83,3 0,0 0,6 1,5 0 36
>
> This version of iostat doesn't show average sizes :-( but you can
> calculate them from the data :-)
Is there a version that does that? I see they can be calculated, but it
might be nice to see results right there:)
Is it in official repo or do I have to get it somewhere else?
> From this we can suggest:
> 1. avoid 4KB sector sized disks for this configuration and workload
> 2. look further up the stack to determine why such small physical I/Os
> are being used
I will take a look at the enterprise drives, but I think most of them
are still 512b. I will be on alert when ordering the new drives.
As far as looking up the stack, where can I look for that? Is there
something that can be done on the server itself or do I have to check
that on iscsi clients?
>
>> Is it just me, or are there too much IOPS for those drive to handle
>> even in theory, let alone in practice? How to get the right measurement?
>
> To measure IOPS written into the pool, look at fsstat for file
> systems. For iSCSI, this isn't
> quite so easy to gather, so we use dtrace, see iscsisvrtop as an example.
Looking at iscsivrtop, I see around 260 r ops(10MB/s) and 140 w
ops(14MB/s) total:
client ops reads writes nops rd_bw wr_bw
ard_sz awr_sz rd_t wr_t nop_t align%
xxx.xxx.140.38 70 49 21 0 498 2201 1
10 6 6154 0 100
xxx.xxx.57.158 114 38 67 0 1210 7699 3
11 58 697704 0 100
xxx.xxx.7.3 233 188 46 0 9240 1241
4 2 5235 11183 0 100
all 446 284 148 0 11204
12641 3 8 3453 326634 0 0
Why are numbers here smaller then the one reported by 'iostat' and
'zpool iostat'?
> Unfortunately, most versions of arcstat do not measure what you want
> to know. The measurement
> you're looking for is the reason for eviction. These are measured as
> kstats:
> # kstat -p ::arcstats:evict_l2\*
> zfs:0:arcstats:evict_l2_cached 0
> zfs:0:arcstats:evict_l2_eligible 2224128
> zfs:0:arcstats:evict_l2_ineligible 4096
>
> For this example system, you can see:
> + nothing is in the L2 cache (mostly, because there is no L2 :-)
> + 2224128 ARC evictions were eligible to be satisfied from an L2 cache
> + 4096 ARC evictions were not eligible
>
> This example system can benefit from an L2 cache.
I'm not sure if I understand this.
What does evict_l2_eligible and evict_l2_ineligible mean?
'ARC evictions were eligible to be satisfied from an L2 cache' <--
2224128 arc evictions would be moved to l2 and serviced from l2 arc, so
l2 arc would help in 2224128 cases?
If that is the case, eligible value needs to be high for l2 to make sense?
This are the stats from our server:
kstat -p ::arcstats:evict_l2\*
zfs:0:arcstats:evict_l2_cached 10717615626752
zfs:0:arcstats:evict_l2_eligible 13868417704448
zfs:0:arcstats:evict_l2_ineligible 8230791512064
zfs:0:arcstats:evict_l2_skip 174367
What do those stats tell me? Eligible counter is high, is it safe to
assume, bigger l2 cache would help us?
Would it be better to add more vdev's instead of bigger l2 cache?
>> *3.)*
>> As far as ZIL goes, do we need it?
>
> From the data below, yes, it will help
>
>> Output from 'zilstat':
>> Code:
>> N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
>> 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0
>> 178352 178352 178352 262144 262144 262144 2 0 0 2
>> 134823992 134823992 134823992 221380608 221380608 221380608 1689 0 0 1689
>> 102893848 102893848 102893848 168427520 168427520 168427520 1285 0 0 1285
>> 0 0 0 0 0 0 0 0 0 0
>> 4472 4472 4472 131072 131072 131072 1 0 0 1
>> 0 0 0 0 0 0 0 0 0 0
>> 41904 41904 41904 262144 262144 262144 2 0 0 2
>> 134963824 134963824 134963824 221511680 221511680 221511680 1690 0 0 1690
>> 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0
>> 0 0 0 0 0 0 0 0 0 0
>> 32789896 32789896 32789896 53346304 53346304 53346304 407 0 0 407
>> 25467912 25467912 25467912 41811968 41811968 41811968 319 0 0 319
>> Given the stats, is ZIL even necessary? When I'm running zilstat, I
>> see big ops every 5s. Why is that? I know system is suppose to flush
>> data from memory to spindles every 5s, but that shouldn't be seen as
>> ZIL flush, is that correct?
Would you be so kind and interpret the data a bit. What can I see and
what can I gather from this data?
What are N-bytes and B-bytes?
What I see here are writes directly to ZIL and data are only written to
ZIL, if I set sync option to always or if iscsi client mounts it's
device with sync options, right?
Would you consider this zil as a big performance benefit for our system
or would it be better to spent money on more vdev's?
>
>>
>> *5.)*
>> In case we decide to go with 4 JBOD cases, would it be better to
>> build 2 pools, just so that in case 1 pool has a hickup, we won't
>> loose all data?
>
> This is a common configuration: two SAS pools + two servers configured
> such that either server can
> serve the pool.
How do you usually achieve that? Do you also use RSF-1 software or do
you use other solutions?
Matej
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150413/3aa2112c/attachment-0001.html>
More information about the OmniOS-discuss
mailing list