[OmniOS-discuss] Slow scrub performance

Tue Jul 29 15:02:18 UTC 2014

On Jul 28, 2014, at 20:11, wuffers <moo at wuffers.net> wrote:

> Does this look normal?

Short answer, yes. … Keep in mind that 

1. a scrub runs in the background (so as not to impact production I/O, this was not always the case and caused serious issues in the past with a pool being unresponsive due to a scrub)

2. a scrub essentially walks the zpool examining every transaction in order (as does a resilver)

So the time to complete a scrub depends on how many write transactions since the pool was created (which is generally related to the amount of data but not always). You are limited by the random I/O capability of the disks involved. With VMs I assume this is a file server, so the I/O size will also affect performance.

<snip>

> This is a ~90TB SAN on r151008, with 25 pairs of 4TB mirror drives. The last scrub I ran was about 3 months ago, which took (from my recollection) ~250 hours or so. I've only run about 4 scrubs so far on this installation.
> 
> The current scrub has been running for 2 weeks, with no end in sight. The last time I saw an estimate, it said around ~650 hours remaining. 

Run the numbers… you are scanning 24.2TB at about 5.5MB/sec … 4,613,734 seconds or 54 days. And that assumes the same rate for all of the scan. The rate will change as other I/O competes for resources.

> 
> This thread http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/46021 from over 3 years ago mention the metaslab_min_alloc_size as a way to improve this (reducing it to 4K from 10MB). Further reading into this property got me this Illumos bug: https://www.illumos.org/issues/54, which states "Turns out this tunable is made irrelevant as a result of a change to use the metaslab_df_ops allocator. We don't need to change it. I'm closing this bug.". So that seems like a dead end to me. 
> 
> This is the current load with scrub running (~350 VMs between Hyper-V and VMware environments):
> 
> # iostat -xnze
>                             extended device statistics       ---- errors ---
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
>     0.4   12.5   39.7   78.8  0.1  0.0    5.0    0.1   0   0   0   0   0   0 rpool
>     0.2    6.9   19.9   39.4  0.0  0.0    0.0    0.1   0   0   0   0   0   0 c4t0d0
>     0.2    6.8   19.9   39.4  0.0  0.0    0.0    0.1   0   0   0   0   0   0 c4t1d0
>     4.4   29.3  209.7  962.7  0.0  0.0    0.0    1.4   0   3   0   0   0   0 c1t5000C50055F8723Bd0
>     4.7   25.1  209.4  962.3  0.0  0.0    0.0    1.5   0   3   0   0   0   0 c1t5000C50055E66B63d0
>     4.7   27.6  208.3  952.7  0.0  0.0    0.0    1.3   0   3   0   0   0   0 c1t5000C50055F87E73d0

<snip>

Looks like you have a fair bit of activity going on (almost 1MB/sec of writes per spindle).

Since this is storage for VMs, I assume this is the storage server for separate compute servers? Have you tuned the block size for the file share you are using? That can make a huge difference in performance.

I also noted that you only have a single LOG device. Best Practice is to mirror log devices so you do not lose any data in flight if hit by a power outage (of course, if this server has more UPS runtime that all the clients that may not matter).

You may want to ask this question over on the ZFS discuss list…

Subscribe here: https://www.listbox.com/subscribe/?listname=zfs@lists.illumos.org