[OmniOS-discuss] Slow scrub performance
wuffers
moo at wuffers.net
Thu Jul 31 19:44:45 UTC 2014
This is going to be long winded as well (apologies!).. lots of pasted data.
On Thu, Jul 31, 2014 at 1:37 AM, Richard Elling <
richard.elling at richardelling.com> wrote:
>
> The %busy for controllers is a sum of the %busy for all disks on the
> controller, so
> is can be large, but overall isn't interesting. With HDDs, there is no way
> you can
> saturate the controller, so we don't really care what the %busy shows.
>
> The more important item is that the number of read ops is fairly low for
> all but 4 disks.
> Since you didn't post the pool configuration, we can only guess that they
> might be a
> souce of the bottleneck.
>
> You're seeing a lot of reads from the cache devices. How much RAM does
> this system
> have?
>
>
I realized that the busy % was a sum after looking through some of that
data, but good to know that it isn't very relevant.
The pool configuration was in the original post, but here it is again
(after re-attaching the mirror log device). Just saw your edit, but this
has been updated from the original post anyways.
pool: tank
state: ONLINE
scan: scrub in progress since Tue Jul 29 15:41:27 2014
82.5G scanned out of 24.5T at 555K/s, (scan is slow, no estimated time)
0 repaired, 0.33% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t5000C50055F9F637d0 ONLINE 0 0 0
c1t5000C50055F9EF2Fd0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t5000C50055F87D97d0 ONLINE 0 0 0
c1t5000C50055F9D3B3d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c1t5000C50055E6606Fd0 ONLINE 0 0 0
c1t5000C50055F9F92Bd0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c1t5000C50055F856CFd0 ONLINE 0 0 0
c1t5000C50055F9FE87d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c1t5000C50055F84A97d0 ONLINE 0 0 0
c1t5000C50055FA0AF7d0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
c1t5000C50055F9D3E3d0 ONLINE 0 0 0
c1t5000C50055F9F0B3d0 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
c1t5000C50055F8A46Fd0 ONLINE 0 0 0
c1t5000C50055F9FB8Bd0 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
c1t5000C50055F8B21Fd0 ONLINE 0 0 0
c1t5000C50055F9F89Fd0 ONLINE 0 0 0
mirror-8 ONLINE 0 0 0
c1t5000C50055F8BE3Fd0 ONLINE 0 0 0
c1t5000C50055F9E123d0 ONLINE 0 0 0
mirror-9 ONLINE 0 0 0
c1t5000C50055F9379Bd0 ONLINE 0 0 0
c1t5000C50055F9E7D7d0 ONLINE 0 0 0
mirror-10 ONLINE 0 0 0
c1t5000C50055E65F0Fd0 ONLINE 0 0 0
c1t5000C50055F9F80Bd0 ONLINE 0 0 0
mirror-11 ONLINE 0 0 0
c1t5000C50055F8A22Bd0 ONLINE 0 0 0
c1t5000C50055F8D48Fd0 ONLINE 0 0 0
mirror-12 ONLINE 0 0 0
c1t5000C50055E65807d0 ONLINE 0 0 0
c1t5000C50055F8BFA3d0 ONLINE 0 0 0
mirror-13 ONLINE 0 0 0
c1t5000C50055E579F7d0 ONLINE 0 0 0
c1t5000C50055E65877d0 ONLINE 0 0 0
mirror-14 ONLINE 0 0 0
c1t5000C50055F9FA1Fd0 ONLINE 0 0 0
c1t5000C50055F8CDA7d0 ONLINE 0 0 0
mirror-15 ONLINE 0 0 0
c1t5000C50055F8BF9Bd0 ONLINE 0 0 0
c1t5000C50055F9A607d0 ONLINE 0 0 0
mirror-16 ONLINE 0 0 0
c1t5000C50055E66503d0 ONLINE 0 0 0
c1t5000C50055E4FDE7d0 ONLINE 0 0 0
mirror-17 ONLINE 0 0 0
c1t5000C50055F8E017d0 ONLINE 0 0 0
c1t5000C50055F9F3EBd0 ONLINE 0 0 0
mirror-18 ONLINE 0 0 0
c1t5000C50055F8B80Fd0 ONLINE 0 0 0
c1t5000C50055F9F63Bd0 ONLINE 0 0 0
mirror-19 ONLINE 0 0 0
c1t5000C50055F84FB7d0 ONLINE 0 0 0
c1t5000C50055F9FEABd0 ONLINE 0 0 0
mirror-20 ONLINE 0 0 0
c1t5000C50055F8CCAFd0 ONLINE 0 0 0
c1t5000C50055F9F91Bd0 ONLINE 0 0 0
mirror-21 ONLINE 0 0 0
c1t5000C50055E65ABBd0 ONLINE 0 0 0
c1t5000C50055F8905Fd0 ONLINE 0 0 0
mirror-22 ONLINE 0 0 0
c1t5000C50055E57A5Fd0 ONLINE 0 0 0
c1t5000C50055F87E73d0 ONLINE 0 0 0
mirror-23 ONLINE 0 0 0
c1t5000C50055E66053d0 ONLINE 0 0 0
c1t5000C50055E66B63d0 ONLINE 0 0 0
mirror-24 ONLINE 0 0 0
c1t5000C50055F8723Bd0 ONLINE 0 0 0
c1t5000C50055F8C3ABd0 ONLINE 0 0 0
logs
mirror-25 ONLINE 0 0 0
c2t5000A72A3007811Dd0 ONLINE 0 0 0
c12t5000A72B300780FFd0 ONLINE 0 0 0
cache
c2t500117310015D579d0 ONLINE 0 0 0
c2t50011731001631FDd0 ONLINE 0 0 0
c12t500117310015D59Ed0 ONLINE 0 0 0
c12t500117310015D54Ed0 ONLINE 0 0 0
spares
c1t5000C50055FA2AEFd0 AVAIL
c1t5000C50055E595B7d0 AVAIL
Basically, this is 2 head nodes (Supermicro 826BE26) connected to a
Supermicro 847E26 JBOD, using LSI 9207s. There are 52 Seagate ST4000NM0023s
(4TB SAS drives) in 25 mirror pairs plus 2 which are spares. There are 4
Smart Optimus 400GB SSDs as cache drives, and 2 Stec ZeusRAMs for slogs.
They're wired in such a way that both nodes can see all the drives (data,
cache and log), and the data drives are on separate controllers than the
cache/slog devices. RSF-1 was also specced in here but not in use at the
moment. All the SAN traffic is through InfiniBand (SRP). Each head unit has
256GB of RAM. Dedupe is not in use and all the latest feature flags are
enabled.
An arc_summary output:
System Memory:
Physical RAM: 262103 MB
Free Memory : 10273 MB
LotsFree: 4095 MB
ZFS Tunables (/etc/system):
set zfs:zfs_arc_shrink_shift = 10
ARC Size:
Current Size: 225626 MB (arcsize)
Target Size (Adaptive): 225626 MB (c)
Min Size (Hard Limit): 8190 MB (zfs_arc_min)
Max Size (Hard Limit): 261079 MB (zfs_arc_max)
ARC Size Breakdown:
Most Recently Used Cache Size: 10% 23290 MB (p)
Most Frequently Used Cache Size: 89% 202335 MB (c-p)
ARC Efficency:
Cache Access Total: 27377320465
Cache Hit Ratio: 93% 25532510784 [Defined State for
buffer]
Cache Miss Ratio: 6% 1844809681 [Undefined State
for Buffer]
REAL Hit Ratio: 92% 25243933796 [MRU/MFU Hits Only]
Data Demand Efficiency: 95%
Data Prefetch Efficiency: 40%
CACHE HITS BY CACHE LIST:
Anon: --% Counter Rolled.
Most Recently Used: 18% 4663226393 (mru) [
Return Customer ]
Most Frequently Used: 80% 20580707403 (mfu)
[ Frequent Customer ]
Most Recently Used Ghost: 0% 176686906 (mru_ghost) [
Return Customer Evicted, Now Back ]
Most Frequently Used Ghost: 0% 126286869 (mfu_ghost) [
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
Demand Data: 95% 24413671342
Prefetch Data: 1% 358419705
Demand Metadata: 2% 698314899
Prefetch Metadata: 0% 62104838
CACHE MISSES BY DATA TYPE:
Demand Data: 69% 1277347273
Prefetch Data: 28% 519579788
Demand Metadata: 2% 39512363
Prefetch Metadata: 0% 8370257
And a sample of arcstat (deleted first line of output):
# arcstat -f
read,hits,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size,l2asize 1
read hits miss hit% l2read l2hits l2miss l2hit% arcsz l2size
l2asize
5.9K 4.6K 1.3K 78 1.3K 1.2K 80 93 220G 1.6T
901G
6.7K 5.2K 1.5K 76 1.5K 1.3K 250 83 220G 1.6T
901G
7.0K 5.3K 1.7K 76 1.7K 1.4K 316 81 220G 1.6T
901G
6.5K 5.3K 1.2K 80 1.2K 1.1K 111 91 220G 1.6T
901G
6.4K 5.2K 1.2K 81 1.2K 1.1K 100 91 220G 1.6T
901G
7.2K 5.6K 1.6K 78 1.6K 1.3K 289 81 220G 1.6T
901G
8.5K 6.8K 1.7K 80 1.7K 1.3K 351 79 220G 1.6T
901G
7.5K 5.9K 1.6K 78 1.6K 1.3K 282 82 220G 1.6T
901G
6.7K 5.6K 1.1K 83 1.1K 991 123 88 220G 1.6T
901G
6.8K 5.5K 1.3K 80 1.3K 1.1K 234 82 220G 1.6T
901G
Interesting to see only an l2asize of 901G even though I should have more..
373G x 4 is just under 1.5TB of raw storage. The compressed l2arc size is
1.6TB, while actual used space is 901G. I expect more to be used. Perhaps
Saso can comment on this portion, if he's following this thread (snipped
from "zpool iostat -v"):
cache - - - - - -
c2t500117310015D579d0 373G 8M 193 16 2.81M 394K
c2t50011731001631FDd0 373G 5.29M 194 15 2.85M 360K
c12t500117310015D59Ed0 373G 5.50M 191 17 2.74M 368K
c12t500117310015D54Ed0 373G 5.57M 200 14 2.89M 300K
(from this discussion here:
http://lists.omniti.com/pipermail/omnios-discuss/2014-February/002287.html),
and the uptime on this is currently around ~58 days, so it should have had
enough time to rotate through the l2arc "rotor".
> methinks the scrub I/Os are getting starved and since they are low
> priority, they
> could get very starved. In general, I wouldn't worry about it, but I
> understand
> why you might be nervous. Keep in mind that in ZFS scrubs are intended to
> find
> errors on idle data, not frequently accessed data.
>
> more far below...
>
>
I'm worried because there's no way the scrub will ever complete before the
next reboot. Regular scrubs are important, right?
> ok, so the pool is issuing 720 read iops, including resilver workload, vs
> 1298 write iops.
> There is plenty of I/O capacity left on the table here, as you can see by
> the %busy being
> so low.
>
> So I think the pool is not scheduling scrub I/Os very well. You can
> increase the number of
> scrub I/Os in the scheduler by adjusting the zfs_vdev_scrub_max_active
> tunable. The
> default is 2, but you'll have to consider that a share (in the stock
> market sense) where
> the active sync reads and writes are getting 10 each. You can try bumping
> up the value
> and see what happens over some time, perhaps 10 minutes or so -- too short
> of a time
> and you won't get a good feeling for the impact (try this in off-peak
> time).
> echo zfs_vdev_scrub_max_active/W0t5 | mdb -kw
> will change the value from 2 to 5, increasing its share of the total I/O
> workload.
>
> You can see the progress of scan (scrubs do scan) workload by looking at
> the ZFS
> debug messages.
> echo ::zfs_dbgmsg | mdb -k
> These will look mysterious... they are. But the interesting bits are about
> how many blocks
> are visited in some amount of time (txg sync interval). Ideally, this will
> change as you
> adjust zfs_vdev_scrub_max_active.
> -- richard
>
>
Actually, you used the data from before the resilver. During resilver this
was the activity on the pool:
3883.5 1357.7 40141.6 60739.5 22.8 38.6 4.4 7.4 54 100 tank
Are you looking at an individual drive's busy % or the pool's busy % to
determine whether it's "busy"? During the resilver this was the activity on
the drives (a sample - between 38-45%, whereas during the scrub the
individual drives were 2-5% busy):
59.5 30.0 553.2 741.8 0.0 0.9 0.0 9.8 0 45
c1t5000C50055F8A46Fd0
57.4 22.5 504.0 724.5 0.0 0.8 0.0 9.6 0 41
c1t5000C50055F856CFd0
58.4 24.6 531.4 786.9 0.0 0.7 0.0 8.4 0 38
c1t5000C50055E6606Fd0
But yes, without the resilver the busy % was much less (during the scrub
each individual drive was 2-4% busy). I've pasted the current iostat output
further below.
With the zfs_vdev_scrub_max_active at the default of 2, it was doing an
average of 162 blocks:
doing scan sync txg 26678243; bm=897/1/0/15785978
scanned dataset 897 (tank/vmware-64k-5tb-1) with min=1 max=26652167;
pausing=1
visited 162 blocks in 6090ms
doing scan sync txg 26678244; bm=897/1/0/15786126
scanned dataset 897 (tank/vmware-64k-5tb-1) with min=1 max=26652167;
pausing=1
visited 162 blocks in 6094ms
After changing it to 5, and waiting about 20 mins, I'm not seeing anything
significantly different:
doing scan sync txg 26678816; bm=897/1/0/37082043
scanned dataset 897 (tank/vmware-64k-5tb-1) with min=1 max=26652167;
pausing=1
visited 163 blocks in 6154ms
doing scan sync txg 26678817; bm=897/1/0/37082193
scanned dataset 897 (tank/vmware-64k-5tb-1) with min=1 max=26652167;
pausing=1
visited 162 blocks in 6128ms
pool: tank
state: ONLINE
scan: scrub in progress since Tue Jul 29 15:41:27 2014
97.0G scanned out of 24.5T at 599K/s, (scan is slow, no estimated time)
0 repaired, 0.39% done
I'll keep the zfs_vdev_scrub_max_active tunable to 5, as it doesn't appear
to be impacting too much, and monitor for changes. What's strange to me is
that it was "humming" along at 5.5MB/s at the 2 week mark but is now 10x
slower (compared to before reattaching the mirror log device). It *seems*
marginally faster, from 541K/s to almost 600K/s..
This is the current activity from "iostat -xnCz 60 2":
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
158.8 1219.2 3717.8 39969.8 0.0 1.6 0.0 1.1 0 139 c1
3.6 35.1 86.2 730.9 0.0 0.0 0.0 0.9 0 3
c1t5000C50055F8723Bd0
3.7 19.9 83.7 789.9 0.0 0.0 0.0 1.4 0 3
c1t5000C50055E66B63d0
2.7 22.5 60.8 870.9 0.0 0.0 0.0 1.1 0 2
c1t5000C50055F87E73d0
2.4 27.9 66.0 765.8 0.0 0.0 0.0 0.8 0 2
c1t5000C50055F8BFA3d0
2.8 17.9 64.9 767.0 0.0 0.0 0.0 0.8 0 1
c1t5000C50055F9E123d0
3.1 26.1 73.8 813.3 0.0 0.0 0.0 0.9 0 2
c1t5000C50055F9F0B3d0
3.1 15.5 79.4 783.4 0.0 0.0 0.0 1.3 0 2
c1t5000C50055F9D3B3d0
3.8 38.6 86.2 826.8 0.0 0.1 0.0 1.2 0 4
c1t5000C50055E4FDE7d0
3.8 15.4 93.0 822.3 0.0 0.0 0.0 1.5 0 3
c1t5000C50055F9A607d0
3.0 25.7 79.4 719.7 0.0 0.0 0.0 0.9 0 2
c1t5000C50055F8CDA7d0
3.2 26.5 69.0 824.3 0.0 0.0 0.0 1.1 0 3
c1t5000C50055E65877d0
3.7 42.6 79.2 834.1 0.0 0.1 0.0 1.3 0 5
c1t5000C50055F9E7D7d0
3.3 23.2 79.5 778.0 0.0 0.0 0.0 1.2 0 3
c1t5000C50055FA0AF7d0
3.4 30.2 77.0 805.9 0.0 0.0 0.0 0.9 0 3
c1t5000C50055F9FE87d0
3.0 15.4 72.6 795.0 0.0 0.0 0.0 1.6 0 3
c1t5000C50055F9F91Bd0
2.5 38.1 61.1 859.4 0.0 0.1 0.0 1.6 0 5
c1t5000C50055F9FEABd0
2.1 13.2 42.7 801.6 0.0 0.0 0.0 1.6 0 2
c1t5000C50055F9F63Bd0
3.0 20.0 62.6 766.6 0.0 0.0 0.0 1.1 0 2
c1t5000C50055F9F3EBd0
3.7 24.3 80.2 807.9 0.0 0.0 0.0 1.0 0 2
c1t5000C50055F9F80Bd0
3.2 35.2 66.1 852.4 0.0 0.0 0.0 1.2 0 4
c1t5000C50055F9FB8Bd0
3.9 30.6 84.7 845.7 0.0 0.0 0.0 0.8 0 3
c1t5000C50055F9F92Bd0
2.7 18.1 68.8 831.4 0.0 0.0 0.0 1.4 0 2
c1t5000C50055F8905Fd0
2.7 17.7 61.4 762.1 0.0 0.0 0.0 1.0 0 2
c1t5000C50055F8D48Fd0
3.5 17.5 87.8 749.7 0.0 0.0 0.0 1.7 0 3
c1t5000C50055F9F89Fd0
2.6 13.7 58.6 780.9 0.0 0.0 0.0 1.7 0 3
c1t5000C50055F9EF2Fd0
3.3 34.9 74.5 730.9 0.0 0.0 0.0 0.8 0 3
c1t5000C50055F8C3ABd0
3.1 19.3 64.7 789.9 0.0 0.0 0.0 1.0 0 2
c1t5000C50055E66053d0
3.8 38.5 82.9 826.8 0.0 0.1 0.0 1.3 0 4
c1t5000C50055E66503d0
3.7 25.8 91.4 813.3 0.0 0.0 0.0 0.8 0 2
c1t5000C50055F9D3E3d0
2.2 37.9 52.5 859.4 0.0 0.0 0.0 1.1 0 4
c1t5000C50055F84FB7d0
2.8 20.0 62.8 766.6 0.0 0.0 0.0 1.0 0 2
c1t5000C50055F8E017d0
3.9 26.1 86.5 824.3 0.0 0.0 0.0 1.1 0 3
c1t5000C50055E579F7d0
3.1 27.7 79.9 765.8 0.0 0.0 0.0 1.2 0 3
c1t5000C50055E65807d0
2.9 22.8 76.3 778.0 0.0 0.0 0.0 1.1 0 3
c1t5000C50055F84A97d0
3.6 15.3 89.0 783.4 0.0 0.0 0.0 1.7 0 3
c1t5000C50055F87D97d0
2.8 13.8 77.9 780.9 0.0 0.0 0.0 1.5 0 2
c1t5000C50055F9F637d0
2.1 18.3 51.4 831.4 0.0 0.0 0.0 1.1 0 2
c1t5000C50055E65ABBd0
3.1 15.4 70.9 822.3 0.0 0.0 0.0 1.2 0 2
c1t5000C50055F8BF9Bd0
3.2 17.9 75.5 762.1 0.0 0.0 0.0 1.2 0 2
c1t5000C50055F8A22Bd0
3.7 42.4 83.3 834.1 0.0 0.1 0.0 1.1 0 5
c1t5000C50055F9379Bd0
4.0 22.7 86.8 870.9 0.0 0.0 0.0 1.0 0 2
c1t5000C50055E57A5Fd0
2.6 15.5 67.5 795.0 0.0 0.0 0.0 1.4 0 2
c1t5000C50055F8CCAFd0
2.9 13.2 65.4 801.6 0.0 0.0 0.0 1.9 0 3
c1t5000C50055F8B80Fd0
3.3 25.7 82.7 719.7 0.0 0.0 0.0 1.1 0 3
c1t5000C50055F9FA1Fd0
4.0 24.0 84.9 807.9 0.0 0.0 0.0 1.1 0 3
c1t5000C50055E65F0Fd0
2.8 18.4 69.5 767.0 0.0 0.0 0.0 1.0 0 2
c1t5000C50055F8BE3Fd0
3.3 17.6 81.6 749.7 0.0 0.0 0.0 1.4 0 3
c1t5000C50055F8B21Fd0
3.3 35.1 64.2 852.4 0.0 0.0 0.0 1.1 0 4
c1t5000C50055F8A46Fd0
3.5 30.0 82.1 805.9 0.0 0.0 0.0 0.9 0 3
c1t5000C50055F856CFd0
3.9 30.4 89.5 845.7 0.0 0.0 0.0 0.9 0 3
c1t5000C50055E6606Fd0
429.4 133.6 5933.0 8163.0 0.0 0.2 0.0 0.3 0 12 c2
215.8 28.4 2960.4 677.7 0.0 0.1 0.0 0.2 0 5
c2t500117310015D579d0
213.7 27.4 2972.6 654.1 0.0 0.1 0.0 0.2 0 5
c2t50011731001631FDd0
0.0 77.8 0.0 6831.2 0.0 0.1 0.0 0.8 0 2
c2t5000A72A3007811Dd0
0.0 12.3 0.0 46.8 0.0 0.0 0.0 0.0 0 0 c4
0.0 6.2 0.0 23.4 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 6.1 0.0 23.4 0.0 0.0 0.0 0.0 0 0 c4t1d0
418.4 134.8 5663.1 8197.8 0.0 0.2 0.0 0.3 0 11 c12
0.0 77.8 0.0 6831.2 0.0 0.1 0.0 0.8 0 2
c12t5000A72B300780FFd0
203.5 29.7 2738.0 715.8 0.0 0.1 0.0 0.2 0 5
c12t500117310015D59Ed0
214.9 27.2 2925.2 650.8 0.0 0.1 0.0 0.2 0 5
c12t500117310015D54Ed0
0.0 11.3 0.0 46.8 0.0 0.0 0.7 0.1 0 0 rpool
1006.7 1478.2 15313.9 56330.7 30.4 2.0 12.2 0.8 6 64 tank
Seems the pool is busy at 64% but the individual drives are not taxed at
all (this load is virtually identical to when the scrub was running before
the resilver was triggered). Still not too sure how to interpret this data.
Is the system over-stressed? Is there really a bottleneck somewhere, or
just need to fine tune some settings?
Going to try some iometer results via the VM I/O Analyzer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20140731/ca5a7585/attachment-0001.html>
More information about the OmniOS-discuss
mailing list