[OmniOS-discuss] Mildly confusing ZFS iostat output
Richard Elling
richard.elling at richardelling.com
Tue Jan 27 04:14:14 UTC 2015
> On Jan 26, 2015, at 5:16 PM, W Verb <wverb73 at gmail.com> wrote:
>
> Hello All,
>
> I am mildly confused by something iostat does when displaying statistics for a zpool. Before I begin rooting through the iostat source, does anyone have an idea of why I am seeing high "wait" and "wsvc_t" values for "ppool" when my devices apparently are not busy? I would have assumed that the stats for the pool would be the sum of the stats for the zdevs....
welcome to queuing theory! ;-)
First, iostat knows nothing about the devices being measured. It is really just a processor
for kstats of type KSTAT_TYPE_IO (see the kstat(3kstat) man page for discussion) For that
type, you get a 2-queue set. For many cases, 2-queues is a fine model, but when there is
only one interesting queue, sometimes developers choose to put less interesting info in the
"wait" queue.
Second, it is the responsibility of the developer to define the queues. In the case of pools,
the queues are defined as:
wait = vdev_queue_io_add() until vdev_queue_io_remove()
run = vdev_queue_pending_add() until vdev_queue_pending_remove()
The run queue is closer to the actual measured I/O to the vdev (the juicy performance bits)
The wait queue is closer to the transaction engine and includes time for aggregation.
Thus we expect the wait queue to be higher, especially for async workloads. But since I/Os
can and do get aggregated prior to being sent to the vdev, it is not a very useful measure of
overall performance. In other words, optimizing this away could actually hurt performance.
In general, worry about the run queues and don't worry so much about the wait queues.
NB, iostat calls "run" queues "active" queues. You say Tomato, I say 'mater.
-- richard
>
> extended device statistics
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
> 10.0 9183.0 40.5 344942.0 0.0 1.8 0.0 0.2 0 178 c4
> 1.0 187.0 4.0 19684.0 0.0 0.1 0.0 0.5 0 8 c4t5000C5006A597B93d0
> 2.0 199.0 12.0 20908.0 0.0 0.1 0.0 0.6 0 12 c4t5000C500653DE049d0
> 2.0 197.0 8.0 20788.0 0.0 0.2 0.0 0.8 0 15 c4t5000C5003607D87Bd0
> 0.0 202.0 0.0 20908.0 0.0 0.1 0.0 0.6 0 11 c4t5000C5006A5903A2d0
> 0.0 189.0 0.0 19684.0 0.0 0.1 0.0 0.5 0 10 c4t5000C500653DEE58d0
> 5.0 957.0 16.5 1966.5 0.0 0.1 0.0 0.1 0 7 c4t50026B723A07AC78d0
> 0.0 201.0 0.0 20787.9 0.0 0.1 0.0 0.7 0 14 c4t5000C5003604ED37d0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t5000C500653E447Ad0
> 0.0 3525.0 0.0 110107.7 0.0 0.5 0.0 0.2 0 51 c4t500253887000690Dd0
> 0.0 3526.0 0.0 110107.7 0.0 0.5 0.0 0.1 1 50 c4t5002538870006917d0
> 10.0 6046.0 40.5 344941.5 837.4 1.9 138.3 0.3 23 67 ppool
>
>
> For those following the VAAI thread, this is the system I will be using as my testbed.
>
> Here is the structure of ppool (taken at a different time than above):
>
> root at sanbox:/root# zpool iostat -v ppool
> capacity operations bandwidth
> pool alloc free read write read write
> ------------------------- ----- ----- ----- ----- ----- -----
> ppool 191G 7.97T 23 637 140K 15.0M
> mirror 63.5G 2.66T 7 133 46.3K 840K
> c4t5000C5006A597B93d0 - - 1 13 24.3K 844K
> c4t5000C500653DEE58d0 - - 1 13 24.1K 844K
> mirror 63.6G 2.66T 7 133 46.5K 839K
> c4t5000C5006A5903A2d0 - - 1 13 24.0K 844K
> c4t5000C500653DE049d0 - - 1 13 24.6K 844K
> mirror 63.5G 2.66T 7 133 46.8K 839K
> c4t5000C5003607D87Bd0 - - 1 13 24.5K 843K
> c4t5000C5003604ED37d0 - - 1 13 24.4K 843K
> logs - - - - - -
> mirror 301M 222G 0 236 0 12.5M
> c4t5002538870006917d0 - - 0 236 5 12.5M
> c4t500253887000690Dd0 - - 0 236 5 12.5M
> cache - - - - - -
> c4t50026B723A07AC78d0 62.3G 11.4G 19 113 83.0K 1.07M
> ------------------------- ----- ----- ----- ----- ----- -----
>
> root at sanbox:/root# zfs get all ppool
> NAME PROPERTY VALUE SOURCE
> ppool type filesystem -
> ppool creation Sat Jan 24 18:37 2015 -
> ppool used 5.16T -
> ppool available 2.74T -
> ppool referenced 96K -
> ppool compressratio 1.51x -
> ppool mounted yes -
> ppool quota none default
> ppool reservation none default
> ppool recordsize 128K default
> ppool mountpoint /ppool default
> ppool sharenfs off default
> ppool checksum on default
> ppool compression lz4 local
> ppool atime on default
> ppool devices on default
> ppool exec on default
> ppool setuid on default
> ppool readonly off default
> ppool zoned off default
> ppool snapdir hidden default
> ppool aclmode discard default
> ppool aclinherit restricted default
> ppool canmount on default
> ppool xattr on default
> ppool copies 1 default
> ppool version 5 -
> ppool utf8only off -
> ppool normalization none -
> ppool casesensitivity sensitive -
> ppool vscan off default
> ppool nbmand off default
> ppool sharesmb off default
> ppool refquota none default
> ppool refreservation none default
> ppool primarycache all default
> ppool secondarycache all default
> ppool usedbysnapshots 0 -
> ppool usedbydataset 96K -
> ppool usedbychildren 5.16T -
> ppool usedbyrefreservation 0 -
> ppool logbias latency default
> ppool dedup off default
> ppool mlslabel none default
> ppool sync standard local
> ppool refcompressratio 1.00x -
> ppool written 96K -
> ppool logicalused 445G -
> ppool logicalreferenced 9.50K -
> ppool filesystem_limit none default
> ppool snapshot_limit none default
> ppool filesystem_count none default
> ppool snapshot_count none default
> ppool redundant_metadata all default
>
> Currently, ppool contains a single 5TB zvol that I am hosting as an iSCSI block device. At the zdev level, I have ensured that the ashift is 12 for all devices, all physical devices are 4k-native SATA, and the cache/log SSDs are also set for 4k. The block sizes are manually set in sd.conf, and confirmed with "echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'". The zvol blocksize is 4k, and the iSCSI block transfer size is 512B (not that it matters).
>
> All drives contain a single Solaris2 partition with an EFI label, and are properly aligned:
> format> verify
>
> Volume name = < >
> ascii name = <ATA-ST3000DM001-1CH1-CC27-2.73TB>
> bytes/sector = 512
> sectors = 5860533167
> accessible sectors = 5860533134
> Part Tag Flag First Sector Size Last Sector
> 0 usr wm 256 2.73TB 5860516750
> 1 unassigned wm 0 0 0
> 2 unassigned wm 0 0 0
> 3 unassigned wm 0 0 0
> 4 unassigned wm 0 0 0
> 5 unassigned wm 0 0 0
> 6 unassigned wm 0 0 0
> 8 reserved wm 5860516751 8.00MB 5860533134
>
> I scrubbed the pool last night, which completed without error. From "zdb ppool", I have extracted (with minor formatting):
>
> capacity operations bandwidth ---- errors ----
> description used avail read write read write read write cksum
> ppool 339G 7.82T 26.6K 0 175M 0 0 0 5
> mirror 113G 2.61T 8.87K 0 58.5M 0 0 0 2
> /dev/dsk/c4t5000C5006A597B93d0s0 3.15K 0 48.8M 0 0 0 2
> /dev/dsk/c4t5000C500653DEE58d0s0 3.10K 0 49.0M 0 0 0 2
>
> mirror 113G 2.61T 8.86K 0 58.5M 0 0 0 8
> /dev/dsk/c4t5000C5006A5903A2d0s0 3.12K 0 48.7M 0 0 0 8
> /dev/dsk/c4t5000C500653DE049d0s0 3.08K 0 48.9M 0 0 0 8
>
> mirror 113G 2.61T 8.86K 0 58.5M 0 0 0 10
> /dev/dsk/c4t5000C5003607D87Bd0s0 2.48K 0 48.8M 0 0 0 10
> /dev/dsk/c4t5000C5003604ED37d0s0 2.47K 0 48.9M 0 0 0 10
>
> log mirror 44.0K 222G 0 0 37 0 0 0 0
> /dev/dsk/c4t5002538870006917d0s0 0 0 290 0 0 0 0
> /dev/dsk/c4t500253887000690Dd0s0 0 0 290 0 0 0 0
> Cache
> /dev/dsk/c4t50026B723A07AC78d0s0
> 0 73.8G 0 0 35 0 0 0 0
> Spare
> /dev/dsk/c4t5000C500653E447Ad0s0 4 0 136K 0 0 0 0
>
> This shows a few checksum errors, which is not consistent with the output of "zfs status -v", and "iostat -eE" shows no physical error count. I again see the discrepancy between the "ppool" value and what I would expect, which would be a sum of the cksum errors for each vdev.
>
> I also observed a ton of leaked space, which I expect from a live pool, as well as a single:
> db_blkptr_cb: Got error 50 reading <96, 1, 2, 3fc8> DVA[0]=<1:1dc4962000:1000> DVA[1]=<2:1dc4654000:1000> [L2 zvol object] fletcher4 lz4 LE contiguous unique double size=4000L/a00P birth=52386L/52386P fill=4825 cksum=c70e8a7765:f2a
> dce34f59c:c8a289b51fe11d:7e0af40fe154aab4 -- skipping
>
>
> By the way, I also found:
>
> Uberblock:
> magic = 0000000000bab10c
>
> Wow. Just wow.
>
>
> -Warren V
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150126/774c85ed/attachment-0001.html>
More information about the OmniOS-discuss
mailing list