[OmniOS-discuss] Understanding OmniOS disk IO timeouts and options to control them
Chris Siebenmann
cks at cs.toronto.edu
Wed Jan 4 18:04:25 UTC 2017
We recently had a server reboot due to the ZFS vdev_deadman/spa_deadman
timeout timer activating and panicing the system. If you haven't heard
of this timer before, that's not surprising; triggering it requires an
IO to a vdev to take more than 1000 seconds (by default; it's controlled
by the value of zfs_deadman_synctime_ms, in spa_misc.c).
Before this happened, I would not have expected that our OmniOS system
allowed an IO to run that long before timing it out and returning an
error to ZFS. Clearly I'm wrong, which means that I'd like to understand
what disk IO timeouts OmniOS has and where (or if) we can control them
so that really long IOs get turned into forced errors well before 16
minutes go by. Unfortunately our disk topology is a bit complicated;
we have scsi_vhci multipathing on top of iSCSI disks.
In some Internet searching I've found sd_io_time (60 seconds by
default) and the default SD retry count of 5 (I think, it may be
3), which can be adjusted on a per-disk-type basis through the
retries-timeout parameter (per the sd manpage). Searching the kernel
code suggests that there are some hard-coded timeouts in the 30 to 90
second range, which also doesn't seem excessive.
(I have a crash dump from this panic, so I can in theory use mdb
to look through it to see just what level an IO appears stuck at
if I know what to look for and how.)
Based on 'fmdump -eV' output, it looks like our server was
retrying IO repeatedly.
Does anyone know what I should be looking at to find and adjust
timeouts, retry counts, and so on? Is there any documentation that
I'm overlooking?
Thanks in advance.
- cks
PS: some links I've dug up in searches:
http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
https://smartos.org/bugview/OS-2415
https://www.illumos.org/issues/1553
More information about the OmniOS-discuss
mailing list