[OmniOS-discuss] Understanding OmniOS disk IO timeouts and options to control them
Richard Elling
richard.elling at richardelling.com
Wed Jan 4 18:29:51 UTC 2017
> On Jan 4, 2017, at 10:04 AM, Chris Siebenmann <cks at cs.toronto.edu> wrote:
>
> We recently had a server reboot due to the ZFS vdev_deadman/spa_deadman
> timeout timer activating and panicing the system. If you haven't heard
> of this timer before, that's not surprising; triggering it requires an
> IO to a vdev to take more than 1000 seconds (by default; it's controlled
> by the value of zfs_deadman_synctime_ms, in spa_misc.c).
>
> Before this happened, I would not have expected that our OmniOS system
> allowed an IO to run that long before timing it out and returning an
> error to ZFS. Clearly I'm wrong, which means that I'd like to understand
> what disk IO timeouts OmniOS has and where (or if) we can control them
> so that really long IOs get turned into forced errors well before 16
> minutes go by. Unfortunately our disk topology is a bit complicated;
> we have scsi_vhci multipathing on top of iSCSI disks.
Do not assume the timeout reflects properly operating software or firmware.
The original impetus for the deadman was to allow debugging of the underlying
stack. Prior to adding the deadman, the I/O could be stuck forever.
>
> In some Internet searching I've found sd_io_time (60 seconds by
> default) and the default SD retry count of 5 (I think, it may be
> 3), which can be adjusted on a per-disk-type basis through the
> retries-timeout parameter (per the sd manpage). Searching the kernel
> code suggests that there are some hard-coded timeouts in the 30 to 90
> second range, which also doesn't seem excessive.
For sd-level, most commands follow the sd_io_time and retries. scsi_vhci adds
significant complexity above sd and below zfs.
— richard
>
> (I have a crash dump from this panic, so I can in theory use mdb
> to look through it to see just what level an IO appears stuck at
> if I know what to look for and how.)
>
> Based on 'fmdump -eV' output, it looks like our server was
> retrying IO repeatedly.
>
> Does anyone know what I should be looking at to find and adjust
> timeouts, retry counts, and so on? Is there any documentation that
> I'm overlooking?
>
> Thanks in advance.
>
> - cks
> PS: some links I've dug up in searches:
> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
> https://smartos.org/bugview/OS-2415
> https://www.illumos.org/issues/1553
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
More information about the OmniOS-discuss
mailing list