[OmniOS-discuss] scsi command timeouts

Thu Jun 22 22:22:11 UTC 2017

A couple things that I've discovered over time that might help:

Don't ever use the root user for zpool queries such as "zpool status". If you have a really bad failing disk a zpool status command can take forever to complete when ran as root. A "su nobody -c 'zpool status'" will return results almost instantly. So if your device discovery script(s) use zpool commands, that might be a choking point.

# make sure to prevent scsi bus resets (in /kernel/drv/sd.conf) especially in an HA environment
allow-bus-device-reset=0;

Also, depending on the disk model, I've found that some of them wreak havoc on the SAS topology itself when they start to fail. Some just handle errors really badly and can flood the SAS channel. If you have a SAS switch in between, you might be able to get an idea of which device is causing the grief from there based on the counts.

In my case I have had horrible experiences with the WD WD4001FYYG. That model of drive has caused me an insane amount of headache. The disk scan on boot literally takes 13 seconds per-disk (when the disks are perfectly good and much much longer when one is dying). If I replace them with another make/model drive, the disk scan is done in a fraction of a second. Also, booting the same machine into any linux os the scan completes in a fraction of a second. Must be something about that model's firmware that doesn't play nicely with Illumos's driver. Anyway, that's a story for another time ;)

I've reduced the drive scan time at boot down to 5 seconds per disk instead of the 13 seconds per disk for that horrible accursed drive by adding this to /kernel/drv/sd.conf

sd-config-list= "WD      WD4001FYYG","power-condition:false";

Followed by this command to commit it:
update_drv -vf sd

Hope this helps.

Michael

> On Jun 22, 2017, at 1:41 PM, Schweiss, Chip <chip at innovates.com> wrote:
> 
> I'm talking about an offline pool.   I started this thread after rebooting a server that is part of an HA pair. The other server has the pools online.  It's been over 4 hours now and it still hasn't completed its disk scan.   
> 
> Every tool I have that helps me locate disks, suffers from the same insane command timeout to happen many times before moving on.   Operations that typically take seconds blow up to hours really fast because of a few dead disks.     
> 
> -Chip
> 
> 
> 
> On Thu, Jun 22, 2017 at 3:12 PM, Dale Ghent <daleg at omniti.com <mailto:daleg at omniti.com>> wrote:
> 
> Have you able to and have tried offlining it in the zpool?
> 
> zpool offline thepool <disk>
> 
> I'm assuming the pool has some redundancy which would allow for this.
> 
> /dale
> 
> > On Jun 22, 2017, at 11:54 AM, Schweiss, Chip <chip at innovates.com <mailto:chip at innovates.com>> wrote:
> >
> > When ever a disk goes south, several disk related takes become painfully slow.  Boot up times can jump into the hours to complete the disk scans.
> >
> > The logs slowly get these type messages:
> >
> > genunix: WARNING /pci at 0,0/pci8086,340c at 5/pci15d9,400 at 0 (mpt_sas0):
> >     Timeout of 60 seconds expired with 1 commands on target 16 lun 0
> >
> > I thought this /etc/system setting would reduce the timeout to 5 seconds:
> > set sd:sd_io_time = 5
> >
> > But this doesn't seem to change anything.
> >
> > Is there anyway to make this a more reasonable timeout, besides pulling the disk that's causing it?   Just locating the defective disk is also painfully slow because of this problem.
> >
> > -Chip
> > _______________________________________________
> > OmniOS-discuss mailing list
> > OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
> > http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
> 
> 
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20170622/f2ec46a3/attachment-0001.html>