[OmniOS-discuss] system hangs randomly

Fri Nov 8 22:02:26 UTC 2013

On Nov 8, 2013, at 8:20 AM, Jim Klimov <jimklimov at cos.ru> wrote:

> The logs specify that your IDE devices (I believe, these are the rpool
> SSDs in legacy mode) return errors on reads and timeout on retries or
> resets. This may mean a few things:
> 
> 1) Imminent device death i.e. due to wear over lifetime, try to get
> these replaced with new units (especially if their age or some actual
> diagnostics results from "smartctl" or vendor tools also indicate the
> possibility of such scenario)

I vote for this one. The X-25E are well-known for behaving this way as a
failure mode. The only recourse is to replace the disk.

> 
> 2) Bad diagnostics, perhaps due to IDE protocol limitations - try to
> switch the controller into SATA mode and use some illumos live media
> (OI LiveCD/LiveUSB or OmniOS equivalents) to boot the server with the
> rpool disks in SATA mode and run:

This isn't the cause or solution for the disk's woes, but I recommend going
to AHCI mode at your convenience. You might be able to replace the disk 
without an outage, but this step will require an outage.
 -- richard

> 
> zpool import -N -R /a -f rpool
> mount -F zfs rpool/ROOT/your_BE_name /a && \
>  touch /a/reconfigure
> zpool export rpool
> 
> Depending on your OS setup, the BE mounting may require some other
> command (like "zfs mount rpool/ROOT/your_BE_name").
> 
> This routine mounts the pool, indicates to the BE that it should make
> new device nodes (so it runs "devfsadm" early in the boot), and exports
> the pool. In the process, the rpool ZFS labels begin referencing the new
> hard-disk device node names which is what the rootfs procedure relies
> on. In some more difficult cases it might help to also copy (rsync) the
> /dev/ and /devices/ from the live environment into the on-disk BE so
> that these device names saved into the pool labels would match those
> discovered by the kernel upon boot.
> 
> Do have backups; it might make sense to complete this experiment with
> one of the mirror halves removed, so that if nothing works (even rolling
> back to an IDE-only setup) you can destroy this half's content and boot
> in IDE mode from the other half and re-attach the mirrored part to it.
> 
> As a variant, it might make sense (if you'd also refresh the hardware)
> to attach the new device(s) to the rpool as a 3/4-way mirror, and then
> completing the switcheroo to SATA with only the new couple plugged in -
> you'd be able to fall back on the old and tested set if all goes wrong
> somehow.
> 
> Good luck,
> //Jim
> 
> 
> On 2013-11-08 13:35, Hafiz Rafibeyli wrote:
>> log on  monitor when system hangs was like this:(can send actuall taken screenshot to individual mail adresses)
>> 
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: reset bus, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: early timeout, target=0 lun=0
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>>         Error for command 'read sector'   Error Level: Informational
>> gda: [ID 107833 kern.notice]           Sense Key: aborted command
>> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>>         Error for command 'read sector'   Error Level: Informational
>> gda: [ID 107833 kern.notice]           Sense Key: aborted command
>> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: abort request, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: abort device, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: reset target, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: reset bus, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>>         timeout: early timeout, target=0 lun=0
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>>         Error for command 'read sector'   Error Level: Informational
>> gda: [ID 107833 kern.notice]           Sense Key: aborted command
>> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>> 
>> 
>> Hello,
>> 
>> Omnios version:SunOS  5.11 omnios-b281e50
>> Server:Supermicro X8DAH (24x storage chassis)
>> 
>> we are using omnios as a production nfs server for Esxi hosts.
>> 
>> everything was ok,but last 20 days system hangs 3 times.Nothing changed on hardware side.
>> 
>> for  OS disks we are using two SSDSA2SH032G1GN(32 Gb Intel X25-E SSD)  in zfs mirror  attached onboard sata ports of motherboard.
>> 
>> I captured monitor screenshot when system hangs,and sending as attachment.
>> 
>> 
>> My pools info:
>> 
>> pool: rpool
>>  state: ONLINE
>>   scan: resilvered 20.0G in 0h3m with 0 errors on Sun Oct 20 14:01:01 2013
>> config:
>> 
>> 	NAME        STATE     READ WRITE CKSUM
>> 	rpool       ONLINE       0     0     0
>> 	  mirror-0  ONLINE       0     0     0
>> 	    c4d0s0  ONLINE       0     0     0
>> 	    c3d1s0  ONLINE       0     0     0
>> 
>> errors: No known data errors
>> 
>> 
>>   pool: zpool1
>>  state: ONLINE
>> status: Some supported features are not enabled on the pool. The pool can
>> 	still be used, but some features are unavailable.
>> action: Enable all features using 'zpool upgrade'. Once this is done,
>> 	the pool may no longer be accessible by software that does not support
>> 	the features. See zpool-features(5) for details.
>>   scan: scrub repaired 0 in 5h0m with 0 errors on Sat Oct 12 19:00:53 2013
>> config:
>> 
>> 	NAME                       STATE     READ WRITE CKSUM
>> 	zpool1                     ONLINE       0     0     0
>> 	  raidz1-0                 ONLINE       0     0     0
>> 	    c1t5000C50041E9D9A7d0  ONLINE       0     0     0
>> 	    c1t5000C50041F1A5EFd0  ONLINE       0     0     0
>> 	    c1t5000C5004253FF87d0  ONLINE       0     0     0
>> 	    c1t5000C50055A607E3d0  ONLINE       0     0     0
>> 	    c1t5000C50055A628EFd0  ONLINE       0     0     0
>> 	    c1t5000C50055A62F57d0  ONLINE       0     0     0
>> 	logs
>> 	  mirror-1                 ONLINE       0     0     0
>> 	    c1t5001517959627219d0  ONLINE       0     0     0
>> 	    c1t5001517BB2747BE7d0  ONLINE       0     0     0
>> 	cache
>> 	  c1t5001517803D007D8d0    ONLINE       0     0     0
>> 	  c1t5001517BB2AFB592d0    ONLINE       0     0     0
>> 	spares
>> 	  c1t5000C5005600A6B3d0    AVAIL
>> 	  c1t5000C5005600B43Bd0    AVAIL
>> 
>> errors: No known data errors
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>> 
> 
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

--

Richard.Elling at RichardElling.com
+1-760-896-4422

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20131108/7601fda4/attachment-0001.html>