[OmniOS-discuss] Fwd: system hangs randomly

Jim Klimov jimklimov at cos.ru
Fri Nov 8 16:20:06 UTC 2013


The logs specify that your IDE devices (I believe, these are the rpool
SSDs in legacy mode) return errors on reads and timeout on retries or
resets. This may mean a few things:

1) Imminent device death i.e. due to wear over lifetime, try to get
these replaced with new units (especially if their age or some actual
diagnostics results from "smartctl" or vendor tools also indicate the
possibility of such scenario)

2) Bad diagnostics, perhaps due to IDE protocol limitations - try to
switch the controller into SATA mode and use some illumos live media
(OI LiveCD/LiveUSB or OmniOS equivalents) to boot the server with the
rpool disks in SATA mode and run:

zpool import -N -R /a -f rpool
mount -F zfs rpool/ROOT/your_BE_name /a && \
   touch /a/reconfigure
zpool export rpool

Depending on your OS setup, the BE mounting may require some other
command (like "zfs mount rpool/ROOT/your_BE_name").

This routine mounts the pool, indicates to the BE that it should make
new device nodes (so it runs "devfsadm" early in the boot), and exports
the pool. In the process, the rpool ZFS labels begin referencing the new
hard-disk device node names which is what the rootfs procedure relies
on. In some more difficult cases it might help to also copy (rsync) the
/dev/ and /devices/ from the live environment into the on-disk BE so
that these device names saved into the pool labels would match those
discovered by the kernel upon boot.

Do have backups; it might make sense to complete this experiment with
one of the mirror halves removed, so that if nothing works (even rolling
back to an IDE-only setup) you can destroy this half's content and boot
in IDE mode from the other half and re-attach the mirrored part to it.

As a variant, it might make sense (if you'd also refresh the hardware)
to attach the new device(s) to the rpool as a 3/4-way mirror, and then
completing the switcheroo to SATA with only the new couple plugged in -
you'd be able to fall back on the old and tested set if all goes wrong
somehow.

Good luck,
//Jim


On 2013-11-08 13:35, Hafiz Rafibeyli wrote:
> log on  monitor when system hangs was like this:(can send actuall taken screenshot to individual mail adresses)
>
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: reset bus, target=0 lun=0
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: early timeout, target=0 lun=0
> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>          Error for command 'read sector'   Error Level: Informational
> gda: [ID 107833 kern.notice]           Sense Key: aborted command
> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>          Error for command 'read sector'   Error Level: Informational
> gda: [ID 107833 kern.notice]           Sense Key: aborted command
> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: abort request, target=0 lun=0
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: abort device, target=0 lun=0
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: reset target, target=0 lun=0
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: reset bus, target=0 lun=0
> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>          timeout: early timeout, target=0 lun=0
> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>          Error for command 'read sector'   Error Level: Informational
> gda: [ID 107833 kern.notice]           Sense Key: aborted command
> gda: [ID 107833 kern.notice]           Vendor 'Gen-ATA ' error code: 0x3
> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>
>
> Hello,
>
> Omnios version:SunOS  5.11 omnios-b281e50
> Server:Supermicro X8DAH (24x storage chassis)
>
> we are using omnios as a production nfs server for Esxi hosts.
>
> everything was ok,but last 20 days system hangs 3 times.Nothing changed on hardware side.
>
> for  OS disks we are using two SSDSA2SH032G1GN(32 Gb Intel X25-E SSD)  in zfs mirror  attached onboard sata ports of motherboard.
>
> I captured monitor screenshot when system hangs,and sending as attachment.
>
>
> My pools info:
>
> pool: rpool
>   state: ONLINE
>    scan: resilvered 20.0G in 0h3m with 0 errors on Sun Oct 20 14:01:01 2013
> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	rpool       ONLINE       0     0     0
> 	  mirror-0  ONLINE       0     0     0
> 	    c4d0s0  ONLINE       0     0     0
> 	    c3d1s0  ONLINE       0     0     0
>
> errors: No known data errors
>
>
>    pool: zpool1
>   state: ONLINE
> status: Some supported features are not enabled on the pool. The pool can
> 	still be used, but some features are unavailable.
> action: Enable all features using 'zpool upgrade'. Once this is done,
> 	the pool may no longer be accessible by software that does not support
> 	the features. See zpool-features(5) for details.
>    scan: scrub repaired 0 in 5h0m with 0 errors on Sat Oct 12 19:00:53 2013
> config:
>
> 	NAME                       STATE     READ WRITE CKSUM
> 	zpool1                     ONLINE       0     0     0
> 	  raidz1-0                 ONLINE       0     0     0
> 	    c1t5000C50041E9D9A7d0  ONLINE       0     0     0
> 	    c1t5000C50041F1A5EFd0  ONLINE       0     0     0
> 	    c1t5000C5004253FF87d0  ONLINE       0     0     0
> 	    c1t5000C50055A607E3d0  ONLINE       0     0     0
> 	    c1t5000C50055A628EFd0  ONLINE       0     0     0
> 	    c1t5000C50055A62F57d0  ONLINE       0     0     0
> 	logs
> 	  mirror-1                 ONLINE       0     0     0
> 	    c1t5001517959627219d0  ONLINE       0     0     0
> 	    c1t5001517BB2747BE7d0  ONLINE       0     0     0
> 	cache
> 	  c1t5001517803D007D8d0    ONLINE       0     0     0
> 	  c1t5001517BB2AFB592d0    ONLINE       0     0     0
> 	spares
> 	  c1t5000C5005600A6B3d0    AVAIL
> 	  c1t5000C5005600B43Bd0    AVAIL
>
> errors: No known data errors
>
>
>
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>



More information about the OmniOS-discuss mailing list