[OmniOS-discuss] system hangs randomly
Richard Elling
richard.elling at richardelling.com
Fri Nov 8 22:02:26 UTC 2013
On Nov 8, 2013, at 8:20 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> The logs specify that your IDE devices (I believe, these are the rpool
> SSDs in legacy mode) return errors on reads and timeout on retries or
> resets. This may mean a few things:
>
> 1) Imminent device death i.e. due to wear over lifetime, try to get
> these replaced with new units (especially if their age or some actual
> diagnostics results from "smartctl" or vendor tools also indicate the
> possibility of such scenario)
I vote for this one. The X-25E are well-known for behaving this way as a
failure mode. The only recourse is to replace the disk.
>
> 2) Bad diagnostics, perhaps due to IDE protocol limitations - try to
> switch the controller into SATA mode and use some illumos live media
> (OI LiveCD/LiveUSB or OmniOS equivalents) to boot the server with the
> rpool disks in SATA mode and run:
This isn't the cause or solution for the disk's woes, but I recommend going
to AHCI mode at your convenience. You might be able to replace the disk
without an outage, but this step will require an outage.
-- richard
>
> zpool import -N -R /a -f rpool
> mount -F zfs rpool/ROOT/your_BE_name /a && \
> touch /a/reconfigure
> zpool export rpool
>
> Depending on your OS setup, the BE mounting may require some other
> command (like "zfs mount rpool/ROOT/your_BE_name").
>
> This routine mounts the pool, indicates to the BE that it should make
> new device nodes (so it runs "devfsadm" early in the boot), and exports
> the pool. In the process, the rpool ZFS labels begin referencing the new
> hard-disk device node names which is what the rootfs procedure relies
> on. In some more difficult cases it might help to also copy (rsync) the
> /dev/ and /devices/ from the live environment into the on-disk BE so
> that these device names saved into the pool labels would match those
> discovered by the kernel upon boot.
>
> Do have backups; it might make sense to complete this experiment with
> one of the mirror halves removed, so that if nothing works (even rolling
> back to an IDE-only setup) you can destroy this half's content and boot
> in IDE mode from the other half and re-attach the mirrored part to it.
>
> As a variant, it might make sense (if you'd also refresh the hardware)
> to attach the new device(s) to the rpool as a 3/4-way mirror, and then
> completing the switcheroo to SATA with only the new couple plugged in -
> you'd be able to fall back on the old and tested set if all goes wrong
> somehow.
>
> Good luck,
> //Jim
>
>
> On 2013-11-08 13:35, Hafiz Rafibeyli wrote:
>> log on monitor when system hangs was like this:(can send actuall taken screenshot to individual mail adresses)
>>
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: reset bus, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: early timeout, target=0 lun=0
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>> Error for command 'read sector' Error Level: Informational
>> gda: [ID 107833 kern.notice] Sense Key: aborted command
>> gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>> Error for command 'read sector' Error Level: Informational
>> gda: [ID 107833 kern.notice] Sense Key: aborted command
>> gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: abort request, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: abort device, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: reset target, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: reset bus, target=0 lun=0
>> scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0 (ata0):
>> timeout: early timeout, target=0 lun=0
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>> Error for command 'read sector' Error Level: Informational
>> gda: [ID 107833 kern.notice] Sense Key: aborted command
>> gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3
>> gda: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci-ide at 1f,2/ide at 0/cmdk at 0,0 (Disk0):
>>
>>
>> Hello,
>>
>> Omnios version:SunOS 5.11 omnios-b281e50
>> Server:Supermicro X8DAH (24x storage chassis)
>>
>> we are using omnios as a production nfs server for Esxi hosts.
>>
>> everything was ok,but last 20 days system hangs 3 times.Nothing changed on hardware side.
>>
>> for OS disks we are using two SSDSA2SH032G1GN(32 Gb Intel X25-E SSD) in zfs mirror attached onboard sata ports of motherboard.
>>
>> I captured monitor screenshot when system hangs,and sending as attachment.
>>
>>
>> My pools info:
>>
>> pool: rpool
>> state: ONLINE
>> scan: resilvered 20.0G in 0h3m with 0 errors on Sun Oct 20 14:01:01 2013
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> rpool ONLINE 0 0 0
>> mirror-0 ONLINE 0 0 0
>> c4d0s0 ONLINE 0 0 0
>> c3d1s0 ONLINE 0 0 0
>>
>> errors: No known data errors
>>
>>
>> pool: zpool1
>> state: ONLINE
>> status: Some supported features are not enabled on the pool. The pool can
>> still be used, but some features are unavailable.
>> action: Enable all features using 'zpool upgrade'. Once this is done,
>> the pool may no longer be accessible by software that does not support
>> the features. See zpool-features(5) for details.
>> scan: scrub repaired 0 in 5h0m with 0 errors on Sat Oct 12 19:00:53 2013
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> zpool1 ONLINE 0 0 0
>> raidz1-0 ONLINE 0 0 0
>> c1t5000C50041E9D9A7d0 ONLINE 0 0 0
>> c1t5000C50041F1A5EFd0 ONLINE 0 0 0
>> c1t5000C5004253FF87d0 ONLINE 0 0 0
>> c1t5000C50055A607E3d0 ONLINE 0 0 0
>> c1t5000C50055A628EFd0 ONLINE 0 0 0
>> c1t5000C50055A62F57d0 ONLINE 0 0 0
>> logs
>> mirror-1 ONLINE 0 0 0
>> c1t5001517959627219d0 ONLINE 0 0 0
>> c1t5001517BB2747BE7d0 ONLINE 0 0 0
>> cache
>> c1t5001517803D007D8d0 ONLINE 0 0 0
>> c1t5001517BB2AFB592d0 ONLINE 0 0 0
>> spares
>> c1t5000C5005600A6B3d0 AVAIL
>> c1t5000C5005600B43Bd0 AVAIL
>>
>> errors: No known data errors
>>
>>
>>
>>
>>
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
--
Richard.Elling at RichardElling.com
+1-760-896-4422
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20131108/7601fda4/attachment-0001.html>
More information about the OmniOS-discuss
mailing list