[OmniOS-discuss] OmniOS r06 locked up due to smartctl running?
Stephan Budach
stephan.budach at JVM.DE
Tue Jan 20 13:59:27 UTC 2015
Am 20.01.15 um 14:15 schrieb Stephan Budach:
> Hi guys,
>
> we just experienced a lock-up on one of our OmniOS r006 boxes in a way
> that we had to reset it to get it working again. This box is running
> on a SuperMicro storage server and it had been checked using smartctl
> by our check_mk client each 10 mins.
>
> Looking through the logs, I found these messages being repeatedly
> written to them…
>
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk at g5000cca22bc46337 (sd12):
> Dec 20 03:18:17 nfsvmpool01 Error for Command: <undecoded cmd
> 0x85> Error Level: Recovered
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Requested
> Block: 0 Error Block: 0
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Vendor:
> ATA Serial Number: PK1361
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Sense Key:
> Soft_Error
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] ASC: 0x0
> (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk at g5000cca22bc4e51d (sd11):
> Dec 20 03:18:19 nfsvmpool01 Error for Command: <undecoded cmd
> 0x85> Error Level: Recovered
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Requested
> Block: 0 Error Block: 0
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Vendor:
> ATA Serial Number: PK1361
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Sense Key:
> Soft_Error
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] ASC: 0x0
> (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0
> Dec 20 03:18:21 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk at g5000cca22bc512c5 (sd3):
> Dec 20 03:18:21 nfsvmpool01 Error for Command: <undecoded cmd
> 0x85> Error Level: Recovered
>
> Could it be, that the use of smartctl somehow caused that lock-up?
>
> Thanks,
> budy
Seems that this was the real issue:
=> this was smartctl: Jan 20 13:14:04 nfsvmpool01 scsi: [ID 107833
kern.notice] ASC: 0x3a (medium not present - tray closed), ASCQ:
0x1, FRU: 0x0
Jan 20 13:18:58 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:18:58 nfsvmpool01 MPT Firmware Fault, code: 2651
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:19:00 nfsvmpool01 mpt1 Firmware version v15.0.0.0 (?)
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:19:00 nfsvmpool01 mpt1: IOC Operational.
=> System reset: Jan 20 13:30:45 nfsvmpool01 genunix: [ID 540533
kern.notice] ^MSunOS Release 5.11 Version omnios-b281e50 64-bit
Jan 20 13:30:45 nfsvmpool01 genunix: [ID 877030 kern.notice] Copyright
(c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.
Tried a bit on googling about that fault and came up with this one from
the LSI SCS Engineering Release Notice:
(SCGCQ00257616 - Port of SCGCQ00237417)
HEADLINE: Controller may fault on bad response with incomplete write
data transfer
DESC OF CHANGE: When completing a write IO with incomplete data transfer
with bad status, clean the IO from the transmit hardware to prevent it
from accessing an invalid memory address while attempting to service the
already-completed IO.
TO REPRODUCE: Run heavy write IO against a very large topology of SAS
drives. Repeatedly cause multiple drives to send response frames
containing sense data for outstanding IOs before the initiator has
finished transferring the write data for the IOs
ISSUE DESC: f a SAS drive sends a response frame with response or sense
data for a write command before the transfer length specified in the
last XferReady frame is satisfied, an 0xD04 or 0x2651 fault may occur.
The question is, why did the box lock up? It seems that only one of the
LSI HBAs was affected and my zpools are entirey spread across two HBAs,
except the cache logs:
root at nfsvmpool01:/var/adm# zpool status sasTank
pool: sasTank
state: ONLINE
scan: scrub repaired 0 in 0h8m with 0 errors on Wed Dec 24 09:21:40 2014
config:
NAME STATE READ WRITE CKSUM
sasTank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c2t5000CCA04106EAA5d0 ONLINE 0 0 0
c5t5000CCA04106EE41d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c3t5000CCA02A9BE9E1d0 ONLINE 0 0 0
c6t5000CCA02ADEE805d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c4t5000CCA04106EF21d0 ONLINE 0 0 0
c7t5000CCA04106C1F5d0 ONLINE 0 0 0
logs
c1t5001517803D653E2d0p1 ONLINE 0 0 0
c1t5001517803D83760d0p1 ONLINE 0 0 0
cache
c1t50015179596C5A85d0 ONLINE 0 0 0
errors: No known data errors
root at nfsvmpool01:/var/adm# zpool status sataTank
pool: sataTank
state: ONLINE
scan: scrub repaired 0 in 10h39m with 0 errors on Wed Dec 24 20:22:27
2014
config:
NAME STATE READ WRITE CKSUM
sataTank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t5000CCA22BC4E51Dd0 ONLINE 0 0 0
c1t5000CCA22BC512C5d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t5000CCA22BC51BADd0 ONLINE 0 0 0
c1t5000CCA22BC46337d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c1t5000CCA22BC51BB9d0 ONLINE 0 0 0
c1t5000CCA23DED646Fd0 ONLINE 0 0 0
logs
c1t5001517803D653E2d0p2 ONLINE 0 0 0
c1t5001517803D83760d0p2 ONLINE 0 0 0
cache
c1t5001517803D00E64d0 ONLINE 0 0 0
errors: No known data errors
Cheers,
budy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150120/3983bd6e/attachment-0001.html>
More information about the OmniOS-discuss
mailing list