[OmniOS-discuss] OmniOS r06 locked up due to smartctl running?

Tue Jan 20 13:59:27 UTC 2015

Am 20.01.15 um 14:15 schrieb Stephan Budach:
> Hi guys,
>
> we just experienced a lock-up on one of our OmniOS r006 boxes in a way 
> that we had to reset it to get it working again. This box is running 
> on a SuperMicro storage server and it had been checked using smartctl 
> by our check_mk client each 10 mins.
>
> Looking through the logs, I found these messages being repeatedly 
> written to them…
>
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING: 
> /scsi_vhci/disk at g5000cca22bc46337 (sd12):
> Dec 20 03:18:17 nfsvmpool01     Error for Command: <undecoded cmd 
> 0x85>    Error Level: Recovered
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Requested 
> Block: 0                         Error Block: 0
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Vendor: 
> ATA                                Serial Number: PK1361
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] Sense Key: 
> Soft_Error
> Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice] ASC: 0x0 
> (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING: 
> /scsi_vhci/disk at g5000cca22bc4e51d (sd11):
> Dec 20 03:18:19 nfsvmpool01     Error for Command: <undecoded cmd 
> 0x85>    Error Level: Recovered
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Requested 
> Block: 0                         Error Block: 0
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Vendor: 
> ATA                                Serial Number: PK1361
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] Sense Key: 
> Soft_Error
> Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice] ASC: 0x0 
> (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0
> Dec 20 03:18:21 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING: 
> /scsi_vhci/disk at g5000cca22bc512c5 (sd3):
> Dec 20 03:18:21 nfsvmpool01     Error for Command: <undecoded cmd 
> 0x85>    Error Level: Recovered
>
> Could it be, that the use of smartctl somehow caused that lock-up?
>
> Thanks,
> budy
Seems that this was the real issue:

=> this was smartctl: Jan 20 13:14:04 nfsvmpool01 scsi: [ID 107833 
kern.notice]       ASC: 0x3a (medium not present - tray closed), ASCQ: 
0x1, FRU: 0x0
Jan 20 13:18:58 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:18:58 nfsvmpool01     MPT Firmware Fault, code: 2651
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info] 
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:19:00 nfsvmpool01     mpt1 Firmware version v15.0.0.0 (?)
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info] 
/pci at 0,0/pci8086,3c08 at 3/pci1000,3020 at 0 (mpt_sas1):
Jan 20 13:19:00 nfsvmpool01     mpt1: IOC Operational.
=> System reset: Jan 20 13:30:45 nfsvmpool01 genunix: [ID 540533 
kern.notice] ^MSunOS Release 5.11 Version omnios-b281e50 64-bit
Jan 20 13:30:45 nfsvmpool01 genunix: [ID 877030 kern.notice] Copyright 
(c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.

Tried a bit on googling about that fault and came up with this one from 
the LSI SCS Engineering Release Notice:

(SCGCQ00257616 - Port of SCGCQ00237417)
HEADLINE: Controller may fault on bad response with incomplete write 
data transfer

DESC OF CHANGE: When completing a write IO with incomplete data transfer 
with bad status, clean the IO from the transmit hardware to prevent it 
from accessing an invalid memory address while attempting to service the 
already-completed IO.

TO REPRODUCE: Run heavy write IO against a very large topology of SAS 
drives. Repeatedly cause multiple drives to send response frames 
containing sense data for outstanding IOs before the initiator has 
finished transferring the write data for the IOs

ISSUE DESC: f a SAS drive sends a response frame with response or sense 
data for a write command before the transfer length specified in the 
last XferReady frame is satisfied, an 0xD04 or 0x2651 fault may occur.

The question is, why did the box lock up? It seems that only one of the 
LSI HBAs was affected and my zpools are entirey spread across two HBAs, 
except the cache logs:

root at nfsvmpool01:/var/adm# zpool status sasTank
   pool: sasTank
  state: ONLINE
   scan: scrub repaired 0 in 0h8m with 0 errors on Wed Dec 24 09:21:40 2014
config:

         NAME                       STATE     READ WRITE CKSUM
         sasTank                    ONLINE       0     0 0
           mirror-0                 ONLINE       0     0 0
             c2t5000CCA04106EAA5d0  ONLINE       0     0 0
             c5t5000CCA04106EE41d0  ONLINE       0     0 0
           mirror-1                 ONLINE       0     0 0
             c3t5000CCA02A9BE9E1d0  ONLINE       0     0 0
             c6t5000CCA02ADEE805d0  ONLINE       0     0 0
           mirror-2                 ONLINE       0     0 0
             c4t5000CCA04106EF21d0  ONLINE       0     0 0
             c7t5000CCA04106C1F5d0  ONLINE       0     0 0
         logs
           c1t5001517803D653E2d0p1  ONLINE       0     0 0
           c1t5001517803D83760d0p1  ONLINE       0     0 0
         cache
           c1t50015179596C5A85d0    ONLINE       0     0 0

errors: No known data errors

root at nfsvmpool01:/var/adm# zpool status sataTank
   pool: sataTank
  state: ONLINE
   scan: scrub repaired 0 in 10h39m with 0 errors on Wed Dec 24 20:22:27 
2014
config:

         NAME                       STATE     READ WRITE CKSUM
         sataTank                   ONLINE       0     0 0
           mirror-0                 ONLINE       0     0 0
             c1t5000CCA22BC4E51Dd0  ONLINE       0     0 0
             c1t5000CCA22BC512C5d0  ONLINE       0     0 0
           mirror-1                 ONLINE       0     0 0
             c1t5000CCA22BC51BADd0  ONLINE       0     0 0
             c1t5000CCA22BC46337d0  ONLINE       0     0 0
           mirror-2                 ONLINE       0     0 0
             c1t5000CCA22BC51BB9d0  ONLINE       0     0 0
             c1t5000CCA23DED646Fd0  ONLINE       0     0 0
         logs
           c1t5001517803D653E2d0p2  ONLINE       0     0 0
           c1t5001517803D83760d0p2  ONLINE       0     0 0
         cache
           c1t5001517803D00E64d0    ONLINE       0     0 0

errors: No known data errors

Cheers,
budy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150120/3983bd6e/attachment-0001.html>