[OmniOS-discuss] mpt_sas wedge

Mon Dec 9 21:45:46 UTC 2013

My storage server (currently running omnios stable 151006) with an LSI
9201-16i SAS controller misbehaved this morning :(.

It started out with some complaints about a target:

Dec  9 05:18:23 storage kernel: scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c0
a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        Log info 0x31080000 received for target 17.
        scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0

It repeated quite a few of these, then one with a different status:

Dec  9 05:24:11 storage kernel: scsi_vhci: [ID 734749 kern.warning] WARNING:
vhci_scsi_res
et 0x1
Dec  9 05:24:11 storage kernel: scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c0a at 3,2/pci1
000,30c0 at 0 (mpt_sas0):
        Log info 0x31140000 received for target 17.
        scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc

Then it seems it tried to reset the bus:

Dec  9 05:28:18 storage kernel: scsi_vhci: [ID 734749 kern.warning] WARNING:
vhci_scsi_res
et 0x1

And then it got really unhappy:

Dec  9 05:28:42 storage kernel: scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        mptsas_check_task_mgt: Task 0x3 failed. IOCStatus=0x4a
IOCLogInfo=0x0 target=17
Dec  9 05:28:42 storage kernel: scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        mptsas_ioc_task_management failed try to reset ioc to recovery!
Dec  9 05:28:44 storage kernel: scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c0a at 3,2/pci1
000,30c0 at 0 (mpt_sas0):
        mpt0 Firmware version v16.0.0.0 (?)
Dec  9 05:28:44 storage kernel: scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c0a at 3,2/pci1
000,30c0 at 0 (mpt_sas0):
        mpt0: IOC Operational.
Dec  9 05:29:45 storage kernel: scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        config header request timeout
Dec  9 05:29:45 storage kernel: scsi: [ID 365881 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        NULL command for address reply in slot 6
Dec  9 05:30:45 storage kernel: scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        config header request timeout
Dec  9 05:31:45 storage kernel: scsi: [ID 107833 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        config header request timeout
Dec  9 05:31:45 storage kernel: scsi: [ID 243001 kern.warning] WARNING:
/scsi_vhci (scsi_v
hci0):
        sd10: path mpt_sas7/disk at w50014ee603347e67,0, reset 1 failed
Dec  9 05:31:45 storage kernel: scsi: [ID 365881 kern.warning] WARNING:
/pci at 0,0/pci8086,3c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        NULL command for address reply in slot 38
Dec  9 05:31:45 storage kernel: scsi: [ID 365881 kern.warning] WARNING:
/pci at 0,0/pci8086,3
c0a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        NULL command for address reply in slot 54

At this point, it looks like any I/O to any device on that controller was
wedged, which pretty much took out my storage pool. I sent an NMI to try and
get a crash dump, but while my dump device is a raw disk, not part of any
zpool, it is on that controller, and after about an hour I gave up and just
did a hard reset.

It came up okay, and seemed to be working fine. I ran a few scrubs to see
what would happen, on every scrub I saw some of the same errors:

Dec  9 13:03:02 storage kernel: scsi: [ID 365881 kern.info]
/pci at 0,0/pci8086,3c0
a at 3,2/pci1000,30c0 at 0 (mpt_sas0):
        Log info 0x31080000 received for target 17.
        scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0

Sometimes it would reset the bus, sometimes it wouldn't, but while I was
poking at it it never completely wedged up and froze like it did earlier
this morning.

Disks fail, I'm running RAIDZ2 on this box plus have a hot spare, so no
worries there; but it kind of sucks when a misbehaving disk takes out the
entire controller :(. I went ahead and off-lined the disk and replaced it
with a hot spare, I'll poke at it some more once it is done resilvering and
make sure I don't get any more errors. Is it a pretty sure bet that the disk
is going bad, or should I take the time/trouble to pull it out and run
diagnostics on it on a separate machine before RMA'ing it?

I vaguely recall seeing some mpt_sas improvements flyby over the past 6-8
months, once the first update for 151008 comes out I guess I will go ahead
and upgrade to that, maybe next time things go wiggy it won't take out the
whole controller.

One question; in this case, when it completely died, it ended up printing
out the wwn of the drive in question so it was easy to find. However, I'm
not quite sure how to map "target 17" to an actual disk if all I had were
the warning messages. Any pointers?

Thanks much.