[OmniOS-discuss] Scrub leaves pool in wiered state with all devices in "repairing" ?
Eric Sproul
eric.sproul at circonus.com
Fri Jun 17 18:32:16 UTC 2016
On Fri, Jun 17, 2016 at 2:21 PM, <steve at linuxsuite.org> wrote:
> I successfully exported the problematic zfs and installed it into
> another JBOD chassis
> and imported. scrub and zfs send now run fine. So it isn't the
> disks, must be cable or chassis or HBA or ....
FWIW there is a handy tool [1] that decodes LSI log info codes.
Looking at your logs, there are two unique IOCLogInfo codes:
IOCLogInfo=0x3112010c
IOCLogInfo=0x31120302
$ ./lsi_decode_loginfo.py 0x3112010c
Value 3112010Ch
Type: 30000000h SAS
Origin: 01000000h PL
Code: 00120000h PL_LOGINFO_CODE_ABORT See Sub-Codes below
(PL_LOGINFO_SUB_CODE)
Sub Code: 00000100h PL_LOGINFO_SUB_CODE_OPEN_FAILURE
SubSub Code: 0000000Ch PL_LOGINFO_SUB_CODE_OPEN_FAIL_OPEN_TIMEOUT_EXP
$ ./lsi_decode_loginfo.py 0x31120302
Value 31120302h
Type: 30000000h SAS
Origin: 01000000h PL
Code: 00120000h PL_LOGINFO_CODE_ABORT See Sub-Codes below
(PL_LOGINFO_SUB_CODE)
Sub Code: 00000300h PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH
Unparsed 00000002h
If I had to hazard a guess I'd say there's a low-level issue in the
SAS fabric, maybe a bad expander or cable that's disrupting
everything. The HBA is aborting the commands it's waiting for answers
to, either because the target never responds, or in the latter case,
possibly corruption of the protocol traffic. This would seem to align
with your finding that moving the disks to a new chassis made the
issue go away.
Eric
[1] https://github.com/baruch/lsi_decode_loginfo
More information about the OmniOS-discuss
mailing list