[OmniOS-discuss] Scrub leaves pool in wiered state with all devices in "repairing" ?

Fri Jun 17 18:32:16 UTC 2016

On Fri, Jun 17, 2016 at 2:21 PM,  <steve at linuxsuite.org> wrote:
>       I successfully  exported the problematic zfs and installed it into
> another JBOD chassis
> and imported. scrub and zfs send now run fine. So it isn't the
> disks, must be cable or chassis or HBA or ....

FWIW there is a handy tool [1] that decodes LSI log info codes.
Looking at your logs, there are two unique IOCLogInfo codes:

IOCLogInfo=0x3112010c
IOCLogInfo=0x31120302

$ ./lsi_decode_loginfo.py 0x3112010c
Value     3112010Ch
Type:     30000000h SAS
Origin:   01000000h PL
Code:     00120000h PL_LOGINFO_CODE_ABORT See Sub-Codes below
(PL_LOGINFO_SUB_CODE)
Sub Code: 00000100h PL_LOGINFO_SUB_CODE_OPEN_FAILURE
SubSub Code: 0000000Ch PL_LOGINFO_SUB_CODE_OPEN_FAIL_OPEN_TIMEOUT_EXP

$ ./lsi_decode_loginfo.py 0x31120302
Value     31120302h
Type:     30000000h SAS
Origin:   01000000h PL
Code:     00120000h PL_LOGINFO_CODE_ABORT See Sub-Codes below
(PL_LOGINFO_SUB_CODE)
Sub Code: 00000300h PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH
Unparsed   00000002h

If I had to hazard a guess I'd say there's a low-level issue in the
SAS fabric, maybe a bad expander or cable that's disrupting
everything.  The HBA is aborting the commands it's waiting for answers
to, either because the target never responds, or in the latter case,
possibly corruption of the protocol traffic.  This would seem to align
with your finding that moving the disks to a new chassis made the
issue go away.

Eric

[1] https://github.com/baruch/lsi_decode_loginfo