[OmniOS-discuss] How bad are these controller / io errors??

Fri Aug 16 14:41:10 UTC 2013

> We're seeing something similar on the same gear (LSI/supermicro expanders,
> lsi controllers, sata drives).
>
> We've tried standard hardware debugging (cable reseat/replacement, etc)
> and
> the problem in our case seems to follow the sas expander backplane.
>
> We did a disk by disk migration into a different expander and they
> stopped.
>
> How high are your error counts? (in our case, we were getting about
> 1500/day/device). Is your performance impacted? (it was in our case)
>  -nld
>

        Different expander? but still SATA behind SAS expander? On
Supermicro 847 chassis?

        Is your setup stable, ie. works and drives don't drop out as failed?
Performance isn't an issue here, but stability is..

         It is definately a SATA behind SAS expanders issue. I did lots of
testing
with pools built on SAS drives they have no errors.  I also did a lot of
stress testing
with 20T SATA pools, they were completely unusable, scrub would
always wipe out the pool because drives would "drop out" as failed,
but a hardware power cycle of the  SuperMicro chassis
would bring them all back. Then I turned off NCQ on the  LSI controller
and everything worked fine. Couldn't get anything to fail no matter
how hard I beat on it.

         I will start to track error rates, we are not moving much data yet..

         Would SATA port multipliers be a better solution? Does
Solaris/OmniOS
support such a hardware config.

           Just came across this

             http://www.45drives.com/

         Which I think is  a SATA port multiplier solution.... 
Centos/NetBSD?? can
it work with OMniOS?

       -steve


>
> On Tue, Aug 13, 2013 at 10:20 AM, <steve at linuxsuite.org> wrote:
>
>>
>>    Howdy!
>>
>>          This is a SuperMicro JBOD with SATA disks. I am aware of the
>> issues of having
>> SATA on SAS, but was wondering just how serious these kinds of errors
>> are.. a scrub of the pool
>> completes without noticable problems.. I did a lot of stress testing
>> earlier and could
>> not get a failure. Disabling NCQ on the controller was a neccessary.
>> What is the practical risk to data??
>>
>>         See below info for iostat / syslog
>>
>>  thanx - steve
>>
>>            syslog info
>>
>> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
>> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event_sync:
>> IOCStatus=0x8000, IOCLogInfo=0x31120303
>> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
>> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event_sync:
>> IOCStatus=0x8000, IOCLogInfo=0x31120436
>> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
>> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event:
>> IOCStatus=0x8000, IOCLogInfo=0x31120303
>> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
>> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>>
>> Blah Blah...
>>
>> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event:
>> IOCStatus=0x8000, IOCLogInfo=0x31120436
>> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
>> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>> kern.info<6>: Aug 13 10:39:11 dfs1 #011Log info 0x31120303 received for
>> target 13.
>> kern.info<6>: Aug 13 10:39:11 dfs1 #011scsi_status=0x0,
>> ioc_status=0x804b,
>> scsi_state=0xc
>> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
>> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>> kern.info<6>: Aug 13 10:39:11 dfs1 #011Log info 0x31120303 received for
>> target 13.
>> kern.info<6>: Aug 13 10:39:11 dfs1 #011scsi_status=0x0,
>> ioc_status=0x804b,
>> scsi_state=0xc
>> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
>> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
>>
>>           Output of iostat -En
>>
>>          Looks like "Hard Errors" and "No Device" correspond. What
>> does "Transport Error" and "Recoverable" mean. I see no evidence
>> of data corruption/loss, does ZFS deal/recover from these errors in a
>> good/safe
>> way?
>>
>>
>> c5t5000C500489947A8d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 11
>> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No:
>> W1F0AAMA
>> Size: 3000.59GB <3000592982016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
>> Illegal Request: 2 Predictive Failure Analysis: 0
>>
>> c5t5000C500525EB2B9d0 Soft Errors: 0 Hard Errors: 5 Transport Errors: 46
>> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No:
>> W1F0QM5H
>> Size: 3000.59GB <3000592982016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 5 Recoverable: 0
>> Illegal Request: 5 Predictive Failure Analysis: 0
>>
>> c5t5000C50045561CEAd0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 7
>> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No:
>> W1F09G4Q
>> Size: 3000.59GB <3000592982016 bytes>
>> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
>> Illegal Request: 1 Predictive Failure Analysis: 0
>>
>>
>>
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>
>