[OmniOS-discuss] Bad ZeusRAM? How to tell if another component is causing issues?

Wed Dec 3 06:31:23 UTC 2014

On Dec 2, 2014, at 10:02 PM, wuffers <moo at wuffers.net> wrote:

> I'm at home just looking into the health of our SAN and came across a bunch of errors on the Stec ZeusRAM (in a mirrored log configuration):
> 
> # iostat -En
> c12t5000A72B300780FFd0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 5224

ZeusRAMs are more sensitive to noisy fabric or cables than some other drives.
At the OS level, one symptom is transport errors, but that is only one view of the
fabric and you'll need more than one view to reach root cause.

Check the drive's health using something like sg3_utils:
	sg_logs -a /dev/rdsk/c#t#d#
and look for link stats, especially loss of DWORD sync and running disparity errors.
 -- richard

> Vendor: STEC     Product: ZeusRAM          Revision: C018 Serial No: STM000170C98
> Size: 8.00GB <8000000000 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 391 Predictive Failure Analysis: 0
> 
> #fmdump -eV
> Dec 03 2014 00:26:22.592888816 ereport.io.scsi.cmd.disk.recovered
> nvlist version: 0
>         class = ereport.io.scsi.cmd.disk.recovered
>         ena = 0xd38b237e7ed02001
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci at 0,0/pci8086,3c08 at 3/pci1000,3030 at 0/iport at f/disk at w5000a72b300780ff,0
>                 devid = id1,sd at n5000a720300780ff
>         (end detector)
> 
>         devid = id1,sd at n5000a720300780ff
>         driver-assessment = recovered
>         op-code = 0x2a
>         cdb = 0x2a 0x0 0x0 0x2d 0xda 0x0 0x0 0x0 0xf8 0x0
>         pkt-reason = 0x0
>         pkt-state = 0x1f
>         pkt-stats = 0x0
>         __ttl = 0x1
>         __tod = 0x547e9efe 0x2356c3f0
> 
> # dmesg
> Dec  3 00:28:24 san1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,3c08 at 3/pci1000,3030 at 0 (mpt_sas1):
> Dec  3 00:28:24 san1    Log info 0x31120303 received for target 10 w5000a72b300780ff.
> Dec  3 00:28:24 san1    scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
> 
> from format:
> 
> 57. c12t5000A72B300780FFd0 <STEC-ZeusRAM-C018-7.45GB>
>           /pci at 0,0/pci8086,3c08 at 3/pci1000,3030 at 0/iport at f/disk at w5000a72b300780ff,0
> 
> Both fmdump and dmesg has these errors repeating over and over. Everything seems to point to the drive. I suppose I would have to physically move the drive to eliminate cable, backplane or controller issues. Is there another way to tell just by these error logs or is the physical test the way to go? 
> 
> Are logs enough to justify an RMA? 
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20141202/52c8b070/attachment.html>