[OmniOS-discuss] ZeusRAM - predictive failure

Richard Elling richard.elling at richardelling.com
Wed Apr 12 00:24:34 UTC 2017


> On Apr 10, 2017, at 4:30 PM, Machine Man <gearboxes at outlook.com> wrote:
> 
> Do you select drives based on DWPD?

Not really. Inside a given product line, the difference in DWPD is a matter of overprovisioning.
You can adjust the overprovisioning yourself, if needed.

note to the lurkers, overprovisioning also impacts the write performance of garbage collection

> I am struggling to $500 - $700 drives in stock. I am limited to a number of distributors and pretty much unless its HP, Cisco or Dell its not kept in stock. On a number of disks options I got a ship date of late June and all 3 distributors  indicating SSD drives are constrained. 

Yes, there is a global shortage and all major vendors are on allocation.

> I am now down to adding a single SSD during busy hours or when the alerts start rolling in and removing the ZIL afterhours or when the load reduces again.
> 
> My only other options for the next 3 weeks are:
> 1 - add 15K drives for ZIL and see if that helps.
> 2 - Hope for the best on the single old OCS Talos 2

I have bad luck with these

> 3 - Mix SAS/SATA on the same backplane.

No guarantees, but for more modern expanders and HBAs, we see fewer problems mixing.
I wouldn’t attempt for 3G SAS/SATA, but 12G seems more robust.
 — richard


> 
> I was 100% banking on the ZeusRAM since that is what I could get my hands immediately.
> From: Richard Elling <richard.elling at richardelling.com>
> Sent: Monday, April 10, 2017 5:49:55 PM
> To: Machine Man
> Cc: omnios-discuss at lists.omniti.com
> Subject: Re: [OmniOS-discuss] ZeusRAM - predictive failure
>  
> 
>> On Apr 10, 2017, at 2:39 PM, Machine Man <gearboxes at outlook.com <mailto:gearboxes at outlook.com>> wrote:
>> 
>> Thank you. I am sending it back to where we purchased it from. I thought these were no longer avail, but the distributor still listed them and had in stock.
>> I was hesitant to purchase, but I am in desperate need for a ZIL. 
> 
> ZeusRAMs have been EOL for a year or more. AIUI, the parts are no longer available to build them.
> We do see better performance from the modern, enterprise-class, 12G SAS parts from HGST and Toshiba.
> Unfortunately, they are priced by $/GB and not $/latency, so the smaller capacity (GB) drives are also slower.
>  — richard
> 
>> 
>> 
>> From: Richard Elling <richard.elling at richardelling.com <mailto:richard.elling at richardelling.com>>
>> Sent: Monday, April 10, 2017 4:15:32 PM
>> To: Machine Man
>> Cc: omnios-discuss at lists.omniti.com <mailto:omnios-discuss at lists.omniti.com>
>> Subject: Re: [OmniOS-discuss] ZeusRAM - predictive failure
>>  
>> 
>>> On Apr 10, 2017, at 1:00 PM, Machine Man <gearboxes at outlook.com <mailto:gearboxes at outlook.com>> wrote:
>>> 
>>> Today I received one of the ZeusRAM that I ordered, both brand new. I was struggling to find SAS SSD drives that were available in my price range as I desperately need to add a ZIL. 
>>> I decided to order ZeusRAM since they had one in stock and figured I'll add it while waiting for the other one as they are really should not be prone to failure based on design. I have not used them and would normally just prefer to use regular SSD drives.
>>> 
>>> Slotted ZeusRAM in and it began to rapidly blink the same as the disks that are currently in the pool on that backplain. Running the command format would never return with a list of disks. I left it for about 15 min and pulled it since it says on the disk that it can take up to 10 min for the caps. I could see there is an amber and green LED on the drive itself blinking, even when removed.
>>> I slotted it back in and the disk was then available. After a few min the fault light cam on and the disk was unavailable due to the following:
>>> 
>>> Fault class : fault.io.disk.predictive-failure
>> 
>> This occurs when the drive responds to an I/O and indicates a predictive failure or
>> the periodic query for drives sees a predicted failure. It is the drive telling the OS that
>> the drive thinks it will fail. There is nothing you can do on the OS to “fix” this.
>> 
>> It is possible that HGST (nee STEC) can help with further diagnosis using the vendor-specific
>> log pages. Several years ago, STEC helped us with root cause of failing ultracapacitor in a drive.
>> AFAIK, there is no publicly available decoder for those log pages.
>>  — richard
>> 
>> 
>>> Affects     : dev:///:devid=id1,sd@n5000a720300b3d57//pci@0,0/pci8086,340e@7/pci1000,3040@0/iport@f0/disk@w5000a72a300b3d57,0 <dev:///:devid=id1,sd@n5000a720300b3d57//pci@0,0/pci8086,340e@7/pci1000,3040@0/iport@f0/disk@w5000a72a300b3d57,0>
>>>                   faulted and taken out of service
>>> FRU         : "Slot 09" (hc://:product-id=LSI-SAS2X36:server-id=:chassis-id=50030480178cf57f:serial=STM000****:part=STEC-ZeusRAM:revision=C025/ses-enclosure=1/bay=8/disk=0 <hc://:product-id=LSI-SAS2X36:server-id=:chassis-id=50030480178cf57f:serial=STM000****:part=STEC-ZeusRAM:revision=C025/ses-enclosure=1/bay=8/disk=0>)
>>>                   faulty
>>> Description : SMART health-monitoring firmware reported that a disk
>>>               failure is imminent.
>>> 
>>> 
>>> I cleared the fault and the drive was then usable again for a few min same thing happened. Eventually the amber light on the disk itself (not the enclosure disk light) no longer blinked and the disks was online for quite some time before the alert above reappeared.
>>> 
>>> 
>>> === START OF INFORMATION SECTION ===
>>> Vendor:               STEC
>>> Product:              ZeusRAM
>>> Revision:             C025
>>> Compliance:           SPC-4
>>> User Capacity:        8,000,000,000 bytes [8.00 GB]
>>> Logical block size:   512 bytes
>>> Rotation Rate:        Solid State Device
>>> Form Factor:          3.5 inches
>>> Logical Unit id:      0x5000a720300b3d57
>>> Serial number:        STM000******
>>> Device type:          disk
>>> Transport protocol:   SAS (SPL-3)
>>> Local Time is:        Mon Apr 10 19:17:23 2017 UTC
>>> SMART support is:     Available - device has SMART capability.
>>> SMART support is:     Enabled
>>> Temperature Warning:  Enabled
>>> === START OF READ SMART DATA SECTION ===
>>> SMART Health Status: OK
>>> Current Drive Temperature:     40 C
>>> Drive Trip Temperature:        80 C
>>> Elements in grown defect list: 0
>>> Vendor (Seagate) cache information
>>>   Blocks sent to initiator = 0
>>>   Blocks sent to initiator = 0
>>> Error counter log:
>>>            Errors Corrected by           Total   Correction     Gigabytes    Total
>>>                ECC          rereads/    errors   algorithm      processed    uncorrected
>>>            fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
>>> read:          0        0         0         0          0         21.323           0
>>> write:         0        0         0         0          0         83.809           0
>>> Non-medium error count:        0
>>> 
>>> 
>>> 
>>> Is there anything special that should be done for ZeusRAM in sd.conf? Its a node install and both nodes can see all the drives. I don't see any smart errors listed, but running fmadm it will show the disk as faulty due to predictive failure.
>>> OmniOS r20 all patches applied.
>>> 
>>> 
>>> thanks,  
>>> _______________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti.com <mailto:OmniOS-discuss at lists.omniti.com>
>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
>> --
>> 
>> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
>> +1-760-896-4422

--

Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20170411/d1489edd/attachment-0001.html>


More information about the OmniOS-discuss mailing list