[OmniOS-discuss] Overheating faults with ST4000NM0023

Mon Feb 24 02:42:47 UTC 2014

Clarification below...

On Feb 13, 2014, at 2:18 AM, Thibault VINCENT <thibault.vincent at smartjog.com> wrote:

> On 02/12/2014 09:59 PM, Steamer wrote:
>> Did you ever find a solution to the overheating faults with the
>> ST4000NM0023?
>> 
>> I'm currently having the exact same issue with ST1000NM0023 drives,
>> seems like seagate has the user temp probe set at 40'C. The manual
>> states that the temperature settings are programmable via smart, but I
>> haven't found a way to do that.
> 
> Hello Emile,
> 
> I've found a workaround but the definitive fix should be handled by
> Illumos I guess. There is no open ticket, first I was waiting for
> something to happen with #4051 before going back to using that distro
> and kernel.
> 
> Here's the story:
> The SCSI specification defines two registers to store the temperature
> thresholds in SMART data. One contains the recommended maximum operation
> temperature for best MTBF, and the other register is for the absolute
> maximum rating. Usually the industry has always put the same value in
> both, and that is the absolute maximum.

Yes, because the SPC-4 (section 7.3.21.3) standard is very precise in its defined use. 
Seagate made a mistake and didn't follow the spec. AIUI, this is being corrected in firmware 
0004, available RSN.
 -- richard

> That's why we always see
> something like 60/65°C from SMART. But recently Seagate has changed that
> because it was asked by a large OS company to comply with the
> specification for better hardware monitoring integration. The change did
> not only occur in newer products but in a firmware update for existing
> disks and that was applied to the production line which explains some
> disks mays or may not expose this problem although they are the same
> model. Our disks are of the Megalodon serie and all share the same
> firmware basecode.
> 
> So any Seagate disk will now trigger faults in FMA if they have a
> firmware with the newer policy. Also I think other brands will follow
> the same path.
> 
> Like other members suggested in that thread, maybe nothing should change
> in FMA but let's face it, you can't maintain a temperature steadily
> under 40°C in a JBOD of hundreds of busy disks. Especially in
> eco-friendly datacenters. IMHO we should not trigger a fault on the
> lower threshold, and certainly not a drive retirement. It breaks storage
> servers on reboot or before a pool import, also spare disks could
> disappear with the retirement triggered.
> 
> The workaround is to downgrade firmware to the last version before the
> change, and to reset the register with an SCSI command. It is not
> possible to set the register to a user specified value like the
> documentation suggests, they confirmed it.
> 
> I'm sending a working firmware to you in a private mail. I'm not aware
> of any issue working with that older version and hopefully it should
> upload to 1TB drives as well.
> I'm applying it like this but from Linux not OmniOS:
> # ./dl_sea_fw-0.2.3_32 -f Megalodon_StdOEM_SAS_0002+C84C.lod -m ST4000NM0023
> # ./dl_sea_fw-0.2.3_32 -i
> 
> Then you should reset the drives so they reload the firmware.
> Here's our example for 4TB drives:
> -------------
> for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
>  sg_reset -d $i
> done
> -------------
> 
> And reset the register that contains value from the previous firmware.
> It doesn't work well so we've got this script to run a few times until
> all disks got it. Again it matches 4TB Megalodon.
> -------------
> for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
>  echo -n "$i "
>  if sg_logs $i --page=0x0d | grep 'Reference temperature = 68 C'
>> /dev/null ; then
>    echo 'ok'
>  else
>    sg_logs $i --page=0x0d --reset
>    echo 'reset'
>  fi
> done
> -------------
> 
> 
> Cheers
> 
> -- 
> Thibault VINCENT - Lead System Engineer, Infrastructure -
> Arkena | Phone: +33 1 58 68 62 38 | Mobile: +33 6 78 97 01 08
> 27 Blvd Hippolyte Marquès, 94200 Ivry-sur-Seine, France | www.arkena.com
> Arkena - Ready to Play
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

--

Richard.Elling at RichardElling.com
+1-760-896-4422

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20140223/9e0398c5/attachment-0001.html>