[OmniOS-discuss] Overheating faults with ST4000NM0023

Thu Feb 13 10:18:58 UTC 2014

On 02/12/2014 09:59 PM, Steamer wrote:
> Did you ever find a solution to the overheating faults with the
> ST4000NM0023?
>  
> I'm currently having the exact same issue with ST1000NM0023 drives,
> seems like seagate has the user temp probe set at 40'C. The manual
> states that the temperature settings are programmable via smart, but I
> haven't found a way to do that.

Hello Emile,

I've found a workaround but the definitive fix should be handled by
Illumos I guess. There is no open ticket, first I was waiting for
something to happen with #4051 before going back to using that distro
and kernel.

Here's the story:
The SCSI specification defines two registers to store the temperature
thresholds in SMART data. One contains the recommended maximum operation
temperature for best MTBF, and the other register is for the absolute
maximum rating. Usually the industry has always put the same value in
both, and that is the absolute maximum. That's why we always see
something like 60/65°C from SMART. But recently Seagate has changed that
because it was asked by a large OS company to comply with the
specification for better hardware monitoring integration. The change did
not only occur in newer products but in a firmware update for existing
disks and that was applied to the production line which explains some
disks mays or may not expose this problem although they are the same
model. Our disks are of the Megalodon serie and all share the same
firmware basecode.

So any Seagate disk will now trigger faults in FMA if they have a
firmware with the newer policy. Also I think other brands will follow
the same path.

Like other members suggested in that thread, maybe nothing should change
in FMA but let's face it, you can't maintain a temperature steadily
under 40°C in a JBOD of hundreds of busy disks. Especially in
eco-friendly datacenters. IMHO we should not trigger a fault on the
lower threshold, and certainly not a drive retirement. It breaks storage
servers on reboot or before a pool import, also spare disks could
disappear with the retirement triggered.

The workaround is to downgrade firmware to the last version before the
change, and to reset the register with an SCSI command. It is not
possible to set the register to a user specified value like the
documentation suggests, they confirmed it.

I'm sending a working firmware to you in a private mail. I'm not aware
of any issue working with that older version and hopefully it should
upload to 1TB drives as well.
I'm applying it like this but from Linux not OmniOS:
# ./dl_sea_fw-0.2.3_32 -f Megalodon_StdOEM_SAS_0002+C84C.lod -m ST4000NM0023
# ./dl_sea_fw-0.2.3_32 -i

Then you should reset the drives so they reload the firmware.
Here's our example for 4TB drives:
-------------
for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
  sg_reset -d $i
done
-------------

And reset the register that contains value from the previous firmware.
It doesn't work well so we've got this script to run a few times until
all disks got it. Again it matches 4TB Megalodon.
-------------
for i in $(lsscsi | grep 'ST4000NM0023' | awk '{print $6}') ; do
  echo -n "$i "
  if sg_logs $i --page=0x0d | grep 'Reference temperature = 68 C'
>/dev/null ; then
    echo 'ok'
  else
    sg_logs $i --page=0x0d --reset
    echo 'reset'
  fi
done
-------------

Cheers

-- 
Thibault VINCENT - Lead System Engineer, Infrastructure -
Arkena | Phone: +33 1 58 68 62 38 | Mobile: +33 6 78 97 01 08
27 Blvd Hippolyte Marquès, 94200 Ivry-sur-Seine, France | www.arkena.com
Arkena - Ready to Play