[OmniOS-discuss] slow drive response times
Richard Elling
richard.elling at richardelling.com
Fri Jan 2 22:45:02 UTC 2015
> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
>
> I've run 'sg_logs' on the drive I pulled last week. There were alot of
> errors in the backgroud scan section of the output, which made it very
> large, so I put it here:
>
> http://pastebin.com/jx5BvSep
>
> When I pulled this drive, the SMART health status was OK.
SMART isn’t smart :-P
> However, when
> I put it in a test system to run 'sg_logs', the status changed to
> "impending failure...". Had the SMART status changed before pulling the
> drive, I'm sure 'fmd' would have alerted me to the problem…
By default, fmd looks for the predictive failure (PFA) and self-test every hour using the disk_transport
agent. fmstat should show activity there. When a PFA is seen, then there will be an ereport generated
and, for most cases, a syslog message. However, this will not cause a zfs-retire event.
Vendors have significant leeway in how they implement SMART. In my experience the only thing
you can say for sure is if the vendor thinks the drive’s death is imminent, then you should replace
it. I suspect these policies are financially motivated rather than scientific… some amount of truthiness
is to be expected.
In the logs, clearly the one disk has lots of errors that have been corrected and the rate is increasing.
The rate of change for "Errors corrected with possible delays” may correlate to your performance issues,
but the interpretation is left up to the vendors.
In the case of this naughty drive, yep it needs replacing.
>
> Since that drive had other indications of trouble, I ran 'sg_logs' on
> another drive I pulled recently that has a SMART health status of OK,
> but exibits similar slow service time behavior:
>
> http://pastebin.com/Q0t8Jnug <http://pastebin.com/Q0t8Jnug>
This one looks mostly healthy.
Another place to look for latency issues is the phy logs. In the sg_logs output, this is the
Protocol Specific port log page for SAS SSP. Key values are running disparity error
count and loss of dword sync count. The trick here is that you need to look at both ends
of the wire for each wire. For a simple case, this means looking at both the HBA’s phys error
counts and the driver. If you have expanders in the mix, it is more work. You’ll want to look at
all of the HBA, expander, and drive phys health counters for all phys.
This can get tricky because wide ports are mostly dumb. For example, if an HBA has a 4-link
wide port (common) and one of the links is acting up (all too common) the latency impacts
will be random.
To see HBA and expander link health, you can use sg3_utils, its companion smp_utils, or
sasinfo (installed as a separate package from OmniOS, IIRC). For example,
sasinfo hba-port -l
HTH
— richard
>
> Thanks for taking the time to look at these, please let me know what you
> find...
>
> Kevin
>
>
>
>
> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>
>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
>>>
>>> Hello Richard and group, thanks for your reply!
>>>
>>> I'll look into sg_logs for one of these devices once I have a chance to
>>> track that progam down...
>>>
>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>> happen in normal cases. However, I don't believe what I'm seeing
>>> constitutes normal behavior.
>>>
>>> First, some anecdotal evidence: If I pull and replace the suspect
>>> drive, my downstream systems stop complaining, and the high service time
>>> numbers go away.
>>
>> We call these "wounded soldiers" -- it takes more resources to manage a
>> wounded soldier than a dead soldier, so one strategy of war is to wound your
>> enemy causing them to consume resources tending the wounded. The sg_logs
>> should be enlightening.
>>
>> NB, consider a 4TB disk with 5 platters: if a head or surface starts to go, then
>> you have a 1/10 chance that the data you request is under the damaged head
>> and will need to be recovered by the drive. So it is not uncommon to see
>> 90+% of the I/Os to the drive completing quickly. It is also not unusual to see
>> only a small number of sectors or tracks affected.
>>
>> Detecting these becomes tricky, especially as you reduce the timeout/retry
>> interval, since the problem is rarely seen in the average latency -- that which
>> iostat and sar record. This is an area where we can and are improving.
>> -- richard
>>
>>>
>>> I threw out 500 ms as a guess to the point at which I start seeing
>>> problems. However, I see service times far in excess of that, sometimes
>>> over 30,000 ms! Below is 20 minutes of sar output from a drive I pulled
>>> a few days ago, during a time when downstream VMWare servers were
>>> complaining. (since the sar output is so verbose, I grepped out the
>>> info just for the suspect drive):
>>>
>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>> sd91,a 99 5.3 1 42 0.0 7811.7
>>> sd91,a 100 11.3 1 53 0.0 11016.0
>>> sd91,a 100 3.8 1 75 0.0 3615.8
>>> sd91,a 100 4.9 1 25 0.0 8633.5
>>> sd91,a 93 3.9 1 55 0.0 4385.3
>>> sd91,a 86 3.5 2 75 0.0 2060.5
>>> sd91,a 91 3.1 4 80 0.0 823.8
>>> sd91,a 97 3.5 1 50 0.0 3984.5
>>> sd91,a 100 4.4 1 56 0.0 6068.6
>>> sd91,a 100 5.0 1 55 0.0 8836.0
>>> sd91,a 100 5.7 1 51 0.0 7939.6
>>> sd91,a 98 9.9 1 42 0.0 12526.8
>>> sd91,a 100 7.4 0 10 0.0 36813.6
>>> sd91,a 51 3.8 8 90 0.0 500.2
>>> sd91,a 88 3.4 1 60 0.0 2338.8
>>> sd91,a 100 4.5 1 28 0.0 6969.2
>>> sd91,a 93 3.8 1 59 0.0 5138.9
>>> sd91,a 79 3.1 1 59 0.0 3143.9
>>> sd91,a 99 4.7 1 52 0.0 5598.4
>>> sd91,a 100 4.8 1 62 0.0 6638.4
>>> sd91,a 94 5.0 1 54 0.0 3752.7
>>>
>>> For comparison, here's the sar output from another drive in the same
>>> pool for the same period of time:
>>>
>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>> sd82,a 0 0.0 2 28 0.0 5.6
>>> sd82,a 1 0.0 3 51 0.0 5.4
>>> sd82,a 1 0.0 4 66 0.0 6.3
>>> sd82,a 1 0.0 3 48 0.0 4.3
>>> sd82,a 1 0.0 3 45 0.0 6.1
>>> sd82,a 1 0.0 6 82 0.0 2.7
>>> sd82,a 1 0.0 8 112 0.0 2.8
>>> sd82,a 0 0.0 3 27 0.0 1.8
>>> sd82,a 1 0.0 5 80 0.0 3.1
>>> sd82,a 0 0.0 3 35 0.0 3.1
>>> sd82,a 1 0.0 3 35 0.0 3.8
>>> sd82,a 1 0.0 4 49 0.0 3.2
>>> sd82,a 0 0.0 0 0 0.0 4.1
>>> sd82,a 3 0.0 9 84 0.0 4.1
>>> sd82,a 1 0.0 6 55 0.0 3.7
>>> sd82,a 0 0.0 1 23 0.0 7.0
>>> sd82,a 0 0.0 6 57 0.0 1.8
>>> sd82,a 1 0.0 5 70 0.0 2.3
>>> sd82,a 1 0.0 4 55 0.0 3.7
>>> sd82,a 1 0.0 5 72 0.0 4.1
>>> sd82,a 1 0.0 4 54 0.0 3.6
>>>
>>> The other drives in this pool all show data similar to that of sd82.
>>>
>>> Your point about tuning blindly is well taken, and I'm certainly no
>>> expert on the IO stack. What's a humble sysadmin to do?
>>>
>>> For further reference, this system is running r151010. The drive in
>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>> was in is using lz4 compression and looks like this:
>>>
>>> # zpool status data1
>>> pool: data1
>>> state: ONLINE
>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31 14:40:20 2014
>>> config:
>>>
>>> NAME STATE READ WRITE CKSUM
>>> data1 ONLINE 0 0 0
>>> raidz2-0 ONLINE 0 0 0
>>> c6t5000039468CB54F0d0 ONLINE 0 0 0
>>> c6t5000039478CB5138d0 ONLINE 0 0 0
>>> c6t5000039468D000DCd0 ONLINE 0 0 0
>>> c6t5000039468D000E8d0 ONLINE 0 0 0
>>> c6t5000039468D00F5Cd0 ONLINE 0 0 0
>>> c6t5000039478C816CCd0 ONLINE 0 0 0
>>> c6t5000039478C8546Cd0 ONLINE 0 0 0
>>> raidz2-1 ONLINE 0 0 0
>>> c6t5000039478C855F0d0 ONLINE 0 0 0
>>> c6t5000039478C856E8d0 ONLINE 0 0 0
>>> c6t5000039478C856ECd0 ONLINE 0 0 0
>>> c6t5000039478C856F4d0 ONLINE 0 0 0
>>> c6t5000039478C86374d0 ONLINE 0 0 0
>>> c6t5000039478C8C2A8d0 ONLINE 0 0 0
>>> c6t5000039478C8C364d0 ONLINE 0 0 0
>>> raidz2-2 ONLINE 0 0 0
>>> c6t5000039478C9958Cd0 ONLINE 0 0 0
>>> c6t5000039478C995C4d0 ONLINE 0 0 0
>>> c6t5000039478C9DACCd0 ONLINE 0 0 0
>>> c6t5000039478C9DB30d0 ONLINE 0 0 0
>>> c6t5000039478C9DB6Cd0 ONLINE 0 0 0
>>> c6t5000039478CA73B4d0 ONLINE 0 0 0
>>> c6t5000039478CB3A20d0 ONLINE 0 0 0
>>> raidz2-3 ONLINE 0 0 0
>>> c6t5000039478CB3A64d0 ONLINE 0 0 0
>>> c6t5000039478CB3A70d0 ONLINE 0 0 0
>>> c6t5000039478CB3E7Cd0 ONLINE 0 0 0
>>> c6t5000039478CB3EB0d0 ONLINE 0 0 0
>>> c6t5000039478CB3FBCd0 ONLINE 0 0 0
>>> c6t5000039478CB4048d0 ONLINE 0 0 0
>>> c6t5000039478CB4054d0 ONLINE 0 0 0
>>> raidz2-4 ONLINE 0 0 0
>>> c6t5000039478CB424Cd0 ONLINE 0 0 0
>>> c6t5000039478CB4250d0 ONLINE 0 0 0
>>> c6t5000039478CB470Cd0 ONLINE 0 0 0
>>> c6t5000039478CB471Cd0 ONLINE 0 0 0
>>> c6t5000039478CB4E50d0 ONLINE 0 0 0
>>> c6t5000039478CB50A8d0 ONLINE 0 0 0
>>> c6t5000039478CB50BCd0 ONLINE 0 0 0
>>> spares
>>> c6t50000394A8CBC93Cd0 AVAIL
>>>
>>> errors: No known data errors
>>>
>>>
>>> Thanks for your help,
>>> Kevin
>>>
>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>
>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
>>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis, with
>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>> various SAS HDD's. These systems are serving block storage via Comstar
>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>
>>>>> The problem we've got is that as the drives age, some of them start to
>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>> errors, and nothing logged in /var/adm/messages. The slow performance
>>>>> can be seen as high average service times in iostat or sar.
>>>>
>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>
>>>>>
>>>>> When these service times get above 500ms, they start to cause IO
>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>
>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or SATA NCQ
>>>>
>>>>>
>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>> just gives up and fails them?
>>>>
>>>> Yes, the tuning in Alasdair's blog should work as he describes. More below...
>>>>
>>>>>
>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>> but are ignored by the mpt_sas driver:
>>>>>
>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>
>>>>> Does anyone know the current status of these tunables, or have any other
>>>>> suggestions that might help?
>>>>
>>>> These tunables are on the order of seconds. The default, 60, is obviously too big
>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the manufacturer's
>>>> internal limit (default or tuned) can lead to an unstable system. Some vendors are
>>>> better than others at documenting these, but in any case you'll need to see their spec.
>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>
>>>> There are a lot of tunables in this area at all levels of the architecture. OOB, the OmniOS
>>>> settings ensure stable behaviour. Tuning any layer without understanding the others can
>>>> lead to unstable systems, as demonstrated by your current downstream consumers.
>>>> -- richard
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>>
>>>>> --
>>>>> -------------------------------------------------------------------
>>>>> Kevin Swab UNIX Systems Administrator
>>>>> ACNS Colorado State University
>>>>> Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
>>>>> _______________________________________________
>>>>> OmniOS-discuss mailing list
>>>>> OmniOS-discuss at lists.omniti.com
>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>
>
> --
> -------------------------------------------------------------------
> Kevin Swab UNIX Systems Administrator
> ACNS Colorado State University
> Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
--
Richard.Elling at RichardElling.com
+1-760-896-4422
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150102/2c3458aa/attachment-0001.html>
More information about the OmniOS-discuss
mailing list