[OmniOS-discuss] slow drive response times
Richard Elling
richard.elling at richardelling.com
Tue Jan 6 21:23:14 UTC 2015
> On Jan 6, 2015, at 12:18 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
>
> SAS expanders are involved in my systems, so I installed 'sasinfo' and
> 'smp_utils'. After a bit of poking around in the dark, I came up with
> the following commands which I think get at the error counters you
> mentioned.
Yes, this data looks fine
>
> Unfortunately, I had to remove the "wounded soldier" from this system
> since it was causing problems. This output is from the same slot, but
> with a healthy replacement drive:
>
> # sasinfo hba-port -a SUNW-mpt_sas-1 -l
> HBA Name: SUNW-mpt_sas-1
> HBA Port Name: /dev/cfg/c7
> Phy Information:
> Identifier: 0
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 1
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 2
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 3
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
perfect!
> HBA Port Name: /dev/cfg/c8
> Phy Information:
> Identifier: 4
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 5
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 6
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
> Reset Problem: 0
> Identifier: 7
> Link Error Statistics:
> Invalid Dword: 0
> Running Disparity Error: 0
> Loss of Dword Sync: 0
perfect!
>
>
>
> # ./smp_discover /dev/smp/expd9 | egrep '(c982|c983)'
> phy 26:U:attached:[50000394a8cbc982:00 t(SSP)] 6 Gbps
> # ./smp_discover /dev/smp/expd11 | egrep '(c982|c983)'
> phy 26:U:attached:[50000394a8cbc983:01 t(SSP)] 6 Gbps
> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd9
> Report phy error log response:
> Expander change count: 228
> phy identifier: 26
> invalid dword count: 0
> running disparity error count: 0
> loss of dword synchronization count: 0
> phy reset problem count: 0
> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd11
> Report phy error log response:
> Expander change count: 228
> phy identifier: 26
> invalid dword count: 0
> running disparity error count: 0
> loss of dword synchronization count: 0
> phy reset problem count: 0
> #
>
> "disparity error count" and "loss of dword sync count" are 0 in all of
> this output, in contrast with the non-zero values seen in the sg_logs
> output for the "wounded soldier".
perfect!
>
> Am I looking at the right output?
Yes, this is not showing any errors, which is a good thing.
> Does "phy" in the above commands
> refer to the HDD itself or the port on the expander it's connected to?
Expander port. The HDD's view is in the sg_logs --page=0x18 /dev/rdsk/...
> Had I been able to run the above commands with the "wounded soldier"
> still installed, what should I have been looking for?
The process is to rule out errors. You have succeeded.
-- richard
>
> Thanks again for your help,
> Kevin
>
>
> On 01/02/2015 03:45 PM, Richard Elling wrote:
>>
>>> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu
>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>
>>> I've run 'sg_logs' on the drive I pulled last week. There were alot of
>>> errors in the backgroud scan section of the output, which made it very
>>> large, so I put it here:
>>>
>>> http://pastebin.com/jx5BvSep
>>>
>>> When I pulled this drive, the SMART health status was OK.
>>
>> SMART isn’t smart :-P
>>
>>> However, when
>>> I put it in a test system to run 'sg_logs', the status changed to
>>> "impending failure...". Had the SMART status changed before pulling the
>>> drive, I'm sure 'fmd' would have alerted me to the problem…
>>
>> By default, fmd looks for the predictive failure (PFA) and self-test
>> every hour using the disk_transport
>> agent. fmstat should show activity there. When a PFA is seen, then there
>> will be an ereport generated
>> and, for most cases, a syslog message. However, this will not cause a
>> zfs-retire event.
>>
>> Vendors have significant leeway in how they implement SMART. In my
>> experience the only thing
>> you can say for sure is if the vendor thinks the drive’s death is
>> imminent, then you should replace
>> it. I suspect these policies are financially motivated rather than
>> scientific… some amount of truthiness
>> is to be expected.
>>
>> In the logs, clearly the one disk has lots of errors that have been
>> corrected and the rate is increasing.
>> The rate of change for "Errors corrected with possible delays” may
>> correlate to your performance issues,
>> but the interpretation is left up to the vendors.
>>
>> In the case of this naughty drive, yep it needs replacing.
>>
>>>
>>> Since that drive had other indications of trouble, I ran 'sg_logs' on
>>> another drive I pulled recently that has a SMART health status of OK,
>>> but exibits similar slow service time behavior:
>>>
>>> http://pastebin.com/Q0t8Jnug
>>
>> This one looks mostly healthy.
>>
>> Another place to look for latency issues is the phy logs. In the sg_logs
>> output, this is the
>> Protocol Specific port log page for SAS SSP. Key values are running
>> disparity error
>> count and loss of dword sync count. The trick here is that you need to
>> look at both ends
>> of the wire for each wire. For a simple case, this means looking at both
>> the HBA’s phys error
>> counts and the driver. If you have expanders in the mix, it is more
>> work. You’ll want to look at
>> all of the HBA, expander, and drive phys health counters for all phys.
>>
>> This can get tricky because wide ports are mostly dumb. For example, if
>> an HBA has a 4-link
>> wide port (common) and one of the links is acting up (all too common)
>> the latency impacts
>> will be random.
>>
>> To see HBA and expander link health, you can use sg3_utils, its
>> companion smp_utils, or
>> sasinfo (installed as a separate package from OmniOS, IIRC). For example,
>> sasinfo hba-port -l
>>
>> HTH
>> — richard
>>
>>
>>>
>>> Thanks for taking the time to look at these, please let me know what you
>>> find...
>>>
>>> Kevin
>>>
>>>
>>>
>>>
>>> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>>>
>>>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU
>>>>> <mailto:Kevin.Swab at ColoState.EDU>> wrote:
>>>>>
>>>>> Hello Richard and group, thanks for your reply!
>>>>>
>>>>> I'll look into sg_logs for one of these devices once I have a chance to
>>>>> track that progam down...
>>>>>
>>>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>>>> happen in normal cases. However, I don't believe what I'm seeing
>>>>> constitutes normal behavior.
>>>>>
>>>>> First, some anecdotal evidence: If I pull and replace the suspect
>>>>> drive, my downstream systems stop complaining, and the high service time
>>>>> numbers go away.
>>>>
>>>> We call these "wounded soldiers" -- it takes more resources to manage a
>>>> wounded soldier than a dead soldier, so one strategy of war is to
>>>> wound your
>>>> enemy causing them to consume resources tending the wounded. The sg_logs
>>>> should be enlightening.
>>>>
>>>> NB, consider a 4TB disk with 5 platters: if a head or surface starts
>>>> to go, then
>>>> you have a 1/10 chance that the data you request is under the
>>>> damaged head
>>>> and will need to be recovered by the drive. So it is not uncommon to see
>>>> 90+% of the I/Os to the drive completing quickly. It is also not
>>>> unusual to see
>>>> only a small number of sectors or tracks affected.
>>>>
>>>> Detecting these becomes tricky, especially as you reduce the
>>>> timeout/retry
>>>> interval, since the problem is rarely seen in the average latency --
>>>> that which
>>>> iostat and sar record. This is an area where we can and are improving.
>>>> -- richard
>>>>
>>>>>
>>>>> I threw out 500 ms as a guess to the point at which I start seeing
>>>>> problems. However, I see service times far in excess of that, sometimes
>>>>> over 30,000 ms! Below is 20 minutes of sar output from a drive I pulled
>>>>> a few days ago, during a time when downstream VMWare servers were
>>>>> complaining. (since the sar output is so verbose, I grepped out the
>>>>> info just for the suspect drive):
>>>>>
>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>>>> sd91,a 99 5.3 1 42 0.0 7811.7
>>>>> sd91,a 100 11.3 1 53 0.0 11016.0
>>>>> sd91,a 100 3.8 1 75 0.0 3615.8
>>>>> sd91,a 100 4.9 1 25 0.0 8633.5
>>>>> sd91,a 93 3.9 1 55 0.0 4385.3
>>>>> sd91,a 86 3.5 2 75 0.0 2060.5
>>>>> sd91,a 91 3.1 4 80 0.0 823.8
>>>>> sd91,a 97 3.5 1 50 0.0 3984.5
>>>>> sd91,a 100 4.4 1 56 0.0 6068.6
>>>>> sd91,a 100 5.0 1 55 0.0 8836.0
>>>>> sd91,a 100 5.7 1 51 0.0 7939.6
>>>>> sd91,a 98 9.9 1 42 0.0 12526.8
>>>>> sd91,a 100 7.4 0 10 0.0 36813.6
>>>>> sd91,a 51 3.8 8 90 0.0 500.2
>>>>> sd91,a 88 3.4 1 60 0.0 2338.8
>>>>> sd91,a 100 4.5 1 28 0.0 6969.2
>>>>> sd91,a 93 3.8 1 59 0.0 5138.9
>>>>> sd91,a 79 3.1 1 59 0.0 3143.9
>>>>> sd91,a 99 4.7 1 52 0.0 5598.4
>>>>> sd91,a 100 4.8 1 62 0.0 6638.4
>>>>> sd91,a 94 5.0 1 54 0.0 3752.7
>>>>>
>>>>> For comparison, here's the sar output from another drive in the same
>>>>> pool for the same period of time:
>>>>>
>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>>>> sd82,a 0 0.0 2 28 0.0 5.6
>>>>> sd82,a 1 0.0 3 51 0.0 5.4
>>>>> sd82,a 1 0.0 4 66 0.0 6.3
>>>>> sd82,a 1 0.0 3 48 0.0 4.3
>>>>> sd82,a 1 0.0 3 45 0.0 6.1
>>>>> sd82,a 1 0.0 6 82 0.0 2.7
>>>>> sd82,a 1 0.0 8 112 0.0 2.8
>>>>> sd82,a 0 0.0 3 27 0.0 1.8
>>>>> sd82,a 1 0.0 5 80 0.0 3.1
>>>>> sd82,a 0 0.0 3 35 0.0 3.1
>>>>> sd82,a 1 0.0 3 35 0.0 3.8
>>>>> sd82,a 1 0.0 4 49 0.0 3.2
>>>>> sd82,a 0 0.0 0 0 0.0 4.1
>>>>> sd82,a 3 0.0 9 84 0.0 4.1
>>>>> sd82,a 1 0.0 6 55 0.0 3.7
>>>>> sd82,a 0 0.0 1 23 0.0 7.0
>>>>> sd82,a 0 0.0 6 57 0.0 1.8
>>>>> sd82,a 1 0.0 5 70 0.0 2.3
>>>>> sd82,a 1 0.0 4 55 0.0 3.7
>>>>> sd82,a 1 0.0 5 72 0.0 4.1
>>>>> sd82,a 1 0.0 4 54 0.0 3.6
>>>>>
>>>>> The other drives in this pool all show data similar to that of sd82.
>>>>>
>>>>> Your point about tuning blindly is well taken, and I'm certainly no
>>>>> expert on the IO stack. What's a humble sysadmin to do?
>>>>>
>>>>> For further reference, this system is running r151010. The drive in
>>>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>>>> was in is using lz4 compression and looks like this:
>>>>>
>>>>> # zpool status data1
>>>>> pool: data1
>>>>> state: ONLINE
>>>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31
>>>>> 14:40:20 2014
>>>>> config:
>>>>>
>>>>> NAME STATE READ WRITE CKSUM
>>>>> data1 ONLINE 0 0 0
>>>>> raidz2-0 ONLINE 0 0 0
>>>>> c6t5000039468CB54F0d0 ONLINE 0 0 0
>>>>> c6t5000039478CB5138d0 ONLINE 0 0 0
>>>>> c6t5000039468D000DCd0 ONLINE 0 0 0
>>>>> c6t5000039468D000E8d0 ONLINE 0 0 0
>>>>> c6t5000039468D00F5Cd0 ONLINE 0 0 0
>>>>> c6t5000039478C816CCd0 ONLINE 0 0 0
>>>>> c6t5000039478C8546Cd0 ONLINE 0 0 0
>>>>> raidz2-1 ONLINE 0 0 0
>>>>> c6t5000039478C855F0d0 ONLINE 0 0 0
>>>>> c6t5000039478C856E8d0 ONLINE 0 0 0
>>>>> c6t5000039478C856ECd0 ONLINE 0 0 0
>>>>> c6t5000039478C856F4d0 ONLINE 0 0 0
>>>>> c6t5000039478C86374d0 ONLINE 0 0 0
>>>>> c6t5000039478C8C2A8d0 ONLINE 0 0 0
>>>>> c6t5000039478C8C364d0 ONLINE 0 0 0
>>>>> raidz2-2 ONLINE 0 0 0
>>>>> c6t5000039478C9958Cd0 ONLINE 0 0 0
>>>>> c6t5000039478C995C4d0 ONLINE 0 0 0
>>>>> c6t5000039478C9DACCd0 ONLINE 0 0 0
>>>>> c6t5000039478C9DB30d0 ONLINE 0 0 0
>>>>> c6t5000039478C9DB6Cd0 ONLINE 0 0 0
>>>>> c6t5000039478CA73B4d0 ONLINE 0 0 0
>>>>> c6t5000039478CB3A20d0 ONLINE 0 0 0
>>>>> raidz2-3 ONLINE 0 0 0
>>>>> c6t5000039478CB3A64d0 ONLINE 0 0 0
>>>>> c6t5000039478CB3A70d0 ONLINE 0 0 0
>>>>> c6t5000039478CB3E7Cd0 ONLINE 0 0 0
>>>>> c6t5000039478CB3EB0d0 ONLINE 0 0 0
>>>>> c6t5000039478CB3FBCd0 ONLINE 0 0 0
>>>>> c6t5000039478CB4048d0 ONLINE 0 0 0
>>>>> c6t5000039478CB4054d0 ONLINE 0 0 0
>>>>> raidz2-4 ONLINE 0 0 0
>>>>> c6t5000039478CB424Cd0 ONLINE 0 0 0
>>>>> c6t5000039478CB4250d0 ONLINE 0 0 0
>>>>> c6t5000039478CB470Cd0 ONLINE 0 0 0
>>>>> c6t5000039478CB471Cd0 ONLINE 0 0 0
>>>>> c6t5000039478CB4E50d0 ONLINE 0 0 0
>>>>> c6t5000039478CB50A8d0 ONLINE 0 0 0
>>>>> c6t5000039478CB50BCd0 ONLINE 0 0 0
>>>>> spares
>>>>> c6t50000394A8CBC93Cd0 AVAIL
>>>>>
>>>>> errors: No known data errors
>>>>>
>>>>>
>>>>> Thanks for your help,
>>>>> Kevin
>>>>>
>>>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>>>
>>>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu
>>>>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis,
>>>>>>> with
>>>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>>>> various SAS HDD's. These systems are serving block storage via
>>>>>>> Comstar
>>>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>>>
>>>>>>> The problem we've got is that as the drives age, some of them start to
>>>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>>>> errors, and nothing logged in /var/adm/messages. The slow performance
>>>>>>> can be seen as high average service times in iostat or sar.
>>>>>>
>>>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>>>
>>>>>>>
>>>>>>> When these service times get above 500ms, they start to cause IO
>>>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>>>
>>>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or
>>>>>> SATA NCQ
>>>>>>
>>>>>>>
>>>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>>>> just gives up and fails them?
>>>>>>
>>>>>> Yes, the tuning in Alasdair's blog should work as he describes.
>>>>>> More below...
>>>>>>
>>>>>>>
>>>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>>>> but are ignored by the mpt_sas driver:
>>>>>>>
>>>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>>>
>>>>>>> Does anyone know the current status of these tunables, or have any
>>>>>>> other
>>>>>>> suggestions that might help?
>>>>>>
>>>>>> These tunables are on the order of seconds. The default, 60, is
>>>>>> obviously too big
>>>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the
>>>>>> manufacturer's
>>>>>> internal limit (default or tuned) can lead to an unstable system.
>>>>>> Some vendors are
>>>>>> better than others at documenting these, but in any case you'll
>>>>>> need to see their spec.
>>>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>>>
>>>>>> There are a lot of tunables in this area at all levels of the
>>>>>> architecture. OOB, the OmniOS
>>>>>> settings ensure stable behaviour. Tuning any layer without
>>>>>> understanding the others can
>>>>>> lead to unstable systems, as demonstrated by your current
>>>>>> downstream consumers.
>>>>>> -- richard
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -------------------------------------------------------------------
>>>>>>> Kevin Swab UNIX Systems Administrator
>>>>>>> ACNS Colorado State University
>>>>>>> Phone: (970)491-6572 Email:
>>>>>>> Kevin.Swab at ColoState.EDU <mailto:Kevin.Swab at ColoState.EDU>
>>>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
>>>>>>> _______________________________________________
>>>>>>> OmniOS-discuss mailing list
>>>>>>> OmniOS-discuss at lists.omniti.com
>>>>>>> <mailto:OmniOS-discuss at lists.omniti.com>
>>>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>>>
>>>
>>> --
>>> -------------------------------------------------------------------
>>> Kevin Swab UNIX Systems Administrator
>>> ACNS Colorado State University
>>> Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
>>> <mailto:Kevin.Swab at ColoState.EDU>
>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
>>
>> --
>>
>> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
>> +1-760-896-4422
>>
>>
>>
>
> --
> -------------------------------------------------------------------
> Kevin Swab UNIX Systems Administrator
> ACNS Colorado State University
> Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
More information about the OmniOS-discuss
mailing list