[OmniOS-discuss] slow drive response times
Kevin Swab
Kevin.Swab at ColoState.EDU
Tue Jan 6 20:18:22 UTC 2015
SAS expanders are involved in my systems, so I installed 'sasinfo' and
'smp_utils'. After a bit of poking around in the dark, I came up with
the following commands which I think get at the error counters you
mentioned.
Unfortunately, I had to remove the "wounded soldier" from this system
since it was causing problems. This output is from the same slot, but
with a healthy replacement drive:
# sasinfo hba-port -a SUNW-mpt_sas-1 -l
HBA Name: SUNW-mpt_sas-1
HBA Port Name: /dev/cfg/c7
Phy Information:
Identifier: 0
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 1
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 2
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 3
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
HBA Port Name: /dev/cfg/c8
Phy Information:
Identifier: 4
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 5
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 6
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
Reset Problem: 0
Identifier: 7
Link Error Statistics:
Invalid Dword: 0
Running Disparity Error: 0
Loss of Dword Sync: 0
# ./smp_discover /dev/smp/expd9 | egrep '(c982|c983)'
phy 26:U:attached:[50000394a8cbc982:00 t(SSP)] 6 Gbps
# ./smp_discover /dev/smp/expd11 | egrep '(c982|c983)'
phy 26:U:attached:[50000394a8cbc983:01 t(SSP)] 6 Gbps
# ./smp_rep_phy_err_log --phy=26 /dev/smp/expd9
Report phy error log response:
Expander change count: 228
phy identifier: 26
invalid dword count: 0
running disparity error count: 0
loss of dword synchronization count: 0
phy reset problem count: 0
# ./smp_rep_phy_err_log --phy=26 /dev/smp/expd11
Report phy error log response:
Expander change count: 228
phy identifier: 26
invalid dword count: 0
running disparity error count: 0
loss of dword synchronization count: 0
phy reset problem count: 0
#
"disparity error count" and "loss of dword sync count" are 0 in all of
this output, in contrast with the non-zero values seen in the sg_logs
output for the "wounded soldier".
Am I looking at the right output? Does "phy" in the above commands
refer to the HDD itself or the port on the expander it's connected to?
Had I been able to run the above commands with the "wounded soldier"
still installed, what should I have been looking for?
Thanks again for your help,
Kevin
On 01/02/2015 03:45 PM, Richard Elling wrote:
>
>> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu
>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>
>> I've run 'sg_logs' on the drive I pulled last week. There were alot of
>> errors in the backgroud scan section of the output, which made it very
>> large, so I put it here:
>>
>> http://pastebin.com/jx5BvSep
>>
>> When I pulled this drive, the SMART health status was OK.
>
> SMART isn’t smart :-P
>
>> However, when
>> I put it in a test system to run 'sg_logs', the status changed to
>> "impending failure...". Had the SMART status changed before pulling the
>> drive, I'm sure 'fmd' would have alerted me to the problem…
>
> By default, fmd looks for the predictive failure (PFA) and self-test
> every hour using the disk_transport
> agent. fmstat should show activity there. When a PFA is seen, then there
> will be an ereport generated
> and, for most cases, a syslog message. However, this will not cause a
> zfs-retire event.
>
> Vendors have significant leeway in how they implement SMART. In my
> experience the only thing
> you can say for sure is if the vendor thinks the drive’s death is
> imminent, then you should replace
> it. I suspect these policies are financially motivated rather than
> scientific… some amount of truthiness
> is to be expected.
>
> In the logs, clearly the one disk has lots of errors that have been
> corrected and the rate is increasing.
> The rate of change for "Errors corrected with possible delays” may
> correlate to your performance issues,
> but the interpretation is left up to the vendors.
>
> In the case of this naughty drive, yep it needs replacing.
>
>>
>> Since that drive had other indications of trouble, I ran 'sg_logs' on
>> another drive I pulled recently that has a SMART health status of OK,
>> but exibits similar slow service time behavior:
>>
>> http://pastebin.com/Q0t8Jnug
>
> This one looks mostly healthy.
>
> Another place to look for latency issues is the phy logs. In the sg_logs
> output, this is the
> Protocol Specific port log page for SAS SSP. Key values are running
> disparity error
> count and loss of dword sync count. The trick here is that you need to
> look at both ends
> of the wire for each wire. For a simple case, this means looking at both
> the HBA’s phys error
> counts and the driver. If you have expanders in the mix, it is more
> work. You’ll want to look at
> all of the HBA, expander, and drive phys health counters for all phys.
>
> This can get tricky because wide ports are mostly dumb. For example, if
> an HBA has a 4-link
> wide port (common) and one of the links is acting up (all too common)
> the latency impacts
> will be random.
>
> To see HBA and expander link health, you can use sg3_utils, its
> companion smp_utils, or
> sasinfo (installed as a separate package from OmniOS, IIRC). For example,
> sasinfo hba-port -l
>
> HTH
> — richard
>
>
>>
>> Thanks for taking the time to look at these, please let me know what you
>> find...
>>
>> Kevin
>>
>>
>>
>>
>> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>>
>>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU
>>>> <mailto:Kevin.Swab at ColoState.EDU>> wrote:
>>>>
>>>> Hello Richard and group, thanks for your reply!
>>>>
>>>> I'll look into sg_logs for one of these devices once I have a chance to
>>>> track that progam down...
>>>>
>>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>>> happen in normal cases. However, I don't believe what I'm seeing
>>>> constitutes normal behavior.
>>>>
>>>> First, some anecdotal evidence: If I pull and replace the suspect
>>>> drive, my downstream systems stop complaining, and the high service time
>>>> numbers go away.
>>>
>>> We call these "wounded soldiers" -- it takes more resources to manage a
>>> wounded soldier than a dead soldier, so one strategy of war is to
>>> wound your
>>> enemy causing them to consume resources tending the wounded. The sg_logs
>>> should be enlightening.
>>>
>>> NB, consider a 4TB disk with 5 platters: if a head or surface starts
>>> to go, then
>>> you have a 1/10 chance that the data you request is under the
>>> damaged head
>>> and will need to be recovered by the drive. So it is not uncommon to see
>>> 90+% of the I/Os to the drive completing quickly. It is also not
>>> unusual to see
>>> only a small number of sectors or tracks affected.
>>>
>>> Detecting these becomes tricky, especially as you reduce the
>>> timeout/retry
>>> interval, since the problem is rarely seen in the average latency --
>>> that which
>>> iostat and sar record. This is an area where we can and are improving.
>>> -- richard
>>>
>>>>
>>>> I threw out 500 ms as a guess to the point at which I start seeing
>>>> problems. However, I see service times far in excess of that, sometimes
>>>> over 30,000 ms! Below is 20 minutes of sar output from a drive I pulled
>>>> a few days ago, during a time when downstream VMWare servers were
>>>> complaining. (since the sar output is so verbose, I grepped out the
>>>> info just for the suspect drive):
>>>>
>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>>> sd91,a 99 5.3 1 42 0.0 7811.7
>>>> sd91,a 100 11.3 1 53 0.0 11016.0
>>>> sd91,a 100 3.8 1 75 0.0 3615.8
>>>> sd91,a 100 4.9 1 25 0.0 8633.5
>>>> sd91,a 93 3.9 1 55 0.0 4385.3
>>>> sd91,a 86 3.5 2 75 0.0 2060.5
>>>> sd91,a 91 3.1 4 80 0.0 823.8
>>>> sd91,a 97 3.5 1 50 0.0 3984.5
>>>> sd91,a 100 4.4 1 56 0.0 6068.6
>>>> sd91,a 100 5.0 1 55 0.0 8836.0
>>>> sd91,a 100 5.7 1 51 0.0 7939.6
>>>> sd91,a 98 9.9 1 42 0.0 12526.8
>>>> sd91,a 100 7.4 0 10 0.0 36813.6
>>>> sd91,a 51 3.8 8 90 0.0 500.2
>>>> sd91,a 88 3.4 1 60 0.0 2338.8
>>>> sd91,a 100 4.5 1 28 0.0 6969.2
>>>> sd91,a 93 3.8 1 59 0.0 5138.9
>>>> sd91,a 79 3.1 1 59 0.0 3143.9
>>>> sd91,a 99 4.7 1 52 0.0 5598.4
>>>> sd91,a 100 4.8 1 62 0.0 6638.4
>>>> sd91,a 94 5.0 1 54 0.0 3752.7
>>>>
>>>> For comparison, here's the sar output from another drive in the same
>>>> pool for the same period of time:
>>>>
>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>>> 14:50:00 device %busy avque r+w/s blks/s avwait avserv
>>>> sd82,a 0 0.0 2 28 0.0 5.6
>>>> sd82,a 1 0.0 3 51 0.0 5.4
>>>> sd82,a 1 0.0 4 66 0.0 6.3
>>>> sd82,a 1 0.0 3 48 0.0 4.3
>>>> sd82,a 1 0.0 3 45 0.0 6.1
>>>> sd82,a 1 0.0 6 82 0.0 2.7
>>>> sd82,a 1 0.0 8 112 0.0 2.8
>>>> sd82,a 0 0.0 3 27 0.0 1.8
>>>> sd82,a 1 0.0 5 80 0.0 3.1
>>>> sd82,a 0 0.0 3 35 0.0 3.1
>>>> sd82,a 1 0.0 3 35 0.0 3.8
>>>> sd82,a 1 0.0 4 49 0.0 3.2
>>>> sd82,a 0 0.0 0 0 0.0 4.1
>>>> sd82,a 3 0.0 9 84 0.0 4.1
>>>> sd82,a 1 0.0 6 55 0.0 3.7
>>>> sd82,a 0 0.0 1 23 0.0 7.0
>>>> sd82,a 0 0.0 6 57 0.0 1.8
>>>> sd82,a 1 0.0 5 70 0.0 2.3
>>>> sd82,a 1 0.0 4 55 0.0 3.7
>>>> sd82,a 1 0.0 5 72 0.0 4.1
>>>> sd82,a 1 0.0 4 54 0.0 3.6
>>>>
>>>> The other drives in this pool all show data similar to that of sd82.
>>>>
>>>> Your point about tuning blindly is well taken, and I'm certainly no
>>>> expert on the IO stack. What's a humble sysadmin to do?
>>>>
>>>> For further reference, this system is running r151010. The drive in
>>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>>> was in is using lz4 compression and looks like this:
>>>>
>>>> # zpool status data1
>>>> pool: data1
>>>> state: ONLINE
>>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31
>>>> 14:40:20 2014
>>>> config:
>>>>
>>>> NAME STATE READ WRITE CKSUM
>>>> data1 ONLINE 0 0 0
>>>> raidz2-0 ONLINE 0 0 0
>>>> c6t5000039468CB54F0d0 ONLINE 0 0 0
>>>> c6t5000039478CB5138d0 ONLINE 0 0 0
>>>> c6t5000039468D000DCd0 ONLINE 0 0 0
>>>> c6t5000039468D000E8d0 ONLINE 0 0 0
>>>> c6t5000039468D00F5Cd0 ONLINE 0 0 0
>>>> c6t5000039478C816CCd0 ONLINE 0 0 0
>>>> c6t5000039478C8546Cd0 ONLINE 0 0 0
>>>> raidz2-1 ONLINE 0 0 0
>>>> c6t5000039478C855F0d0 ONLINE 0 0 0
>>>> c6t5000039478C856E8d0 ONLINE 0 0 0
>>>> c6t5000039478C856ECd0 ONLINE 0 0 0
>>>> c6t5000039478C856F4d0 ONLINE 0 0 0
>>>> c6t5000039478C86374d0 ONLINE 0 0 0
>>>> c6t5000039478C8C2A8d0 ONLINE 0 0 0
>>>> c6t5000039478C8C364d0 ONLINE 0 0 0
>>>> raidz2-2 ONLINE 0 0 0
>>>> c6t5000039478C9958Cd0 ONLINE 0 0 0
>>>> c6t5000039478C995C4d0 ONLINE 0 0 0
>>>> c6t5000039478C9DACCd0 ONLINE 0 0 0
>>>> c6t5000039478C9DB30d0 ONLINE 0 0 0
>>>> c6t5000039478C9DB6Cd0 ONLINE 0 0 0
>>>> c6t5000039478CA73B4d0 ONLINE 0 0 0
>>>> c6t5000039478CB3A20d0 ONLINE 0 0 0
>>>> raidz2-3 ONLINE 0 0 0
>>>> c6t5000039478CB3A64d0 ONLINE 0 0 0
>>>> c6t5000039478CB3A70d0 ONLINE 0 0 0
>>>> c6t5000039478CB3E7Cd0 ONLINE 0 0 0
>>>> c6t5000039478CB3EB0d0 ONLINE 0 0 0
>>>> c6t5000039478CB3FBCd0 ONLINE 0 0 0
>>>> c6t5000039478CB4048d0 ONLINE 0 0 0
>>>> c6t5000039478CB4054d0 ONLINE 0 0 0
>>>> raidz2-4 ONLINE 0 0 0
>>>> c6t5000039478CB424Cd0 ONLINE 0 0 0
>>>> c6t5000039478CB4250d0 ONLINE 0 0 0
>>>> c6t5000039478CB470Cd0 ONLINE 0 0 0
>>>> c6t5000039478CB471Cd0 ONLINE 0 0 0
>>>> c6t5000039478CB4E50d0 ONLINE 0 0 0
>>>> c6t5000039478CB50A8d0 ONLINE 0 0 0
>>>> c6t5000039478CB50BCd0 ONLINE 0 0 0
>>>> spares
>>>> c6t50000394A8CBC93Cd0 AVAIL
>>>>
>>>> errors: No known data errors
>>>>
>>>>
>>>> Thanks for your help,
>>>> Kevin
>>>>
>>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>>
>>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu
>>>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis,
>>>>>> with
>>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>>> various SAS HDD's. These systems are serving block storage via
>>>>>> Comstar
>>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>>
>>>>>> The problem we've got is that as the drives age, some of them start to
>>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>>> errors, and nothing logged in /var/adm/messages. The slow performance
>>>>>> can be seen as high average service times in iostat or sar.
>>>>>
>>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>>
>>>>>>
>>>>>> When these service times get above 500ms, they start to cause IO
>>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>>
>>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or
>>>>> SATA NCQ
>>>>>
>>>>>>
>>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>>> just gives up and fails them?
>>>>>
>>>>> Yes, the tuning in Alasdair's blog should work as he describes.
>>>>> More below...
>>>>>
>>>>>>
>>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>>> but are ignored by the mpt_sas driver:
>>>>>>
>>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>>
>>>>>> Does anyone know the current status of these tunables, or have any
>>>>>> other
>>>>>> suggestions that might help?
>>>>>
>>>>> These tunables are on the order of seconds. The default, 60, is
>>>>> obviously too big
>>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the
>>>>> manufacturer's
>>>>> internal limit (default or tuned) can lead to an unstable system.
>>>>> Some vendors are
>>>>> better than others at documenting these, but in any case you'll
>>>>> need to see their spec.
>>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>>
>>>>> There are a lot of tunables in this area at all levels of the
>>>>> architecture. OOB, the OmniOS
>>>>> settings ensure stable behaviour. Tuning any layer without
>>>>> understanding the others can
>>>>> lead to unstable systems, as demonstrated by your current
>>>>> downstream consumers.
>>>>> -- richard
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -------------------------------------------------------------------
>>>>>> Kevin Swab UNIX Systems Administrator
>>>>>> ACNS Colorado State University
>>>>>> Phone: (970)491-6572 Email:
>>>>>> Kevin.Swab at ColoState.EDU <mailto:Kevin.Swab at ColoState.EDU>
>>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
>>>>>> _______________________________________________
>>>>>> OmniOS-discuss mailing list
>>>>>> OmniOS-discuss at lists.omniti.com
>>>>>> <mailto:OmniOS-discuss at lists.omniti.com>
>>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>>
>>
>> --
>> -------------------------------------------------------------------
>> Kevin Swab UNIX Systems Administrator
>> ACNS Colorado State University
>> Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
>> <mailto:Kevin.Swab at ColoState.EDU>
>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
>
> --
>
> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
> +1-760-896-4422
>
>
>
--
-------------------------------------------------------------------
Kevin Swab UNIX Systems Administrator
ACNS Colorado State University
Phone: (970)491-6572 Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17 8EB8 8A7D 142F 2392 791C
More information about the OmniOS-discuss
mailing list