[OmniOS-discuss] slow drive response times

Richard Elling richard.elling at richardelling.com
Tue Jan 6 21:23:14 UTC 2015


> On Jan 6, 2015, at 12:18 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
> 
> SAS expanders are involved in my systems, so I installed 'sasinfo' and
> 'smp_utils'.  After a bit of poking around in the dark, I came up with
> the following commands which I think get at the error counters you
> mentioned.

Yes, this data looks fine

> 
> Unfortunately, I had to remove the "wounded soldier" from this system
> since it was causing problems.  This output is from the same slot, but
> with a healthy replacement drive:
> 
> # sasinfo hba-port -a SUNW-mpt_sas-1 -l
> HBA Name: SUNW-mpt_sas-1
>  HBA Port Name: /dev/cfg/c7
>    Phy Information:
>      Identifier: 0
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 1
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 2
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 3
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0

perfect!

>  HBA Port Name: /dev/cfg/c8
>    Phy Information:
>      Identifier: 4
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 5
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 6
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0
>            Reset Problem: 0
>      Identifier: 7
>        Link Error Statistics:
>            Invalid Dword: 0
>            Running Disparity Error: 0
>            Loss of Dword Sync: 0

perfect!

> 
> 
> 
> # ./smp_discover /dev/smp/expd9 | egrep '(c982|c983)'
>  phy  26:U:attached:[50000394a8cbc982:00  t(SSP)]  6 Gbps
> # ./smp_discover /dev/smp/expd11 | egrep '(c982|c983)'
>  phy  26:U:attached:[50000394a8cbc983:01  t(SSP)]  6 Gbps
> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd9
> Report phy error log response:
>  Expander change count: 228
>  phy identifier: 26
>  invalid dword count: 0
>  running disparity error count: 0
>  loss of dword synchronization count: 0
>  phy reset problem count: 0
> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd11
> Report phy error log response:
>  Expander change count: 228
>  phy identifier: 26
>  invalid dword count: 0
>  running disparity error count: 0
>  loss of dword synchronization count: 0
>  phy reset problem count: 0
> #
> 
> "disparity error count" and "loss of dword sync count" are 0 in all of
> this output, in contrast with the non-zero values seen in the sg_logs
> output for the "wounded soldier".

perfect!

> 
> Am I looking at the right output?

Yes, this is not showing any errors, which is a good thing.

>  Does "phy" in the above commands
> refer to the HDD itself or the port on the expander it's connected to?

Expander port. The HDD's view is in the sg_logs --page=0x18 /dev/rdsk/...

> Had I been able to run the above commands with the "wounded soldier"
> still installed, what should I have been looking for?

The process is to rule out errors. You have succeeded.
 -- richard

> 
> Thanks again for your help,
> Kevin
> 
> 
> On 01/02/2015 03:45 PM, Richard Elling wrote:
>> 
>>> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu
>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>> 
>>> I've run 'sg_logs' on the drive I pulled last week.  There were alot of
>>> errors in the backgroud scan section of the output, which made it very
>>> large, so I put it here:
>>> 
>>> http://pastebin.com/jx5BvSep
>>> 
>>> When I pulled this drive, the SMART health status was OK.  
>> 
>> SMART isn’t smart :-P
>> 
>>> However, when
>>> I put it in a test system to run 'sg_logs', the status changed to
>>> "impending failure...".  Had the SMART status changed before pulling the
>>> drive, I'm sure 'fmd' would have alerted me to the problem…
>> 
>> By default, fmd looks for the predictive failure (PFA) and self-test
>> every hour using the disk_transport 
>> agent. fmstat should show activity there. When a PFA is seen, then there
>> will be an ereport generated
>> and, for most cases, a syslog message. However, this will not cause a
>> zfs-retire event.
>> 
>> Vendors have significant leeway in how they implement SMART. In my
>> experience the only thing
>> you can say for sure is if the vendor thinks the drive’s death is
>> imminent, then you should replace
>> it. I suspect these policies are financially motivated rather than
>> scientific… some amount of truthiness
>> is to be expected.
>> 
>> In the logs, clearly the one disk has lots of errors that have been
>> corrected and the rate is increasing.
>> The rate of change for "Errors corrected with possible delays” may
>> correlate to your performance issues,
>> but the interpretation is left up to the vendors.
>> 
>> In the case of this naughty drive, yep it needs replacing.
>> 
>>> 
>>> Since that drive had other indications of trouble, I ran 'sg_logs' on
>>> another drive I pulled recently that has a SMART health status of OK,
>>> but exibits similar slow service time behavior:
>>> 
>>> http://pastebin.com/Q0t8Jnug
>> 
>> This one looks mostly healthy.
>> 
>> Another place to look for latency issues is the phy logs. In the sg_logs
>> output, this is the
>> Protocol Specific port log page for SAS SSP. Key values are running
>> disparity error 
>> count and loss of dword sync count. The trick here is that you need to
>> look at both ends
>> of the wire for each wire. For a simple case, this means looking at both
>> the HBA’s phys error
>> counts and the driver. If you have expanders in the mix, it is more
>> work. You’ll want to look at 
>> all of the HBA, expander, and drive phys health counters for all phys.
>> 
>> This can get tricky because wide ports are mostly dumb. For example, if
>> an HBA has a 4-link
>> wide port (common) and one of the links is acting up (all too common)
>> the latency impacts 
>> will be random.
>> 
>> To see HBA and expander link health, you can use sg3_utils, its
>> companion smp_utils, or
>> sasinfo (installed as a separate package from OmniOS, IIRC). For example, 
>> sasinfo hba-port -l
>> 
>> HTH
>> — richard
>> 
>> 
>>> 
>>> Thanks for taking the time to look at these, please let me know what you
>>> find...
>>> 
>>> Kevin
>>> 
>>> 
>>> 
>>> 
>>> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>>> 
>>>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU
>>>>> <mailto:Kevin.Swab at ColoState.EDU>> wrote:
>>>>> 
>>>>> Hello Richard and group, thanks for your reply!
>>>>> 
>>>>> I'll look into sg_logs for one of these devices once I have a chance to
>>>>> track that progam down...
>>>>> 
>>>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>>>> happen in normal cases.  However, I don't believe what I'm seeing
>>>>> constitutes normal behavior.
>>>>> 
>>>>> First, some anecdotal evidence:  If I pull and replace the suspect
>>>>> drive, my downstream systems stop complaining, and the high service time
>>>>> numbers go away.
>>>> 
>>>> We call these "wounded soldiers"  -- it takes more resources to manage a
>>>> wounded soldier than a dead soldier, so one strategy of war is to
>>>> wound your
>>>> enemy causing them to consume resources tending the wounded. The sg_logs
>>>> should be enlightening.
>>>> 
>>>> NB, consider a 4TB disk with 5 platters: if a head or surface starts
>>>> to go, then
>>>> you  have a 1/10 chance that the data you request is under the
>>>> damaged head
>>>> and will need to be recovered by the drive. So it is not uncommon to see
>>>> 90+% of the I/Os to the drive completing quickly. It is also not
>>>> unusual to see
>>>> only a small number of sectors or tracks affected.
>>>> 
>>>> Detecting these becomes tricky, especially as you reduce the
>>>> timeout/retry
>>>> interval, since the problem is rarely seen in the average latency --
>>>> that which
>>>> iostat and sar record. This is an area where we can and are improving.
>>>> -- richard
>>>> 
>>>>> 
>>>>> I threw out 500 ms as a guess to the point at which I start seeing
>>>>> problems.  However, I see service times far in excess of that, sometimes
>>>>> over 30,000 ms!  Below is 20 minutes of sar output from a drive I pulled
>>>>> a few days ago, during a time when downstream VMWare servers were
>>>>> complaining.  (since the sar output is so verbose, I grepped out the
>>>>> info just for the suspect drive):
>>>>> 
>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>>         sd91,a           99     5.3       1      42     0.0  7811.7
>>>>>         sd91,a          100    11.3       1      53     0.0 11016.0
>>>>>         sd91,a          100     3.8       1      75     0.0  3615.8
>>>>>         sd91,a          100     4.9       1      25     0.0  8633.5
>>>>>         sd91,a           93     3.9       1      55     0.0  4385.3
>>>>>         sd91,a           86     3.5       2      75     0.0  2060.5
>>>>>         sd91,a           91     3.1       4      80     0.0   823.8
>>>>>         sd91,a           97     3.5       1      50     0.0  3984.5
>>>>>         sd91,a          100     4.4       1      56     0.0  6068.6
>>>>>         sd91,a          100     5.0       1      55     0.0  8836.0
>>>>>         sd91,a          100     5.7       1      51     0.0  7939.6
>>>>>         sd91,a           98     9.9       1      42     0.0 12526.8
>>>>>         sd91,a          100     7.4       0      10     0.0 36813.6
>>>>>         sd91,a           51     3.8       8      90     0.0   500.2
>>>>>         sd91,a           88     3.4       1      60     0.0  2338.8
>>>>>         sd91,a          100     4.5       1      28     0.0  6969.2
>>>>>         sd91,a           93     3.8       1      59     0.0  5138.9
>>>>>         sd91,a           79     3.1       1      59     0.0  3143.9
>>>>>         sd91,a           99     4.7       1      52     0.0  5598.4
>>>>>         sd91,a          100     4.8       1      62     0.0  6638.4
>>>>>         sd91,a           94     5.0       1      54     0.0  3752.7
>>>>> 
>>>>> For comparison, here's the sar output from another drive in the same
>>>>> pool for the same period of time:
>>>>> 
>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>>         sd82,a            0     0.0       2      28     0.0     5.6
>>>>>         sd82,a            1     0.0       3      51     0.0     5.4
>>>>>         sd82,a            1     0.0       4      66     0.0     6.3
>>>>>         sd82,a            1     0.0       3      48     0.0     4.3
>>>>>         sd82,a            1     0.0       3      45     0.0     6.1
>>>>>         sd82,a            1     0.0       6      82     0.0     2.7
>>>>>         sd82,a            1     0.0       8     112     0.0     2.8
>>>>>         sd82,a            0     0.0       3      27     0.0     1.8
>>>>>         sd82,a            1     0.0       5      80     0.0     3.1
>>>>>         sd82,a            0     0.0       3      35     0.0     3.1
>>>>>         sd82,a            1     0.0       3      35     0.0     3.8
>>>>>         sd82,a            1     0.0       4      49     0.0     3.2
>>>>>         sd82,a            0     0.0       0       0     0.0     4.1
>>>>>         sd82,a            3     0.0       9      84     0.0     4.1
>>>>>         sd82,a            1     0.0       6      55     0.0     3.7
>>>>>         sd82,a            0     0.0       1      23     0.0     7.0
>>>>>         sd82,a            0     0.0       6      57     0.0     1.8
>>>>>         sd82,a            1     0.0       5      70     0.0     2.3
>>>>>         sd82,a            1     0.0       4      55     0.0     3.7
>>>>>         sd82,a            1     0.0       5      72     0.0     4.1
>>>>>         sd82,a            1     0.0       4      54     0.0     3.6
>>>>> 
>>>>> The other drives in this pool all show data similar to that of sd82.
>>>>> 
>>>>> Your point about tuning blindly is well taken, and I'm certainly no
>>>>> expert on the IO stack.  What's a humble sysadmin to do?
>>>>> 
>>>>> For further reference, this system is running r151010.  The drive in
>>>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>>>> was in is using lz4 compression and looks like this:
>>>>> 
>>>>> # zpool status data1
>>>>> pool: data1
>>>>> state: ONLINE
>>>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31
>>>>> 14:40:20 2014
>>>>> config:
>>>>> 
>>>>>      NAME                       STATE     READ WRITE CKSUM
>>>>>      data1                      ONLINE       0     0     0
>>>>>        raidz2-0                 ONLINE       0     0     0
>>>>>          c6t5000039468CB54F0d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB5138d0  ONLINE       0     0     0
>>>>>          c6t5000039468D000DCd0  ONLINE       0     0     0
>>>>>          c6t5000039468D000E8d0  ONLINE       0     0     0
>>>>>          c6t5000039468D00F5Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478C816CCd0  ONLINE       0     0     0
>>>>>          c6t5000039478C8546Cd0  ONLINE       0     0     0
>>>>>        raidz2-1                 ONLINE       0     0     0
>>>>>          c6t5000039478C855F0d0  ONLINE       0     0     0
>>>>>          c6t5000039478C856E8d0  ONLINE       0     0     0
>>>>>          c6t5000039478C856ECd0  ONLINE       0     0     0
>>>>>          c6t5000039478C856F4d0  ONLINE       0     0     0
>>>>>          c6t5000039478C86374d0  ONLINE       0     0     0
>>>>>          c6t5000039478C8C2A8d0  ONLINE       0     0     0
>>>>>          c6t5000039478C8C364d0  ONLINE       0     0     0
>>>>>        raidz2-2                 ONLINE       0     0     0
>>>>>          c6t5000039478C9958Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478C995C4d0  ONLINE       0     0     0
>>>>>          c6t5000039478C9DACCd0  ONLINE       0     0     0
>>>>>          c6t5000039478C9DB30d0  ONLINE       0     0     0
>>>>>          c6t5000039478C9DB6Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478CA73B4d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB3A20d0  ONLINE       0     0     0
>>>>>        raidz2-3                 ONLINE       0     0     0
>>>>>          c6t5000039478CB3A64d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB3A70d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB3E7Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478CB3EB0d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB3FBCd0  ONLINE       0     0     0
>>>>>          c6t5000039478CB4048d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB4054d0  ONLINE       0     0     0
>>>>>        raidz2-4                 ONLINE       0     0     0
>>>>>          c6t5000039478CB424Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478CB4250d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB470Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478CB471Cd0  ONLINE       0     0     0
>>>>>          c6t5000039478CB4E50d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB50A8d0  ONLINE       0     0     0
>>>>>          c6t5000039478CB50BCd0  ONLINE       0     0     0
>>>>>      spares
>>>>>        c6t50000394A8CBC93Cd0    AVAIL
>>>>> 
>>>>> errors: No known data errors
>>>>> 
>>>>> 
>>>>> Thanks for your help,
>>>>> Kevin
>>>>> 
>>>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>>> 
>>>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu
>>>>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>>>> 
>>>>>>> Hello Everyone,
>>>>>>> 
>>>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis,
>>>>>>> with
>>>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>>>> various SAS HDD's.  These systems are serving block storage via
>>>>>>> Comstar
>>>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>>> 
>>>>>>> The problem we've got is that as the drives age, some of them start to
>>>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>>>> errors, and nothing logged in /var/adm/messages.  The slow performance
>>>>>>> can be seen as high average service times in iostat or sar.
>>>>>> 
>>>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>>> 
>>>>>>> 
>>>>>>> When these service times get above 500ms, they start to cause IO
>>>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>>> 
>>>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or
>>>>>> SATA NCQ
>>>>>> 
>>>>>>> 
>>>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>>>> just gives up and fails them?
>>>>>> 
>>>>>> Yes, the tuning in Alasdair's blog should work as he describes.
>>>>>> More below...
>>>>>> 
>>>>>>> 
>>>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>>>> but are ignored by the mpt_sas driver:
>>>>>>> 
>>>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>>> 
>>>>>>> Does anyone know the current status of these tunables, or have any
>>>>>>> other
>>>>>>> suggestions that might help?
>>>>>> 
>>>>>> These tunables are on the order of seconds. The default, 60, is
>>>>>> obviously too big
>>>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the
>>>>>> manufacturer's
>>>>>> internal limit (default or tuned) can lead to an unstable system.
>>>>>> Some vendors are
>>>>>> better than others at documenting these, but in any case you'll
>>>>>> need to see their spec.
>>>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>>> 
>>>>>> There are a lot of tunables in this area at all levels of the
>>>>>> architecture. OOB, the OmniOS
>>>>>> settings ensure stable behaviour. Tuning any layer without
>>>>>> understanding the others can
>>>>>> lead to unstable systems, as demonstrated by your current
>>>>>> downstream consumers.
>>>>>> -- richard
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> -------------------------------------------------------------------
>>>>>>> Kevin Swab                          UNIX Systems Administrator
>>>>>>> ACNS                                Colorado State University
>>>>>>> Phone: (970)491-6572                Email:
>>>>>>> Kevin.Swab at ColoState.EDU <mailto:Kevin.Swab at ColoState.EDU>
>>>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>>>>>> _______________________________________________
>>>>>>> OmniOS-discuss mailing list
>>>>>>> OmniOS-discuss at lists.omniti.com
>>>>>>> <mailto:OmniOS-discuss at lists.omniti.com>
>>>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>>> 
>>> 
>>> -- 
>>> -------------------------------------------------------------------
>>> Kevin Swab                          UNIX Systems Administrator
>>> ACNS                                Colorado State University
>>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>>> <mailto:Kevin.Swab at ColoState.EDU>
>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>> 
>> --
>> 
>> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
>> +1-760-896-4422
>> 
>> 
>> 
> 
> -- 
> -------------------------------------------------------------------
> Kevin Swab                          UNIX Systems Administrator
> ACNS                                Colorado State University
> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C



More information about the OmniOS-discuss mailing list