[OmniOS-discuss] slow drive response times

Kevin Swab Kevin.Swab at ColoState.EDU
Fri Jan 2 21:50:29 UTC 2015


I've run 'sg_logs' on the drive I pulled last week.  There were alot of
errors in the backgroud scan section of the output, which made it very
large, so I put it here:

http://pastebin.com/jx5BvSep

When I pulled this drive, the SMART health status was OK.  However, when
I put it in a test system to run 'sg_logs', the status changed to
"impending failure...".  Had the SMART status changed before pulling the
drive, I'm sure 'fmd' would have alerted me to the problem...

Since that drive had other indications of trouble, I ran 'sg_logs' on
another drive I pulled recently that has a SMART health status of OK,
but exibits similar slow service time behavior:

http://pastebin.com/Q0t8Jnug

Thanks for taking the time to look at these, please let me know what you
find...

Kevin




On 12/31/2014 06:13 PM, Richard Elling wrote:
> 
>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
>>
>> Hello Richard and group, thanks for your reply!
>>
>> I'll look into sg_logs for one of these devices once I have a chance to
>> track that progam down...
>>
>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>> happen in normal cases.  However, I don't believe what I'm seeing
>> constitutes normal behavior.
>>
>> First, some anecdotal evidence:  If I pull and replace the suspect
>> drive, my downstream systems stop complaining, and the high service time
>> numbers go away.
> 
> We call these "wounded soldiers"  -- it takes more resources to manage a
> wounded soldier than a dead soldier, so one strategy of war is to wound your
> enemy causing them to consume resources tending the wounded. The sg_logs
> should be enlightening.
> 
> NB, consider a 4TB disk with 5 platters: if a head or surface starts to go, then
> you  have a 1/10 chance that the data you request is under the damaged head
> and will need to be recovered by the drive. So it is not uncommon to see
> 90+% of the I/Os to the drive completing quickly. It is also not unusual to see
> only a small number of sectors or tracks affected.
> 
> Detecting these becomes tricky, especially as you reduce the timeout/retry
> interval, since the problem is rarely seen in the average latency -- that which
> iostat and sar record. This is an area where we can and are improving.
>  -- richard
> 
>>
>> I threw out 500 ms as a guess to the point at which I start seeing
>> problems.  However, I see service times far in excess of that, sometimes
>> over 30,000 ms!  Below is 20 minutes of sar output from a drive I pulled
>> a few days ago, during a time when downstream VMWare servers were
>> complaining.  (since the sar output is so verbose, I grepped out the
>> info just for the suspect drive):
>>
>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>           sd91,a           99     5.3       1      42     0.0  7811.7
>>           sd91,a          100    11.3       1      53     0.0 11016.0
>>           sd91,a          100     3.8       1      75     0.0  3615.8
>>           sd91,a          100     4.9       1      25     0.0  8633.5
>>           sd91,a           93     3.9       1      55     0.0  4385.3
>>           sd91,a           86     3.5       2      75     0.0  2060.5
>>           sd91,a           91     3.1       4      80     0.0   823.8
>>           sd91,a           97     3.5       1      50     0.0  3984.5
>>           sd91,a          100     4.4       1      56     0.0  6068.6
>>           sd91,a          100     5.0       1      55     0.0  8836.0
>>           sd91,a          100     5.7       1      51     0.0  7939.6
>>           sd91,a           98     9.9       1      42     0.0 12526.8
>>           sd91,a          100     7.4       0      10     0.0 36813.6
>>           sd91,a           51     3.8       8      90     0.0   500.2
>>           sd91,a           88     3.4       1      60     0.0  2338.8
>>           sd91,a          100     4.5       1      28     0.0  6969.2
>>           sd91,a           93     3.8       1      59     0.0  5138.9
>>           sd91,a           79     3.1       1      59     0.0  3143.9
>>           sd91,a           99     4.7       1      52     0.0  5598.4
>>           sd91,a          100     4.8       1      62     0.0  6638.4
>>           sd91,a           94     5.0       1      54     0.0  3752.7
>>
>> For comparison, here's the sar output from another drive in the same
>> pool for the same period of time:
>>
>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>           sd82,a            0     0.0       2      28     0.0     5.6
>>           sd82,a            1     0.0       3      51     0.0     5.4
>>           sd82,a            1     0.0       4      66     0.0     6.3
>>           sd82,a            1     0.0       3      48     0.0     4.3
>>           sd82,a            1     0.0       3      45     0.0     6.1
>>           sd82,a            1     0.0       6      82     0.0     2.7
>>           sd82,a            1     0.0       8     112     0.0     2.8
>>           sd82,a            0     0.0       3      27     0.0     1.8
>>           sd82,a            1     0.0       5      80     0.0     3.1
>>           sd82,a            0     0.0       3      35     0.0     3.1
>>           sd82,a            1     0.0       3      35     0.0     3.8
>>           sd82,a            1     0.0       4      49     0.0     3.2
>>           sd82,a            0     0.0       0       0     0.0     4.1
>>           sd82,a            3     0.0       9      84     0.0     4.1
>>           sd82,a            1     0.0       6      55     0.0     3.7
>>           sd82,a            0     0.0       1      23     0.0     7.0
>>           sd82,a            0     0.0       6      57     0.0     1.8
>>           sd82,a            1     0.0       5      70     0.0     2.3
>>           sd82,a            1     0.0       4      55     0.0     3.7
>>           sd82,a            1     0.0       5      72     0.0     4.1
>>           sd82,a            1     0.0       4      54     0.0     3.6
>>
>> The other drives in this pool all show data similar to that of sd82.
>>
>> Your point about tuning blindly is well taken, and I'm certainly no
>> expert on the IO stack.  What's a humble sysadmin to do?
>>
>> For further reference, this system is running r151010.  The drive in
>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>> was in is using lz4 compression and looks like this:
>>
>> # zpool status data1
>>  pool: data1
>> state: ONLINE
>>  scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31 14:40:20 2014
>> config:
>>
>>        NAME                       STATE     READ WRITE CKSUM
>>        data1                      ONLINE       0     0     0
>>          raidz2-0                 ONLINE       0     0     0
>>            c6t5000039468CB54F0d0  ONLINE       0     0     0
>>            c6t5000039478CB5138d0  ONLINE       0     0     0
>>            c6t5000039468D000DCd0  ONLINE       0     0     0
>>            c6t5000039468D000E8d0  ONLINE       0     0     0
>>            c6t5000039468D00F5Cd0  ONLINE       0     0     0
>>            c6t5000039478C816CCd0  ONLINE       0     0     0
>>            c6t5000039478C8546Cd0  ONLINE       0     0     0
>>          raidz2-1                 ONLINE       0     0     0
>>            c6t5000039478C855F0d0  ONLINE       0     0     0
>>            c6t5000039478C856E8d0  ONLINE       0     0     0
>>            c6t5000039478C856ECd0  ONLINE       0     0     0
>>            c6t5000039478C856F4d0  ONLINE       0     0     0
>>            c6t5000039478C86374d0  ONLINE       0     0     0
>>            c6t5000039478C8C2A8d0  ONLINE       0     0     0
>>            c6t5000039478C8C364d0  ONLINE       0     0     0
>>          raidz2-2                 ONLINE       0     0     0
>>            c6t5000039478C9958Cd0  ONLINE       0     0     0
>>            c6t5000039478C995C4d0  ONLINE       0     0     0
>>            c6t5000039478C9DACCd0  ONLINE       0     0     0
>>            c6t5000039478C9DB30d0  ONLINE       0     0     0
>>            c6t5000039478C9DB6Cd0  ONLINE       0     0     0
>>            c6t5000039478CA73B4d0  ONLINE       0     0     0
>>            c6t5000039478CB3A20d0  ONLINE       0     0     0
>>          raidz2-3                 ONLINE       0     0     0
>>            c6t5000039478CB3A64d0  ONLINE       0     0     0
>>            c6t5000039478CB3A70d0  ONLINE       0     0     0
>>            c6t5000039478CB3E7Cd0  ONLINE       0     0     0
>>            c6t5000039478CB3EB0d0  ONLINE       0     0     0
>>            c6t5000039478CB3FBCd0  ONLINE       0     0     0
>>            c6t5000039478CB4048d0  ONLINE       0     0     0
>>            c6t5000039478CB4054d0  ONLINE       0     0     0
>>          raidz2-4                 ONLINE       0     0     0
>>            c6t5000039478CB424Cd0  ONLINE       0     0     0
>>            c6t5000039478CB4250d0  ONLINE       0     0     0
>>            c6t5000039478CB470Cd0  ONLINE       0     0     0
>>            c6t5000039478CB471Cd0  ONLINE       0     0     0
>>            c6t5000039478CB4E50d0  ONLINE       0     0     0
>>            c6t5000039478CB50A8d0  ONLINE       0     0     0
>>            c6t5000039478CB50BCd0  ONLINE       0     0     0
>>        spares
>>          c6t50000394A8CBC93Cd0    AVAIL
>>
>> errors: No known data errors
>>
>>
>> Thanks for your help,
>> Kevin
>>
>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>
>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
>>>>
>>>> Hello Everyone,
>>>>
>>>> 	We've been running OmniOS on a number of SuperMicro 36bay chassis, with
>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>> various SAS HDD's.  These systems are serving block storage via Comstar
>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>
>>>> 	The problem we've got is that as the drives age, some of them start to
>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>> errors, and nothing logged in /var/adm/messages.  The slow performance
>>>> can be seen as high average service times in iostat or sar.
>>>
>>> Look at the drive's error logs using sg_logs (-a for all)
>>>
>>>>
>>>> 	When these service times get above 500ms, they start to cause IO
>>>> timeouts on the downstream storage consumers, which is bad...
>>>
>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or SATA NCQ
>>>
>>>>
>>>> 	I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>> just gives up and fails them?
>>>
>>> Yes, the tuning in Alasdair's blog should work as he describes. More below...
>>>
>>>>
>>>> 	I found an old post from 2011 which states that some tunables exist,
>>>> but are ignored by the mpt_sas driver:
>>>>
>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>
>>>> Does anyone know the current status of these tunables, or have any other
>>>> suggestions that might help?
>>>
>>> These tunables are on the order of seconds. The default, 60, is obviously too big
>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the manufacturer's
>>> internal limit (default or tuned) can lead to an unstable system. Some vendors are
>>> better than others at documenting these, but in any case you'll need to see their spec.
>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>
>>> There are a lot of tunables in this area at all levels of the architecture. OOB, the OmniOS
>>> settings ensure stable behaviour. Tuning any layer without understanding the others can
>>> lead to unstable systems, as demonstrated by your current downstream consumers.
>>> -- richard
>>>
>>>
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> -- 
>>>> -------------------------------------------------------------------
>>>> Kevin Swab                          UNIX Systems Administrator
>>>> ACNS                                Colorado State University
>>>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>>> _______________________________________________
>>>> OmniOS-discuss mailing list
>>>> OmniOS-discuss at lists.omniti.com
>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>

-- 
-------------------------------------------------------------------
Kevin Swab                          UNIX Systems Administrator
ACNS                                Colorado State University
Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C


More information about the OmniOS-discuss mailing list