[OmniOS-discuss] slow drive response times

Thu Jan 1 01:13:04 UTC 2015

> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
> 
> Hello Richard and group, thanks for your reply!
> 
> I'll look into sg_logs for one of these devices once I have a chance to
> track that progam down...
> 
> Thanks for the tip on the 500 ms latency, I wasn't aware that could
> happen in normal cases.  However, I don't believe what I'm seeing
> constitutes normal behavior.
> 
> First, some anecdotal evidence:  If I pull and replace the suspect
> drive, my downstream systems stop complaining, and the high service time
> numbers go away.

We call these "wounded soldiers"  -- it takes more resources to manage a
wounded soldier than a dead soldier, so one strategy of war is to wound your
enemy causing them to consume resources tending the wounded. The sg_logs
should be enlightening.

NB, consider a 4TB disk with 5 platters: if a head or surface starts to go, then
you  have a 1/10 chance that the data you request is under the damaged head
and will need to be recovered by the drive. So it is not uncommon to see
90+% of the I/Os to the drive completing quickly. It is also not unusual to see
only a small number of sectors or tracks affected.

Detecting these becomes tricky, especially as you reduce the timeout/retry
interval, since the problem is rarely seen in the average latency -- that which
iostat and sar record. This is an area where we can and are improving.
 -- richard

> 
> I threw out 500 ms as a guess to the point at which I start seeing
> problems.  However, I see service times far in excess of that, sometimes
> over 30,000 ms!  Below is 20 minutes of sar output from a drive I pulled
> a few days ago, during a time when downstream VMWare servers were
> complaining.  (since the sar output is so verbose, I grepped out the
> info just for the suspect drive):
> 
> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>           sd91,a           99     5.3       1      42     0.0  7811.7
>           sd91,a          100    11.3       1      53     0.0 11016.0
>           sd91,a          100     3.8       1      75     0.0  3615.8
>           sd91,a          100     4.9       1      25     0.0  8633.5
>           sd91,a           93     3.9       1      55     0.0  4385.3
>           sd91,a           86     3.5       2      75     0.0  2060.5
>           sd91,a           91     3.1       4      80     0.0   823.8
>           sd91,a           97     3.5       1      50     0.0  3984.5
>           sd91,a          100     4.4       1      56     0.0  6068.6
>           sd91,a          100     5.0       1      55     0.0  8836.0
>           sd91,a          100     5.7       1      51     0.0  7939.6
>           sd91,a           98     9.9       1      42     0.0 12526.8
>           sd91,a          100     7.4       0      10     0.0 36813.6
>           sd91,a           51     3.8       8      90     0.0   500.2
>           sd91,a           88     3.4       1      60     0.0  2338.8
>           sd91,a          100     4.5       1      28     0.0  6969.2
>           sd91,a           93     3.8       1      59     0.0  5138.9
>           sd91,a           79     3.1       1      59     0.0  3143.9
>           sd91,a           99     4.7       1      52     0.0  5598.4
>           sd91,a          100     4.8       1      62     0.0  6638.4
>           sd91,a           94     5.0       1      54     0.0  3752.7
> 
> For comparison, here's the sar output from another drive in the same
> pool for the same period of time:
> 
> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>           sd82,a            0     0.0       2      28     0.0     5.6
>           sd82,a            1     0.0       3      51     0.0     5.4
>           sd82,a            1     0.0       4      66     0.0     6.3
>           sd82,a            1     0.0       3      48     0.0     4.3
>           sd82,a            1     0.0       3      45     0.0     6.1
>           sd82,a            1     0.0       6      82     0.0     2.7
>           sd82,a            1     0.0       8     112     0.0     2.8
>           sd82,a            0     0.0       3      27     0.0     1.8
>           sd82,a            1     0.0       5      80     0.0     3.1
>           sd82,a            0     0.0       3      35     0.0     3.1
>           sd82,a            1     0.0       3      35     0.0     3.8
>           sd82,a            1     0.0       4      49     0.0     3.2
>           sd82,a            0     0.0       0       0     0.0     4.1
>           sd82,a            3     0.0       9      84     0.0     4.1
>           sd82,a            1     0.0       6      55     0.0     3.7
>           sd82,a            0     0.0       1      23     0.0     7.0
>           sd82,a            0     0.0       6      57     0.0     1.8
>           sd82,a            1     0.0       5      70     0.0     2.3
>           sd82,a            1     0.0       4      55     0.0     3.7
>           sd82,a            1     0.0       5      72     0.0     4.1
>           sd82,a            1     0.0       4      54     0.0     3.6
> 
> The other drives in this pool all show data similar to that of sd82.
> 
> Your point about tuning blindly is well taken, and I'm certainly no
> expert on the IO stack.  What's a humble sysadmin to do?
> 
> For further reference, this system is running r151010.  The drive in
> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
> was in is using lz4 compression and looks like this:
> 
> # zpool status data1
>  pool: data1
> state: ONLINE
>  scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31 14:40:20 2014
> config:
> 
>        NAME                       STATE     READ WRITE CKSUM
>        data1                      ONLINE       0     0     0
>          raidz2-0                 ONLINE       0     0     0
>            c6t5000039468CB54F0d0  ONLINE       0     0     0
>            c6t5000039478CB5138d0  ONLINE       0     0     0
>            c6t5000039468D000DCd0  ONLINE       0     0     0
>            c6t5000039468D000E8d0  ONLINE       0     0     0
>            c6t5000039468D00F5Cd0  ONLINE       0     0     0
>            c6t5000039478C816CCd0  ONLINE       0     0     0
>            c6t5000039478C8546Cd0  ONLINE       0     0     0
>          raidz2-1                 ONLINE       0     0     0
>            c6t5000039478C855F0d0  ONLINE       0     0     0
>            c6t5000039478C856E8d0  ONLINE       0     0     0
>            c6t5000039478C856ECd0  ONLINE       0     0     0
>            c6t5000039478C856F4d0  ONLINE       0     0     0
>            c6t5000039478C86374d0  ONLINE       0     0     0
>            c6t5000039478C8C2A8d0  ONLINE       0     0     0
>            c6t5000039478C8C364d0  ONLINE       0     0     0
>          raidz2-2                 ONLINE       0     0     0
>            c6t5000039478C9958Cd0  ONLINE       0     0     0
>            c6t5000039478C995C4d0  ONLINE       0     0     0
>            c6t5000039478C9DACCd0  ONLINE       0     0     0
>            c6t5000039478C9DB30d0  ONLINE       0     0     0
>            c6t5000039478C9DB6Cd0  ONLINE       0     0     0
>            c6t5000039478CA73B4d0  ONLINE       0     0     0
>            c6t5000039478CB3A20d0  ONLINE       0     0     0
>          raidz2-3                 ONLINE       0     0     0
>            c6t5000039478CB3A64d0  ONLINE       0     0     0
>            c6t5000039478CB3A70d0  ONLINE       0     0     0
>            c6t5000039478CB3E7Cd0  ONLINE       0     0     0
>            c6t5000039478CB3EB0d0  ONLINE       0     0     0
>            c6t5000039478CB3FBCd0  ONLINE       0     0     0
>            c6t5000039478CB4048d0  ONLINE       0     0     0
>            c6t5000039478CB4054d0  ONLINE       0     0     0
>          raidz2-4                 ONLINE       0     0     0
>            c6t5000039478CB424Cd0  ONLINE       0     0     0
>            c6t5000039478CB4250d0  ONLINE       0     0     0
>            c6t5000039478CB470Cd0  ONLINE       0     0     0
>            c6t5000039478CB471Cd0  ONLINE       0     0     0
>            c6t5000039478CB4E50d0  ONLINE       0     0     0
>            c6t5000039478CB50A8d0  ONLINE       0     0     0
>            c6t5000039478CB50BCd0  ONLINE       0     0     0
>        spares
>          c6t50000394A8CBC93Cd0    AVAIL
> 
> errors: No known data errors
> 
> 
> Thanks for your help,
> Kevin
> 
> On 12/31/2014 3:22 PM, Richard Elling wrote:
>> 
>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
>>> 
>>> Hello Everyone,
>>> 
>>> 	We've been running OmniOS on a number of SuperMicro 36bay chassis, with
>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>> various SAS HDD's.  These systems are serving block storage via Comstar
>>> and Qlogic FC HBA's, and have been running well for several years.
>>> 
>>> 	The problem we've got is that as the drives age, some of them start to
>>> perform slowly (intermittently) without failing - no zpool or iostat
>>> errors, and nothing logged in /var/adm/messages.  The slow performance
>>> can be seen as high average service times in iostat or sar.
>> 
>> Look at the drive's error logs using sg_logs (-a for all)
>> 
>>> 
>>> 	When these service times get above 500ms, they start to cause IO
>>> timeouts on the downstream storage consumers, which is bad...
>> 
>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or SATA NCQ
>> 
>>> 
>>> 	I'm wondering - is there a way to tune OmniOS' behavior so that it
>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>> just gives up and fails them?
>> 
>> Yes, the tuning in Alasdair's blog should work as he describes. More below...
>> 
>>> 
>>> 	I found an old post from 2011 which states that some tunables exist,
>>> but are ignored by the mpt_sas driver:
>>> 
>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>> 
>>> Does anyone know the current status of these tunables, or have any other
>>> suggestions that might help?
>> 
>> These tunables are on the order of seconds. The default, 60, is obviously too big
>> unless you have old, slow, SCSI CD-ROMs. But setting it below the manufacturer's
>> internal limit (default or tuned) can lead to an unstable system. Some vendors are
>> better than others at documenting these, but in any case you'll need to see their spec.
>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>> 
>> There are a lot of tunables in this area at all levels of the architecture. OOB, the OmniOS
>> settings ensure stable behaviour. Tuning any layer without understanding the others can
>> lead to unstable systems, as demonstrated by your current downstream consumers.
>> -- richard
>> 
>> 
>>> 
>>> Thanks,
>>> Kevin
>>> 
>>> 
>>> -- 
>>> -------------------------------------------------------------------
>>> Kevin Swab                          UNIX Systems Administrator
>>> ACNS                                Colorado State University
>>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>> _______________________________________________
>>> OmniOS-discuss mailing list
>>> OmniOS-discuss at lists.omniti.com
>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>