[OmniOS-discuss] tuning ahci timeouts/retries/ticks (and mpt_sas) (repost from illumos-discuss)
Andrew Kinney
andykinney at advantagecom.net
Wed Aug 10 21:52:24 UTC 2016
Hello,
I'm new to omnios-discuss. Probably should have joined awhile ago. We're
running r151014.
I sent the message below to illumos-discuss and it's been crickets over
there for a couple of days now with no responses at all. Probably just
TL;DR.
Executive summary: badly behaved enterprise SATA SSD firmware caused
ahci driver timeouts, blocked all pool IO for 2 minutes and recovered.
All the iSCSI initiators and their loads fell over. I need to tune the
related tuneables to prevent this, but I'm unsure where those are in
illumos/omnios or if they even work. Details below.
Andrew
Hello,
I'm a long time lurker dating back to OpenSolaris days (2009), but today
I have a question. Hopefully the answers are of sufficient value to the
community to be shared. I've tried to include enough detail to satisfy
the inevitable questions about the devices, their setup, and "did you
try xyz?". The system is in light production, so I can do minimally
disruptive things for troubleshooting.
We have an iSCSI storage server that recently knocked all its initiators
offline when ZFS blocked all IO to the pool during a failed erase (soft,
remapped) on an SSD used as a log (zil) device. Fortunately, the system
is in early production use and we have not yet moved anything critical
to it yet. Even so, it was quite problematic that several diskless
systems were knocked offline along with their entire workloads.
I've been beating my head against a wall trying to sort out where to
configure all the various related timeouts, retries, and ticks, but
documentation is sparse on these topics, even in the code.
Here's our system info:
We're using OmniOS due to its uptime favorable release policies.
# uname -a
SunOS storage0 5.11 omnios-b13298f i86pc i386 i86pc
# zpool list -v xenpool
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
xenpool 14.5T 266G 14.2T - 1% 1% 1.00x ONLINE -
raidz2 14.5T 266G 14.2T - 1% 1%
c1t5000CCA05C68D505d0 - - - - - -
c1t5000CCA05C681EB9d0 - - - - - -
c1t5000CCA03B1007E5d0 - - - - - -
c1t5000CCA03B10C085d0 - - - - - -
c4t0d0p2 24.9G 2.62M 24.9G - 92% 0%
c4t1d0p2 24.9G 2.59M 24.9G - 92% 0%
cache - - - - - -
c4t0d0p3 336G 6.60G 329G - 0% 1%
c4t1d0p3 336G 6.70G 329G - 0% 1%
# zpool status -v xenpool
pool: xenpool
state: ONLINE
scan: scrub repaired 0 in 0h42m with 0 errors on Sun Aug 7 03:42:37 2016
config:
NAME STATE READ WRITE CKSUM
xenpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c1t5000CCA05C68D505d0 ONLINE 0 0 0
c1t5000CCA05C681EB9d0 ONLINE 0 0 0
c1t5000CCA03B1007E5d0 ONLINE 0 0 0
c1t5000CCA03B10C085d0 ONLINE 0 0 0
logs
c4t0d0p2 ONLINE 0 0 0
c4t1d0p2 ONLINE 0 0 0
cache
c4t0d0p3 ONLINE 0 0 0
c4t1d0p3 ONLINE 0 0 0
errors: No known data errors
There are no SAS or SATA expanders in the system. All direct-attach.
The logged evidence of the issue:
Jul 12 19:55:46 storage0 genunix: [ID 517647 kern.warning] WARNING:
ahci0: watchdog port 1 satapkt 0xffffff11c59e2978 timed out
Jul 12 19:57:27 storage0 last message repeated 5 times
The device that had the problem was c4t1d0, which is used as a log (zil)
device. We have 'sync' set to 'always' on the dataset containing our
iSCSI extents, so every write waits on the log (zil) device.
# zfs get sync xenpool
NAME PROPERTY VALUE SOURCE
xenpool sync always local
The device that had the issue is on an Intel C600 series AHCI controller:
# prtconf -ad /dev/dsk/c4t1d0p0
i86pc
pci, instance #0
pci15d9,702 (pciex8086,1d02) [Intel Corporation C600/X79 series chipset
6-Port SATA AHCI Controller], instance #0
disk, instance #1
It's a purpose-appropriate SSD with power loss protection, sufficient
durability, and decent speed: Samsung 845DC Pro 400GB.
At the time (it's been rebooted since), you could see the "Soft Errors"
had incremented just as the logs suggested (notations mine). Ignore the
"Illegal Request" counters because that's just me mucking about with
smartctl trying to find the problem.
# iostat -Exn
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
<snip - irrelevant content>
c4t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA Product: Samsung SSD 845D Revision: 0X3Q Serial No: <redacted>
Size: 400.09GB <400088457216 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 3967 Predictive Failure Analysis: 0
-------------found you bugger!------------------
c4t1d0 Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
Vendor: ATA Product: Samsung SSD 845D Revision: 0X3Q Serial No: <redacted>
Size: 400.09GB <400088457216 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 3966 Predictive Failure Analysis: 0
-------------/found you bugger!------------------
c1t5000CCA05C68D505d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t5000CCA05C681EB9d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t5000CCA03B1007E5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t5000CCA03B10C085d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
You can see from its SMART data (notations mine) that it experienced a
failed block erase and swapped that block with a spare block:
# smartctl -a -d sat,12 /dev/rdsk/c4t1d0s0
smartctl 6.5 2016-05-07 r4318 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 845DC PRO 400GB
Serial Number: <redacted>
LU WWN Device Id: 5 002538 570000c00
Firmware Version: DXV80X3Q
User Capacity: 400,088,457,216 bytes [400 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Aug 7 16:02:38 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
<snip - miscellaneous unrelated cruft>
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED
RAW_VALUE
---problem---
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 1
---/problem---
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 13901
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 38
177 Wear_Leveling_Count 0x0013 099 099 005 Pre-fail Always - 132
---problem---
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 1
---/problem---
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 3519
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
---problem---
182 Erase_Fail_Count_Total 0x0032 099 099 000 Old_age Always - 1
---/problem---
---problem---
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 1
---/problem---
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 061 053 000 Old_age Always - 39
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
202 Unknown_SSD_Attribute 0x0033 100 100 010 Pre-fail Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 32
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 10096198631
242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 2030227326
250 Read_Error_Retry_Rate 0x0032 100 100 001 Old_age Always - 0
<snip - miscellaneous unrelated cruft>
It *did* recover successfully (2 minutes 1 seconds later or so from
logs), but it was so late in doing so that all the iSCSI initiators had
freaked out by the time ZFS starting allowing write IO to flow to the
pool again.
While, technically, I could configure every iSCSI initiator and their
related loads to allow for a multi-minute pause in disk IO, none of our
users will find that acceptable. They don't want to experience problems
and want the vagaries of storage hidden from view. It just has to work
and be fast, all the time. Truthfully, a 2 minute pause is better than a
crash and reboot of the iSCSI initiators and their loads, so there's
that. I could use that as a final fallback if there is no satisfaction
to be had otherwise, though it flies in the face of most OS level defaults.
My initial gut reaction was that this was a firmware problem with the
SSD and it should have responded or failed/errored the IO much sooner
than it did (like TLER in virtually every enterprise device). However,
there is no newer firmware and Samsung has abandoned this model for
whatever reason. I'll probably never buy Samsung SSDs again, no matter
how good the hardware, purely because their support for "datacenter"
SSDs is worse than their consumer gear, but I digress on that point
(long story about my interactions with them on this issue). We'll be
RMA'ing the SSD, but doing it through the server vendor that sold it to
us so we don't have to deal with Samsung.
So, in light of a badly behaving storage device firmware (Ding! Number 1
problem in storage today!), I'm faced with tuning drivers to fix the
problem unless there is some magical way to tell the device itself to
not wait so darn long before failing an IO. For legacy reasons (tape,
optical media, etc.), none of which are (or ever will be) in our
sufficiently modern 2 year old system, the timeouts, retries, and ticks
values in various drivers are often pretty high by default,
controverting the purpose of this system.
While I'd like to resolve this specific issue with the SSD, I also think
it prudent to also deal with timeouts, retries, and ticks for the other
storage controller (mpt_sas driver) and disks in the system:
# prtconf -ad /dev/dsk/c1t5000CCA05C68D505d0
i86pc
pci, instance #0
pci8086,e06 (pciex8086,e06) [Intel Corporation Xeon E7 v2/Xeon E5
v2/Core i7 PCI Express Root Port 2c], instance #2
pci15d9,691 (pciex1000,86) [LSI Logic / Symbios Logic SAS2308
PCI-Express Fusion-MPT SAS-2], instance #0
iport, instance #1
disk, instance #4
Here's what I've done so far since the incident.
In /etc/system :
* reduce the storage device command timeout from the default to 5 seconds
set sd:sd_io_time = 0x5
In /kernel/drv/sd.conf :
# set command retries limit to 3 for all storage devices - 20160726 ATK
#
sd-config-list = "**","retries-timeout:3";
There is a report (Alasdair in 2011,
http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
) that this didn't work for mpt_sas, but that a new kernel patch was in
the works to make the drivers pay attention to these values. I was
unable to discern if the mentioned patch was ever created and applied.
It is unclear if mpt_sas or ahci in my system will honor these settings.
Further, while the sd driver picks up the value, it seems that ahci does
its own thing. It has a hardcoded value for the watchdog timeout of 5
seconds with a 5 second tick interval:
root at storage0:/root# echo "ahci_watchdog_timeout/D" | mdb -k
ahci_watchdog_timeout:
ahci_watchdog_timeout: 5
root at storage0:/root# echo "ahci_watchdog_tick/D" | mdb -k
ahci_watchdog_tick:
ahci_watchdog_tick: 500
(5 seconds if 10ms per tick)
I suspect there's timing issues here since tick interval is not less
than timeout. Effective real world time for the timeout seems to be 20
seconds, if you believe the logs, and then it retries 5 times (or more)
after the initial failure creating a 2 minute delay for the IO. In my
case, it succeeded on the 7th try, so I don't know how many more retries
it would have done before the ahci driver gave up and returned an error
for ZFS to munch. I have been unable to find where the number of retries
is defined for the ahci driver or its current value.
mpt_sas and scsi have their own set of values as well:
root at storage0:/root# echo "mptsas_scsi_watchdog_tick/D" | mdb -k
mptsas_scsi_watchdog_tick:
mptsas_scsi_watchdog_tick: 1
(1 second or 10ms?)
root at storage0:/root# echo "scsi_watchdog_tick/D" | mdb -k
scsi_watchdog_tick:
scsi_watchdog_tick: 10
(10 seconds or 100ms?)
I see that Joyent addressed some of this with mpt_sas back in 2013 in
their SmartOS fork (
https://github.com/joyent/smartos-live/commit/e07b2d7fd21c58c9a366d7a5ae040a212f850af5
), but, since they're just adjusting conf files, I don't know if the
settings are even honored by mpt_sas, let alone ahci.
What I'm having trouble wrapping my brain around is this:
1. What timeout, retries, and tick values affect ahci devices and
where/how do I set them so they're honored?
2. What timeout, retries, and tick values affect mpt_sas devices and
where/how do I set them so they're honored?
3. Is there documentation on any of this? If so, where? I'm happy to
RTFM, if only I knew where it was. None of the driver man pages were of
any help.
Sincerely,
Andrew Kinney
More information about the OmniOS-discuss
mailing list