[OmniOS-discuss] tuning ahci timeouts/retries/ticks (and mpt_sas) (repost from illumos-discuss)

Wed Aug 10 21:52:24 UTC 2016

Hello,

I'm new to omnios-discuss. Probably should have joined awhile ago. We're 
running r151014.

I sent the message below to illumos-discuss and it's been crickets over 
there for a couple of days now with no responses at all. Probably just 
TL;DR.

Executive summary: badly behaved enterprise SATA SSD firmware caused 
ahci driver timeouts, blocked all pool IO for 2 minutes and recovered. 
All the iSCSI initiators and their loads fell over. I need to tune the 
related tuneables to prevent this, but I'm unsure where those are in 
illumos/omnios or if they even work. Details below.

Andrew

Hello,

I'm a long time lurker dating back to OpenSolaris days (2009), but today 
I have a question. Hopefully the answers are of sufficient value to the 
community to be shared. I've tried to include enough detail to satisfy 
the inevitable questions about the devices, their setup, and "did you 
try xyz?". The system is in light production, so I can do minimally 
disruptive things for troubleshooting.

We have an iSCSI storage server that recently knocked all its initiators 
offline when ZFS blocked all IO to the pool during a failed erase (soft, 
remapped) on an SSD used as a log (zil) device. Fortunately, the system 
is in early production use and we have not yet moved anything critical 
to it yet. Even so, it was quite problematic that several diskless 
systems were knocked offline along with their entire workloads.

I've been beating my head against a wall trying to sort out where to 
configure all the various related timeouts, retries, and ticks, but 
documentation is sparse on these topics, even in the code.

Here's our system info:

We're using OmniOS due to its uptime favorable release policies.
# uname -a
SunOS storage0 5.11 omnios-b13298f i86pc i386 i86pc

# zpool list -v xenpool
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
xenpool 14.5T 266G 14.2T - 1% 1% 1.00x ONLINE -
raidz2 14.5T 266G 14.2T - 1% 1%
c1t5000CCA05C68D505d0 - - - - - -
c1t5000CCA05C681EB9d0 - - - - - -
c1t5000CCA03B1007E5d0 - - - - - -
c1t5000CCA03B10C085d0 - - - - - -
c4t0d0p2 24.9G 2.62M 24.9G - 92% 0%
c4t1d0p2 24.9G 2.59M 24.9G - 92% 0%
cache - - - - - -
c4t0d0p3 336G 6.60G 329G - 0% 1%
c4t1d0p3 336G 6.70G 329G - 0% 1%

# zpool status -v xenpool
pool: xenpool
state: ONLINE
scan: scrub repaired 0 in 0h42m with 0 errors on Sun Aug 7 03:42:37 2016
config:

NAME STATE READ WRITE CKSUM
xenpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c1t5000CCA05C68D505d0 ONLINE 0 0 0
c1t5000CCA05C681EB9d0 ONLINE 0 0 0
c1t5000CCA03B1007E5d0 ONLINE 0 0 0
c1t5000CCA03B10C085d0 ONLINE 0 0 0
logs
c4t0d0p2 ONLINE 0 0 0
c4t1d0p2 ONLINE 0 0 0
cache
c4t0d0p3 ONLINE 0 0 0
c4t1d0p3 ONLINE 0 0 0

errors: No known data errors

There are no SAS or SATA expanders in the system. All direct-attach.

The logged evidence of the issue:
Jul 12 19:55:46 storage0 genunix: [ID 517647 kern.warning] WARNING: 
ahci0: watchdog port 1 satapkt 0xffffff11c59e2978 timed out
Jul 12 19:57:27 storage0 last message repeated 5 times

The device that had the problem was c4t1d0, which is used as a log (zil) 
device. We have 'sync' set to 'always' on the dataset containing our 
iSCSI extents, so every write waits on the log (zil) device.
# zfs get sync xenpool
NAME PROPERTY VALUE SOURCE
xenpool sync always local

The device that had the issue is on an Intel C600 series AHCI controller:
# prtconf -ad /dev/dsk/c4t1d0p0
i86pc
pci, instance #0
pci15d9,702 (pciex8086,1d02) [Intel Corporation C600/X79 series chipset 
6-Port SATA AHCI Controller], instance #0
disk, instance #1

It's a purpose-appropriate SSD with power loss protection, sufficient 
durability, and decent speed: Samsung 845DC Pro 400GB.

At the time (it's been rebooted since), you could see the "Soft Errors" 
had incremented just as the logs suggested (notations mine). Ignore the 
"Illegal Request" counters because that's just me mucking about with 
smartctl trying to find the problem.
# iostat -Exn
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
<snip - irrelevant content>

c4t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA Product: Samsung SSD 845D Revision: 0X3Q Serial No: <redacted>
Size: 400.09GB <400088457216 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 3967 Predictive Failure Analysis: 0

-------------found you bugger!------------------
c4t1d0 Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
Vendor: ATA Product: Samsung SSD 845D Revision: 0X3Q Serial No: <redacted>
Size: 400.09GB <400088457216 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 3966 Predictive Failure Analysis: 0
-------------/found you bugger!------------------

c1t5000CCA05C68D505d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

c1t5000CCA05C681EB9d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

c1t5000CCA03B1007E5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

c1t5000CCA03B10C085d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: HGST Product: HUS724040ALS640 Revision: A1C4 Serial No: <redacted>
Size: 4000.79GB <4000787030016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

You can see from its SMART data (notations mine) that it experienced a 
failed block erase and swapped that block with a spare block:
# smartctl -a -d sat,12 /dev/rdsk/c4t1d0s0
smartctl 6.5 2016-05-07 r4318 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 845DC PRO 400GB
Serial Number: <redacted>
LU WWN Device Id: 5 002538 570000c00
Firmware Version: DXV80X3Q
User Capacity: 400,088,457,216 bytes [400 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Aug 7 16:02:38 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
<snip - miscellaneous unrelated cruft>

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED 
RAW_VALUE

---problem---
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 1
---/problem---

9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 13901
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 38
177 Wear_Leveling_Count 0x0013 099 099 005 Pre-fail Always - 132

---problem---
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 1
---/problem---

180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 3519
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0

---problem---
182 Erase_Fail_Count_Total 0x0032 099 099 000 Old_age Always - 1
---/problem---

---problem---
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 1
---/problem---

184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 061 053 000 Old_age Always - 39
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
202 Unknown_SSD_Attribute 0x0033 100 100 010 Pre-fail Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 32
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 10096198631
242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 2030227326
250 Read_Error_Retry_Rate 0x0032 100 100 001 Old_age Always - 0

<snip - miscellaneous unrelated cruft>

It *did* recover successfully (2 minutes 1 seconds later or so from 
logs), but it was so late in doing so that all the iSCSI initiators had 
freaked out by the time ZFS starting allowing write IO to flow to the 
pool again.

While, technically, I could configure every iSCSI initiator and their 
related loads to allow for a multi-minute pause in disk IO, none of our 
users will find that acceptable. They don't want to experience problems 
and want the vagaries of storage hidden from view. It just has to work 
and be fast, all the time. Truthfully, a 2 minute pause is better than a 
crash and reboot of the iSCSI initiators and their loads, so there's 
that. I could use that as a final fallback if there is no satisfaction 
to be had otherwise, though it flies in the face of most OS level defaults.

My initial gut reaction was that this was a firmware problem with the 
SSD and it should have responded or failed/errored the IO much sooner 
than it did (like TLER in virtually every enterprise device). However, 
there is no newer firmware and Samsung has abandoned this model for 
whatever reason. I'll probably never buy Samsung SSDs again, no matter 
how good the hardware, purely because their support for "datacenter" 
SSDs is worse than their consumer gear, but I digress on that point 
(long story about my interactions with them on this issue). We'll be 
RMA'ing the SSD, but doing it through the server vendor that sold it to 
us so we don't have to deal with Samsung.

So, in light of a badly behaving storage device firmware (Ding! Number 1 
problem in storage today!), I'm faced with tuning drivers to fix the 
problem unless there is some magical way to tell the device itself to 
not wait so darn long before failing an IO. For legacy reasons (tape, 
optical media, etc.), none of which are (or ever will be) in our 
sufficiently modern 2 year old system, the timeouts, retries, and ticks 
values in various drivers are often pretty high by default, 
controverting the purpose of this system.

While I'd like to resolve this specific issue with the SSD, I also think 
it prudent to also deal with timeouts, retries, and ticks for the other 
storage controller (mpt_sas driver) and disks in the system:
# prtconf -ad /dev/dsk/c1t5000CCA05C68D505d0
i86pc
pci, instance #0
pci8086,e06 (pciex8086,e06) [Intel Corporation Xeon E7 v2/Xeon E5 
v2/Core i7 PCI Express Root Port 2c], instance #2
pci15d9,691 (pciex1000,86) [LSI Logic / Symbios Logic SAS2308 
PCI-Express Fusion-MPT SAS-2], instance #0
iport, instance #1
disk, instance #4

Here's what I've done so far since the incident.

In /etc/system :
* reduce the storage device command timeout from the default to 5 seconds
set sd:sd_io_time = 0x5

In /kernel/drv/sd.conf :
# set command retries limit to 3 for all storage devices - 20160726 ATK
#
sd-config-list = "**","retries-timeout:3";

There is a report (Alasdair in 2011, 
http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/ 
) that this didn't work for mpt_sas, but that a new kernel patch was in 
the works to make the drivers pay attention to these values. I was 
unable to discern if the mentioned patch was ever created and applied. 
It is unclear if mpt_sas or ahci in my system will honor these settings.

Further, while the sd driver picks up the value, it seems that ahci does 
its own thing. It has a hardcoded value for the watchdog timeout of 5 
seconds with a 5 second tick interval:
root at storage0:/root# echo "ahci_watchdog_timeout/D" | mdb -k
ahci_watchdog_timeout:
ahci_watchdog_timeout: 5
root at storage0:/root# echo "ahci_watchdog_tick/D" | mdb -k
ahci_watchdog_tick:
ahci_watchdog_tick: 500
(5 seconds if 10ms per tick)

I suspect there's timing issues here since tick interval is not less 
than timeout. Effective real world time for the timeout seems to be 20 
seconds, if you believe the logs, and then it retries 5 times (or more) 
after the initial failure creating a 2 minute delay for the IO. In my 
case, it succeeded on the 7th try, so I don't know how many more retries 
it would have done before the ahci driver gave up and returned an error 
for ZFS to munch. I have been unable to find where the number of retries 
is defined for the ahci driver or its current value.

mpt_sas and scsi have their own set of values as well:
root at storage0:/root# echo "mptsas_scsi_watchdog_tick/D" | mdb -k
mptsas_scsi_watchdog_tick:
mptsas_scsi_watchdog_tick: 1
(1 second or 10ms?)
root at storage0:/root# echo "scsi_watchdog_tick/D" | mdb -k
scsi_watchdog_tick:
scsi_watchdog_tick: 10
(10 seconds or 100ms?)

I see that Joyent addressed some of this with mpt_sas back in 2013 in 
their SmartOS fork ( 
https://github.com/joyent/smartos-live/commit/e07b2d7fd21c58c9a366d7a5ae040a212f850af5 
), but, since they're just adjusting conf files, I don't know if the 
settings are even honored by mpt_sas, let alone ahci.

What I'm having trouble wrapping my brain around is this:

1. What timeout, retries, and tick values affect ahci devices and 
where/how do I set them so they're honored?
2. What timeout, retries, and tick values affect mpt_sas devices and 
where/how do I set them so they're honored?
3. Is there documentation on any of this? If so, where? I'm happy to 
RTFM, if only I knew where it was. None of the driver man pages were of 
any help.

Sincerely,
Andrew Kinney