[OmniOS-discuss] disk failure causing reboot?
Rune Tipsmark
rt at steait.net
Tue May 19 09:19:17 UTC 2015
Same issue here around two months ago when a L2arc device failed… failmode was default and the device was actually an mSata SSD mounted in a PCI-E mSata card:
http://www.addonics.com/products/ad4mspx2.php and the disk was one of four of these http://www.samsung.com/us/computer/memory-storage/MZ-MTE1T0BW
Can these reboots be avoided in any way?
Br,
Rune
From: OmniOS-discuss [mailto:omnios-discuss-bounces at lists.omniti.com] On Behalf Of Schweiss, Chip
Sent: Monday, May 18, 2015 10:31 PM
To: Paul B. Henson
Cc: omnios-discuss
Subject: Re: [OmniOS-discuss] disk failure causing reboot?
I had the exact same failure mode last week. With over 1000 spindles I see this about once a month.
I can publish my dump also if anyone actually want's to try to fix this problem, but I think there are several of the same thing already linked to tickets in Illumos-gate.
Pools for the most part should be set to failmode=panic or wait, but a failed disk should not cause a panic. The system this happened to me on failmode was set to wait. It is also on r151012, waiting on a window to upgrade to r151014. My pool is raidz3, so no reason not to kick a bad disk.
All my disks are SAS in DataON JBODs, dual connected across two LSI HBAs. BTW, pull a SAS cable and you get a panic too, not degraded multipath. Illumos seems to panic on just about any SAS event these days regardless of redundancy.
-Chip
On Mon, May 18, 2015 at 3:08 PM, Paul B. Henson <henson at acm.org<mailto:henson at acm.org>> wrote:
On Mon, May 18, 2015 at 06:25:34PM +0000, Jeff Stockett wrote:
> A drive failed in one of our supermicro 5048R-E1CR36L servers running
> omnios r151012 last night, and somewhat unexpectedly, the whole system
> seems to have panicked.
You don't happen to have failmode set to panic on the pool?
From the zpool manpage:
failmode=wait | continue | panic
Controls the system behavior in the event of catastrophic pool
failure. This condition is typically a result of a loss of
connectivity to the underlying storage device(s) or a failure of
all devices within the pool. The behavior of such an event is
determined as follows:
wait
Blocks all I/O access until the device connectivity is
recovered and the errors are cleared. This is the
default behavior.
continue
Returns EIO to any new write I/O requests but allows
reads to any of the remaining healthy devices. Any
write requests that have yet to be committed to disk
would be blocked.
panic
Prints out a message to the console and generates a
system crash dump.
_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss at lists.omniti.com<mailto:OmniOS-discuss at lists.omniti.com>
http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150519/e80f57a9/attachment-0001.html>
More information about the OmniOS-discuss
mailing list