[OmniOS-discuss] disk failure causing reboot?

Tue May 19 09:19:17 UTC 2015

Same issue here around two months ago when a L2arc device failed… failmode was default and the device was actually an mSata SSD mounted in a PCI-E mSata card:

http://www.addonics.com/products/ad4mspx2.php  and the disk was one of four of these http://www.samsung.com/us/computer/memory-storage/MZ-MTE1T0BW

Can these reboots be avoided in any way?

Br,
Rune

From: OmniOS-discuss [mailto:omnios-discuss-bounces at lists.omniti.com] On Behalf Of Schweiss, Chip
Sent: Monday, May 18, 2015 10:31 PM
To: Paul B. Henson
Cc: omnios-discuss
Subject: Re: [OmniOS-discuss] disk failure causing reboot?

I had the exact same failure mode last week.  With over 1000 spindles I see this about once a month.

I can publish my dump also if anyone actually want's to try to fix this problem, but I think there are several of the same thing already linked to tickets in Illumos-gate.
Pools for the most part should be set to failmode=panic or wait, but a failed disk should not cause a panic.   The system this happened to me on failmode was set to wait.  It is also on r151012, waiting on a window to upgrade to r151014.  My pool is raidz3, so no reason not to kick a bad disk.
All my disks are SAS in DataON JBODs, dual connected across two LSI HBAs.    BTW, pull a SAS cable and you get a panic too, not degraded multipath.    Illumos seems to panic on just about any SAS event these days regardless of redundancy.
-Chip

On Mon, May 18, 2015 at 3:08 PM, Paul B. Henson <henson at acm.org<mailto:henson at acm.org>> wrote:
On Mon, May 18, 2015 at 06:25:34PM +0000, Jeff Stockett wrote:
> A drive failed in one of our supermicro 5048R-E1CR36L servers running
> omnios r151012 last night, and somewhat unexpectedly, the whole system
> seems to have panicked.

You don't happen to have failmode set to panic on the pool?

From the zpool manpage:

       failmode=wait | continue | panic
           Controls the system behavior in the event of catastrophic pool
           failure. This condition is typically a result of a loss of
           connectivity to the underlying storage device(s) or a failure of
           all devices within the pool. The behavior of such an event is
           determined as follows:

           wait
                       Blocks all I/O access until the device connectivity is
                       recovered and the errors are cleared. This is the
                       default behavior.

           continue
                       Returns EIO to any new write I/O requests but allows
                       reads to any of the remaining healthy devices. Any
                       write requests that have yet to be committed to disk
                       would be blocked.

           panic
                       Prints out a message to the console and generates a
                       system crash dump.

_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss at lists.omniti.com<mailto:OmniOS-discuss at lists.omniti.com>
http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150519/e80f57a9/attachment-0001.html>