[OmniOS-discuss] ZFS crash/reboot loop

Mon Jul 13 00:26:54 UTC 2015

On 7/12/15 3:21 PM, Günther Alka wrote:
> First action:
> If you can mount the pool read-only, update your backup

We are securing all the non-scratch data currently before messing with
the pool any more.  We had backups as recent as the night before but it
is still going to be faster to pull the current data from the readonly
pool than from backups.

> Then
> I would expect that a single bad disk is the reason of the problem on a
> write command. I would first check the system and fault log or
> smartvalues for hints about a bad disk. If there is a suspicious disk,
> remove that and retry a regular import.

We have pulled all disks individually yesterday to test this exact
theory.  We have hit the mpt_sas disk failure panics before so we had
already tried this.

> If there is no hint
> Next what I would try is a pool export. Then create a script that
> imports the pool followed by a scrub cancel. (Hope that the cancel is
> faster than the crash). Then check logs during some pool activity.

If I have not imported the pool RW can I export the pool?  I thought we
have tried this but I will have to confer.

> If this does not help, I would remove all data disks and bootup.
> Then hot-plug disk by disk and check if its detected properly and check
> logs. Your pool remains offline until enough disks come back.
> Adding disk by disk and checking logs should help to find a bad disk
> that initiates a crash

This is interesting and we will try this once we secure the data.

> Next option is, try a pool import where always one or next disk is
> missing. Until there is no write, missing disks are not a problem with
> ZFS (you may need to clear errors).

Wouldn't this be the same as above hot-plugging disk by disk?

> Last option:
> use another server where you try to import (mainboard, power,  hba or
> backplane problem) remove all disks and do a nondestructive or smart
> test on another machine

Sadly we do not have a spare chassis with 40 slots around to test this.
 I am so far unconvinced that this is a hardware problem though.

We will most likely boot up into linux live CD to run smartctl and see
if it has any information on the disks.

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies