[OmniOS-discuss] ZFS crash/reboot loop

Mon Jul 13 01:18:17 UTC 2015

> On Jul 12, 2015, at 5:26 PM, Derek Yarnell <derek at umiacs.umd.edu> wrote:
> 
> On 7/12/15 3:21 PM, Günther Alka wrote:
>> First action:
>> If you can mount the pool read-only, update your backup
> 
> We are securing all the non-scratch data currently before messing with
> the pool any more.  We had backups as recent as the night before but it
> is still going to be faster to pull the current data from the readonly
> pool than from backups.
> 
>> Then
>> I would expect that a single bad disk is the reason of the problem on a
>> write command. I would first check the system and fault log or
>> smartvalues for hints about a bad disk. If there is a suspicious disk,
>> remove that and retry a regular import.
> 
> We have pulled all disks individually yesterday to test this exact
> theory.  We have hit the mpt_sas disk failure panics before so we had
> already tried this.

I don't believe this is a bad disk.

Some additional block pointer verification code was added in changeset
f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause
of this assertion. In general, assertion failures are almost always software
problems -- the programmer didn't see what they expected.

Dan, if you're listening, Matt would be the best person to weigh-in on this.
 -- richard

> 
>> If there is no hint
>> Next what I would try is a pool export. Then create a script that
>> imports the pool followed by a scrub cancel. (Hope that the cancel is
>> faster than the crash). Then check logs during some pool activity.
> 
> If I have not imported the pool RW can I export the pool?  I thought we
> have tried this but I will have to confer.
> 
>> If this does not help, I would remove all data disks and bootup.
>> Then hot-plug disk by disk and check if its detected properly and check
>> logs. Your pool remains offline until enough disks come back.
>> Adding disk by disk and checking logs should help to find a bad disk
>> that initiates a crash
> 
> This is interesting and we will try this once we secure the data.
> 
>> Next option is, try a pool import where always one or next disk is
>> missing. Until there is no write, missing disks are not a problem with
>> ZFS (you may need to clear errors).
> 
> Wouldn't this be the same as above hot-plugging disk by disk?
> 
>> Last option:
>> use another server where you try to import (mainboard, power,  hba or
>> backplane problem) remove all disks and do a nondestructive or smart
>> test on another machine
> 
> Sadly we do not have a spare chassis with 40 slots around to test this.
> I am so far unconvinced that this is a hardware problem though.
> 
> We will most likely boot up into linux live CD to run smartctl and see
> if it has any information on the disks.
> 
> -- 
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss