[OmniOS-discuss] big zfs storage?

Thu Oct 8 15:14:57 UTC 2015

Thanks everyone who answered, very insightful.

What scares me the most is hearing about the panics and FMA not having
time to react at all, and also stories of sub-optimal multi hot-spare
kicking into action like described by Chip.  Recipe for disaster.
I guess this is an area where Nexenta has worked hard in implementing
their own graceful handling of various tested failure scenarios.
However you're covered if and only if you have a system conforming to
their HCL.
This goes in-line with what Chris has implemented where he works;
very customized to their environment and policies.

On Wed, Oct 7, 2015 at 9:36 PM, Chris Siebenmann <cks at cs.toronto.edu> wrote:
>> I completely concur with Richard on this.  Let me give an a real example
>> that emphases this point as it's a critical design decision.
> [...]
>> Now I only run one hot spare per pool.  Most of my pools are raidz2 or
>> raidz3.  This way any event like this can not take out more than one
>> disk and data parity will never be lost.
>>
>> There are other causes that can trigger multiple disk replacements. I
>> have not encountered them.  If I do, they won't hurt my data with the
>> limit of one hot spare.
>
>  My view is that spare handling needs to be a local decision based on
> your storage topology and pool and vdev structure (and on your durability
> needs, and even on how staffing is handled, eg if you have a 24/7 on
> call rotation). I don't think there is any single global right answer;
> hot spares will be good for some people and bad for others.
>
>  Locally we use mirrored vdevs, multiple pools, an iSCSI SAN to connect
> to actual disks, multiple backend disk controllers, and no 24/7 on call
> setup. We've developed an automated spares handling system that knows a
> great deal about our local storage topology (so it knows what are 'good'
> and 'bad' spares for any particular bad disk, using various criteria)
> and having it available has been very helpful in the face of various
> things going wrong, both individual disk failures and entire backend
> disk controllers suffering power failures after the end of the workday.
> Our solution is of course very local, but the important thing is that
> it's clear that automating this has been the right tradeoff *for us*.
>
> (In another environment it would probably be the wrong answer, eg if we
> had a 24/7 NOC staffed with people to swap physical disks and hardware
> at any time of the day, night, or holidays, and a 24/7 on call sysadmin
> to do system things like 'zpool replace'. There are other parts of
> the university which do have this. I suspect that they don't use an
> automated spares system of any kind, although I don't know for sure.)
>
>         - cks
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss