[OmniOS-discuss] zpool degraded while smart sais disks are OK
Tobias Oetiker
tobi at oetiker.ch
Mon Mar 31 14:16:08 UTC 2014
Hi Richard,
Mar 23 Richard Elling wrote:
>
> On Mar 21, 2014, at 10:13 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:
>
> > Yesterday Richard Elling wrote:
> >
> >>
> >> On Mar 21, 2014, at 3:23 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:
> >
> > [...]
> >>>
> >>> it happened over time as you can see from the timestamps in the
> >>> log. The errors from zfs's point of view were 1 read and about 30 write
> >>>
> >>> but according to smart the disks are without flaw
> >>
> >> Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable
> >> errors that are related to media or heads. For a clue to more permanent errors,
> >> you will want to look at the read/write error reports for errors that are
> >> corrected with possible delays. You can also look at the grown defects list.
> >>
> >> This behaviour is expected for drives with errors that are not being quickly
> >> corrected or have firmware bugs (horrors!) and where the disk does not do TLER
> >> (or its vendor's equivalent)
> >> -- richard
> >
> > the error counters look like this:
> >
> >
> > Error counter log:
> > Errors Corrected by Total Correction Gigabytes Total
> > ECC rereads/ errors algorithm processed uncorrected
> > fast | delayed rewrites corrected invocations [10^9 bytes] errors
> > read: 3494 0 0 3494 44904 530.879 0
> > write: 0 0 0 0 39111 1793.323 0
> > verify: 0 0 0 0 8133 0.000 0
>
> Errors corrected without delay looks good. The problem lies elsewhere.
>
> >
> > the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with
> > three of them. The system has been running fine for two months previously.
>
> ...and yet there are aborted commands, likely due to a reset after a timeout.
> Resets aren't issued without cause.
>
> There are two different resets issued by the sd driver: LU and bus. If the
> LU reset doesn't work, the resets are escalated to bus. This is, of course,
> tunable, but is rarely tuned. A bus reset for SAS is a questionable practice,
> since SAS is a fabric, not a bus. But the effect of a device in the fabric
> being reset could be seen as aborted commands by more than one target. To
> troubleshoot these cases, you need to look at all of the devices in the data
> path and map the common causes: HBAs, expanders, enclosures, etc. Traverse
> the devices looking for errors, as you did with the disks. Useful tools:
> sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo.
thanks for the hints ... after detatching/attaching the 'failed'
disks, they got resilvered and a subsequent scrub did not detect
any errors ...
all a bit mysterious ... will keep an eye on the box to see how it
fares on the future ...
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
www.oetiker.ch tobi at oetiker.ch +41 62 775 9902
*** We are hiring IT staff: www.oetiker.ch/jobs ***
More information about the OmniOS-discuss
mailing list