[OmniOS-discuss] ZFS data corruption
Stephan Budach
stephan.budach at JVM.DE
Mon Aug 24 09:54:29 UTC 2015
Am 22.08.15 um 19:02 schrieb Doug Hughes:
> I've been experiencing spontaneous checksum failure/corruption on read
> at the zvol level recently on a box running r12 as well. None of the
> disks show any errors. All of the errors show up at the zvol level
> until all the disks in the vol get marked as degraded and then a
> reboot clears it up. repeated scrubs find files to delete, but then
> after additional heavy read I/O activity, more checksum on read errors
> occur, and more files need to be removed. So far on r14 I haven't seen
> this, but I'm keeping an eye on it.
>
> The write activity on this server is very low. I'm currently trying to
> evacuate it with zfs send | mbuffer to another host over 10g, so the
> read activity is very high and consistent over a long period of time
> since I have to move about 10TB.
>
This morning, I received another of these zvol errors, which was also
reported up to my RAC cluster. I haven't fully checked that yet, but I
think the ASM/ADVM simply issued a re-read and was happy with the
result. Otherwise ASM would have issued a read against the mirror side
and probably have taken the "faulty" failure group offline, which it didn't.
However, I was wondering how to get some more information from the STMF
framework and found a post, how to read from the STMF trace buffer…
root at nfsvmpool07:/root# echo '*stmf_trace_buf/s' | mdb -k | more
0xffffff090f828000: :0002579: Imported the LU
600144f090860e6b000055
0c3a290001
:0002580: Imported the LU 600144f090860e6b0000550c3e240002
:0002581: Imported the LU 600144f090860e6b0000550c3e270003
:0002603: Imported the LU 600144f090860e6b000055925a120001
:0002604: Imported the LU 600144f090860e6b000055a50ebf0002
:0002604: Imported the LU 600144f090860e6b000055a8f7d70003
:0002605: Imported the LU 600144f090860e6b000055a8f7e30004
:150815416: UIO_READ failed, ret = 5, resid = 131072
:224314824: UIO_READ failed, ret = 5, resid = 131072
So, this basically shows two read errors, which is consistent with the
incidents I had on this system. Unfortuanetly, this doesn't buy me much
more, since I don't know how to track that further down, but it seems
that COMSTAR had issues reading from the zvol.
Is it possible to debug this further?
>
> On 8/21/2015 2:06 AM, wuffers wrote:
>> Oh, the PSOD is not caused by the corruption in ZFS - I suspect it
>> was the other way around (VMware host PSOD -> ZFS corruption). I've
>> experienced the PSOD before, it may be related to IO issues which I
>> outlined in another post here:
>> http://lists.omniti.com/pipermail/omnios-discuss/2015-June/005222.html
>>
>> Nobody chimed in, but it's an ongoing issue. I need to dedicate more
>> time to troubleshoot but other projects are taking my attention right
>> now (coupled with a personal house move time is at a premium!).
>>
>> Also, I've had many improper shutdowns of the hosts and VMs, and this
>> was the first time I've seen a ZFS corruption.
>>
>> I know I'm repeating myself, but my question is still:
>> - Can I safely use this block device again now that it reports no
>> errors? Again, I've moved all data off of it.. and there are no other
>> signs of hardware issues. Recreate it?
>>
>> On Wed, Aug 19, 2015 at 12:49 PM, Stephan Budach
>> <stephan.budach at jvm.de> wrote:
>>
>> Hi Joerg,
>>
>> Am 19.08.15 um 14:59 schrieb Joerg Goltermann:
>>
>> Hi,
>>
>> the PSOD you got can cause the problems on your exchange
>> database.
>>
>> Can you check the ESXi logs for the root cause of the PSOD?
>>
>> I never got a PSOD on such a "corruption". I still think this is
>> a "cosmetic" bug, but this should be verified by one of the ZFS
>> developers ...
>>
>> - Joerg
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150824/10a2afbd/attachment.html>
More information about the OmniOS-discuss
mailing list