[OmniOS-discuss] ZFS data corruption

Mon Aug 24 09:54:29 UTC 2015

Am 22.08.15 um 19:02 schrieb Doug Hughes:
> I've been experiencing spontaneous checksum failure/corruption on read 
> at the zvol level recently on a box running r12 as well. None of the 
> disks show any errors. All of the errors show up at the zvol level 
> until all the disks in the vol get marked as degraded and then a 
> reboot clears it up. repeated scrubs find files to delete, but then 
> after additional heavy read I/O activity, more checksum on read errors 
> occur, and more files need to be removed. So far on r14 I haven't seen 
> this, but I'm keeping an eye on it.
>
> The write activity on this server is very low. I'm currently trying to 
> evacuate it with zfs send | mbuffer to another host over 10g, so the 
> read activity is very high and consistent over a long period of time 
> since I have to move about 10TB.
>
This morning, I received another of these zvol errors, which was also 
reported up to my RAC cluster. I haven't  fully checked that yet, but I 
think the ASM/ADVM simply issued a re-read and was happy with the 
result. Otherwise ASM would have issued a read against the mirror side 
and probably have taken the "faulty" failure group offline, which it didn't.

However, I was wondering how to get some more information from the STMF 
framework and found a post, how to read from the STMF trace buffer…

root at nfsvmpool07:/root#  echo '*stmf_trace_buf/s'  | mdb -k | more
0xffffff090f828000:             :0002579: Imported the LU 
600144f090860e6b000055
0c3a290001
:0002580: Imported the LU 600144f090860e6b0000550c3e240002
:0002581: Imported the LU 600144f090860e6b0000550c3e270003
:0002603: Imported the LU 600144f090860e6b000055925a120001
:0002604: Imported the LU 600144f090860e6b000055a50ebf0002
:0002604: Imported the LU 600144f090860e6b000055a8f7d70003
:0002605: Imported the LU 600144f090860e6b000055a8f7e30004
:150815416: UIO_READ failed, ret = 5, resid = 131072
:224314824: UIO_READ failed, ret = 5, resid = 131072

So, this basically shows two read errors, which is consistent with the 
incidents I had on this system. Unfortuanetly, this doesn't buy me much 
more, since I don't know how to track that further down, but it seems 
that COMSTAR had issues reading from the zvol.

Is it possible to debug this further?

>
> On 8/21/2015 2:06 AM, wuffers wrote:
>> Oh, the PSOD is not caused by the corruption in ZFS - I suspect it 
>> was the other way around (VMware host PSOD -> ZFS corruption). I've 
>> experienced the PSOD before, it may be related to IO issues which I 
>> outlined in another post here:
>> http://lists.omniti.com/pipermail/omnios-discuss/2015-June/005222.html
>>
>> Nobody chimed in, but it's an ongoing issue. I need to dedicate more 
>> time to troubleshoot but other projects are taking my attention right 
>> now (coupled with a personal house move time is at a premium!).
>>
>> Also, I've had many improper shutdowns of the hosts and VMs, and this 
>> was the first time I've seen a ZFS corruption.
>>
>> I know I'm repeating myself, but my question is still:
>> - Can I safely use this block device again now that it reports no 
>> errors? Again, I've moved all data off of it.. and there are no other 
>> signs of hardware issues. Recreate it?
>>
>> On Wed, Aug 19, 2015 at 12:49 PM, Stephan Budach 
>> <stephan.budach at jvm.de> wrote:
>>
>>     Hi Joerg,
>>
>>     Am 19.08.15 um 14:59 schrieb Joerg Goltermann:
>>
>>         Hi,
>>
>>         the PSOD you got can cause the problems on your exchange
>>         database.
>>
>>         Can you check the ESXi logs for the root cause of the PSOD?
>>
>>         I never got a PSOD on such a "corruption". I still think this is
>>         a "cosmetic" bug, but this should be verified by one of the ZFS
>>         developers ...
>>
>>          - Joerg
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150824/10a2afbd/attachment.html>