[OmniOS-discuss] core dump while trying to import pool

Fri Dec 4 19:33:09 UTC 2015

I also came upon this same issue after rebooting one of my OmniOS machines.
I did have l2arc devices on my pool until the announcement of the bug
found. At that point I immediately removed my l2arc devices and didn't
reboot the machine until a convenient time where if something bad were to
happen I could manage it. Well, it was good I planned for that reboot ;)

I was able to boot in single user mode, delete the pool cache file, reboot,
import without mounting (zpool import -N <pool>) and then scrub. Scrub
fixed 16kb of data in my 254TB pool.. then exported and imported the pool
as rw only to discover that it did not fix the problem at all. Importing as
read-only allows proper mounting to pull data off.

The problem for me stemmed around mounting 1 of my 52 filesystem as rw. I
was able to mount the filesystems one by one after a zpool import -N to
discover which filesystem was causing the issue.

I'm still rsync'ng the problem filesystem out since as luck would have it,
it was the only one that I wasn't replicating out (probably a good thing
considering) since I used it for a scratch drive. But my plan is to destroy
then recreate the problem fs after the sync finishes and rsync it back..
And cross my fingers that the problem doesn't come back or get worse..

The problem I'm seeing that causes this is:
BAD TRAP: type=e (#pf Page fault) rp=ffffff00f5cee290 addr=20 occurred in
module "zfs" due to a NULL pointer dereference

Here's the details of my crash which appears to be the same as yours:

root at store2:/var/crash/unknown# mdb unix.2 vmcore.2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix
scsi_vhci zfs mr_sas sd ip hook neti sockfs arp usba stmf stmf_sbd random
md lofs idm sata cpc crypto kvm mpt_sas ufs logindmux nsmb ptm smbsrv nfs
ipc ]
> $c
zap_leaf_lookup_closest+0x45(ffffff223e7bd290, 0, 0, ffffff00f5cee3f0)
fzap_cursor_retrieve+0xbb(ffffff223e7bd290, ffffff00f5cee650,
ffffff00f5cee530)
zap_cursor_retrieve+0x11e(ffffff00f5cee650, ffffff00f5cee530)
zfs_purgedir+0x67(ffffff2232f41bc0)
zfs_rmnode+0x202(ffffff2232f41bc0)
zfs_zinactive+0xe8(ffffff2232f41bc0)
zfs_inactive+0x75(ffffff2232f44640, ffffff221918b468, 0)
fop_inactive+0x76(ffffff2232f44640, ffffff221918b468, 0)
vn_rele+0x82(ffffff2232f44640)
zfs_unlinked_drain+0xaa(ffffff21f254d000)
zfsvfs_setup+0xe8(ffffff21f254d000, 1)
zfs_domount+0x131(ffffff223d709368, ffffff222916fd80)
zfs_mount+0x24f(ffffff223d709368, ffffff21f2645400, ffffff00f5ceee00,
ffffff221918b468)
fsop_mount+0x1e(ffffff223d709368, ffffff21f2645400, ffffff00f5ceee00,
ffffff221918b468)
domount+0x86b(0, ffffff00f5ceee00, ffffff21f2645400, ffffff221918b468,
ffffff00f5ceee40)
mount+0x167(ffffff2228e61c38, ffffff00f5ceee90)
syscall_ap+0x94()
_sys_sysenter_post_swapgs+0x149()
> ::status
debugging crash dump vmcore.2 (64-bit) from store2
operating system: 5.11 omnios-8322307 (i86pc)
image uuid: 69a1d6dd-f13a-627d-c2a0-b00c9e50a800
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff00f5cee290 addr=20 occurred in
module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> ::stack
zap_leaf_lookup_closest+0x45(ffffff223e7bd290, 0, 0, ffffff00f5cee3f0)
fzap_cursor_retrieve+0xbb(ffffff223e7bd290, ffffff00f5cee650,
ffffff00f5cee530)
zap_cursor_retrieve+0x11e(ffffff00f5cee650, ffffff00f5cee530)
zfs_purgedir+0x67(ffffff2232f41bc0)
zfs_rmnode+0x202(ffffff2232f41bc0)
zfs_zinactive+0xe8(ffffff2232f41bc0)
zfs_inactive+0x75(ffffff2232f44640, ffffff221918b468, 0)
fop_inactive+0x76(ffffff2232f44640, ffffff221918b468, 0)
vn_rele+0x82(ffffff2232f44640)
zfs_unlinked_drain+0xaa(ffffff21f254d000)
zfsvfs_setup+0xe8(ffffff21f254d000, 1)
zfs_domount+0x131(ffffff223d709368, ffffff222916fd80)
zfs_mount+0x24f(ffffff223d709368, ffffff21f2645400, ffffff00f5ceee00,
ffffff221918b468)
fsop_mount+0x1e(ffffff223d709368, ffffff21f2645400, ffffff00f5ceee00,
ffffff221918b468)
domount+0x86b(0, ffffff00f5ceee00, ffffff21f2645400, ffffff221918b468,
ffffff00f5ceee40)
mount+0x167(ffffff2228e61c38, ffffff00f5ceee90)
syscall_ap+0x94()
_sys_sysenter_post_swapgs+0x149()
> ::panicinfo
             cpu                3
          thread ffffff21f2968440
         message
BAD TRAP: type=e (#pf Page fault) rp=ffffff00f5cee290 addr=20 occurred in
module "zfs" due to a NULL pointer dereference
             rdi ffffff223e7bd290
             rsi                0
             rdx                8
             rcx         4170d6eb
              r8 ffffff00f5cee3f0
              r9 ffffff00f5cee1c8
             rax         4170d6f0
             rbx ffffff00f5cee650
             rbp ffffff00f5cee3d0
             r10 fffffffffb854358
             r11                0
             r12              800
             r13                0
             r14 ffffff00f5cee3f0
             r15 ffffff00f5cee530
          fsbase                0
          gsbase ffffff21f169c000
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                e
             err                0
             rip fffffffff7a11e95
              cs               30
          rflags            10206
             rsp ffffff00f5cee380
              ss               38
          gdt_hi                0
          gdt_lo         700001ef
          idt_hi                0
          idt_lo         40000fff
             ldt                0
            task               70
             cr0         8005003b
             cr2               20
             cr3       206fe00000
             cr4            426f8
>

________________________
Michael Talbott
Systems Administrator
La Jolla Institute

On Dec 4, 2015, at 7:56 AM, Dan McDonald <danmcd at omniti.com> wrote:

On Dec 4, 2015, at 10:53 AM, Lawrence Giam <paladinemishakal at gmail.com>
wrote:

Should I cancel the scrub and try the method that John suggest?

I'd let the scrub run to be sure.  If it's the class of bug I'm thinking,
though, scrub won't catch it.  :(

And if you can provide one of those r151014 core dumps, that'd be great.
If this pool has confidential data, though, I can understand why not.

Dan

_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss at lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20151204/c2d3ed52/attachment-0001.html>