[OmniOS-discuss] kernel panic - anon_decref

Sat Nov 16 07:48:43 UTC 2013

When it pours, it rains. With r151006y, I had two kernel panics in quick
succession while trying to create some zero thick eager disks (4 at the
same time) in ESXi. They are now "kernel heap corruption detected" instead
of anon_decref.

Kernel panic 2 (dump info:
https://drive.google.com/file/d/0B7mCJnZUzJPKMHhqZHJnaDEzYkk)
http://i.imgur.com/eIssxmc.png?1
http://i.imgur.com/MXJy4zP.png?1

TIME                           UUID
SUNW-MSG-ID
Nov 16 2013 00:51:24.912170000 5998ba1e-3aa5-ccac-e885-be4897cfcfe8
SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Nov 16 00:51:24.8638 ireport.os.sunos.panic.dump_available
0x0000000000000000
  Nov 16 00:49:58.8671 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 5998ba1e-3aa5-ccac-e885-be4897cfcfe8
        code = SUNOS-8000-KL
        diag-time = 1384581084 866703
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru =
sw:///:path=/var/crash/unknown/.5998ba1e-3aa5-ccac-e885-be4897cfcfe8
                resource =
sw:///:path=/var/crash/unknown/.5998ba1e-3aa5-ccac-e885-be4897cfcfe8
                savecore-succcess = 1
                dump-dir = /var/crash/unknown
                dump-files = vmdump.1
                os-instance-uuid = 5998ba1e-3aa5-ccac-e885-be4897cfcfe8
                panicstr = kernel heap corruption detected
                panicstack = fffffffffba49c04 () |
genunix:kmem_slab_free+c1 () | genunix:kmem_magazine_destroy+6e () |
genunix:kmem_depot_ws_reap+5d () | genunix:kmem_cache_magazine_purge+118 ()
| genunix:kmem_cache_magazine_resize+40 () | genunix:taskq_thread+2d0 () |
unix:thread_start+8 () |
                crashtime = 1384577735
                panic-time = Fri Nov 15 23:55:35 2013 EST
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x528707dc 0x365e9c10

kernel panic 3 (dump info:
https://drive.google.com/file/d/0B7mCJnZUzJPKbnZIeWZzQjhUOTQ):
(looked the same, no screenshots)

TIME                           UUID
SUNW-MSG-ID
Nov 16 2013 01:44:43.327489000 a6592c60-199f-ead5-9586-ff013bf5ab2d
SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Nov 16 01:44:43.2941 ireport.os.sunos.panic.dump_available
0x0000000000000000
  Nov 16 01:44:03.5356 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = a6592c60-199f-ead5-9586-ff013bf5ab2d
        code = SUNOS-8000-KL
        diag-time = 1384584283 296816
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru =
sw:///:path=/var/crash/unknown/.a6592c60-199f-ead5-9586-ff013bf5ab2d
                resource =
sw:///:path=/var/crash/unknown/.a6592c60-199f-ead5-9586-ff013bf5ab2d
                savecore-succcess = 1
                dump-dir = /var/crash/unknown
                dump-files = vmdump.2
                os-instance-uuid = a6592c60-199f-ead5-9586-ff013bf5ab2d
                panicstr = kernel heap corruption detected
                panicstack = fffffffffba49c04 () |
genunix:kmem_slab_free+c1 () | genunix:kmem_magazine_destroy+6e () |
genunix:kmem_cache_magazine_purge+dc () |
genunix:kmem_cache_magazine_resize+40 () | genunix:taskq_thread+2d0 () |
unix:thread_start+8 () |
                crashtime = 1384582658
                panic-time = Sat Nov 16 01:17:38 2013 EST
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x5287145b 0x138515e8

---
Now, having looked through all 3, I can see in the first two there were
some warnings:

WARNING: /pci at 0
<http://lists.omniti.com/mailman/listinfo/omnios-discuss>,0/pci8086,3c08
at 3 <http://lists.omniti.com/mailman/listinfo/omnios-discuss>/pci1000,3030
at 0 <http://lists.omniti.com/mailman/listinfo/omnios-discuss>
(mpt_sas1):
        mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120303

The /var/adm/message also had a sprinkling of these:
Nov 15 23:36:43 san1 scsi: [ID 243001 kern.warning] WARNING: /pci at 0
,0/pci8086,3c08 at 3/pci1000,3030 at 0 (mpt_sas1):
Nov 15 23:36:43 san1    mptsas_handle_event: IOCStatus=0x8000,
IOCLogInfo=0x31120303
Nov 15 23:36:43 san1 scsi: [ID 365881 kern.info] /pci at 0,0/pci8086,3c08 at 3
/pci1000,3030 at 0 (mpt_sas1):
Nov 15 23:36:43 san1    Log info 0x31120303 received for target 10.
Nov 15 23:36:43 san1    scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc

Following this
http://lists.omniti.com/pipermail/omnios-discuss/2013-March/000544.html to
map the target disk, it's my Stec ZeusRAM ZIL drive that's configured as a
mirror (if I've done it right). I didn't see these errors in the 3rd dump,
so don't know if it's contributing. I may try to do a memtest tomorrow on
the system just in case it's some hardware issues.

My zpool status shows all my drives okay with no known data errors.

Not sure how to proceed from here.. my Hyper-V hosts have been using the
SAN with no issues for 2+ months since it's been up and configured, using
SRP and IB. I'd expect the VM hosts to crash before my SAN does.

Of course, I can make the vmdump.x files available to anyone who wants to
look at them (7GB, 8GB, 4GB).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20131116/302ede10/attachment.html>