[OmniOS-discuss] Ang: stmf trouble, crash and dump

Thu Oct 6 16:00:39 UTC 2016

Hi!

Reponding my own mail instead of Dan's, because he didn't have history included. See further down the mdb output. It is obvious that the pool hang, but not why...

-----"OmniOS-discuss" <omnios-discuss-bounces at lists.omniti.com> skrev: -----
Till: omnios-discuss at lists.omniti.com
Från: Johan Kragsterman 
Sänt av: "OmniOS-discuss" 
Datum: 2016-10-06 11:03
Ärende: [OmniOS-discuss] stmf trouble, crash and dump

Hi!

Got a problem here a couple of days ago when I ran a snapshot stream over fibre channel on my home/business/devel server to the clone backup server.

Systems: OmniOS 5.11     omnios-r151018-95eaa7e on both systems, initiator on one, and and target on the other. Also same hardware: Dell precision workstation with dual xeon 6-cores and 96 GB registred ram, intel quad port gb nic, and qlogic QLE2462 HBA's.

Configured with one lun provisioned from the target/backup system to the initiator system as a backup lun, and that backup lun configured as a zpool in the initiator system. I should also say, that I run this Fc connection point-to-point, no switch is involved, and it's a single fibre pair, 10 m.

I did a zfs send/recv of a snapshot, and I thought it took a long time. It was around 67 GB. Then the initiator system crashed and dumped. It rebooted, and I got into it again without any trouble.
What I immediately saw was that the zpool "backpool" that was backed by the Fc lun was not present any longer. I made a zpool import, and it was back there again. I did another test, sent a much smaller snap, this was around 450 MB, and that worked fine, although I thought it took a lot of time.

I once again tried with the bigger snap, and same thing happened, system crashed and dumped. I got those two dump files, but I don't know wether this might be a problem on the target system or the initiator side.

I can provide access to dump files.

Here is some information from the two systems that I find interesting:

The initiator system, omni:

omni:

root at omni:/var/log# dmesg | grep qlc
Oct  2 18:33:08 omni qlc: [ID 439991 kern.info] NOTICE: Qlogic qlc(0,0): Loop OFFLINE

root at omni:/# dmesg | grep scsi
Oct  2 18:34:58 omni scsi: [ID 243001 kern.info] /pci at 19,0/pci8086,3410 at 9/pci1077,138 at 0/fp at 0,0 (fcp0):
Oct  2 18:34:58 omni genunix: [ID 408114 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) offline
Oct  2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline
root at omni:/# dmesg | grep multipath
Oct  2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline

As you can see, the loop is marked offline at the occasion for the crash. But notably strange, there is also an info of a failed multipath...? Why this? There is no multipath here...

The target system, omni2:

root at omni2:/root# grep stmf /var/adm/messages
Oct  2 09:56:37 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf_sbd0
Oct  2 09:56:37 omni2 genunix: [ID 936769 kern.info] stmf_sbd0 is /pseudo/stmf_sbd at 0
Oct  2 09:56:46 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct  2 09:56:46 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct  2 09:57:31 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct  2 09:57:31 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct  2 09:57:31 omni2 stmf_sbd: [ID 690249 kern.warning] WARNING: ioctl(DKIOCINFO) failed 25

There is this warning, ioctl(DKCINFO) failed 25, that I tried to find out what it is about, but not succeeded.

Perhaps it is just so simple that the Fc connection isn't good enough. The cable shouldn't be a problem, since it is brand new, but it could of coarse be something with the HBA's. I could get another cable for doing multipath, and see how that would work, but let's start with this first.

mdb output from the initiator system here. Exactly the same from both crash dumps. So no questions there, the question is why...? In the target system I only found that warning "stmf_sbd: [ID 690249 kern.warning] WARNING: ioctl(DKIOCINFO) failed 25"

root at omni:/var/crash/unknown# mdb unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti sockfs arp usba uhci mm stmf stmf_sbd md lofs mpt_sas sata random idm cpc crypto kvm ufs logindmux nsmb ptm smbsrv nfs ]
> ::status
debugging crash dump vmcore.0 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 0fdcc7e7-aaf9-4d9d-cc75-ac766a3c3b5a
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> C
debugging crash dump vmcore.0 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 0fdcc7e7-aaf9-4d9d-cc75-ac766a3c3b5a
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
>  ::panicinfo
             cpu               20
          thread ffffff007ba86c40
         message I/O to pool 'backpool' appears to be hung.
             rdi fffffffff7a72290
             rsi ffffff007ba869c0
             rdx ffffff007ba86c40
             rcx ffffff196891f274
              r8               20
              r9                a
             rax ffffff007ba869e0
             rbx ffffff25200971c8
             rbp ffffff007ba86a20
             r10                0
             r11 ffffff007ba868c0
             r12 ffffff19ae36b000
             r13       e984f1e438
             r14 ffffff2520096cc0
             r15 ffffff753a6df538
          fsbase fffffd7fff061a40
          gsbase ffffff195ca63a80
              ds               38
              es               38
              fs                0
              gs                0
       trapno                0     
             err                0
             rip fffffffffb860190
              cs               30
          rflags              246
             rsp ffffff007ba869a8
              ss               38
          gdt_hi                0
          gdt_lo         5000ffff
          idt_hi                0
          idt_lo         4000ffff
             ldt                0
            task               70
             cr0         8005003b
             cr2     7ffcb34d6618
             cr3          c000000
             cr4             26f8
> 

root at omni:/var/crash/unknown# mdb unix.1 vmcore.1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti sockfs arp usba uhci mm stmf stmf_sbd idm mpt_sas sata cpc crypto md kvm random lofs ufs logindmux nsmb ptm smbsrv nfs ]
> ::status
debugging crash dump vmcore.1 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 748b6c72-5dec-c92c-a155-f1788f51b3fd
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> C
debugging crash dump vmcore.1 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 748b6c72-5dec-c92c-a155-f1788f51b3fd
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> ::panicinfo
             cpu                2
          thread ffffff007a4afc40
         message I/O to pool 'backpool' appears to be hung.
             rdi fffffffff7a72290
             rsi ffffff007a4af9c0
             rdx ffffff007a4afc40
             rcx ffffff1964e842ee
              r8               20
              r9                a
             rax ffffff007a4af9e0
             rbx ffffff1966da1188
             rbp ffffff007a4afa20
             r10                0
             r11 ffffff007a4af8c0
             r12 ffffff1997848000
             r13       e9840bec87
             r14 ffffff1966da0c80
             r15 ffffff19e954c718
          fsbase fffffd7fff072a40
          gsbase ffffff195c0e6040
              ds               4b
              es               4b
              fs                0
              gs                0
          trapno                0     
             err                0
             rip fffffffffb860190
              cs               30
          rflags              246
             rsp ffffff007a4af9a8
              ss               38
          gdt_hi                0
          gdt_lo         d000ffff
          idt_hi                0
          idt_lo         c000ffff
             ldt                0
            task               70
             cr0         8005003b
             cr2     7ffd18fc0568
             cr3          c400000
             cr4             26f8
> 

Best regards from/Med vänliga hälsningar från

Johan Kragsterman

Capvert

_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss at lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss