[OmniOS-discuss] Ang: stmf trouble, crash and dump
Johan Kragsterman
johan.kragsterman at capvert.se
Thu Oct 6 16:00:39 UTC 2016
Hi!
Reponding my own mail instead of Dan's, because he didn't have history included. See further down the mdb output. It is obvious that the pool hang, but not why...
-----"OmniOS-discuss" <omnios-discuss-bounces at lists.omniti.com> skrev: -----
Till: omnios-discuss at lists.omniti.com
Från: Johan Kragsterman
Sänt av: "OmniOS-discuss"
Datum: 2016-10-06 11:03
Ärende: [OmniOS-discuss] stmf trouble, crash and dump
Hi!
Got a problem here a couple of days ago when I ran a snapshot stream over fibre channel on my home/business/devel server to the clone backup server.
Systems: OmniOS 5.11 omnios-r151018-95eaa7e on both systems, initiator on one, and and target on the other. Also same hardware: Dell precision workstation with dual xeon 6-cores and 96 GB registred ram, intel quad port gb nic, and qlogic QLE2462 HBA's.
Configured with one lun provisioned from the target/backup system to the initiator system as a backup lun, and that backup lun configured as a zpool in the initiator system. I should also say, that I run this Fc connection point-to-point, no switch is involved, and it's a single fibre pair, 10 m.
I did a zfs send/recv of a snapshot, and I thought it took a long time. It was around 67 GB. Then the initiator system crashed and dumped. It rebooted, and I got into it again without any trouble.
What I immediately saw was that the zpool "backpool" that was backed by the Fc lun was not present any longer. I made a zpool import, and it was back there again. I did another test, sent a much smaller snap, this was around 450 MB, and that worked fine, although I thought it took a lot of time.
I once again tried with the bigger snap, and same thing happened, system crashed and dumped. I got those two dump files, but I don't know wether this might be a problem on the target system or the initiator side.
I can provide access to dump files.
Here is some information from the two systems that I find interesting:
The initiator system, omni:
omni:
root at omni:/var/log# dmesg | grep qlc
Oct 2 18:33:08 omni qlc: [ID 439991 kern.info] NOTICE: Qlogic qlc(0,0): Loop OFFLINE
root at omni:/# dmesg | grep scsi
Oct 2 18:34:58 omni scsi: [ID 243001 kern.info] /pci at 19,0/pci8086,3410 at 9/pci1077,138 at 0/fp at 0,0 (fcp0):
Oct 2 18:34:58 omni genunix: [ID 408114 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) offline
Oct 2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline
root at omni:/# dmesg | grep multipath
Oct 2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline
As you can see, the loop is marked offline at the occasion for the crash. But notably strange, there is also an info of a failed multipath...? Why this? There is no multipath here...
The target system, omni2:
root at omni2:/root# grep stmf /var/adm/messages
Oct 2 09:56:37 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf_sbd0
Oct 2 09:56:37 omni2 genunix: [ID 936769 kern.info] stmf_sbd0 is /pseudo/stmf_sbd at 0
Oct 2 09:56:46 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct 2 09:56:46 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct 2 09:57:31 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct 2 09:57:31 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct 2 09:57:31 omni2 stmf_sbd: [ID 690249 kern.warning] WARNING: ioctl(DKIOCINFO) failed 25
There is this warning, ioctl(DKCINFO) failed 25, that I tried to find out what it is about, but not succeeded.
Perhaps it is just so simple that the Fc connection isn't good enough. The cable shouldn't be a problem, since it is brand new, but it could of coarse be something with the HBA's. I could get another cable for doing multipath, and see how that would work, but let's start with this first.
mdb output from the initiator system here. Exactly the same from both crash dumps. So no questions there, the question is why...? In the target system I only found that warning "stmf_sbd: [ID 690249 kern.warning] WARNING: ioctl(DKIOCINFO) failed 25"
root at omni:/var/crash/unknown# mdb unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti sockfs arp usba uhci mm stmf stmf_sbd md lofs mpt_sas sata random idm cpc crypto kvm ufs logindmux nsmb ptm smbsrv nfs ]
> ::status
debugging crash dump vmcore.0 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 0fdcc7e7-aaf9-4d9d-cc75-ac766a3c3b5a
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> C
debugging crash dump vmcore.0 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 0fdcc7e7-aaf9-4d9d-cc75-ac766a3c3b5a
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> ::panicinfo
cpu 20
thread ffffff007ba86c40
message I/O to pool 'backpool' appears to be hung.
rdi fffffffff7a72290
rsi ffffff007ba869c0
rdx ffffff007ba86c40
rcx ffffff196891f274
r8 20
r9 a
rax ffffff007ba869e0
rbx ffffff25200971c8
rbp ffffff007ba86a20
r10 0
r11 ffffff007ba868c0
r12 ffffff19ae36b000
r13 e984f1e438
r14 ffffff2520096cc0
r15 ffffff753a6df538
fsbase fffffd7fff061a40
gsbase ffffff195ca63a80
ds 38
es 38
fs 0
gs 0
trapno 0
err 0
rip fffffffffb860190
cs 30
rflags 246
rsp ffffff007ba869a8
ss 38
gdt_hi 0
gdt_lo 5000ffff
idt_hi 0
idt_lo 4000ffff
ldt 0
task 70
cr0 8005003b
cr2 7ffcb34d6618
cr3 c000000
cr4 26f8
>
root at omni:/var/crash/unknown# mdb unix.1 vmcore.1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti sockfs arp usba uhci mm stmf stmf_sbd idm mpt_sas sata cpc crypto md kvm random lofs ufs logindmux nsmb ptm smbsrv nfs ]
> ::status
debugging crash dump vmcore.1 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 748b6c72-5dec-c92c-a155-f1788f51b3fd
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> C
debugging crash dump vmcore.1 (64-bit) from omni
operating system: 5.11 omnios-r151018-95eaa7e (i86pc)
image uuid: 748b6c72-5dec-c92c-a155-f1788f51b3fd
panic message: I/O to pool 'backpool' appears to be hung.
dump content: kernel pages only
> ::panicinfo
cpu 2
thread ffffff007a4afc40
message I/O to pool 'backpool' appears to be hung.
rdi fffffffff7a72290
rsi ffffff007a4af9c0
rdx ffffff007a4afc40
rcx ffffff1964e842ee
r8 20
r9 a
rax ffffff007a4af9e0
rbx ffffff1966da1188
rbp ffffff007a4afa20
r10 0
r11 ffffff007a4af8c0
r12 ffffff1997848000
r13 e9840bec87
r14 ffffff1966da0c80
r15 ffffff19e954c718
fsbase fffffd7fff072a40
gsbase ffffff195c0e6040
ds 4b
es 4b
fs 0
gs 0
trapno 0
err 0
rip fffffffffb860190
cs 30
rflags 246
rsp ffffff007a4af9a8
ss 38
gdt_hi 0
gdt_lo d000ffff
idt_hi 0
idt_lo c000ffff
ldt 0
task 70
cr0 8005003b
cr2 7ffd18fc0568
cr3 c400000
cr4 26f8
>
Best regards from/Med vänliga hälsningar från
Johan Kragsterman
Capvert
_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss at lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss
More information about the OmniOS-discuss
mailing list