[OmniOS-discuss] stmf trouble, crash and dump

Thu Oct 6 09:00:58 UTC 2016

Hi!

Got a problem here a couple of days ago when I ran a snapshot stream over fibre channel on my home/business/devel server to the clone backup server.

Systems: OmniOS 5.11     omnios-r151018-95eaa7e on both systems, initiator on one, and and target on the other. Also same hardware: Dell precision workstation with dual xeon 6-cores and 96 GB registred ram, intel quad port gb nic, and qlogic QLE2462 HBA's.

Configured with one lun provisioned from the target/backup system to the initiator system as a backup lun, and that backup lun configured as a zpool in the initiator system. I should also say, that I run this Fc connection point-to-point, no switch is involved, and it's a single fibre pair, 10 m.

I did a zfs send/recv of a snapshot, and I thought it took a long time. It was around 67 GB. Then the initiator system crashed and dumped. It rebooted, and I got into it again without any trouble.
What I immediately saw was that the zpool "backpool" that was backed by the Fc lun was not present any longer. I made a zpool import, and it was back there again. I did another test, sent a much smaller snap, this was around 450 MB, and that worked fine, although I thought it took a lot of time.

I once again tried with the bigger snap, and same thing happened, system crashed and dumped. I got those two dump files, but I don't know wether this might be a problem on the target system or the initiator side.

I can provide access to dump files.

Here is some information from the two systems that I find interesting:

The initiator system, omni:

omni:

root at omni:/var/log# dmesg | grep qlc
Oct  2 18:33:08 omni qlc: [ID 439991 kern.info] NOTICE: Qlogic qlc(0,0): Loop OFFLINE

root at omni:/# dmesg | grep scsi
Oct  2 18:34:58 omni scsi: [ID 243001 kern.info] /pci at 19,0/pci8086,3410 at 9/pci1077,138 at 0/fp at 0,0 (fcp0):
Oct  2 18:34:58 omni genunix: [ID 408114 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) offline
Oct  2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline
root at omni:/# dmesg | grep multipath
Oct  2 18:34:58 omni genunix: [ID 483743 kern.info] /scsi_vhci/disk at g600144f0c648ae73000057ef6d370001 (sd5) multipath status: failed: path 4 fp0/disk at w2101001b32a19a92,0 is offline

As you can see, the loop is marked offline at the occasion for the crash. But notably strange, there is also an info of a failed multipath...? Why this? There is no multipath here...

The target system, omni2:

root at omni2:/root# grep stmf /var/adm/messages
Oct  2 09:56:37 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf_sbd0
Oct  2 09:56:37 omni2 genunix: [ID 936769 kern.info] stmf_sbd0 is /pseudo/stmf_sbd at 0
Oct  2 09:56:46 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct  2 09:56:46 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct  2 09:57:31 omni2 pseudo: [ID 129642 kern.info] pseudo-device: stmf0
Oct  2 09:57:31 omni2 genunix: [ID 936769 kern.info] stmf0 is /pseudo/stmf at 0
Oct  2 09:57:31 omni2 stmf_sbd: [ID 690249 kern.warning] WARNING: ioctl(DKIOCINFO) failed 25

There is this warning, ioctl(DKCINFO) failed 25, that I tried to find out what it is about, but not succeeded.

Perhaps it is just so simple that the Fc connection isn't good enough. The cable shouldn't be a problem, since it is brand new, but it could of coarse be something with the HBA's. I could get another cable for doing multipath, and see how that would work, but let's start with this first.

Best regards from/Med vänliga hälsningar från

Johan Kragsterman

Capvert