[OmniOS-discuss] NULL pointer in scsi_vhci triggered by mpathadm (vhci_mpapi_sync_lu_oid_list)
Thibault VINCENT
thibault.vincent at smartjog.com
Wed Aug 14 23:17:19 UTC 2013
Sorry this is a cross post, initially sent to illumos developer ML as recommended on #omnios.
But this may be good to the OmniOS audience as yet incomplete tests suggest there is no issue under SmartOS.
-----------------
Hello,
I'm building ZFS servers with multipathed SAS enclosures and experience
a very reproducible BAD TRAP when running "mpathadm list lu" in some
situations.
You'll find an fmdump at the end of this message and a stack trace
attached. The core is huge but if required I can set kmem_flags to any
value and have it hosted on FTP.
Hardware is:
* Dell R720xd server with two CPUs and 192GB RAM (all firmwares up-to-date)
* 2 * LSI 9207-8e (fw version 16, the latest AFAIK)
* Supermicro SC847E26-RJBOD1 JBODs with LSI expanders
* Seagate ST4000NM0023 drives and STEC SSDs
It's running OmniOS, stable or up-to-date bloody : it crashed in both
cases. "stmsboot -D mpt_sas -e" was used to activate mpiox.
The NULL dereference happens immediately with "mpathadm list lu" at any
time, just after booting or 2 days later. Because the system may run
fine with multipathed drives as long as the command is not used.
"stmsboot -L" even reports stuff with no problem.
But the fault is closely related to the SAS topology :
* one JBOD with 45 drives on a single HBA, single cable (no multipath)
==> no crash, "list lu" reports one path which is expected
* same JBOD with a second cable, making a path to the same HBA or to
an other identical LSI HBA ==> no crash, I see two path and mpathadm is
functional
* any of the two scenarii above but with a second JBOD daisy chained
to the first, raising the topology to 90 drives ==> CRASH in
vhci_mpapi_sync_lu_oid_list()/ddi_get_instance()
* any scenario involving two JBOD and their 90 drives, such as one on
each HBA with a single cable ==> same CRASH
Also, it is very interesting that :
* the problem was way more serious if using a Dell H310 or H710
controller [mr_sas] for system drives. Then only 6 disks in the SAS
topology of LSI HBAs would be enough to trigger the fault. 5 would
not... Since I've stopped using Dell controllers and it results in the
cases above.
* it doesn't crash on a Dell R510, all other peripherals being
identical. Even under heavy ZFS load.
* I tried removing one CPU, leaving only one RAM module, disabling
hyperthreading, disabling NUMA, disabling virtual devices from Dell
iDRACs, changing PCI-e slot.
* forcing 4k blocks, f_sym module, or the load-balance algo in vhci is
not responsible. I tested with or without these.
* it happens with or without filesystem imported (even brand new drives).
# fmdump -u 110c95e8-1906-cb20-c364-8c27099d8b3c -V
TIME UUID
SUNW-MSG-ID
Aug 14 2013 13:01:41.614295000 110c95e8-1906-cb20-c364-8c27099d8b3c
SUNOS-8000-KL
TIME CLASS ENA
Aug 14 13:01:41.0142 ireport.os.sunos.panic.dump_available
0x0000000000000000
Aug 14 13:01:14.8130 ireport.os.sunos.panic.dump_pending_on_device
0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 110c95e8-1906-cb20-c364-8c27099d8b3c
code = SUNOS-8000-KL
diag-time = 1376485301 39052
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = PowerEdge-R720xd
chassis-id = 9FSLFY1
server-id = san-1
(end authority)
mod-name = software-diagnosis
mod-version = 0.1
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = sw
object = (embedded nvlist)
nvlist version: 0
path =
/var/crash/unknown/.110c95e8-1906-cb20-c364-8c27099d8b3c
(end object)
(end asru)
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = sw
object = (embedded nvlist)
nvlist version: 0
path =
/var/crash/unknown/.110c95e8-1906-cb20-c364-8c27099d8b3c
(end object)
(end resource)
savecore-succcess = 1
dump-dir = /var/crash/unknown
dump-files = vmdump.14
os-instance-uuid = 110c95e8-1906-cb20-c364-8c27099d8b3c
panicstr = BAD TRAP: type=e (#pf Page fault)
rp=ffffff00f50999a0 addr=2c occurred in module "genunix" due to a NULL
pointer dereference
panicstack = unix:die+df () | unix:trap+db3 () |
unix:cmntrap+e6 () | genunix:ddi_get_instance+8 () |
scsi_vhci:vhci_mpapi_sync_lu_oid_list+5d () |
scsi_vhci:vhci_mpapi_ioctl+a8 () | scsi_vhci:vhci_mpapi_ctl+e3 () |
scsi_vhci:vhci_ioctl+5d () | genunix:cdev_ioctl+39 () |
specfs:spec_ioctl+60 () | genunix:fop_ioctl+55 () | genunix:ioctl+9b ()
| unix:brand_sys_sysenter+1c9 () |
crashtime = 1376476491
panic-time = Wed Aug 14 10:34:51 2013 UTC
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x520b7fb5 0x249d65d8
Any thoughts about this ?
Thanks
Best regards,
--
Thibault VINCENT - Infrastructure Engineer
SmartJog | T: +33 1 5868 6238
27 Blvd Hippolyte Marquès, 94200 Ivry-sur-Seine, France
www.smartjog.com | a TDF Group company
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/24872454-6cc5e689
Modify Your Subscription: https://www.listbox.com/member/?member_id=24872454&id_secret=24872454-3a034122
Powered by Listbox: http://www.listbox.com
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fault.txt
URL: <https://omniosce.org/ml-archive/attachments/20130814/9f7246dd/attachment-0001.txt>
More information about the OmniOS-discuss
mailing list