[OmniOS-discuss] mpt_sas & ixgbe kernel buffer alloc failures
Marion Hakanson
hakansom at ohsu.edu
Fri Jan 30 19:46:47 UTC 2015
Greetings,
We've had a tough one to fix here, and I'm hoping for some help.
This is an OmniOS-151012 system, and it started suffering ZFS pool
lockups under busy I/O conditions a couple months ago.
The server is an Intel S2600WP server (128GB RAM), and a Quanta
M4600H JBOD, connected via an LSI SAS 9206-16e HBA (with four SAS cables
in a multipath arrangement). The JBOD has 60x 4TB drives, configured as
five, 12-drive raidz3 vdevs in one pool.
Initial testing of this system had been done with a 9207-8e (two SAS cables),
and it was stable under as heavy of a load I could generate using filebench,
etc. We updated the 9206-16e to LSI's P19 firmware (was P17), to match what
had been on the 9207-8e for the stress testing, but the problem remained.
We then took two of the four SAS cables out of service, and at that time a
scrub passed without further lockups. However as time has passed, the
lockups returned, giving the usual device timeouts, and also now warning that
the HBA's firmware signature was invalid. A reboot and power-cycle of the
JBOD were necessary to get things completely unstuck (and the firmware
signature warnings disappeared too).
Suspecting a faulty HBA, we swapped back in the known-good 9206-8e
(and two SAS cables), but now we're facing what looks like a different
recurring problem. Rather than SAS timeouts, we've started seeing errors
logged like these:
ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: ixgbe_rx_copy: allocate buffer
failed
. . .
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci
8086,e04 at 2/pci1000,3040 at 0 (mpt_sas0):
Unable to allocate dma memory for extra SGL.
scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci
8086,e04 at 2/pci1000,3040 at 0 (mpt_sas0):
MPT SGL mem alloc failed
The system is not completely hung, and existing login sessions continue to
work, but new login sessions cannot be established (no network response).
A scrub on the pool has produced lots of checksum errors as well, though
it's hard to know if these are real checksum errors (perhaps from earlier
hang/reboot incidents), or ones induced by the lack of DMA buffers, etc.
We've tried some different BIOS settings in an attempt to give more
memory-mapped device address space. Enabling "memory-mapped devices
above 4GB" unfortunately causes the dual Intel X540 10GbE NIC's to not
be attached by the OS, and produces these kernel warnings at boot:
NOTICE: unsupported 64-bit prefetch memory on pci-pci bridge [0/3/2]
NOTICE: unsupported 64-bit prefetch memory on pci-pci bridge [0/17/0]
Enabling "Mmaximize memory below 4GB" doesn't cause any problems, but
did not alleviate the problem either.
Here's a "::memstat" dump, after the dma memory allocation errors
have started showing up:
# echo "::memstat" | mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 25477565 99521 76%
ZFS File Data 6275294 24512 19%
Anon 19707 76 0%
Exec and libs 724 2 0%
Page cache 4292 16 0%
Free (cachelist) 7535 29 0%
Free (freelist) 1747133 6824 5%
Total 33532250 130985
Physical 33532249 130985
#
Is it just me, or is that an awful lot of kernel memory in use?
Regards,
Marion
More information about the OmniOS-discuss
mailing list