[OmniOS-discuss] OmniOS NFS fileserver hanging under sustained high write loads
Chris Siebenmann
cks at cs.toronto.edu
Mon May 4 21:45:27 UTC 2015
We now have a reproducable setup with OmniOS r151014 where an OmniOS
NFS fileserver will experience memory exhaustion and then hang in the
kernel if it receives sustained NFS write traffic from multiple clients
at a rate faster than its local disks can sustain. The machine will run
okay for a while but with mdb -k's ::memstat showing steadily increasing
'Kernel' memory usage; after a while it tips over the edge, the ZFS ARC
starts shrinking, free RAM reported by 'vmstat' goes basically to nothing
(eg 182 MB), and the system locks hard.
(We have not at this point tried to make a crash dump, but past attempts
to do so in similar situations have been failures.)
A fairly reliable signal that the system is about to lock up very
soon is that '::svc_pool nfs' will report a steadily increasing and often
very large number of 'Pending requests' (as well as all configured threads
being active). Our most recent lockup reported over 270,000 pending
requests. Our working hypothesis is that something in the NFS server code
is accepting (too many) incoming requests and filling all memory with them,
which then leads to the hard lock.
(It's possible that lower levels are also involved, eg TCP socket
receive buffers.)
Our current simplified test setup: the OmniOS machine has 64 GB RAM
with 2x 1G Ethernet for incoming NFS writes, writing to a single pool of
a mirrored pair of 2 TB WD SE SATA drives. There are six client machines
on one network, 25 on the other, and all client machines are running
multiple processes that are writing files of various sizes (from 50 MB
through several GB); all client machines are Ubuntu Linux. We believe
(but have not tested) that multiple clients and possibly multiple
processes are required to provoke this behavior. All NFS traffic is
NFS v3 over TCP.
Has anyone seen or heard of anything like this before?
Is there any way to limit the number of pending NFS requests that the
system will accept? Allowing 270,000 strikes me as kind of absurd.
(I don't suppose anyone with a test environment wants to take a shot
at reproducing this. For us, this happens within an hour or three of
running at this load, and generally happens faster with smaller number
of NFS server threads.)
- cks
More information about the OmniOS-discuss
mailing list