[OmniOS-discuss] What should we look at in a memory exhaustion situation?
Doug Hughes
doug at will.to
Tue Apr 28 19:08:09 UTC 2015
I've seen something similar to this in Solaris10 (that Oracle won't fix).
It was diganosed as basically 'storage is too slow to service requests.
Throw more hardware at it.". What happens is that nfs buffers just keep
building up and building up (even if it's a read request storm) and the I/O
system can't keep up with the request rate and eventually the system goes
into panic/desperation swapping and then becomes totally unresponsive.
You could verify if the same thing is happening by watching something like
vmstat 3. Once memory starts to get around 3MB free, paging stats will
climb (if it's the same issue) until numbers start showing up in 'de', and
then you're dead.
Anyway, it's plausible that the same thing would be possible. You could
watch out for it and see if you see something similar.
On Tue, Apr 28, 2015 at 11:06 AM, Chris Siebenmann <cks at cs.toronto.edu>
wrote:
> We now have a reproducable situation where high NFS load can cause our
> specific fileserver configuration to lock up with what looks like a
> memory exhaustion deadlock. So far, attempts to get a crash dump haven't
> worked (although they may someday). Without a crash dump, what system
> stats and so on should we be looking at that would be useful for people
> to identify the kernel bugs involved here, either in the run up to the
> lockup or in kmdb after the fact once we break into it after the machine
> locks up?
>
> (Our specific fileserver is NFS to ZFS pools with mirrored vdevs where
> each side of the mirror is an iSCSI disk accessed over two backend
> iSCSI networks. We suspect that iSCSI is a contributing factor. Our
> fileservers have 64 GB of RAM.)
>
> So far we have vmstat, mpstat, network volume, arcstat, '::memstat'
> from mdb -k, and some homegrown NFS and ZFS DTrace activity monitoring
> scripts. These say that just before the lockup happens, free memory and
> ARC usage collapses abruptly and catastrophically, 'Kernel' memory usage
> goes to almost or totally 100%, all CPUs spike to over 90% system time,
> we see over a thousand runnable processes[*], and we have over a GB
> of in-flight NFS writes but there is very little actual ZFS (and NFS)
> IO either in flight or being completed.
>
> (The NFS service pool also reports a massively increasing and
> jaw-dropping number of 'Pending requests'; the last snapshot of
> '::svc_pool -v nfs' a few seconds before the crash has 376862 of them.)
>
> This seems to happen more often with lower values for the number of
> NFS server threads, but even very large values are not immune from it
> (our most recent lockup happened at 4096 threads). Slowdowns in NFS server
> IO responsiveness seem to make this more likely to happen; past slowdowns
> have come from both disk IO problems and from full or nearly full pools.
>
> Thanks in advance.
>
> - cks
> [*: this was with quite a lot of NFS server threads configured.]
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150428/64184e88/attachment.html>
More information about the OmniOS-discuss
mailing list