[OmniOS-discuss] Clues for tracking down why kernel memory isn't being released?

Chris Siebenmann cks at cs.toronto.edu
Thu Jul 16 16:48:14 UTC 2015


I wrote:
>  We have one ZFS-based NFS fileserver that persistently runs at a very
> high level of non-ARC kernel memory usage that never seems to shrink.
> On a 128 GB machine, mdb's ::memstat reports 95% memory usage by just
> 'Kernel' while the ZFS ARC is only at about 21 GB (as reported by
> 'kstat -m') although c_max should allow it to grow much bigger.
>
>  According to ::kmastat, a *huge* amount of this memory appears to be
> vanishing into allocated but not used kmem_alloc_131072 slab buffers:
> 
> > ::kmastat
> cache                            buf       buf       buf memory      alloc alloc
> name                            size    in use     total in use    succeed  fail
> ------------------------------ ----- --------- --------- ------ ---------- -----
> [...]
> kmem_alloc_131072               128K         6    613033  74.8G  196862991     0

 It turns out that the explanation for this is relatively simple, as
is the work around. Put simply: the OmniOS kernel does not actually
free up these deallocated cache objects until the system is put under
relatively strong memory pressure. Crucially, *the ZFS ARC does not
create this memory pressure*; I think that you pretty much need a user
level program allocating enough memory in order to trigger it, and I
think the memory growth needs to happen relatively rapidly fast so that
the kernel doesn't reclaim enough memory through lesser means (such as
shrinking the ZFS ARC).

(Specifically, you need to force kmem_reap() to be called. The primary
path for this is if 'freemem' drops under 'lotsfree', which is only a few
hundred MB on many systems. See usr/src/uts/common/os/vm_pageout.c in
the OmniOS source repo.)

 Since our fileservers are purely NFS fileservers and have a basically
static level of user memory usage, they rarely or never rapidly use up
enough memory to trigger this 'allocated but unused' reclaim[*].

 The good news is that it's easy enough these days to eat memory at the
user level (you can do it with modern 64-bit scripting languages like
Python, even at an interactive prompt). The bad news is that when we did
this on the server in question we provoked a significant system stall at
both the NFS server level and even the level of ssh logins and shells;
this is clearly not something that we'd want to automate.

 It's my personal opinion that there should be something in the kernel
that automatically reaps drastically outsized kmem caches after a
while. It's absurd that we've run for weeks with more than 70 GB of RAM
sitting unused and an undersized ZFS ARC because of this.

	- cks
[*: interested parties can see how often cache reaping has been triggered
    with the following 'mdb -k' command:
	::walk kmem_cache | ::printf "%4d %s\n" kmem_cache_t cache_reap cache_name

    Even on this heavily used fileserver, up for 45 days, the reap count
    was *8*.  Many of our other fileservers, with less usage, have reap
    counts of 0.
]


More information about the OmniOS-discuss mailing list