[OmniOS-discuss] ILB memory leak?

Thu Nov 5 11:38:30 UTC 2015

To the mailing list as well...

On 22/10/2015 09:43, Al Slater wrote:
 > On 21/10/2015 17:35, Dan McDonald wrote:
 >>
 >>> On Oct 21, 2015, at 6:08 AM, Al Slater <al.slater at scluk.com>
 >>> wrote:
 >>>
 >>> Hi,
 >>>
 >>> I am running omnios r151014 on a couple of machines with a couple
 >>> of zones each.  1 zone runs apache as an SSL reverse proxy, the
 >>> other runs ILB for load balancing web to app tier connections.
 >>>
 >>> I noticed that in the ILB zone, the ilbd process memory grows to
 >>> about 2Gb.   Restarting ILB releases the memory, and then the
 >>> memory usage gradually increases again, with each memory increase
 >>> approximately 2 * the size of the previous one.  I run a cronjob
 >>> twice a day ( 8am and 8pm) which restarts the ilb service and
 >>> releases the memory.
 >>>
 >>> A graph of memory usage is available at
 >>> https://www.dropbox.com/s/zaz51apxslnivlq/ILB_Memory_2_days.png?dl=0
 >>>
 >   >> There are currently 62 rules in the load balancer, with a
 >   >> total
 >>> of 664 server/port pairs.
 >>>
 >>> Is there anything I can provide that would help track this down?
 >>
 >> You can use svccfg(1M) to enable user-level memory debugging on ilb.
 >>   It may cause the ilb daemon to dump core.  (And you're just noticing
 >>   this in the process, not kernel memory consumption, correct?)
 >
 > I am seeing kernel memory consumption increasing as well, but that may
 > be a different issue.  The ilbd process memory is definitely growing.
 >
 >> As root:
 >>
 >> svcadm disable -t ilb svccfg -s ilb setenv LD_PRELOAD libumem.so
 >> svccfg -s ilb setenv UMEM_DEBUG default svccfg -s ilb refresh svcadm
 >>   enable ilb
 >>
 >> That should enable user-level memory debugging.  If you get a
 >> coredump, save it and share it.  If you don't and the ilb daemon
 >> keeps running, eventually please:
 >>
 >> gcore `pgrep ilbd`
 >>
 >> and share THAT corefile.  You can also do this by youself:
 >>
 >> mdb <ilbd-core> > ::findleaks
 >>
 >> and share ::findleaks.
 >>
 >> Once you're done generating corefiles, repeat the steps above, but
 >> use "unsetenv LD_PRELOAD" and "unsetenv UMEM_DEBUG" instead of the
 >> setenv lines.
 >
 > Thanks Dan.  As we are talking about production boxes here, I will have
 > to try and reproduce on another box and then I will give the process
 > above a go and see what we come up with.

I have reproduced the problem on a test box.

prstat shows:

3041 daemon   3946M 3946M sleep   59    0   0:48:03 0.1% ilbd/1

memstat:

root at loki:/export/home/BRIGHTON/aslate# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     238420               931   12%
ZFS File Data              630861              2464   31%
Anon                      1054835              4120   51%
Exec and libs                2204                 8    0%
Page cache                  10624                41    1%
Free (cachelist)             9236                36    0%
Free (freelist)            105626               412    5%

Total                     2051806              8014
Physical                  2051805              8014

mdb findleaks:

root at loki:/export/home/BRIGHTON/aslate# mdb core.3041
Loading modules: [ libumem.so.1 libc.so.1 libcmdutils.so.1 libuutil.so.1
ld.so.1 ]
  > ::findleaks
findleaks: no memory leaks detected
  >

Now, I am seeing lots of log messages like the following in
/var/adm/messages

Nov  5 11:17:01 l1-lb2 ilbd[3041]: [ID 410242 daemon.error]
ilbd_hc_probe_timer: cannot restart timer: rule ggp server _ggp.11,
disabling it

So, I was wrong about growing to 2Gb, the truth is nearer 4Gb.  I am
guessing that ilbd_hc_restart_timer is failing because no more memory
can be allocated.

I have the 4Gb core file.  Is there anything useful I can extract from
it to try and spot where the problem is?

-- Al Slater