[OmniOS-discuss] Who here had lockd/nlockmgr problems?

Mon Nov 17 15:12:50 UTC 2014

ISTR someone here had a problem where his/her nlockmgr SMF service wouldn't start.  I've encountered this problem myself recently (on a OI VM, but it's running the same new open-source lockd that's in OmniOS with r151010 and later), and wanted to share a conversation from the illumos developer's list.

FYI,
Dan

> Begin forwarded message:
> 
> Subject: Re: [developer] lockd not starting?
> From: Dan McDonald <danmcd at omniti.com>
> Date: November 17, 2014 at 10:10:38 AM EST
> Cc: illumos Developer <developer at lists.illumos.org>, Matt Amdur <matt.amdur at delphix.com>
> To: Sebastien Roy <sebastien.roy at delphix.com>
> 
> 
>> On Nov 16, 2014, at 7:54 PM, Sebastien Roy <sebastien.roy at delphix.com> wrote:
>> 
>> Hey Dan,
>> 
>> I think we're looking into a similar problem here at Delphix. We've noticed that the nfs/nlockmgr service goes into maintenance mode after a timeout of the SM_CRASH call to statd.
>> 
>> Upon startup, the statd daemon is blocked notifying clients. These clients are cached in /var/statmon/sm.bak/... If any of these clients are unreachable, the SM_CRASH call issued by klm will timeout (this timeout is much shorter than that which statd uses to give up on client notifications), and the nfs/nlockmgr service will then end up in maintenance state.
>> 
>> We've been discussing possible fixes to this problem. One easy fix that Matt Amdur (Cc'ed) proposed would be to shorten the timeout that statd uses to notify clients to reduce the chance that lockd's SM_CRASH call would timeout in the event of an unreachable client. We haven't implemented the fix yet.
>> 
>> A workaround is to clear statd's cached clients.
> 
> Okay!  This makes some modicum of sense.  Here's what I have now:
> 
> # ls -lt /var/statmon/sm.bak/
> total 2
> lrwxrwxrwx   1 daemon   daemon        10 Nov 12 19:55 ipv4.10.0.1.68 -> Everywhere
> lrwxrwxrwx   1 daemon   daemon        28 Nov 11 12:44 ipv4.10.8.3.241 -> everywhere.office.omniti.com
> # nslookup everywhere
> Server:         10.8.3.1
> Address:        10.8.3.1#53
> 
> Name:   everywhere.office.omniti.com
> Address: 10.8.3.241
> 
> # 
> 
> While travelling last week, I had to park myself on a 10.0.1.0/24 network.  My VMs are configured to be link-sharers (same-link peers), so I was using NFS over the same link to my VM.  Once I renumbered back to the intended OmniTI networks, things started going screwy.
> 
> I'm going to remove the 10.0.1.68 one and try again... BINGO!
> 
> # svcs -xv nlockmgr
> svc:/network/nfs/nlockmgr:default (NFS lock manager)
> State: maintenance since November 16, 2014 04:00:05 PM EST
> Reason: Start method failed repeatedly, last exited with status 1.
>   See: http://illumos.org/msg/SMF-8000-KS
>   See: man -M /usr/share/man -s 1M lockd
>   See: /var/svc/log/network-nfs-nlockmgr:default.log
> Impact: 1 dependent service is not running:
>        svc:/network/nfs/server:default
> # rm /var/statmon/sm.bak/ipv4.10.0.1.68 
> # svcadm clear nlockmgr
> # svcs -xv nlockmgr
> svc:/network/nfs/nlockmgr:default (NFS lock manager)
> State: online since November 17, 2014 10:08:53 AM EST
>   See: man -M /usr/share/man -s 1M lockd
>   See: /var/svc/log/network-nfs-nlockmgr:default.log
> Impact: None.
> # 
> 
> This is actually good, in that I believe the circumstances for this bug are now reproducible.
> 
> Thanks!
> Dan
> 
>