[OmniOS-discuss] Who here had lockd/nlockmgr problems?

Jacob Vosmaer contact at jacobvosmaer.nl
Sun Nov 23 19:35:01 UTC 2014


Hi Dan,

Thanks for sharing! I run OmniOS v11 r151012 on a home NFS server that gets
turned on and off a lot; I was pretty consistently getting the nlockmgr
maintenance state just after boot. I even went as far as to set the start
timeout to 10 minutes:

svccfg -s nlockmgr setprop \start/timeout_seconds=count: 600

And then I edited the /lib/svc/method/nlockmgr startup script to retry up
to 100 times:

diff --git a/nlockmgr b/nlockmgr
index 3a543d6..0b24bbc 100755
--- a/nlockmgr
+++ b/nlockmgr
@@ -61,4 +61,12 @@ then
        fi
 fi

-exec /usr/lib/nfs/lockd
+i=0
+while ! (echo "$0: starting lockd (attempt $i)"; /usr/lib/nfs/lockd) ; do
+  if [ $i -ge 100 ] ; then
+    exit 1
+  fi
+  i=$((i + 1))
+done
+
+exit 0

An ugly hack but at least it freed me from having to baby-sit the NFS
server after boot.

But then I saw your email and I decided to just remove the contents of
/var/statmon/sm.bak:

rm /var/statmon/sm.bak/*

I have not had any nlockmgr trouble since. Yay! I only did the sm.bak
cleanup today though.

Best wishes,

Jacob



2014-11-17 16:12 GMT+01:00 Dan McDonald <danmcd at omniti.com>:

> ISTR someone here had a problem where his/her nlockmgr SMF service
> wouldn't start.  I've encountered this problem myself recently (on a OI VM,
> but it's running the same new open-source lockd that's in OmniOS with
> r151010 and later), and wanted to share a conversation from the illumos
> developer's list.
>
> FYI,
> Dan
>
>
> > Begin forwarded message:
> >
> > Subject: Re: [developer] lockd not starting?
> > From: Dan McDonald <danmcd at omniti.com>
> > Date: November 17, 2014 at 10:10:38 AM EST
> > Cc: illumos Developer <developer at lists.illumos.org>, Matt Amdur <
> matt.amdur at delphix.com>
> > To: Sebastien Roy <sebastien.roy at delphix.com>
> >
> >
> >> On Nov 16, 2014, at 7:54 PM, Sebastien Roy <sebastien.roy at delphix.com>
> wrote:
> >>
> >> Hey Dan,
> >>
> >> I think we're looking into a similar problem here at Delphix. We've
> noticed that the nfs/nlockmgr service goes into maintenance mode after a
> timeout of the SM_CRASH call to statd.
> >>
> >> Upon startup, the statd daemon is blocked notifying clients. These
> clients are cached in /var/statmon/sm.bak/... If any of these clients are
> unreachable, the SM_CRASH call issued by klm will timeout (this timeout is
> much shorter than that which statd uses to give up on client
> notifications), and the nfs/nlockmgr service will then end up in
> maintenance state.
> >>
> >> We've been discussing possible fixes to this problem. One easy fix that
> Matt Amdur (Cc'ed) proposed would be to shorten the timeout that statd uses
> to notify clients to reduce the chance that lockd's SM_CRASH call would
> timeout in the event of an unreachable client. We haven't implemented the
> fix yet.
> >>
> >> A workaround is to clear statd's cached clients.
> >
> > Okay!  This makes some modicum of sense.  Here's what I have now:
> >
> > # ls -lt /var/statmon/sm.bak/
> > total 2
> > lrwxrwxrwx   1 daemon   daemon        10 Nov 12 19:55 ipv4.10.0.1.68 ->
> Everywhere
> > lrwxrwxrwx   1 daemon   daemon        28 Nov 11 12:44 ipv4.10.8.3.241 ->
> everywhere.office.omniti.com
> > # nslookup everywhere
> > Server:         10.8.3.1
> > Address:        10.8.3.1#53
> >
> > Name:   everywhere.office.omniti.com
> > Address: 10.8.3.241
> >
> > #
> >
> > While travelling last week, I had to park myself on a 10.0.1.0/24
> network.  My VMs are configured to be link-sharers (same-link peers), so I
> was using NFS over the same link to my VM.  Once I renumbered back to the
> intended OmniTI networks, things started going screwy.
> >
> > I'm going to remove the 10.0.1.68 one and try again... BINGO!
> >
> > # svcs -xv nlockmgr
> > svc:/network/nfs/nlockmgr:default (NFS lock manager)
> > State: maintenance since November 16, 2014 04:00:05 PM EST
> > Reason: Start method failed repeatedly, last exited with status 1.
> >   See: http://illumos.org/msg/SMF-8000-KS
> >   See: man -M /usr/share/man -s 1M lockd
> >   See: /var/svc/log/network-nfs-nlockmgr:default.log
> > Impact: 1 dependent service is not running:
> >        svc:/network/nfs/server:default
> > # rm /var/statmon/sm.bak/ipv4.10.0.1.68
> > # svcadm clear nlockmgr
> > # svcs -xv nlockmgr
> > svc:/network/nfs/nlockmgr:default (NFS lock manager)
> > State: online since November 17, 2014 10:08:53 AM EST
> >   See: man -M /usr/share/man -s 1M lockd
> >   See: /var/svc/log/network-nfs-nlockmgr:default.log
> > Impact: None.
> > #
> >
> > This is actually good, in that I believe the circumstances for this bug
> are now reproducible.
> >
> > Thanks!
> > Dan
> >
> >
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20141123/54d7af35/attachment-0001.html>


More information about the OmniOS-discuss mailing list