[OmniOS-discuss] r151012 nlockmgr fails to start

Fri Oct 10 19:09:46 UTC 2014

Apparently something common in my OmniOS setup is triggering this.   I have
no idea what yet, and I'm feeling green at digging through this issue.

On one of my VMs for doing script development, I exported the data pool
planing to test importing it with a different cache location and the
problem immediately happened.   Now I cannot get nlockmgr to start at all
on this VM.   I tried disabling all nfs services and re-enabling.  Still
failing with /usr/lib/nfs/lockd[862]: [ID 491006 daemon.error] Cannot
establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting

root at ZFSsendTest1:/root# svcs -a|grep nfs
disabled       13:47:05 svc:/network/nfs/log:default
disabled       13:47:11 svc:/network/nfs/rquota:default
disabled       13:55:05 svc:/network/nfs/server:default
disabled       13:55:32 svc:/network/nfs/nlockmgr:default
disabled       13:55:32 svc:/network/nfs/mapid:default
disabled       13:55:32 svc:/network/nfs/status:default
disabled       13:55:32 svc:/network/nfs/client:default
disabled       13:55:57 svc:/network/nfs/cbd:default
root at ZFSsendTest1:/root# svcadm enable svc:/network/nfs/status:default
svc:/network/nfs/cbd:default svc:/network/nfs/mapid:default
svc:/network/nfs/server:default svc:/network/nfs/nlockmgr:default
root at ZFSsendTest1:/root# svcs -a|grep nfs
disabled       13:47:05 svc:/network/nfs/log:default
disabled       13:47:11 svc:/network/nfs/rquota:default
disabled       13:55:32 svc:/network/nfs/client:default
online         13:56:56 svc:/network/nfs/status:default
online         13:56:56 svc:/network/nfs/cbd:default
online         13:56:56 svc:/network/nfs/mapid:default
offline        13:56:56 svc:/network/nfs/server:default
offline*       13:56:56 svc:/network/nfs/nlockmgr:default
root at ZFSsendTest1:/root# svcs -a|grep nfs
disabled       13:47:05 svc:/network/nfs/log:default
disabled       13:47:11 svc:/network/nfs/rquota:default
disabled       13:55:32 svc:/network/nfs/client:default
online         13:56:56 svc:/network/nfs/status:default
online         13:56:56 svc:/network/nfs/cbd:default
online         13:56:56 svc:/network/nfs/mapid:default
offline        13:56:56 svc:/network/nfs/server:default
maintenance    13:58:11 svc:/network/nfs/nlockmgr:default

This VM has never had RSF-1 on it, so that definitely isn't the trigger.
This VM has never exhibited this problem before today.  It has been
rebooted many times.

I wonder if the problem is triggered by exporting a pool with NFS exports
that have active client connections.   That is always the case on my
production systems.   This VM has one NFS client that was connected when I
exported the pool.

Now nlockmgr dies and goes to maintenance mode regardless if I import the
data pool or not.

Any advice on where to dig for better diagnosis of this would be helpful.
If any developers would like to get access to this VM I'd be happy to
arrange that too.

-Chip

On Fri, Oct 10, 2014 at 9:26 AM, Richard Elling <
richard.elling at richardelling.com> wrote:

>
> On Oct 10, 2014, at 6:15 AM, "Schweiss, Chip" <chip at innovates.com> wrote:
>
>
> On Thu, Oct 9, 2014 at 9:54 PM, Dan McDonald <danmcd at omniti.com> wrote:
>
>>
>> On Oct 9, 2014, at 10:23 PM, Schweiss, Chip <chip at innovates.com> wrote:
>>
>> > Just tried my 2nd system.   r151010 nlockmgr starts after clearing
>> maintenance mode.   r151012 it will not start at all.  nfs/status was
>> enabled and online.
>> >
>> > The commonality I see on the two systems I have tried is they are both
>> part of an HA cluster.   So they don't import the pool at boot, but RSF-1
>> imports it with cache mapped to a different location.
>>
>> That could be something HA Inc. needs to further test.  We don't directly
>> support RSF-1, after all.
>>
>>
> I there really isn't anything different than an auto imported pool.  I'm
> suspecting using an alternate cache location my be triggering something
> else to go wrong in the nlockmgr.
>
>
> no, these are totally separate subsystems. RSF-1 imports the pool. NFS
> sharing is started by the zpool command, in userland, via sharing after the
> dataset is mounted. You can do the same procedure manually... no magic
> pixie dust needed.
>
>
> Here's the command RSF-1 uses to import the pool:
> zpool import -c /opt/HAC/RSF-1/etc/volume-cache/nrgpool.cache -o
> cachefile=/opt/HAC/RSF-1/etc/v
> olume-cache/nrgpool.cache-live -o failmode=panic  nrgpool
>
> After the pool import it  puts the ip addresses back and is done.   That
> happens in less than 1 second.
>
> In the mean time NFS services auto start and nlockmgr starts spinning.
>
>
> Perhaps share doesn't properly start all of the services? Does it work ok
> if you manually "svcadm enable" all of the NFS services?
>
>
>   -- richard
>
>
>
>
>> > nlockmgr is becoming a real show stopper.
>>
>> svcadm disable nlockmgr nfs/status
>> svcadm enable nfs/status
>> svcadm enable nlockmgr
>>
>> You may wish to discuss this on illumos as well, I'm not sure who all
>> else is seeing this save me one time, and you seemingly a lot of times.
>>
>
> I did that this time, no joy.   Today I'm working on a virtual setup with
> HA to see if I can get this reproduced on r151012.
>
> I thought this nlockmgr propblem was related to lots of nfs exports until,
> I ran into this on my SSD pool.  It used to be able to fail over in about
> 3-5 seconds.   It takes nlockmgr now sits in a spinning state for a few
> minutes and fails every time.   A clear of the maintenance mode, brings it
> back nearly instantly.   This is on r151010.  On r151012 it fails every
> time.
>
> Hopefully I can reproduce and I'll start a new thread copying Illumos too.
>
> -Chip
>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20141010/d532d687/attachment.html>