[OmniOS-discuss] Testing RSF-1 with zpool/nfs HA
Stephan Budach
stephan.budach at JVM.DE
Thu Feb 18 14:00:37 UTC 2016
Am 18.02.16 um 12:14 schrieb Michael Rasmussen:
> On Thu, 18 Feb 2016 07:13:36 +0100
> Stephan Budach <stephan.budach at JVM.DE> wrote:
>
>> So, when I issue a simple ls -l on the folder of the vdisks, while the switchover is happening, the command somtimes comcludes in 18 to 20 seconds, but sometime ls will just sit there for minutes.
>>
> This is a known limitation in NFS. NFS was never intended to be
> clustered so what you experience is the NFS process on the client side
> keeps kernel locks for the now unavailable NFS server and any request
> to the process hangs waiting for these locks to be resolved. This can
> be compared to a situation where you hot-swap a drive in the pool
> without notifying the pool.
>
> Only way to resolve this is to forcefully kill all NFS client processes
> and the restart the NFS client.
>
>
>
This is not the main issue, as this is not a clustered NFS, it's a
failover one. Of course the client will have to reset it's connection,
but it seems that the NFS client is just doing that, after the NFS share
becomes available on the failover host. Looking at the tcpdump, I found
that failing over from the primary NFS server to the secondary, will
work straight on. The service stalls for some more seconds than RSF-1
needs to switch over the ZPOOL and the vip. In my tests it was always
the switchback that caused these issue. Looking at the tcpdump, I
noticed that when the switchback occured, the dump was swamped with DUP!
acks. This indicated to me that the still running nfs server on the
primary was still sending some outstanding ack to the now returned
client, which vigorously denied them.
I think the outcome is, that the server finally gave up sending those
ack(s) and then the connection resumed. So, what helped in this case was
to restart the nfs server on the primary after the ZPOOL had been
switched over to the secondary…
I have just tried to wait at least 5 minutes before failing back from
the secondary to the primary node and this time it went as smoothly as
it did, when I initially failed over from the primary to the secondary.
However, I think for sanity the RSF-1 agent should also restart the nfs
server on teh host, where it just moved the ZPOOL away from.
So, as fas as I am concerned, this issue is resolved. Thanks everybody
for chiming in on this.
Cheers,
Stephan
More information about the OmniOS-discuss
mailing list