diff mbox series

[RFC,1/1] nfsd: Initial implementation of NFSv4 Courteous Server

Message ID 20210603181438.109851-1-dai.ngo@oracle.com (mailing list archive)
State New
Headers show
Series [RFC,1/1] nfsd: Initial implementation of NFSv4 Courteous Server | expand

Commit Message

Dai Ngo June 3, 2021, 6:14 p.m. UTC
Currently an NFSv4 client must maintain its lease by using the at least
one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
a singleton SEQUENCE (4.1) at least once during each lease period. If the
client fails to renew the lease, for any reason, the Linux server expunges
the state tokens immediately upon detection of the "failure to renew the
lease" condition and begins returning NFS4ERR_EXPIRED if the client should
reconnect and attempt to use the (now) expired state.

The default lease period for the Linux server is 90 seconds.  The typical
client cuts that in half and will issue a lease renewing operation every
45 seconds. The 90 second lease period is very short considering the
potential for moderately long term network partitions.  A network partition
refers to any loss of network connectivity between the NFS client and the
NFS server, regardless of its root cause.  This includes NIC failures, NIC
driver bugs, network misconfigurations & administrative errors, routers &
switches crashing and/or having software updates applied, even down to
cables being physically pulled.  In most cases, these network failures are
transient, although the duration is unknown.

A server which does not immediately expunge the state on lease expiration
is known as a Courteous Server.  A Courteous Server continues to recognize
previously generated state tokens as valid until conflict arises between
the expired state and the requests from another client, or the server reboots.

The initial implementation of the Courteous Server will do the following:

. when the laundromat thread detects an expired client and if that client
still has established states on the Linux server then marks the client as a
COURTESY_CLIENT and skips destroying the client and all its states,
otherwise destroy the client as usual.

. detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
expired client and all its states, skips the delegation recall then allows
the conflicting request to succeed.

. detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
the expired client and all its states then allows the conflicting request
to succeed.

To be done:

. fix a problem with 4.1 where the Linux server keeps returning
SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
client re-connects, causing the client to keep sending BCTS requests to server.

. handle OPEN conflict with share reservations.

. instead of destroy the client anf all its state on conflict, only destroy
the state that is conflicted with the current request.

. destroy the COURTESY_CLIENT either after a fixed period of time to release
resources or as reacting to memory pressure.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
---
 fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
 fs/nfsd/state.h            |  1 +
 include/linux/sunrpc/svc.h |  1 +
 3 files changed, 91 insertions(+), 5 deletions(-)

Comments

Dai Ngo June 11, 2021, 8:42 a.m. UTC | #1
Hi Bruce,

Just a reminder that this RFC still needs a review.

Thanks,
-Dai

On 6/3/21 11:14 AM, Dai Ngo wrote:
> Currently an NFSv4 client must maintain its lease by using the at least
> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
> a singleton SEQUENCE (4.1) at least once during each lease period. If the
> client fails to renew the lease, for any reason, the Linux server expunges
> the state tokens immediately upon detection of the "failure to renew the
> lease" condition and begins returning NFS4ERR_EXPIRED if the client should
> reconnect and attempt to use the (now) expired state.
>
> The default lease period for the Linux server is 90 seconds.  The typical
> client cuts that in half and will issue a lease renewing operation every
> 45 seconds. The 90 second lease period is very short considering the
> potential for moderately long term network partitions.  A network partition
> refers to any loss of network connectivity between the NFS client and the
> NFS server, regardless of its root cause.  This includes NIC failures, NIC
> driver bugs, network misconfigurations & administrative errors, routers &
> switches crashing and/or having software updates applied, even down to
> cables being physically pulled.  In most cases, these network failures are
> transient, although the duration is unknown.
>
> A server which does not immediately expunge the state on lease expiration
> is known as a Courteous Server.  A Courteous Server continues to recognize
> previously generated state tokens as valid until conflict arises between
> the expired state and the requests from another client, or the server reboots.
>
> The initial implementation of the Courteous Server will do the following:
>
> . when the laundromat thread detects an expired client and if that client
> still has established states on the Linux server then marks the client as a
> COURTESY_CLIENT and skips destroying the client and all its states,
> otherwise destroy the client as usual.
>
> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
> expired client and all its states, skips the delegation recall then allows
> the conflicting request to succeed.
>
> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
> the expired client and all its states then allows the conflicting request
> to succeed.
>
> To be done:
>
> . fix a problem with 4.1 where the Linux server keeps returning
> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
> client re-connects, causing the client to keep sending BCTS requests to server.
>
> . handle OPEN conflict with share reservations.
>
> . instead of destroy the client anf all its state on conflict, only destroy
> the state that is conflicted with the current request.
>
> . destroy the COURTESY_CLIENT either after a fixed period of time to release
> resources or as reacting to memory pressure.
>
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>   fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
>   fs/nfsd/state.h            |  1 +
>   include/linux/sunrpc/svc.h |  1 +
>   3 files changed, 91 insertions(+), 5 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b517a8794400..969995872752 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>   
>   	list_move_tail(&clp->cl_lru, &nn->client_lru);
>   	clp->cl_time = ktime_get_boottime_seconds();
> +	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>   }
>   
>   static void put_client_renew_locked(struct nfs4_client *clp)
> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>   	nfsd4_run_cb(&dp->dl_recall);
>   }
>   
> +/*
> + * Set rq_conflict_client if the conflict client have expired
> + * and return true, otherwise return false.
> + */
> +static bool
> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
> +{
> +	struct svc_rqst *rqst;
> +	struct nfs4_client *clp;
> +	struct nfsd_net *nn;
> +	bool ret;
> +
> +	if (!i_am_nfsd())
> +		return false;
> +	rqst = kthread_data(current);
> +	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
> +		return false;
> +	rqst->rq_conflict_client = NULL;
> +	clp = dp->dl_recall.cb_clp;
> +	nn = net_generic(clp->net, nfsd_net_id);
> +	spin_lock(&nn->client_lock);
> +
> +	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
> +				!mark_client_expired_locked(clp)) {
> +		rqst->rq_conflict_client = clp;
> +		ret = true;
> +	} else
> +		ret = false;
> +	spin_unlock(&nn->client_lock);
> +	return ret;
> +}
> +
>   /* Called from break_lease() with i_lock held. */
>   static bool
>   nfsd_break_deleg_cb(struct file_lock *fl)
> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>   	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
>   	struct nfs4_file *fp = dp->dl_stid.sc_file;
>   
> +	if (nfsd_set_conflict_client(dp))
> +		return false;
>   	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>   
>   	/*
> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>   	 */
>   }
>   
> +static bool
> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
> +{
> +	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> +
> +	spin_lock(&nn->client_lock);
> +	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
> +			mark_client_expired_locked(clp)) {
> +		spin_unlock(&nn->client_lock);
> +		return false;
> +	}
> +	spin_unlock(&nn->client_lock);
> +	expire_client(clp);
> +	return true;
> +}
> +
>   __be32
>   nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
>   {
> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>   			goto out;
>   		}
>   	} else {
> +		rqstp->rq_conflict_client = NULL;
>   		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
> +			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
> +				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		}
> +
>   		if (status) {
>   			stp->st_stid.sc_type = NFS4_CLOSED_STID;
>   			release_open_stateid(stp);
> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>   	};
>   	struct nfs4_cpntf_state *cps;
>   	copy_stateid_t *cps_t;
> +	struct nfs4_stid *stid;
> +	int id = 0;
>   	int i;
>   
>   	if (clients_still_reclaiming(nn)) {
> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>   		clp = list_entry(pos, struct nfs4_client, cl_lru);
>   		if (!state_expired(&lt, clp->cl_time))
>   			break;
> +
> +		spin_lock(&clp->cl_lock);
> +		stid = idr_get_next(&clp->cl_stateids, &id);
> +		spin_unlock(&clp->cl_lock);
> +		if (stid) {
> +			/* client still has states */
> +			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
> +			continue;
> +		}
>   		if (mark_client_expired_locked(clp)) {
>   			trace_nfsd_clid_expired(&clp->cl_clientid);
>   			continue;
> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops  = {
>   	.lm_put_owner = nfsd4_fl_put_owner,
>   };
>   
> -static inline void
> +static inline int
>   nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>   {
>   	struct nfs4_lockowner *lo;
> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>   			/* We just don't care that much */
>   			goto nevermind;
>   		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
> +		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
> +			return -EAGAIN;
>   	} else {
>   nevermind:
>   		deny->ld_owner.len = 0;
> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>   	deny->ld_type = NFS4_READ_LT;
>   	if (fl->fl_type != F_RDLCK)
>   		deny->ld_type = NFS4_WRITE_LT;
> +	return 0;
>   }
>   
>   static struct nfs4_lockowner *
> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>   	unsigned int fl_flags = FL_POSIX;
>   	struct net *net = SVC_NET(rqstp);
>   	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> +	bool retried = false;
>   
>   	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>   		(long long) lock->lk_offset,
> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>   		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>   		spin_unlock(&nn->blocked_locks_lock);
>   	}
> -
> +again:
>   	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>   	switch (err) {
>   	case 0: /* success! */
> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>   	case -EAGAIN:		/* conflock holds conflicting lock */
>   		status = nfserr_denied;
>   		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
> +
> +		/* try again if conflict with courtesy client  */
> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
> +			retried = true;
> +			goto again;
> +		}
>   		break;
>   	case -EDEADLK:
>   		status = nfserr_deadlock;
> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>   	struct nfs4_lockowner *lo = NULL;
>   	__be32 status;
>   	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
> +	bool retried = false;
> +	int ret;
>   
>   	if (locks_in_grace(SVC_NET(rqstp)))
>   		return nfserr_grace;
> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>   	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
>   
>   	nfs4_transform_lock_offset(file_lock);
> -
> +again:
>   	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>   	if (status)
>   		goto out;
>   
>   	if (file_lock->fl_type != F_UNLCK) {
>   		status = nfserr_denied;
> -		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		if (ret == -EAGAIN && !retried) {
> +			retried = true;
> +			fh_clear_wcc(&cstate->current_fh);
> +			goto again;
> +		}
>   	}
>   out:
>   	if (lo)
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index e73bdbb1634a..bdc3605e3722 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -345,6 +345,7 @@ struct nfs4_client {
>   #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
>   #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
>   					 1 << NFSD4_CLIENT_CB_KILL)
> +#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
>   	unsigned long		cl_flags;
>   	const struct cred	*cl_cb_cred;
>   	struct rpc_clnt		*cl_cb_client;
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e91d51ea028b..2f0382f9d0ff 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -304,6 +304,7 @@ struct svc_rqst {
>   						 * net namespace
>   						 */
>   	void **			rq_lease_breaker; /* The v4 client breaking a lease */
> +	void			*rq_conflict_client;
>   };
>   
>   #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
J. Bruce Fields June 16, 2021, 4:02 p.m. UTC | #2
On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> Currently an NFSv4 client must maintain its lease by using the at least
> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
> a singleton SEQUENCE (4.1) at least once during each lease period. If the
> client fails to renew the lease, for any reason, the Linux server expunges
> the state tokens immediately upon detection of the "failure to renew the
> lease" condition and begins returning NFS4ERR_EXPIRED if the client should
> reconnect and attempt to use the (now) expired state.
> 
> The default lease period for the Linux server is 90 seconds.  The typical
> client cuts that in half and will issue a lease renewing operation every
> 45 seconds. The 90 second lease period is very short considering the
> potential for moderately long term network partitions.  A network partition
> refers to any loss of network connectivity between the NFS client and the
> NFS server, regardless of its root cause.  This includes NIC failures, NIC
> driver bugs, network misconfigurations & administrative errors, routers &
> switches crashing and/or having software updates applied, even down to
> cables being physically pulled.  In most cases, these network failures are
> transient, although the duration is unknown.
> 
> A server which does not immediately expunge the state on lease expiration
> is known as a Courteous Server.  A Courteous Server continues to recognize
> previously generated state tokens as valid until conflict arises between
> the expired state and the requests from another client, or the server reboots.
> 
> The initial implementation of the Courteous Server will do the following:
> 
> . when the laundromat thread detects an expired client and if that client
> still has established states on the Linux server then marks the client as a
> COURTESY_CLIENT and skips destroying the client and all its states,
> otherwise destroy the client as usual.
> 
> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
> expired client and all its states, skips the delegation recall then allows
> the conflicting request to succeed.
> 
> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
> the expired client and all its states then allows the conflicting request
> to succeed.
> 
> To be done:
> 
> . fix a problem with 4.1 where the Linux server keeps returning
> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
> client re-connects, causing the client to keep sending BCTS requests to server.

Hm, any progress working out what's causing that?

> . handle OPEN conflict with share reservations.

Sounds doable.

> . instead of destroy the client anf all its state on conflict, only destroy
> the state that is conflicted with the current request.

The other todos I think have to be done before we merge, but this one I
think can wait.

> . destroy the COURTESY_CLIENT either after a fixed period of time to release
> resources or as reacting to memory pressure.

I think we need something here, but it can be pretty simple.

> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
>  fs/nfsd/state.h            |  1 +
>  include/linux/sunrpc/svc.h |  1 +
>  3 files changed, 91 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b517a8794400..969995872752 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>  
>  	list_move_tail(&clp->cl_lru, &nn->client_lru);
>  	clp->cl_time = ktime_get_boottime_seconds();
> +	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>  }
>  
>  static void put_client_renew_locked(struct nfs4_client *clp)
> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>  	nfsd4_run_cb(&dp->dl_recall);
>  }
>  
> +/*
> + * Set rq_conflict_client if the conflict client have expired
> + * and return true, otherwise return false.
> + */
> +static bool
> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
> +{
> +	struct svc_rqst *rqst;
> +	struct nfs4_client *clp;
> +	struct nfsd_net *nn;
> +	bool ret;
> +
> +	if (!i_am_nfsd())
> +		return false;
> +	rqst = kthread_data(current);
> +	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
> +		return false;

This says we do nothing if the lock request is not coming from nfsd and
v4.  That doesn't sound right.

For example, if the conflicting lock is coming from a local process (not
from an NFS client at all), we still need to revoke the courtesy state
so that process can make progress.

--b.

> +	rqst->rq_conflict_client = NULL;
> +	clp = dp->dl_recall.cb_clp;
> +	nn = net_generic(clp->net, nfsd_net_id);
> +	spin_lock(&nn->client_lock);
> +
> +	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
> +				!mark_client_expired_locked(clp)) {
> +		rqst->rq_conflict_client = clp;
> +		ret = true;
> +	} else
> +		ret = false;
> +	spin_unlock(&nn->client_lock);
> +	return ret;
> +}
> +
>  /* Called from break_lease() with i_lock held. */
>  static bool
>  nfsd_break_deleg_cb(struct file_lock *fl)
> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>  	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
>  	struct nfs4_file *fp = dp->dl_stid.sc_file;
>  
> +	if (nfsd_set_conflict_client(dp))
> +		return false;
>  	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>  
>  	/*
> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>  	 */
>  }
>  
> +static bool
> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
> +{
> +	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> +
> +	spin_lock(&nn->client_lock);
> +	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
> +			mark_client_expired_locked(clp)) {
> +		spin_unlock(&nn->client_lock);
> +		return false;
> +	}
> +	spin_unlock(&nn->client_lock);
> +	expire_client(clp);
> +	return true;
> +}
> +
>  __be32
>  nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
>  {
> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  			goto out;
>  		}
>  	} else {
> +		rqstp->rq_conflict_client = NULL;
>  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
> +			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
> +				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		}
> +
>  		if (status) {
>  			stp->st_stid.sc_type = NFS4_CLOSED_STID;
>  			release_open_stateid(stp);
> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>  	};
>  	struct nfs4_cpntf_state *cps;
>  	copy_stateid_t *cps_t;
> +	struct nfs4_stid *stid;
> +	int id = 0;
>  	int i;
>  
>  	if (clients_still_reclaiming(nn)) {
> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>  		clp = list_entry(pos, struct nfs4_client, cl_lru);
>  		if (!state_expired(&lt, clp->cl_time))
>  			break;
> +
> +		spin_lock(&clp->cl_lock);
> +		stid = idr_get_next(&clp->cl_stateids, &id);
> +		spin_unlock(&clp->cl_lock);
> +		if (stid) {
> +			/* client still has states */
> +			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
> +			continue;
> +		}
>  		if (mark_client_expired_locked(clp)) {
>  			trace_nfsd_clid_expired(&clp->cl_clientid);
>  			continue;
> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops  = {
>  	.lm_put_owner = nfsd4_fl_put_owner,
>  };
>  
> -static inline void
> +static inline int
>  nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  {
>  	struct nfs4_lockowner *lo;
> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  			/* We just don't care that much */
>  			goto nevermind;
>  		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
> +		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
> +			return -EAGAIN;
>  	} else {
>  nevermind:
>  		deny->ld_owner.len = 0;
> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  	deny->ld_type = NFS4_READ_LT;
>  	if (fl->fl_type != F_RDLCK)
>  		deny->ld_type = NFS4_WRITE_LT;
> +	return 0;
>  }
>  
>  static struct nfs4_lockowner *
> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	unsigned int fl_flags = FL_POSIX;
>  	struct net *net = SVC_NET(rqstp);
>  	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> +	bool retried = false;
>  
>  	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>  		(long long) lock->lk_offset,
> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>  		spin_unlock(&nn->blocked_locks_lock);
>  	}
> -
> +again:
>  	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>  	switch (err) {
>  	case 0: /* success! */
> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	case -EAGAIN:		/* conflock holds conflicting lock */
>  		status = nfserr_denied;
>  		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
> +
> +		/* try again if conflict with courtesy client  */
> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
> +			retried = true;
> +			goto again;
> +		}
>  		break;
>  	case -EDEADLK:
>  		status = nfserr_deadlock;
> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	struct nfs4_lockowner *lo = NULL;
>  	__be32 status;
>  	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
> +	bool retried = false;
> +	int ret;
>  
>  	if (locks_in_grace(SVC_NET(rqstp)))
>  		return nfserr_grace;
> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
>  
>  	nfs4_transform_lock_offset(file_lock);
> -
> +again:
>  	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>  	if (status)
>  		goto out;
>  
>  	if (file_lock->fl_type != F_UNLCK) {
>  		status = nfserr_denied;
> -		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		if (ret == -EAGAIN && !retried) {
> +			retried = true;
> +			fh_clear_wcc(&cstate->current_fh);
> +			goto again;
> +		}
>  	}
>  out:
>  	if (lo)
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index e73bdbb1634a..bdc3605e3722 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -345,6 +345,7 @@ struct nfs4_client {
>  #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
>  #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
>  					 1 << NFSD4_CLIENT_CB_KILL)
> +#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
>  	unsigned long		cl_flags;
>  	const struct cred	*cl_cb_cred;
>  	struct rpc_clnt		*cl_cb_client;
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e91d51ea028b..2f0382f9d0ff 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -304,6 +304,7 @@ struct svc_rqst {
>  						 * net namespace
>  						 */
>  	void **			rq_lease_breaker; /* The v4 client breaking a lease */
> +	void			*rq_conflict_client;
>  };
>  
>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> -- 
> 2.9.5
Chuck Lever III June 16, 2021, 4:32 p.m. UTC | #3
> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>> . instead of destroy the client anf all its state on conflict, only destroy
>> the state that is conflicted with the current request.
> 
> The other todos I think have to be done before we merge, but this one I
> think can wait.

I agree on both points: this one can wait, but the others
should be done before merge.


>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
>> resources or as reacting to memory pressure.
> 
> I think we need something here, but it can be pretty simple.

We should work out a policy now.

A lower bound is good to have. Keep courtesy clients at least
this long. Average network partition length times two as a shot
in the dark. Or it could be N times the lease expiry time.

An upper bound is harder to guess at. Obviously these things
will go away when the server reboots. The laundromat could
handle this sooner. However using a shrinker might be nicer and
more Linux-y, keeping the clients as long as practical, without
the need for adding another administrative setting.


--
Chuck Lever
Dai Ngo June 16, 2021, 7:17 p.m. UTC | #4
On 6/16/21 9:02 AM, J. Bruce Fields wrote:
> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>> Currently an NFSv4 client must maintain its lease by using the at least
>> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
>> a singleton SEQUENCE (4.1) at least once during each lease period. If the
>> client fails to renew the lease, for any reason, the Linux server expunges
>> the state tokens immediately upon detection of the "failure to renew the
>> lease" condition and begins returning NFS4ERR_EXPIRED if the client should
>> reconnect and attempt to use the (now) expired state.
>>
>> The default lease period for the Linux server is 90 seconds.  The typical
>> client cuts that in half and will issue a lease renewing operation every
>> 45 seconds. The 90 second lease period is very short considering the
>> potential for moderately long term network partitions.  A network partition
>> refers to any loss of network connectivity between the NFS client and the
>> NFS server, regardless of its root cause.  This includes NIC failures, NIC
>> driver bugs, network misconfigurations & administrative errors, routers &
>> switches crashing and/or having software updates applied, even down to
>> cables being physically pulled.  In most cases, these network failures are
>> transient, although the duration is unknown.
>>
>> A server which does not immediately expunge the state on lease expiration
>> is known as a Courteous Server.  A Courteous Server continues to recognize
>> previously generated state tokens as valid until conflict arises between
>> the expired state and the requests from another client, or the server reboots.
>>
>> The initial implementation of the Courteous Server will do the following:
>>
>> . when the laundromat thread detects an expired client and if that client
>> still has established states on the Linux server then marks the client as a
>> COURTESY_CLIENT and skips destroying the client and all its states,
>> otherwise destroy the client as usual.
>>
>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
>> expired client and all its states, skips the delegation recall then allows
>> the conflicting request to succeed.
>>
>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
>> the expired client and all its states then allows the conflicting request
>> to succeed.
>>
>> To be done:
>>
>> . fix a problem with 4.1 where the Linux server keeps returning
>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
>> client re-connects, causing the client to keep sending BCTS requests to server.
> Hm, any progress working out what's causing that?

I will fix this in v2 patch.

>
>> . handle OPEN conflict with share reservations.
> Sounds doable.

Yes. But I don't know a way to force the Linux client to specify the
deny mode on OPEN. I may have to test this with SMB client: expired
NFS client should not prevent SMB client from open the file with
deny mode.

>
>> . instead of destroy the client anf all its state on conflict, only destroy
>> the state that is conflicted with the current request.
> The other todos I think have to be done before we merge, but this one I
> think can wait.
>
>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
>> resources or as reacting to memory pressure.
> I think we need something here, but it can be pretty simple.

ok.

>
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
>>   fs/nfsd/state.h            |  1 +
>>   include/linux/sunrpc/svc.h |  1 +
>>   3 files changed, 91 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index b517a8794400..969995872752 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>>   
>>   	list_move_tail(&clp->cl_lru, &nn->client_lru);
>>   	clp->cl_time = ktime_get_boottime_seconds();
>> +	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>   }
>>   
>>   static void put_client_renew_locked(struct nfs4_client *clp)
>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>>   	nfsd4_run_cb(&dp->dl_recall);
>>   }
>>   
>> +/*
>> + * Set rq_conflict_client if the conflict client have expired
>> + * and return true, otherwise return false.
>> + */
>> +static bool
>> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
>> +{
>> +	struct svc_rqst *rqst;
>> +	struct nfs4_client *clp;
>> +	struct nfsd_net *nn;
>> +	bool ret;
>> +
>> +	if (!i_am_nfsd())
>> +		return false;
>> +	rqst = kthread_data(current);
>> +	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
>> +		return false;
> This says we do nothing if the lock request is not coming from nfsd and
> v4.  That doesn't sound right.
>
> For example, if the conflicting lock is coming from a local process (not
> from an NFS client at all), we still need to revoke the courtesy state
> so that process can make progress.

The reason it checks for NFS request is because it uses rq_conflict_client
to pass the expired client back to the upper layer which then does the
expire_client. If this is not from a NFS request then kthread_data() might
not be the svc_rqst. In that case the code will try to recall the delegation
from the expired client and the recall will eventually timed out.

I will rework this code.

Thanks,
-Dai

>> +	rqst->rq_conflict_client = NULL;
>> +	clp = dp->dl_recall.cb_clp;
>> +	nn = net_generic(clp->net, nfsd_net_id);
>> +	spin_lock(&nn->client_lock);
>> +
>> +	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
>> +				!mark_client_expired_locked(clp)) {
>> +		rqst->rq_conflict_client = clp;
>> +		ret = true;
>> +	} else
>> +		ret = false;
>> +	spin_unlock(&nn->client_lock);
>> +	return ret;
>> +}
>> +
>>   /* Called from break_lease() with i_lock held. */
>>   static bool
>>   nfsd_break_deleg_cb(struct file_lock *fl)
>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>>   	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
>>   	struct nfs4_file *fp = dp->dl_stid.sc_file;
>>   
>> +	if (nfsd_set_conflict_client(dp))
>> +		return false;
>>   	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>>   
>>   	/*
>> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>>   	 */
>>   }
>>   
>> +static bool
>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
>> +{
>> +	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
>> +
>> +	spin_lock(&nn->client_lock);
>> +	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
>> +			mark_client_expired_locked(clp)) {
>> +		spin_unlock(&nn->client_lock);
>> +		return false;
>> +	}
>> +	spin_unlock(&nn->client_lock);
>> +	expire_client(clp);
>> +	return true;
>> +}
>> +
>>   __be32
>>   nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
>>   {
>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>>   			goto out;
>>   		}
>>   	} else {
>> +		rqstp->rq_conflict_client = NULL;
>>   		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> +		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
>> +			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
>> +				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> +		}
>> +
>>   		if (status) {
>>   			stp->st_stid.sc_type = NFS4_CLOSED_STID;
>>   			release_open_stateid(stp);
>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>>   	};
>>   	struct nfs4_cpntf_state *cps;
>>   	copy_stateid_t *cps_t;
>> +	struct nfs4_stid *stid;
>> +	int id = 0;
>>   	int i;
>>   
>>   	if (clients_still_reclaiming(nn)) {
>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>>   		clp = list_entry(pos, struct nfs4_client, cl_lru);
>>   		if (!state_expired(&lt, clp->cl_time))
>>   			break;
>> +
>> +		spin_lock(&clp->cl_lock);
>> +		stid = idr_get_next(&clp->cl_stateids, &id);
>> +		spin_unlock(&clp->cl_lock);
>> +		if (stid) {
>> +			/* client still has states */
>> +			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>> +			continue;
>> +		}
>>   		if (mark_client_expired_locked(clp)) {
>>   			trace_nfsd_clid_expired(&clp->cl_clientid);
>>   			continue;
>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops  = {
>>   	.lm_put_owner = nfsd4_fl_put_owner,
>>   };
>>   
>> -static inline void
>> +static inline int
>>   nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   {
>>   	struct nfs4_lockowner *lo;
>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   			/* We just don't care that much */
>>   			goto nevermind;
>>   		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
>> +		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
>> +			return -EAGAIN;
>>   	} else {
>>   nevermind:
>>   		deny->ld_owner.len = 0;
>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   	deny->ld_type = NFS4_READ_LT;
>>   	if (fl->fl_type != F_RDLCK)
>>   		deny->ld_type = NFS4_WRITE_LT;
>> +	return 0;
>>   }
>>   
>>   static struct nfs4_lockowner *
>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	unsigned int fl_flags = FL_POSIX;
>>   	struct net *net = SVC_NET(rqstp);
>>   	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>> +	bool retried = false;
>>   
>>   	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>>   		(long long) lock->lk_offset,
>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>>   		spin_unlock(&nn->blocked_locks_lock);
>>   	}
>> -
>> +again:
>>   	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>>   	switch (err) {
>>   	case 0: /* success! */
>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	case -EAGAIN:		/* conflock holds conflicting lock */
>>   		status = nfserr_denied;
>>   		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
>> +
>> +		/* try again if conflict with courtesy client  */
>> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
>> +			retried = true;
>> +			goto again;
>> +		}
>>   		break;
>>   	case -EDEADLK:
>>   		status = nfserr_deadlock;
>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	struct nfs4_lockowner *lo = NULL;
>>   	__be32 status;
>>   	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>> +	bool retried = false;
>> +	int ret;
>>   
>>   	if (locks_in_grace(SVC_NET(rqstp)))
>>   		return nfserr_grace;
>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
>>   
>>   	nfs4_transform_lock_offset(file_lock);
>> -
>> +again:
>>   	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>>   	if (status)
>>   		goto out;
>>   
>>   	if (file_lock->fl_type != F_UNLCK) {
>>   		status = nfserr_denied;
>> -		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>> +		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>> +		if (ret == -EAGAIN && !retried) {
>> +			retried = true;
>> +			fh_clear_wcc(&cstate->current_fh);
>> +			goto again;
>> +		}
>>   	}
>>   out:
>>   	if (lo)
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index e73bdbb1634a..bdc3605e3722 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -345,6 +345,7 @@ struct nfs4_client {
>>   #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
>>   #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
>>   					 1 << NFSD4_CLIENT_CB_KILL)
>> +#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
>>   	unsigned long		cl_flags;
>>   	const struct cred	*cl_cb_cred;
>>   	struct rpc_clnt		*cl_cb_client;
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index e91d51ea028b..2f0382f9d0ff 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -304,6 +304,7 @@ struct svc_rqst {
>>   						 * net namespace
>>   						 */
>>   	void **			rq_lease_breaker; /* The v4 client breaking a lease */
>> +	void			*rq_conflict_client;
>>   };
>>   
>>   #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
>> -- 
>> 2.9.5
Calum Mackay June 16, 2021, 7:19 p.m. UTC | #5
On 16/06/2021 8:17 pm, dai.ngo@oracle.com wrote:
> 
> On 6/16/21 9:02 AM, J. Bruce Fields wrote:
>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>> Currently an NFSv4 client must maintain its lease by using the at least
>>> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
>>> a singleton SEQUENCE (4.1) at least once during each lease period. If 
>>> the
>>> client fails to renew the lease, for any reason, the Linux server 
>>> expunges
>>> the state tokens immediately upon detection of the "failure to renew the
>>> lease" condition and begins returning NFS4ERR_EXPIRED if the client 
>>> should
>>> reconnect and attempt to use the (now) expired state.
>>>
>>> The default lease period for the Linux server is 90 seconds.  The 
>>> typical
>>> client cuts that in half and will issue a lease renewing operation every
>>> 45 seconds. The 90 second lease period is very short considering the
>>> potential for moderately long term network partitions.  A network 
>>> partition
>>> refers to any loss of network connectivity between the NFS client and 
>>> the
>>> NFS server, regardless of its root cause.  This includes NIC 
>>> failures, NIC
>>> driver bugs, network misconfigurations & administrative errors, 
>>> routers &
>>> switches crashing and/or having software updates applied, even down to
>>> cables being physically pulled.  In most cases, these network 
>>> failures are
>>> transient, although the duration is unknown.
>>>
>>> A server which does not immediately expunge the state on lease 
>>> expiration
>>> is known as a Courteous Server.  A Courteous Server continues to 
>>> recognize
>>> previously generated state tokens as valid until conflict arises between
>>> the expired state and the requests from another client, or the server 
>>> reboots.
>>>
>>> The initial implementation of the Courteous Server will do the 
>>> following:
>>>
>>> . when the laundromat thread detects an expired client and if that 
>>> client
>>> still has established states on the Linux server then marks the 
>>> client as a
>>> COURTESY_CLIENT and skips destroying the client and all its states,
>>> otherwise destroy the client as usual.
>>>
>>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
>>> expired client and all its states, skips the delegation recall then 
>>> allows
>>> the conflicting request to succeed.
>>>
>>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, 
>>> destroys
>>> the expired client and all its states then allows the conflicting 
>>> request
>>> to succeed.
>>>
>>> To be done:
>>>
>>> . fix a problem with 4.1 where the Linux server keeps returning
>>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the 
>>> expired
>>> client re-connects, causing the client to keep sending BCTS requests 
>>> to server.
>> Hm, any progress working out what's causing that?
> 
> I will fix this in v2 patch.
> 
>>
>>> . handle OPEN conflict with share reservations.
>> Sounds doable.
> 
> Yes. But I don't know a way to force the Linux client to specify the
> deny mode on OPEN.

We could do this with pynfs, perhaps, for testing?

cheers,
c.



  I may have to test this with SMB client: expired
> NFS client should not prevent SMB client from open the file with
> deny mode.
> 
>>
>>> . instead of destroy the client anf all its state on conflict, only 
>>> destroy
>>> the state that is conflicted with the current request.
>> The other todos I think have to be done before we merge, but this one I
>> think can wait.
>>
>>> . destroy the COURTESY_CLIENT either after a fixed period of time to 
>>> release
>>> resources or as reacting to memory pressure.
>> I think we need something here, but it can be pretty simple.
> 
> ok.
> 
>>
>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>> ---
>>>   fs/nfsd/nfs4state.c        | 94 
>>> +++++++++++++++++++++++++++++++++++++++++++---
>>>   fs/nfsd/state.h            |  1 +
>>>   include/linux/sunrpc/svc.h |  1 +
>>>   3 files changed, 91 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>> index b517a8794400..969995872752 100644
>>> --- a/fs/nfsd/nfs4state.c
>>> +++ b/fs/nfsd/nfs4state.c
>>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>>>       list_move_tail(&clp->cl_lru, &nn->client_lru);
>>>       clp->cl_time = ktime_get_boottime_seconds();
>>> +    clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>>   }
>>>   static void put_client_renew_locked(struct nfs4_client *clp)
>>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct 
>>> nfs4_delegation *dp)
>>>       nfsd4_run_cb(&dp->dl_recall);
>>>   }
>>> +/*
>>> + * Set rq_conflict_client if the conflict client have expired
>>> + * and return true, otherwise return false.
>>> + */
>>> +static bool
>>> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
>>> +{
>>> +    struct svc_rqst *rqst;
>>> +    struct nfs4_client *clp;
>>> +    struct nfsd_net *nn;
>>> +    bool ret;
>>> +
>>> +    if (!i_am_nfsd())
>>> +        return false;
>>> +    rqst = kthread_data(current);
>>> +    if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
>>> +        return false;
>> This says we do nothing if the lock request is not coming from nfsd and
>> v4.  That doesn't sound right.
>>
>> For example, if the conflicting lock is coming from a local process (not
>> from an NFS client at all), we still need to revoke the courtesy state
>> so that process can make progress.
> 
> The reason it checks for NFS request is because it uses rq_conflict_client
> to pass the expired client back to the upper layer which then does the
> expire_client. If this is not from a NFS request then kthread_data() might
> not be the svc_rqst. In that case the code will try to recall the 
> delegation
> from the expired client and the recall will eventually timed out.
> 
> I will rework this code.
> 
> Thanks,
> -Dai
> 
>>> +    rqst->rq_conflict_client = NULL;
>>> +    clp = dp->dl_recall.cb_clp;
>>> +    nn = net_generic(clp->net, nfsd_net_id);
>>> +    spin_lock(&nn->client_lock);
>>> +
>>> +    if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
>>> +                !mark_client_expired_locked(clp)) {
>>> +        rqst->rq_conflict_client = clp;
>>> +        ret = true;
>>> +    } else
>>> +        ret = false;
>>> +    spin_unlock(&nn->client_lock);
>>> +    return ret;
>>> +}
>>> +
>>>   /* Called from break_lease() with i_lock held. */
>>>   static bool
>>>   nfsd_break_deleg_cb(struct file_lock *fl)
>>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>>>       struct nfs4_delegation *dp = (struct nfs4_delegation 
>>> *)fl->fl_owner;
>>>       struct nfs4_file *fp = dp->dl_stid.sc_file;
>>> +    if (nfsd_set_conflict_client(dp))
>>> +        return false;
>>>       trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>>>       /*
>>> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct 
>>> nfsd4_open *open,
>>>        */
>>>   }
>>> +static bool
>>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
>>> +{
>>> +    struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
>>> +
>>> +    spin_lock(&nn->client_lock);
>>> +    if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
>>> +            mark_client_expired_locked(clp)) {
>>> +        spin_unlock(&nn->client_lock);
>>> +        return false;
>>> +    }
>>> +    spin_unlock(&nn->client_lock);
>>> +    expire_client(clp);
>>> +    return true;
>>> +}
>>> +
>>>   __be32
>>>   nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh 
>>> *current_fh, struct nfsd4_open *open)
>>>   {
>>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, 
>>> struct svc_fh *current_fh, struct nf
>>>               goto out;
>>>           }
>>>       } else {
>>> +        rqstp->rq_conflict_client = NULL;
>>>           status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>>> +        if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
>>> +            if 
>>> (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
>>> +                status = nfs4_get_vfs_file(rqstp, fp, current_fh, 
>>> stp, open);
>>> +        }
>>> +
>>>           if (status) {
>>>               stp->st_stid.sc_type = NFS4_CLOSED_STID;
>>>               release_open_stateid(stp);
>>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>>>       };
>>>       struct nfs4_cpntf_state *cps;
>>>       copy_stateid_t *cps_t;
>>> +    struct nfs4_stid *stid;
>>> +    int id = 0;
>>>       int i;
>>>       if (clients_still_reclaiming(nn)) {
>>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>>>           clp = list_entry(pos, struct nfs4_client, cl_lru);
>>>           if (!state_expired(&lt, clp->cl_time))
>>>               break;
>>> +
>>> +        spin_lock(&clp->cl_lock);
>>> +        stid = idr_get_next(&clp->cl_stateids, &id);
>>> +        spin_unlock(&clp->cl_lock);
>>> +        if (stid) {
>>> +            /* client still has states */
>>> +            set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>> +            continue;
>>> +        }
>>>           if (mark_client_expired_locked(clp)) {
>>>               trace_nfsd_clid_expired(&clp->cl_clientid);
>>>               continue;
>>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations 
>>> nfsd_posix_mng_ops  = {
>>>       .lm_put_owner = nfsd4_fl_put_owner,
>>>   };
>>> -static inline void
>>> +static inline int
>>>   nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied 
>>> *deny)
>>>   {
>>>       struct nfs4_lockowner *lo;
>>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, 
>>> struct nfsd4_lock_denied *deny)
>>>               /* We just don't care that much */
>>>               goto nevermind;
>>>           deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
>>> +        if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
>>> +            return -EAGAIN;
>>>       } else {
>>>   nevermind:
>>>           deny->ld_owner.len = 0;
>>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, 
>>> struct nfsd4_lock_denied *deny)
>>>       deny->ld_type = NFS4_READ_LT;
>>>       if (fl->fl_type != F_RDLCK)
>>>           deny->ld_type = NFS4_WRITE_LT;
>>> +    return 0;
>>>   }
>>>   static struct nfs4_lockowner *
>>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>       unsigned int fl_flags = FL_POSIX;
>>>       struct net *net = SVC_NET(rqstp);
>>>       struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>>> +    bool retried = false;
>>>       dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>>>           (long long) lock->lk_offset,
>>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>           list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>>>           spin_unlock(&nn->blocked_locks_lock);
>>>       }
>>> -
>>> +again:
>>>       err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>>>       switch (err) {
>>>       case 0: /* success! */
>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>       case -EAGAIN:        /* conflock holds conflicting lock */
>>>           status = nfserr_denied;
>>>           dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>>> -        nfs4_set_lock_denied(conflock, &lock->lk_denied);
>>> +
>>> +        /* try again if conflict with courtesy client  */
>>> +        if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == 
>>> -EAGAIN && !retried) {
>>> +            retried = true;
>>> +            goto again;
>>> +        }
>>>           break;
>>>       case -EDEADLK:
>>>           status = nfserr_deadlock;
>>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>       struct nfs4_lockowner *lo = NULL;
>>>       __be32 status;
>>>       struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>>> +    bool retried = false;
>>> +    int ret;
>>>       if (locks_in_grace(SVC_NET(rqstp)))
>>>           return nfserr_grace;
>>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>       file_lock->fl_end = last_byte_offset(lockt->lt_offset, 
>>> lockt->lt_length);
>>>       nfs4_transform_lock_offset(file_lock);
>>> -
>>> +again:
>>>       status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>>>       if (status)
>>>           goto out;
>>>       if (file_lock->fl_type != F_UNLCK) {
>>>           status = nfserr_denied;
>>> -        nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>>> +        ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>>> +        if (ret == -EAGAIN && !retried) {
>>> +            retried = true;
>>> +            fh_clear_wcc(&cstate->current_fh);
>>> +            goto again;
>>> +        }
>>>       }
>>>   out:
>>>       if (lo)
>>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>>> index e73bdbb1634a..bdc3605e3722 100644
>>> --- a/fs/nfsd/state.h
>>> +++ b/fs/nfsd/state.h
>>> @@ -345,6 +345,7 @@ struct nfs4_client {
>>>   #define NFSD4_CLIENT_UPCALL_LOCK    (5)    /* upcall serialization */
>>>   #define NFSD4_CLIENT_CB_FLAG_MASK    (1 << NFSD4_CLIENT_CB_UPDATE | \
>>>                        1 << NFSD4_CLIENT_CB_KILL)
>>> +#define NFSD4_CLIENT_COURTESY        (6)    /* be nice to expired 
>>> client */
>>>       unsigned long        cl_flags;
>>>       const struct cred    *cl_cb_cred;
>>>       struct rpc_clnt        *cl_cb_client;
>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>> index e91d51ea028b..2f0382f9d0ff 100644
>>> --- a/include/linux/sunrpc/svc.h
>>> +++ b/include/linux/sunrpc/svc.h
>>> @@ -304,6 +304,7 @@ struct svc_rqst {
>>>                            * net namespace
>>>                            */
>>>       void **            rq_lease_breaker; /* The v4 client breaking 
>>> a lease */
>>> +    void            *rq_conflict_client;
>>>   };
>>>   #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : 
>>> rqst->rq_bc_net)
>>> -- 
>>> 2.9.5
Dai Ngo June 16, 2021, 7:25 p.m. UTC | #6
On 6/16/21 9:32 AM, Chuck Lever III wrote:
>> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>
>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>> . instead of destroy the client anf all its state on conflict, only destroy
>>> the state that is conflicted with the current request.
>> The other todos I think have to be done before we merge, but this one I
>> think can wait.
> I agree on both points: this one can wait, but the others
> should be done before merge.

yes, will do.

>
>
>>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
>>> resources or as reacting to memory pressure.
>> I think we need something here, but it can be pretty simple.
> We should work out a policy now.
>
> A lower bound is good to have. Keep courtesy clients at least
> this long. Average network partition length times two as a shot
> in the dark. Or it could be N times the lease expiry time.
>
> An upper bound is harder to guess at. Obviously these things
> will go away when the server reboots. The laundromat could
> handle this sooner. However using a shrinker might be nicer and
> more Linux-y, keeping the clients as long as practical, without
> the need for adding another administrative setting.

Can we start out with a simple 12 or 24 hours to accommodate long
network outages for this phase?

Thanks,
-Dai

>
>
> --
> Chuck Lever
>
>
>
Dai Ngo June 16, 2021, 7:27 p.m. UTC | #7
On 6/16/21 12:19 PM, Calum Mackay wrote:
> On 16/06/2021 8:17 pm, dai.ngo@oracle.com wrote:
>>
>> On 6/16/21 9:02 AM, J. Bruce Fields wrote:
>>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>>> Currently an NFSv4 client must maintain its lease by using the at 
>>>> least
>>>> one of the state tokens or if nothing else, by issuing a RENEW 
>>>> (4.0), or
>>>> a singleton SEQUENCE (4.1) at least once during each lease period. 
>>>> If the
>>>> client fails to renew the lease, for any reason, the Linux server 
>>>> expunges
>>>> the state tokens immediately upon detection of the "failure to 
>>>> renew the
>>>> lease" condition and begins returning NFS4ERR_EXPIRED if the client 
>>>> should
>>>> reconnect and attempt to use the (now) expired state.
>>>>
>>>> The default lease period for the Linux server is 90 seconds.  The 
>>>> typical
>>>> client cuts that in half and will issue a lease renewing operation 
>>>> every
>>>> 45 seconds. The 90 second lease period is very short considering the
>>>> potential for moderately long term network partitions.  A network 
>>>> partition
>>>> refers to any loss of network connectivity between the NFS client 
>>>> and the
>>>> NFS server, regardless of its root cause.  This includes NIC 
>>>> failures, NIC
>>>> driver bugs, network misconfigurations & administrative errors, 
>>>> routers &
>>>> switches crashing and/or having software updates applied, even down to
>>>> cables being physically pulled.  In most cases, these network 
>>>> failures are
>>>> transient, although the duration is unknown.
>>>>
>>>> A server which does not immediately expunge the state on lease 
>>>> expiration
>>>> is known as a Courteous Server.  A Courteous Server continues to 
>>>> recognize
>>>> previously generated state tokens as valid until conflict arises 
>>>> between
>>>> the expired state and the requests from another client, or the 
>>>> server reboots.
>>>>
>>>> The initial implementation of the Courteous Server will do the 
>>>> following:
>>>>
>>>> . when the laundromat thread detects an expired client and if that 
>>>> client
>>>> still has established states on the Linux server then marks the 
>>>> client as a
>>>> COURTESY_CLIENT and skips destroying the client and all its states,
>>>> otherwise destroy the client as usual.
>>>>
>>>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys 
>>>> the
>>>> expired client and all its states, skips the delegation recall then 
>>>> allows
>>>> the conflicting request to succeed.
>>>>
>>>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, 
>>>> destroys
>>>> the expired client and all its states then allows the conflicting 
>>>> request
>>>> to succeed.
>>>>
>>>> To be done:
>>>>
>>>> . fix a problem with 4.1 where the Linux server keeps returning
>>>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after 
>>>> the expired
>>>> client re-connects, causing the client to keep sending BCTS 
>>>> requests to server.
>>> Hm, any progress working out what's causing that?
>>
>> I will fix this in v2 patch.
>>
>>>
>>>> . handle OPEN conflict with share reservations.
>>> Sounds doable.
>>
>> Yes. But I don't know a way to force the Linux client to specify the
>> deny mode on OPEN.
>
> We could do this with pynfs, perhaps, for testing?

Ah you're right. I think we can, might need your help with this.

Thanks,
-Dai

>
> cheers,
> c.
>
>
>
>  I may have to test this with SMB client: expired
>> NFS client should not prevent SMB client from open the file with
>> deny mode.
>>
>>>
>>>> . instead of destroy the client anf all its state on conflict, only 
>>>> destroy
>>>> the state that is conflicted with the current request.
>>> The other todos I think have to be done before we merge, but this one I
>>> think can wait.
>>>
>>>> . destroy the COURTESY_CLIENT either after a fixed period of time 
>>>> to release
>>>> resources or as reacting to memory pressure.
>>> I think we need something here, but it can be pretty simple.
>>
>> ok.
>>
>>>
>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>> ---
>>>>   fs/nfsd/nfs4state.c        | 94 
>>>> +++++++++++++++++++++++++++++++++++++++++++---
>>>>   fs/nfsd/state.h            |  1 +
>>>>   include/linux/sunrpc/svc.h |  1 +
>>>>   3 files changed, 91 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index b517a8794400..969995872752 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>>>>       list_move_tail(&clp->cl_lru, &nn->client_lru);
>>>>       clp->cl_time = ktime_get_boottime_seconds();
>>>> +    clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>>>   }
>>>>   static void put_client_renew_locked(struct nfs4_client *clp)
>>>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct 
>>>> nfs4_delegation *dp)
>>>>       nfsd4_run_cb(&dp->dl_recall);
>>>>   }
>>>> +/*
>>>> + * Set rq_conflict_client if the conflict client have expired
>>>> + * and return true, otherwise return false.
>>>> + */
>>>> +static bool
>>>> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
>>>> +{
>>>> +    struct svc_rqst *rqst;
>>>> +    struct nfs4_client *clp;
>>>> +    struct nfsd_net *nn;
>>>> +    bool ret;
>>>> +
>>>> +    if (!i_am_nfsd())
>>>> +        return false;
>>>> +    rqst = kthread_data(current);
>>>> +    if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
>>>> +        return false;
>>> This says we do nothing if the lock request is not coming from nfsd and
>>> v4.  That doesn't sound right.
>>>
>>> For example, if the conflicting lock is coming from a local process 
>>> (not
>>> from an NFS client at all), we still need to revoke the courtesy state
>>> so that process can make progress.
>>
>> The reason it checks for NFS request is because it uses 
>> rq_conflict_client
>> to pass the expired client back to the upper layer which then does the
>> expire_client. If this is not from a NFS request then kthread_data() 
>> might
>> not be the svc_rqst. In that case the code will try to recall the 
>> delegation
>> from the expired client and the recall will eventually timed out.
>>
>> I will rework this code.
>>
>> Thanks,
>> -Dai
>>
>>>> +    rqst->rq_conflict_client = NULL;
>>>> +    clp = dp->dl_recall.cb_clp;
>>>> +    nn = net_generic(clp->net, nfsd_net_id);
>>>> +    spin_lock(&nn->client_lock);
>>>> +
>>>> +    if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
>>>> +                !mark_client_expired_locked(clp)) {
>>>> +        rqst->rq_conflict_client = clp;
>>>> +        ret = true;
>>>> +    } else
>>>> +        ret = false;
>>>> +    spin_unlock(&nn->client_lock);
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   /* Called from break_lease() with i_lock held. */
>>>>   static bool
>>>>   nfsd_break_deleg_cb(struct file_lock *fl)
>>>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>>>>       struct nfs4_delegation *dp = (struct nfs4_delegation 
>>>> *)fl->fl_owner;
>>>>       struct nfs4_file *fp = dp->dl_stid.sc_file;
>>>> +    if (nfsd_set_conflict_client(dp))
>>>> +        return false;
>>>> trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>>>>       /*
>>>> @@ -5237,6 +5272,22 @@ static void 
>>>> nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>>>>        */
>>>>   }
>>>> +static bool
>>>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
>>>> +{
>>>> +    struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
>>>> +
>>>> +    spin_lock(&nn->client_lock);
>>>> +    if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
>>>> +            mark_client_expired_locked(clp)) {
>>>> +        spin_unlock(&nn->client_lock);
>>>> +        return false;
>>>> +    }
>>>> +    spin_unlock(&nn->client_lock);
>>>> +    expire_client(clp);
>>>> +    return true;
>>>> +}
>>>> +
>>>>   __be32
>>>>   nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh 
>>>> *current_fh, struct nfsd4_open *open)
>>>>   {
>>>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, 
>>>> struct svc_fh *current_fh, struct nf
>>>>               goto out;
>>>>           }
>>>>       } else {
>>>> +        rqstp->rq_conflict_client = NULL;
>>>>           status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, 
>>>> open);
>>>> +        if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
>>>> +            if 
>>>> (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
>>>> +                status = nfs4_get_vfs_file(rqstp, fp, current_fh, 
>>>> stp, open);
>>>> +        }
>>>> +
>>>>           if (status) {
>>>>               stp->st_stid.sc_type = NFS4_CLOSED_STID;
>>>>               release_open_stateid(stp);
>>>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>>>>       };
>>>>       struct nfs4_cpntf_state *cps;
>>>>       copy_stateid_t *cps_t;
>>>> +    struct nfs4_stid *stid;
>>>> +    int id = 0;
>>>>       int i;
>>>>       if (clients_still_reclaiming(nn)) {
>>>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>>>>           clp = list_entry(pos, struct nfs4_client, cl_lru);
>>>>           if (!state_expired(&lt, clp->cl_time))
>>>>               break;
>>>> +
>>>> +        spin_lock(&clp->cl_lock);
>>>> +        stid = idr_get_next(&clp->cl_stateids, &id);
>>>> +        spin_unlock(&clp->cl_lock);
>>>> +        if (stid) {
>>>> +            /* client still has states */
>>>> +            set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>>> +            continue;
>>>> +        }
>>>>           if (mark_client_expired_locked(clp)) {
>>>> trace_nfsd_clid_expired(&clp->cl_clientid);
>>>>               continue;
>>>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations 
>>>> nfsd_posix_mng_ops  = {
>>>>       .lm_put_owner = nfsd4_fl_put_owner,
>>>>   };
>>>> -static inline void
>>>> +static inline int
>>>>   nfs4_set_lock_denied(struct file_lock *fl, struct 
>>>> nfsd4_lock_denied *deny)
>>>>   {
>>>>       struct nfs4_lockowner *lo;
>>>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, 
>>>> struct nfsd4_lock_denied *deny)
>>>>               /* We just don't care that much */
>>>>               goto nevermind;
>>>>           deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
>>>> +        if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
>>>> +            return -EAGAIN;
>>>>       } else {
>>>>   nevermind:
>>>>           deny->ld_owner.len = 0;
>>>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, 
>>>> struct nfsd4_lock_denied *deny)
>>>>       deny->ld_type = NFS4_READ_LT;
>>>>       if (fl->fl_type != F_RDLCK)
>>>>           deny->ld_type = NFS4_WRITE_LT;
>>>> +    return 0;
>>>>   }
>>>>   static struct nfs4_lockowner *
>>>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>       unsigned int fl_flags = FL_POSIX;
>>>>       struct net *net = SVC_NET(rqstp);
>>>>       struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>>>> +    bool retried = false;
>>>>       dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>>>>           (long long) lock->lk_offset,
>>>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>           list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>>>>           spin_unlock(&nn->blocked_locks_lock);
>>>>       }
>>>> -
>>>> +again:
>>>>       err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>>>>       switch (err) {
>>>>       case 0: /* success! */
>>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>       case -EAGAIN:        /* conflock holds conflicting lock */
>>>>           status = nfserr_denied;
>>>>           dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>>>> -        nfs4_set_lock_denied(conflock, &lock->lk_denied);
>>>> +
>>>> +        /* try again if conflict with courtesy client  */
>>>> +        if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == 
>>>> -EAGAIN && !retried) {
>>>> +            retried = true;
>>>> +            goto again;
>>>> +        }
>>>>           break;
>>>>       case -EDEADLK:
>>>>           status = nfserr_deadlock;
>>>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>       struct nfs4_lockowner *lo = NULL;
>>>>       __be32 status;
>>>>       struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>>>> +    bool retried = false;
>>>> +    int ret;
>>>>       if (locks_in_grace(SVC_NET(rqstp)))
>>>>           return nfserr_grace;
>>>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct 
>>>> nfsd4_compound_state *cstate,
>>>>       file_lock->fl_end = last_byte_offset(lockt->lt_offset, 
>>>> lockt->lt_length);
>>>>       nfs4_transform_lock_offset(file_lock);
>>>> -
>>>> +again:
>>>>       status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>>>>       if (status)
>>>>           goto out;
>>>>       if (file_lock->fl_type != F_UNLCK) {
>>>>           status = nfserr_denied;
>>>> -        nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>>>> +        ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>>>> +        if (ret == -EAGAIN && !retried) {
>>>> +            retried = true;
>>>> +            fh_clear_wcc(&cstate->current_fh);
>>>> +            goto again;
>>>> +        }
>>>>       }
>>>>   out:
>>>>       if (lo)
>>>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>>>> index e73bdbb1634a..bdc3605e3722 100644
>>>> --- a/fs/nfsd/state.h
>>>> +++ b/fs/nfsd/state.h
>>>> @@ -345,6 +345,7 @@ struct nfs4_client {
>>>>   #define NFSD4_CLIENT_UPCALL_LOCK    (5)    /* upcall 
>>>> serialization */
>>>>   #define NFSD4_CLIENT_CB_FLAG_MASK    (1 << NFSD4_CLIENT_CB_UPDATE 
>>>> | \
>>>>                        1 << NFSD4_CLIENT_CB_KILL)
>>>> +#define NFSD4_CLIENT_COURTESY        (6)    /* be nice to expired 
>>>> client */
>>>>       unsigned long        cl_flags;
>>>>       const struct cred    *cl_cb_cred;
>>>>       struct rpc_clnt        *cl_cb_client;
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index e91d51ea028b..2f0382f9d0ff 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -304,6 +304,7 @@ struct svc_rqst {
>>>>                            * net namespace
>>>>                            */
>>>>       void **            rq_lease_breaker; /* The v4 client 
>>>> breaking a lease */
>>>> +    void            *rq_conflict_client;
>>>>   };
>>>>   #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : 
>>>> rqst->rq_bc_net)
>>>> -- 
>>>> 2.9.5
>
Chuck Lever III June 16, 2021, 7:29 p.m. UTC | #8
> On Jun 16, 2021, at 3:25 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> On 6/16/21 9:32 AM, Chuck Lever III wrote:
>>> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>>> . instead of destroy the client anf all its state on conflict, only destroy
>>>> the state that is conflicted with the current request.
>>> The other todos I think have to be done before we merge, but this one I
>>> think can wait.
>> I agree on both points: this one can wait, but the others
>> should be done before merge.
> 
> yes, will do.
> 
>> 
>> 
>>>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
>>>> resources or as reacting to memory pressure.
>>> I think we need something here, but it can be pretty simple.
>> We should work out a policy now.
>> 
>> A lower bound is good to have. Keep courtesy clients at least
>> this long. Average network partition length times two as a shot
>> in the dark. Or it could be N times the lease expiry time.
>> 
>> An upper bound is harder to guess at. Obviously these things
>> will go away when the server reboots. The laundromat could
>> handle this sooner. However using a shrinker might be nicer and
>> more Linux-y, keeping the clients as long as practical, without
>> the need for adding another administrative setting.
> 
> Can we start out with a simple 12 or 24 hours to accommodate long
> network outages for this phase?

Sure. Let's go with 24 hours.

Bill suggested adding a "clear_locks" like mechanism that could be
used to throw out all courteous clients at once. Maybe another
phase 2 project!


--
Chuck Lever
J. Bruce Fields June 16, 2021, 8:30 p.m. UTC | #9
On Wed, Jun 16, 2021 at 07:29:37PM +0000, Chuck Lever III wrote:
> 
> 
> > On Jun 16, 2021, at 3:25 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > 
> > On 6/16/21 9:32 AM, Chuck Lever III wrote:
> >>> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>> 
> >>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> >>>> . instead of destroy the client anf all its state on conflict, only destroy
> >>>> the state that is conflicted with the current request.
> >>> The other todos I think have to be done before we merge, but this one I
> >>> think can wait.
> >> I agree on both points: this one can wait, but the others
> >> should be done before merge.
> > 
> > yes, will do.
> > 
> >> 
> >> 
> >>>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
> >>>> resources or as reacting to memory pressure.
> >>> I think we need something here, but it can be pretty simple.
> >> We should work out a policy now.
> >> 
> >> A lower bound is good to have. Keep courtesy clients at least
> >> this long. Average network partition length times two as a shot
> >> in the dark. Or it could be N times the lease expiry time.
> >> 
> >> An upper bound is harder to guess at. Obviously these things
> >> will go away when the server reboots. The laundromat could
> >> handle this sooner. However using a shrinker might be nicer and
> >> more Linux-y, keeping the clients as long as practical, without
> >> the need for adding another administrative setting.
> > 
> > Can we start out with a simple 12 or 24 hours to accommodate long
> > network outages for this phase?
> 
> Sure. Let's go with 24 hours.
> 
> Bill suggested adding a "clear_locks" like mechanism that could be
> used to throw out all courteous clients at once. Maybe another
> phase 2 project!

For what it's worth, you can forcibly expire a client by writing
"expire" to /proc/fs/nfsd/client/xxx/ctl.  So it shouldn't be hard to
script this, if we add some kind of "courtesy" flag to
client_info_show() and/or a number of seconds since the most recent
renew.

Maybe adding a command like "expire_if_courtesy" would also simplify
that and avoid a race where the renew comes in simultaneously with the
expire command.

Or we could just add a single call to clear all courtesy clients.  But
the per-client approach would allow more flexibility if you wanted (e.g.
to throw out only clients over a certain age).

--b.
J. Bruce Fields June 24, 2021, 2:02 p.m. UTC | #10
By the way, have we thought about the -ENOSPC case?

An expired client could prevent a lot of disk space from being freed if
it happens to be holding the last reference to a large file.

I'm not sure how to deal with this.  I don't think there's an efficient
way to determine which expired client is responsible for an ENOSPC.  So,
maybe you'd check for ENOSPC and when you see it, remove all expired
clients and retry if there were any?

--b.

On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> Currently an NFSv4 client must maintain its lease by using the at least
> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
> a singleton SEQUENCE (4.1) at least once during each lease period. If the
> client fails to renew the lease, for any reason, the Linux server expunges
> the state tokens immediately upon detection of the "failure to renew the
> lease" condition and begins returning NFS4ERR_EXPIRED if the client should
> reconnect and attempt to use the (now) expired state.
> 
> The default lease period for the Linux server is 90 seconds.  The typical
> client cuts that in half and will issue a lease renewing operation every
> 45 seconds. The 90 second lease period is very short considering the
> potential for moderately long term network partitions.  A network partition
> refers to any loss of network connectivity between the NFS client and the
> NFS server, regardless of its root cause.  This includes NIC failures, NIC
> driver bugs, network misconfigurations & administrative errors, routers &
> switches crashing and/or having software updates applied, even down to
> cables being physically pulled.  In most cases, these network failures are
> transient, although the duration is unknown.
> 
> A server which does not immediately expunge the state on lease expiration
> is known as a Courteous Server.  A Courteous Server continues to recognize
> previously generated state tokens as valid until conflict arises between
> the expired state and the requests from another client, or the server reboots.
> 
> The initial implementation of the Courteous Server will do the following:
> 
> . when the laundromat thread detects an expired client and if that client
> still has established states on the Linux server then marks the client as a
> COURTESY_CLIENT and skips destroying the client and all its states,
> otherwise destroy the client as usual.
> 
> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
> expired client and all its states, skips the delegation recall then allows
> the conflicting request to succeed.
> 
> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
> the expired client and all its states then allows the conflicting request
> to succeed.
> 
> To be done:
> 
> . fix a problem with 4.1 where the Linux server keeps returning
> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
> client re-connects, causing the client to keep sending BCTS requests to server.
> 
> . handle OPEN conflict with share reservations.
> 
> . instead of destroy the client anf all its state on conflict, only destroy
> the state that is conflicted with the current request.
> 
> . destroy the COURTESY_CLIENT either after a fixed period of time to release
> resources or as reacting to memory pressure.
> 
> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
> ---
>  fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
>  fs/nfsd/state.h            |  1 +
>  include/linux/sunrpc/svc.h |  1 +
>  3 files changed, 91 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b517a8794400..969995872752 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>  
>  	list_move_tail(&clp->cl_lru, &nn->client_lru);
>  	clp->cl_time = ktime_get_boottime_seconds();
> +	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>  }
>  
>  static void put_client_renew_locked(struct nfs4_client *clp)
> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>  	nfsd4_run_cb(&dp->dl_recall);
>  }
>  
> +/*
> + * Set rq_conflict_client if the conflict client have expired
> + * and return true, otherwise return false.
> + */
> +static bool
> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
> +{
> +	struct svc_rqst *rqst;
> +	struct nfs4_client *clp;
> +	struct nfsd_net *nn;
> +	bool ret;
> +
> +	if (!i_am_nfsd())
> +		return false;
> +	rqst = kthread_data(current);
> +	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
> +		return false;
> +	rqst->rq_conflict_client = NULL;
> +	clp = dp->dl_recall.cb_clp;
> +	nn = net_generic(clp->net, nfsd_net_id);
> +	spin_lock(&nn->client_lock);
> +
> +	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
> +				!mark_client_expired_locked(clp)) {
> +		rqst->rq_conflict_client = clp;
> +		ret = true;
> +	} else
> +		ret = false;
> +	spin_unlock(&nn->client_lock);
> +	return ret;
> +}
> +
>  /* Called from break_lease() with i_lock held. */
>  static bool
>  nfsd_break_deleg_cb(struct file_lock *fl)
> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>  	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
>  	struct nfs4_file *fp = dp->dl_stid.sc_file;
>  
> +	if (nfsd_set_conflict_client(dp))
> +		return false;
>  	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>  
>  	/*
> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>  	 */
>  }
>  
> +static bool
> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
> +{
> +	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> +
> +	spin_lock(&nn->client_lock);
> +	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
> +			mark_client_expired_locked(clp)) {
> +		spin_unlock(&nn->client_lock);
> +		return false;
> +	}
> +	spin_unlock(&nn->client_lock);
> +	expire_client(clp);
> +	return true;
> +}
> +
>  __be32
>  nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
>  {
> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>  			goto out;
>  		}
>  	} else {
> +		rqstp->rq_conflict_client = NULL;
>  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
> +			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
> +				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> +		}
> +
>  		if (status) {
>  			stp->st_stid.sc_type = NFS4_CLOSED_STID;
>  			release_open_stateid(stp);
> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>  	};
>  	struct nfs4_cpntf_state *cps;
>  	copy_stateid_t *cps_t;
> +	struct nfs4_stid *stid;
> +	int id = 0;
>  	int i;
>  
>  	if (clients_still_reclaiming(nn)) {
> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>  		clp = list_entry(pos, struct nfs4_client, cl_lru);
>  		if (!state_expired(&lt, clp->cl_time))
>  			break;
> +
> +		spin_lock(&clp->cl_lock);
> +		stid = idr_get_next(&clp->cl_stateids, &id);
> +		spin_unlock(&clp->cl_lock);
> +		if (stid) {
> +			/* client still has states */
> +			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
> +			continue;
> +		}
>  		if (mark_client_expired_locked(clp)) {
>  			trace_nfsd_clid_expired(&clp->cl_clientid);
>  			continue;
> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops  = {
>  	.lm_put_owner = nfsd4_fl_put_owner,
>  };
>  
> -static inline void
> +static inline int
>  nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  {
>  	struct nfs4_lockowner *lo;
> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  			/* We just don't care that much */
>  			goto nevermind;
>  		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
> +		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
> +			return -EAGAIN;
>  	} else {
>  nevermind:
>  		deny->ld_owner.len = 0;
> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>  	deny->ld_type = NFS4_READ_LT;
>  	if (fl->fl_type != F_RDLCK)
>  		deny->ld_type = NFS4_WRITE_LT;
> +	return 0;
>  }
>  
>  static struct nfs4_lockowner *
> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	unsigned int fl_flags = FL_POSIX;
>  	struct net *net = SVC_NET(rqstp);
>  	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> +	bool retried = false;
>  
>  	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>  		(long long) lock->lk_offset,
> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>  		spin_unlock(&nn->blocked_locks_lock);
>  	}
> -
> +again:
>  	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>  	switch (err) {
>  	case 0: /* success! */
> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	case -EAGAIN:		/* conflock holds conflicting lock */
>  		status = nfserr_denied;
>  		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
> +
> +		/* try again if conflict with courtesy client  */
> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
> +			retried = true;
> +			goto again;
> +		}
>  		break;
>  	case -EDEADLK:
>  		status = nfserr_deadlock;
> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	struct nfs4_lockowner *lo = NULL;
>  	__be32 status;
>  	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
> +	bool retried = false;
> +	int ret;
>  
>  	if (locks_in_grace(SVC_NET(rqstp)))
>  		return nfserr_grace;
> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
>  
>  	nfs4_transform_lock_offset(file_lock);
> -
> +again:
>  	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>  	if (status)
>  		goto out;
>  
>  	if (file_lock->fl_type != F_UNLCK) {
>  		status = nfserr_denied;
> -		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
> +		if (ret == -EAGAIN && !retried) {
> +			retried = true;
> +			fh_clear_wcc(&cstate->current_fh);
> +			goto again;
> +		}
>  	}
>  out:
>  	if (lo)
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index e73bdbb1634a..bdc3605e3722 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -345,6 +345,7 @@ struct nfs4_client {
>  #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
>  #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
>  					 1 << NFSD4_CLIENT_CB_KILL)
> +#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
>  	unsigned long		cl_flags;
>  	const struct cred	*cl_cb_cred;
>  	struct rpc_clnt		*cl_cb_client;
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e91d51ea028b..2f0382f9d0ff 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -304,6 +304,7 @@ struct svc_rqst {
>  						 * net namespace
>  						 */
>  	void **			rq_lease_breaker; /* The v4 client breaking a lease */
> +	void			*rq_conflict_client;
>  };
>  
>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> -- 
> 2.9.5
Dai Ngo June 24, 2021, 7:50 p.m. UTC | #11
On 6/24/21 7:02 AM, J. Bruce Fields wrote:
> By the way, have we thought about the -ENOSPC case?
>
> An expired client could prevent a lot of disk space from being freed if
> it happens to be holding the last reference to a large file.

I think memory shortage is probably more likely to happen than running out
disk space but we have to handle this too.

>
> I'm not sure how to deal with this.  I don't think there's an efficient
> way to determine which expired client is responsible for an ENOSPC.  So,
> maybe you'd check for ENOSPC and when you see it, remove all expired
> clients and retry if there were any?

I did a quick check on fs/nfsd and did not see anywhere the server returns
ENOSPC/nfserr_nospc so I'm not sure where to check? Currently we plan to
destroy the courtesy client after 24hr if it has not reconnect but that's
a long time.

There are other immediate issues that needs to resolve:

1. before an expired client can be set as courtesy client, we need to make
sure there is no other NFSv4 clients interest in any of the locks owned by
the expired client; another v4 client tries to acquire the lock and gets
denied (the lock request is not blocked). We need a mechanism to detect
this condition and do not allow the expired client to be a courtesy client.

2. handle conflict with a v3 lock request.  The v3 lock code can detect lock
conflicted caused by an v4 client but it needs a way to call into v4 code
to destroy the courtesy client.

3. handle conflict with local lock.  Currently the lock lock code does not
even need to get conflict lock info in case of a conflict.  It's just blocked
and wait for the lock to be released. We need to detect this condition before
allowing an v4 expired client to be a courtesy client and also handle the case
after the v4 client was set to courtesy client then a local lock comes in. I'm
not sure how to handle this without modifying the fs/locks.c.

-Dai

>
> --b.
>
> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>> Currently an NFSv4 client must maintain its lease by using the at least
>> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or
>> a singleton SEQUENCE (4.1) at least once during each lease period. If the
>> client fails to renew the lease, for any reason, the Linux server expunges
>> the state tokens immediately upon detection of the "failure to renew the
>> lease" condition and begins returning NFS4ERR_EXPIRED if the client should
>> reconnect and attempt to use the (now) expired state.
>>
>> The default lease period for the Linux server is 90 seconds.  The typical
>> client cuts that in half and will issue a lease renewing operation every
>> 45 seconds. The 90 second lease period is very short considering the
>> potential for moderately long term network partitions.  A network partition
>> refers to any loss of network connectivity between the NFS client and the
>> NFS server, regardless of its root cause.  This includes NIC failures, NIC
>> driver bugs, network misconfigurations & administrative errors, routers &
>> switches crashing and/or having software updates applied, even down to
>> cables being physically pulled.  In most cases, these network failures are
>> transient, although the duration is unknown.
>>
>> A server which does not immediately expunge the state on lease expiration
>> is known as a Courteous Server.  A Courteous Server continues to recognize
>> previously generated state tokens as valid until conflict arises between
>> the expired state and the requests from another client, or the server reboots.
>>
>> The initial implementation of the Courteous Server will do the following:
>>
>> . when the laundromat thread detects an expired client and if that client
>> still has established states on the Linux server then marks the client as a
>> COURTESY_CLIENT and skips destroying the client and all its states,
>> otherwise destroy the client as usual.
>>
>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the
>> expired client and all its states, skips the delegation recall then allows
>> the conflicting request to succeed.
>>
>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys
>> the expired client and all its states then allows the conflicting request
>> to succeed.
>>
>> To be done:
>>
>> . fix a problem with 4.1 where the Linux server keeps returning
>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired
>> client re-connects, causing the client to keep sending BCTS requests to server.
>>
>> . handle OPEN conflict with share reservations.
>>
>> . instead of destroy the client anf all its state on conflict, only destroy
>> the state that is conflicted with the current request.
>>
>> . destroy the COURTESY_CLIENT either after a fixed period of time to release
>> resources or as reacting to memory pressure.
>>
>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>> ---
>>   fs/nfsd/nfs4state.c        | 94 +++++++++++++++++++++++++++++++++++++++++++---
>>   fs/nfsd/state.h            |  1 +
>>   include/linux/sunrpc/svc.h |  1 +
>>   3 files changed, 91 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index b517a8794400..969995872752 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp)
>>   
>>   	list_move_tail(&clp->cl_lru, &nn->client_lru);
>>   	clp->cl_time = ktime_get_boottime_seconds();
>> +	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>>   }
>>   
>>   static void put_client_renew_locked(struct nfs4_client *clp)
>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>>   	nfsd4_run_cb(&dp->dl_recall);
>>   }
>>   
>> +/*
>> + * Set rq_conflict_client if the conflict client have expired
>> + * and return true, otherwise return false.
>> + */
>> +static bool
>> +nfsd_set_conflict_client(struct nfs4_delegation *dp)
>> +{
>> +	struct svc_rqst *rqst;
>> +	struct nfs4_client *clp;
>> +	struct nfsd_net *nn;
>> +	bool ret;
>> +
>> +	if (!i_am_nfsd())
>> +		return false;
>> +	rqst = kthread_data(current);
>> +	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
>> +		return false;
>> +	rqst->rq_conflict_client = NULL;
>> +	clp = dp->dl_recall.cb_clp;
>> +	nn = net_generic(clp->net, nfsd_net_id);
>> +	spin_lock(&nn->client_lock);
>> +
>> +	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
>> +				!mark_client_expired_locked(clp)) {
>> +		rqst->rq_conflict_client = clp;
>> +		ret = true;
>> +	} else
>> +		ret = false;
>> +	spin_unlock(&nn->client_lock);
>> +	return ret;
>> +}
>> +
>>   /* Called from break_lease() with i_lock held. */
>>   static bool
>>   nfsd_break_deleg_cb(struct file_lock *fl)
>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl)
>>   	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
>>   	struct nfs4_file *fp = dp->dl_stid.sc_file;
>>   
>> +	if (nfsd_set_conflict_client(dp))
>> +		return false;
>>   	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
>>   
>>   	/*
>> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
>>   	 */
>>   }
>>   
>> +static bool
>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp)
>> +{
>> +	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
>> +
>> +	spin_lock(&nn->client_lock);
>> +	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
>> +			mark_client_expired_locked(clp)) {
>> +		spin_unlock(&nn->client_lock);
>> +		return false;
>> +	}
>> +	spin_unlock(&nn->client_lock);
>> +	expire_client(clp);
>> +	return true;
>> +}
>> +
>>   __be32
>>   nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
>>   {
>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
>>   			goto out;
>>   		}
>>   	} else {
>> +		rqstp->rq_conflict_client = NULL;
>>   		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> +		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
>> +			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
>> +				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
>> +		}
>> +
>>   		if (status) {
>>   			stp->st_stid.sc_type = NFS4_CLOSED_STID;
>>   			release_open_stateid(stp);
>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn)
>>   	};
>>   	struct nfs4_cpntf_state *cps;
>>   	copy_stateid_t *cps_t;
>> +	struct nfs4_stid *stid;
>> +	int id = 0;
>>   	int i;
>>   
>>   	if (clients_still_reclaiming(nn)) {
>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn)
>>   		clp = list_entry(pos, struct nfs4_client, cl_lru);
>>   		if (!state_expired(&lt, clp->cl_time))
>>   			break;
>> +
>> +		spin_lock(&clp->cl_lock);
>> +		stid = idr_get_next(&clp->cl_stateids, &id);
>> +		spin_unlock(&clp->cl_lock);
>> +		if (stid) {
>> +			/* client still has states */
>> +			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
>> +			continue;
>> +		}
>>   		if (mark_client_expired_locked(clp)) {
>>   			trace_nfsd_clid_expired(&clp->cl_clientid);
>>   			continue;
>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops  = {
>>   	.lm_put_owner = nfsd4_fl_put_owner,
>>   };
>>   
>> -static inline void
>> +static inline int
>>   nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   {
>>   	struct nfs4_lockowner *lo;
>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   			/* We just don't care that much */
>>   			goto nevermind;
>>   		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
>> +		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
>> +			return -EAGAIN;
>>   	} else {
>>   nevermind:
>>   		deny->ld_owner.len = 0;
>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
>>   	deny->ld_type = NFS4_READ_LT;
>>   	if (fl->fl_type != F_RDLCK)
>>   		deny->ld_type = NFS4_WRITE_LT;
>> +	return 0;
>>   }
>>   
>>   static struct nfs4_lockowner *
>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	unsigned int fl_flags = FL_POSIX;
>>   	struct net *net = SVC_NET(rqstp);
>>   	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>> +	bool retried = false;
>>   
>>   	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
>>   		(long long) lock->lk_offset,
>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
>>   		spin_unlock(&nn->blocked_locks_lock);
>>   	}
>> -
>> +again:
>>   	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
>>   	switch (err) {
>>   	case 0: /* success! */
>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	case -EAGAIN:		/* conflock holds conflicting lock */
>>   		status = nfserr_denied;
>>   		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
>> +
>> +		/* try again if conflict with courtesy client  */
>> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
>> +			retried = true;
>> +			goto again;
>> +		}
>>   		break;
>>   	case -EDEADLK:
>>   		status = nfserr_deadlock;
>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	struct nfs4_lockowner *lo = NULL;
>>   	__be32 status;
>>   	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>> +	bool retried = false;
>> +	int ret;
>>   
>>   	if (locks_in_grace(SVC_NET(rqstp)))
>>   		return nfserr_grace;
>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
>>   
>>   	nfs4_transform_lock_offset(file_lock);
>> -
>> +again:
>>   	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
>>   	if (status)
>>   		goto out;
>>   
>>   	if (file_lock->fl_type != F_UNLCK) {
>>   		status = nfserr_denied;
>> -		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>> +		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
>> +		if (ret == -EAGAIN && !retried) {
>> +			retried = true;
>> +			fh_clear_wcc(&cstate->current_fh);
>> +			goto again;
>> +		}
>>   	}
>>   out:
>>   	if (lo)
>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
>> index e73bdbb1634a..bdc3605e3722 100644
>> --- a/fs/nfsd/state.h
>> +++ b/fs/nfsd/state.h
>> @@ -345,6 +345,7 @@ struct nfs4_client {
>>   #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
>>   #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
>>   					 1 << NFSD4_CLIENT_CB_KILL)
>> +#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
>>   	unsigned long		cl_flags;
>>   	const struct cred	*cl_cb_cred;
>>   	struct rpc_clnt		*cl_cb_client;
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index e91d51ea028b..2f0382f9d0ff 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -304,6 +304,7 @@ struct svc_rqst {
>>   						 * net namespace
>>   						 */
>>   	void **			rq_lease_breaker; /* The v4 client breaking a lease */
>> +	void			*rq_conflict_client;
>>   };
>>   
>>   #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
>> -- 
>> 2.9.5
J. Bruce Fields June 24, 2021, 8:36 p.m. UTC | #12
On Thu, Jun 24, 2021 at 12:50:25PM -0700, dai.ngo@oracle.com wrote:
> On 6/24/21 7:02 AM, J. Bruce Fields wrote:
> >I'm not sure how to deal with this.  I don't think there's an efficient
> >way to determine which expired client is responsible for an ENOSPC.  So,
> >maybe you'd check for ENOSPC and when you see it, remove all expired
> >clients and retry if there were any?
> 
> I did a quick check on fs/nfsd and did not see anywhere the server returns
> ENOSPC/nfserr_nospc so I'm not sure where to check? Currently we plan to
> destroy the courtesy client after 24hr if it has not reconnect but that's
> a long time.

The error normally originates from the filesystem.  Anything that might
need to allocate space could hit it, and it could happen to a non-nfsd
process.

I'm not seeing an easy way to do this; we might need ideas from other
filesystem developers.

--b.
J. Bruce Fields June 28, 2021, 8:23 p.m. UTC | #13
On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	case -EAGAIN:		/* conflock holds conflicting lock */
>  		status = nfserr_denied;
>  		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
> +
> +		/* try again if conflict with courtesy client  */
> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
> +			retried = true;
> +			goto again;
> +		}

Ugh, apologies, this was my idea, but I just noticed it only handles conflicts
from other NFSv4 clients.  The conflicting lock could just as well come from
NLM or a local process.  So we need cooperation from the common locks.c code.

I'm not sure what to suggest....

Maybe something like:

@@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
        }
 
        percpu_down_read(&file_rwsem);
+retry:
        spin_lock(&ctx->flc_lock);
        /*
         * New lock request. Walk all POSIX locks and look for conflicts. If
@@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
                list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
                        if (!posix_locks_conflict(request, fl))
                                continue;
+                       if (fl->fl_lops->fl_expire_lock(fl, 1)) {
+                               spin_unlock(&ctx->flc_lock);
+                               fl->fl_lops->fl_expire_locks(fl, 0);
+                               goto retry;
+                       }
                        if (conflock)
                                locks_copy_conflock(conflock, fl);
                        error = -EAGAIN;


where ->fl_expire_lock is a new lock callback with second argument "check"
where:

	check = 1 means: just check whether this lock could be freed
	check = 0 means: go ahead and free this lock if you can

--b.
Dai Ngo June 28, 2021, 11:39 p.m. UTC | #14
On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>   	case -EAGAIN:		/* conflock holds conflicting lock */
>>   		status = nfserr_denied;
>>   		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>> -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
>> +
>> +		/* try again if conflict with courtesy client  */
>> +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
>> +			retried = true;
>> +			goto again;
>> +		}
> Ugh, apologies, this was my idea, but I just noticed it only handles conflicts
> from other NFSv4 clients.  The conflicting lock could just as well come from
> NLM or a local process.  So we need cooperation from the common locks.c code.
>
> I'm not sure what to suggest....
>
> Maybe something like:
>
> @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>          }
>   
>          percpu_down_read(&file_rwsem);
> +retry:
>          spin_lock(&ctx->flc_lock);
>          /*
>           * New lock request. Walk all POSIX locks and look for conflicts. If
> @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
>                  list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>                          if (!posix_locks_conflict(request, fl))
>                                  continue;
> +                       if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> +                               spin_unlock(&ctx->flc_lock);
> +                               fl->fl_lops->fl_expire_locks(fl, 0);
> +                               goto retry;
> +                       }
>                          if (conflock)
>                                  locks_copy_conflock(conflock, fl);
>                          error = -EAGAIN;
>
>
> where ->fl_expire_lock is a new lock callback with second argument "check"
> where:
>
> 	check = 1 means: just check whether this lock could be freed
> 	check = 0 means: go ahead and free this lock if you can

Thanks Bruce, I will look into this approach.

-Dai

>
> --b.
Dai Ngo June 29, 2021, 4:40 a.m. UTC | #15
On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote:
>
> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
>>> nfsd4_compound_state *cstate,
>>>       case -EAGAIN:        /* conflock holds conflicting lock */
>>>           status = nfserr_denied;
>>>           dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>>> -        nfs4_set_lock_denied(conflock, &lock->lk_denied);
>>> +
>>> +        /* try again if conflict with courtesy client  */
>>> +        if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == 
>>> -EAGAIN && !retried) {
>>> +            retried = true;
>>> +            goto again;
>>> +        }
>> Ugh, apologies, this was my idea, but I just noticed it only handles 
>> conflicts
>> from other NFSv4 clients.  The conflicting lock could just as well 
>> come from
>> NLM or a local process.  So we need cooperation from the common 
>> locks.c code.
>>
>> I'm not sure what to suggest....

One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect
the lock being copied belongs to a courtesy client and schedule the
laundromat to run to destroy the courtesy client. This option requires
callers of vfs_lock_file to provide the 'conflock' argument.
This option, let's call it option 1, is similar to what you suggested
below (option 2). Both of these options can handle lock conflict between
NFSv4, NFSv3/NLM and NFSv4 client.

Option 1:
For lock request that uses F_LOCK, the client retries lock request on lock
denied. The lock request that gets denied also causes the courtesy client
to be destroyed. Client succeeds on next retry.

For lock request that uses F_TLOCK, the request failed with lock denied,
the client does not retry and returns the error to the application. If the
application tries again it will succeed since the courtesy was destroyed
when the lock conflict was detected by nfsd4_fl_get_owner. I think this
behavior is ok since the client can not rely on a specific lease expiration
time maintained by the server.

Option 2:
For lock request that uses F_LOCK, since the new call, fl_expire_locks, is
called out of a spinlock context, we can call expire_client to destroy the
courtesy client in this new call and modify posix_lock_inode to retry the
loop. With this option, the conflict lock request is allowed to succeed
without the need for the client to receive denied then retry.

For lock request that uses F_TLOCK, the conflict lock request is allowed
to succeed same as in F_LOCK case. The application does not need to retry.

Option 1 does not need any modification in fs/lock.c to add new call and
do the retry. It does need to modify fs/lockd/svclock.c to add the
'conflock' arg for vfs_lock_file.

Which option you think we should take, I'm open to either one.

Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and
just block waiting for the lock to be released. Both of the options
above do not handle the case where the local lock happens before the
v4 client expires and becomes courtesy client. In this case we can not
let the v4 client becomes courtesy client. We need to have a way to
detect that someone is blocked on a lock owned by the v4 client and
do not allow that client to become courtesy client.  One way to handle
this to mark the v4 lock as 'has_waiter', and then before allowing
the expired v4 client to become courtesy client we need to search
all the locks of this v4 client for any lock with 'has_waiter' flag
and disallow it. The part that I don't like about this approach is
having to search all locks of each lockowner of the v4 client for
lock with 'has_waiter'.  I need some suggestions here.

-Dai

>>
>> Maybe something like:
>>
>> @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode 
>> *inode, struct file_lock *request,
>>          }
>>            percpu_down_read(&file_rwsem);
>> +retry:
>>          spin_lock(&ctx->flc_lock);
>>          /*
>>           * New lock request. Walk all POSIX locks and look for 
>> conflicts. If
>> @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode 
>> *inode, struct file_lock *request,
>>                  list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>>                          if (!posix_locks_conflict(request, fl))
>>                                  continue;
>> +                       if (fl->fl_lops->fl_expire_lock(fl, 1)) {
>> + spin_unlock(&ctx->flc_lock);
>> + fl->fl_lops->fl_expire_locks(fl, 0);
>> +                               goto retry;
>> +                       }
>>                          if (conflock)
>>                                  locks_copy_conflock(conflock, fl);
>>                          error = -EAGAIN;
>>
>>
>> where ->fl_expire_lock is a new lock callback with second argument 
>> "check"
>> where:
>>
>>     check = 1 means: just check whether this lock could be freed
>>     check = 0 means: go ahead and free this lock if you can
>
> Thanks Bruce, I will look into this approach.
>
> -Dai
>
>>
>> --b.
J. Bruce Fields June 30, 2021, 1:35 a.m. UTC | #16
On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote:
> 
> On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote:
> >
> >On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> >>>@@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp,
> >>>struct nfsd4_compound_state *cstate,
> >>>      case -EAGAIN:        /* conflock holds conflicting lock */
> >>>          status = nfserr_denied;
> >>>          dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> >>>-        nfs4_set_lock_denied(conflock, &lock->lk_denied);
> >>>+
> >>>+        /* try again if conflict with courtesy client  */
> >>>+        if (nfs4_set_lock_denied(conflock, &lock->lk_denied)
> >>>== -EAGAIN && !retried) {
> >>>+            retried = true;
> >>>+            goto again;
> >>>+        }
> >>Ugh, apologies, this was my idea, but I just noticed it only
> >>handles conflicts
> >>from other NFSv4 clients.  The conflicting lock could just as
> >>well come from
> >>NLM or a local process.  So we need cooperation from the common
> >>locks.c code.
> >>
> >>I'm not sure what to suggest....
> 
> One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect
> the lock being copied belongs to a courtesy client and schedule the
> laundromat to run to destroy the courtesy client. This option requires
> callers of vfs_lock_file to provide the 'conflock' argument.

I'm not sure I follow.  What's the advantage of doing it this way?

> Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and
> just block waiting for the lock to be released. Both of the options
> above do not handle the case where the local lock happens before the
> v4 client expires and becomes courtesy client. In this case we can not
> let the v4 client becomes courtesy client.

Oh, good point, yes, we don't want that waiter stuck waiting forever on
this expired client....

> We need to have a way to
> detect that someone is blocked on a lock owned by the v4 client and
> do not allow that client to become courtesy client.  One way to handle
> this to mark the v4 lock as 'has_waiter', and then before allowing
> the expired v4 client to become courtesy client we need to search
> all the locks of this v4 client for any lock with 'has_waiter' flag
> and disallow it. The part that I don't like about this approach is
> having to search all locks of each lockowner of the v4 client for
> lock with 'has_waiter'.  I need some suggestions here.

I'm not seeing a way to do it without iterating over all the client's
locks.

I don't think you should need a new flag, though, shouldn't
!list_empty(&lock->fl_blocked_requests) be enough?

--b.

> 
> -Dai
> 
> >>
> >>Maybe something like:
> >>
> >>@@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode
> >>*inode, struct file_lock *request,
> >>         }
> >>           percpu_down_read(&file_rwsem);
> >>+retry:
> >>         spin_lock(&ctx->flc_lock);
> >>         /*
> >>          * New lock request. Walk all POSIX locks and look for
> >>conflicts. If
> >>@@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode
> >>*inode, struct file_lock *request,
> >>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
> >>                         if (!posix_locks_conflict(request, fl))
> >>                                 continue;
> >>+                       if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> >>+ spin_unlock(&ctx->flc_lock);
> >>+ fl->fl_lops->fl_expire_locks(fl, 0);
> >>+                               goto retry;
> >>+                       }
> >>                         if (conflock)
> >>                                 locks_copy_conflock(conflock, fl);
> >>                         error = -EAGAIN;
> >>
> >>
> >>where ->fl_expire_lock is a new lock callback with second
> >>argument "check"
> >>where:
> >>
> >>    check = 1 means: just check whether this lock could be freed
> >>    check = 0 means: go ahead and free this lock if you can
> >
> >Thanks Bruce, I will look into this approach.
> >
> >-Dai
> >
> >>
> >>--b.
Dai Ngo June 30, 2021, 8:41 a.m. UTC | #17
On 6/29/21 6:35 PM, J. Bruce Fields wrote:
> On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote:
>> On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote:
>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>>>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
>>>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp,
>>>>> struct nfsd4_compound_state *cstate,
>>>>>        case -EAGAIN:        /* conflock holds conflicting lock */
>>>>>            status = nfserr_denied;
>>>>>            dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
>>>>> -        nfs4_set_lock_denied(conflock, &lock->lk_denied);
>>>>> +
>>>>> +        /* try again if conflict with courtesy client  */
>>>>> +        if (nfs4_set_lock_denied(conflock, &lock->lk_denied)
>>>>> == -EAGAIN && !retried) {
>>>>> +            retried = true;
>>>>> +            goto again;
>>>>> +        }
>>>> Ugh, apologies, this was my idea, but I just noticed it only
>>>> handles conflicts
>>> >from other NFSv4 clients.  The conflicting lock could just as
>>>> well come from
>>>> NLM or a local process.  So we need cooperation from the common
>>>> locks.c code.
>>>>
>>>> I'm not sure what to suggest....
>> One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect
>> the lock being copied belongs to a courtesy client and schedule the
>> laundromat to run to destroy the courtesy client. This option requires
>> callers of vfs_lock_file to provide the 'conflock' argument.
> I'm not sure I follow.  What's the advantage of doing it this way?

I'm not sure it's an advantage but I was trying to minimize changes to
the fs code. The only change we need is to add the conflock argument
to do_lock_file_wait to handle local lock conflicts.

If you don't think we're going to get objection with the new callback,
fl_expire_lock, then I will take that approach. We still need to add
the conflock argument to do_lock_file_wait in this case.

>
>> Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and
>> just block waiting for the lock to be released. Both of the options
>> above do not handle the case where the local lock happens before the
>> v4 client expires and becomes courtesy client. In this case we can not
>> let the v4 client becomes courtesy client.
> Oh, good point, yes, we don't want that waiter stuck waiting forever on
> this expired client....
>
>> We need to have a way to
>> detect that someone is blocked on a lock owned by the v4 client and
>> do not allow that client to become courtesy client.  One way to handle
>> this to mark the v4 lock as 'has_waiter', and then before allowing
>> the expired v4 client to become courtesy client we need to search
>> all the locks of this v4 client for any lock with 'has_waiter' flag
>> and disallow it. The part that I don't like about this approach is
>> having to search all locks of each lockowner of the v4 client for
>> lock with 'has_waiter'.  I need some suggestions here.
> I'm not seeing a way to do it without iterating over all the client's
> locks.

ok, i feel a bit better :-)

>
> I don't think you should need a new flag, though, shouldn't
> !list_empty(&lock->fl_blocked_requests) be enough?

Thanks Bruce, this is what I was looking for.

-Dai

>
> --b.
>
>> -Dai
>>
>>>> Maybe something like:
>>>>
>>>> @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode
>>>> *inode, struct file_lock *request,
>>>>           }
>>>>             percpu_down_read(&file_rwsem);
>>>> +retry:
>>>>           spin_lock(&ctx->flc_lock);
>>>>           /*
>>>>            * New lock request. Walk all POSIX locks and look for
>>>> conflicts. If
>>>> @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode
>>>> *inode, struct file_lock *request,
>>>>                   list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>>>>                           if (!posix_locks_conflict(request, fl))
>>>>                                   continue;
>>>> +                       if (fl->fl_lops->fl_expire_lock(fl, 1)) {
>>>> + spin_unlock(&ctx->flc_lock);
>>>> + fl->fl_lops->fl_expire_locks(fl, 0);
>>>> +                               goto retry;
>>>> +                       }
>>>>                           if (conflock)
>>>>                                   locks_copy_conflock(conflock, fl);
>>>>                           error = -EAGAIN;
>>>>
>>>>
>>>> where ->fl_expire_lock is a new lock callback with second
>>>> argument "check"
>>>> where:
>>>>
>>>>      check = 1 means: just check whether this lock could be freed
>>>>      check = 0 means: go ahead and free this lock if you can
>>> Thanks Bruce, I will look into this approach.
>>>
>>> -Dai
>>>
>>>> --b.
J. Bruce Fields June 30, 2021, 2:52 p.m. UTC | #18
On Wed, Jun 30, 2021 at 01:41:32AM -0700, dai.ngo@oracle.com wrote:
> 
> On 6/29/21 6:35 PM, J. Bruce Fields wrote:
> >On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote:
> >>On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote:
> >>>On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>>>On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> >>>>>@@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp,
> >>>>>struct nfsd4_compound_state *cstate,
> >>>>>       case -EAGAIN:        /* conflock holds conflicting lock */
> >>>>>           status = nfserr_denied;
> >>>>>           dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> >>>>>-        nfs4_set_lock_denied(conflock, &lock->lk_denied);
> >>>>>+
> >>>>>+        /* try again if conflict with courtesy client  */
> >>>>>+        if (nfs4_set_lock_denied(conflock, &lock->lk_denied)
> >>>>>== -EAGAIN && !retried) {
> >>>>>+            retried = true;
> >>>>>+            goto again;
> >>>>>+        }
> >>>>Ugh, apologies, this was my idea, but I just noticed it only
> >>>>handles conflicts
> >>>>from other NFSv4 clients.  The conflicting lock could just as
> >>>>well come from
> >>>>NLM or a local process.  So we need cooperation from the common
> >>>>locks.c code.
> >>>>
> >>>>I'm not sure what to suggest....
> >>One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect
> >>the lock being copied belongs to a courtesy client and schedule the
> >>laundromat to run to destroy the courtesy client. This option requires
> >>callers of vfs_lock_file to provide the 'conflock' argument.
> >I'm not sure I follow.  What's the advantage of doing it this way?
> 
> I'm not sure it's an advantage but I was trying to minimize changes to
> the fs code. The only change we need is to add the conflock argument
> to do_lock_file_wait to handle local lock conflicts.

Got it.

That's a clever but kind of unexpected use of lm_get_owner; I think it
could be confusing to a future reader.  And I'd rather not require the
extra retry.  A new lock callback is a complication, but at least it's
pretty obvious what it does.

> If you don't think we're going to get objection with the new callback,
> fl_expire_lock, then I will take that approach. We still need to add
> the conflock argument to do_lock_file_wait in this case.

Why is that?

--b.
J. Bruce Fields June 30, 2021, 3:13 p.m. UTC | #19
On Mon, Jun 28, 2021 at 04:23:31PM -0400, J. Bruce Fields wrote:
> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote:
> > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	case -EAGAIN:		/* conflock holds conflicting lock */
> >  		status = nfserr_denied;
> >  		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
> > -		nfs4_set_lock_denied(conflock, &lock->lk_denied);
> > +
> > +		/* try again if conflict with courtesy client  */
> > +		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
> > +			retried = true;
> > +			goto again;
> > +		}
> 
> Ugh, apologies, this was my idea, but I just noticed it only handles
> conflicts from other NFSv4 clients.

(Oh, and I only just *now* noticed that you'd already pointed out that
problem in an email I apparently stopped reading after the first
paragraph:

	https://lore.kernel.org/linux-nfs/c983218b-d270-bbae-71c5-b591bfbce473@oracle.com/

Sorry!)

--b.
Dai Ngo June 30, 2021, 5:51 p.m. UTC | #20
> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>>
>> where ->fl_expire_lock is a new lock callback with second argument 
>> "check"
>> where:
>>
>>     check = 1 means: just check whether this lock could be freed

Why do we need this, is there a use case for it? can we just always try
to expire the lock and return success/fail?

-Dai

>>     check = 0 means: go ahead and free this lock if you can
J. Bruce Fields June 30, 2021, 6:05 p.m. UTC | #21
On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
> >On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>
> >>where ->fl_expire_lock is a new lock callback with second
> >>argument "check"
> >>where:
> >>
> >>    check = 1 means: just check whether this lock could be freed
> 
> Why do we need this, is there a use case for it? can we just always try
> to expire the lock and return success/fail?

We can't expire the client while holding the flc_lock.  And once we drop
that lock we need to restart the loop.  Clearly we can't do that every
time.

(So, my code was wrong, it should have been:


	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
		spin_unlock(&ct->flc_lock);
		fl->fl_lops->fl_expire_locks(fl, 0);
		goto retry;
	}

)

But the 1 and 0 cases are starting to look pretty different; maybe they
should be two different callbacks.

--b.
Dai Ngo June 30, 2021, 6:49 p.m. UTC | #22
On 6/30/21 11:05 AM, J. Bruce Fields wrote:
> On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>>>> where ->fl_expire_lock is a new lock callback with second
>>>> argument "check"
>>>> where:
>>>>
>>>>      check = 1 means: just check whether this lock could be freed
>> Why do we need this, is there a use case for it? can we just always try
>> to expire the lock and return success/fail?
> We can't expire the client while holding the flc_lock.  And once we drop
> that lock we need to restart the loop.  Clearly we can't do that every
> time.
>
> (So, my code was wrong, it should have been:
>
>
> 	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> 		spin_unlock(&ct->flc_lock);
> 		fl->fl_lops->fl_expire_locks(fl, 0);
> 		goto retry;
> 	}
>
> )

This is what I currently have:

retry:
                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
                         if (!posix_locks_conflict(request, fl))
                                 continue;

                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
                                 spin_unlock(&ctx->flc_lock);
                                 ret = fl->fl_lmops->lm_expire_lock(fl, 0);
                                 spin_lock(&ctx->flc_lock);
                                 if (ret)
                                         goto retry;
                         }

                         if (conflock)
                                 locks_copy_conflock(conflock, fl);

>
> But the 1 and 0 cases are starting to look pretty different; maybe they
> should be two different callbacks.

why the case of 1 (test only) is needed,  who would use this call?

-Dai

>
> --b.
J. Bruce Fields June 30, 2021, 6:55 p.m. UTC | #23
On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote:
> 
> On 6/30/21 11:05 AM, J. Bruce Fields wrote:
> >On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
> >>>On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>>>where ->fl_expire_lock is a new lock callback with second
> >>>>argument "check"
> >>>>where:
> >>>>
> >>>>     check = 1 means: just check whether this lock could be freed
> >>Why do we need this, is there a use case for it? can we just always try
> >>to expire the lock and return success/fail?
> >We can't expire the client while holding the flc_lock.  And once we drop
> >that lock we need to restart the loop.  Clearly we can't do that every
> >time.
> >
> >(So, my code was wrong, it should have been:
> >
> >
> >	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> >		spin_unlock(&ct->flc_lock);
> >		fl->fl_lops->fl_expire_locks(fl, 0);
> >		goto retry;
> >	}
> >
> >)
> 
> This is what I currently have:
> 
> retry:
>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>                         if (!posix_locks_conflict(request, fl))
>                                 continue;
> 
>                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
>                                 spin_unlock(&ctx->flc_lock);
>                                 ret = fl->fl_lmops->lm_expire_lock(fl, 0);
>                                 spin_lock(&ctx->flc_lock);
>                                 if (ret)
>                                         goto retry;

We have to retry regardless of the return value.  Once we've dropped
flc_lock, it's not safe to continue trying to iterate through the list.

>                         }
> 
>                         if (conflock)
>                                 locks_copy_conflock(conflock, fl);
> 
> >
> >But the 1 and 0 cases are starting to look pretty different; maybe they
> >should be two different callbacks.
> 
> why the case of 1 (test only) is needed,  who would use this call?

We need to avoid dropping the spinlock in the case there are no clients
to expire, otherwise we'll make no forward progress.

--b.
Dai Ngo June 30, 2021, 7:13 p.m. UTC | #24
On 6/30/21 11:55 AM, J. Bruce Fields wrote:
> On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote:
>> On 6/30/21 11:05 AM, J. Bruce Fields wrote:
>>> On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
>>>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>>>>>> where ->fl_expire_lock is a new lock callback with second
>>>>>> argument "check"
>>>>>> where:
>>>>>>
>>>>>>      check = 1 means: just check whether this lock could be freed
>>>> Why do we need this, is there a use case for it? can we just always try
>>>> to expire the lock and return success/fail?
>>> We can't expire the client while holding the flc_lock.  And once we drop
>>> that lock we need to restart the loop.  Clearly we can't do that every
>>> time.
>>>
>>> (So, my code was wrong, it should have been:
>>>
>>>
>>> 	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
>>> 		spin_unlock(&ct->flc_lock);
>>> 		fl->fl_lops->fl_expire_locks(fl, 0);
>>> 		goto retry;
>>> 	}
>>>
>>> )
>> This is what I currently have:
>>
>> retry:
>>                  list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>>                          if (!posix_locks_conflict(request, fl))
>>                                  continue;
>>
>>                          if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
>>                                  spin_unlock(&ctx->flc_lock);
>>                                  ret = fl->fl_lmops->lm_expire_lock(fl, 0);
>>                                  spin_lock(&ctx->flc_lock);
>>                                  if (ret)
>>                                          goto retry;
> We have to retry regardless of the return value.  Once we've dropped
> flc_lock, it's not safe to continue trying to iterate through the list.

Yes, thanks!

>
>>                          }
>>
>>                          if (conflock)
>>                                  locks_copy_conflock(conflock, fl);
>>
>>> But the 1 and 0 cases are starting to look pretty different; maybe they
>>> should be two different callbacks.
>> why the case of 1 (test only) is needed,  who would use this call?
> We need to avoid dropping the spinlock in the case there are no clients
> to expire, otherwise we'll make no forward progress.

I think we can remember the last checked file_lock and skip it:

retry:
                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
                         if (!posix_locks_conflict(request, fl))
                                 continue;

                         if (checked_fl != fl && fl->fl_lmops &&
                                         fl->fl_lmops->lm_expire_lock) {
                                 checked_fl = fl;
                                 spin_unlock(&ctx->flc_lock);
                                 fl->fl_lmops->lm_expire_lock(fl);
                                 spin_lock(&ctx->flc_lock);
                                 goto retry;
                         }

                         if (conflock)
                                 locks_copy_conflock(conflock, fl);

-Dai

>
> --b.
J. Bruce Fields June 30, 2021, 7:24 p.m. UTC | #25
On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote:
> 
> On 6/30/21 11:55 AM, J. Bruce Fields wrote:
> >On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote:
> >>On 6/30/21 11:05 AM, J. Bruce Fields wrote:
> >>>On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
> >>>>>On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>>>>>where ->fl_expire_lock is a new lock callback with second
> >>>>>>argument "check"
> >>>>>>where:
> >>>>>>
> >>>>>>     check = 1 means: just check whether this lock could be freed
> >>>>Why do we need this, is there a use case for it? can we just always try
> >>>>to expire the lock and return success/fail?
> >>>We can't expire the client while holding the flc_lock.  And once we drop
> >>>that lock we need to restart the loop.  Clearly we can't do that every
> >>>time.
> >>>
> >>>(So, my code was wrong, it should have been:
> >>>
> >>>
> >>>	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> >>>		spin_unlock(&ct->flc_lock);
> >>>		fl->fl_lops->fl_expire_locks(fl, 0);
> >>>		goto retry;
> >>>	}
> >>>
> >>>)
> >>This is what I currently have:
> >>
> >>retry:
> >>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
> >>                         if (!posix_locks_conflict(request, fl))
> >>                                 continue;
> >>
> >>                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
> >>                                 spin_unlock(&ctx->flc_lock);
> >>                                 ret = fl->fl_lmops->lm_expire_lock(fl, 0);
> >>                                 spin_lock(&ctx->flc_lock);
> >>                                 if (ret)
> >>                                         goto retry;
> >We have to retry regardless of the return value.  Once we've dropped
> >flc_lock, it's not safe to continue trying to iterate through the list.
> 
> Yes, thanks!
> 
> >
> >>                         }
> >>
> >>                         if (conflock)
> >>                                 locks_copy_conflock(conflock, fl);
> >>
> >>>But the 1 and 0 cases are starting to look pretty different; maybe they
> >>>should be two different callbacks.
> >>why the case of 1 (test only) is needed,  who would use this call?
> >We need to avoid dropping the spinlock in the case there are no clients
> >to expire, otherwise we'll make no forward progress.
> 
> I think we can remember the last checked file_lock and skip it:

I doubt that works in the case there are multiple locks with
lm_expire_lock set.

If you really don't want another callback here, maybe you could set some
kind of flag on the lock.

At the time a client expires, you're going to have to walk all of its
locks to see if anyone's waiting for them.  At the same time maybe you
could set an FL_EXPIRABLE flag on all those locks, and test for that
here.

If the network partition heals and the client comes back, you'd have to
remember to clear that flag again.

--b.

> retry:
>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>                         if (!posix_locks_conflict(request, fl))
>                                 continue;
> 
>                         if (checked_fl != fl && fl->fl_lmops &&
>                                         fl->fl_lmops->lm_expire_lock) {
>                                 checked_fl = fl;
>                                 spin_unlock(&ctx->flc_lock);
>                                 fl->fl_lmops->lm_expire_lock(fl);
>                                 spin_lock(&ctx->flc_lock);
>                                 goto retry;
>                         }
> 
>                         if (conflock)
>                                 locks_copy_conflock(conflock, fl);
> 
> -Dai
> 
> >
> >--b.
Dai Ngo June 30, 2021, 11:48 p.m. UTC | #26
On 6/30/21 12:24 PM, J. Bruce Fields wrote:
> On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote:
>> On 6/30/21 11:55 AM, J. Bruce Fields wrote:
>>> On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote:
>>>> On 6/30/21 11:05 AM, J. Bruce Fields wrote:
>>>>> On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
>>>>>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote:
>>>>>>>> where ->fl_expire_lock is a new lock callback with second
>>>>>>>> argument "check"
>>>>>>>> where:
>>>>>>>>
>>>>>>>>      check = 1 means: just check whether this lock could be freed
>>>>>> Why do we need this, is there a use case for it? can we just always try
>>>>>> to expire the lock and return success/fail?
>>>>> We can't expire the client while holding the flc_lock.  And once we drop
>>>>> that lock we need to restart the loop.  Clearly we can't do that every
>>>>> time.
>>>>>
>>>>> (So, my code was wrong, it should have been:
>>>>>
>>>>>
>>>>> 	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
>>>>> 		spin_unlock(&ct->flc_lock);
>>>>> 		fl->fl_lops->fl_expire_locks(fl, 0);
>>>>> 		goto retry;
>>>>> 	}
>>>>>
>>>>> )
>>>> This is what I currently have:
>>>>
>>>> retry:
>>>>                  list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>>>>                          if (!posix_locks_conflict(request, fl))
>>>>                                  continue;
>>>>
>>>>                          if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
>>>>                                  spin_unlock(&ctx->flc_lock);
>>>>                                  ret = fl->fl_lmops->lm_expire_lock(fl, 0);
>>>>                                  spin_lock(&ctx->flc_lock);
>>>>                                  if (ret)
>>>>                                          goto retry;
>>> We have to retry regardless of the return value.  Once we've dropped
>>> flc_lock, it's not safe to continue trying to iterate through the list.
>> Yes, thanks!
>>
>>>>                          }
>>>>
>>>>                          if (conflock)
>>>>                                  locks_copy_conflock(conflock, fl);
>>>>
>>>>> But the 1 and 0 cases are starting to look pretty different; maybe they
>>>>> should be two different callbacks.
>>>> why the case of 1 (test only) is needed,  who would use this call?
>>> We need to avoid dropping the spinlock in the case there are no clients
>>> to expire, otherwise we'll make no forward progress.
>> I think we can remember the last checked file_lock and skip it:
> I doubt that works in the case there are multiple locks with
> lm_expire_lock set.
>
> If you really don't want another callback here, maybe you could set some
> kind of flag on the lock.
>
> At the time a client expires, you're going to have to walk all of its
> locks to see if anyone's waiting for them.  At the same time maybe you
> could set an FL_EXPIRABLE flag on all those locks, and test for that
> here.
>
> If the network partition heals and the client comes back, you'd have to
> remember to clear that flag again.

It's too much unnecessary work.

Would this be suffice:

retry:
                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
                         if (!posix_locks_conflict(request, fl))
                                 continue;
                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock &&
                                         fl->fl_lmops->lm_expire_lock(fl, 1)) {
                                 spin_unlock(&ctx->flc_lock);
                                 fl->fl_lmops->lm_expire_lock(fl, 0);
                                 spin_lock(&ctx->flc_lock);
                                 goto retry;
                         }
                         if (conflock)
                                 locks_copy_conflock(conflock, fl);

-Dai

>
> --b.
>
>> retry:
>>                  list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>>                          if (!posix_locks_conflict(request, fl))
>>                                  continue;
>>
>>                          if (checked_fl != fl && fl->fl_lmops &&
>>                                          fl->fl_lmops->lm_expire_lock) {
>>                                  checked_fl = fl;
>>                                  spin_unlock(&ctx->flc_lock);
>>                                  fl->fl_lmops->lm_expire_lock(fl);
>>                                  spin_lock(&ctx->flc_lock);
>>                                  goto retry;
>>                          }
>>
>>                          if (conflock)
>>                                  locks_copy_conflock(conflock, fl);
>>
>> -Dai
>>
>>> --b.
J. Bruce Fields July 1, 2021, 1:16 a.m. UTC | #27
On Wed, Jun 30, 2021 at 04:48:57PM -0700, dai.ngo@oracle.com wrote:
> 
> On 6/30/21 12:24 PM, J. Bruce Fields wrote:
> >On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote:
> >>On 6/30/21 11:55 AM, J. Bruce Fields wrote:
> >>>On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote:
> >>>>On 6/30/21 11:05 AM, J. Bruce Fields wrote:
> >>>>>On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote:
> >>>>>>>On 6/28/21 1:23 PM, J. Bruce Fields wrote:
> >>>>>>>>where ->fl_expire_lock is a new lock callback with second
> >>>>>>>>argument "check"
> >>>>>>>>where:
> >>>>>>>>
> >>>>>>>>     check = 1 means: just check whether this lock could be freed
> >>>>>>Why do we need this, is there a use case for it? can we just always try
> >>>>>>to expire the lock and return success/fail?
> >>>>>We can't expire the client while holding the flc_lock.  And once we drop
> >>>>>that lock we need to restart the loop.  Clearly we can't do that every
> >>>>>time.
> >>>>>
> >>>>>(So, my code was wrong, it should have been:
> >>>>>
> >>>>>
> >>>>>	if (fl->fl_lops->fl_expire_lock(fl, 1)) {
> >>>>>		spin_unlock(&ct->flc_lock);
> >>>>>		fl->fl_lops->fl_expire_locks(fl, 0);
> >>>>>		goto retry;
> >>>>>	}
> >>>>>
> >>>>>)
> >>>>This is what I currently have:
> >>>>
> >>>>retry:
> >>>>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
> >>>>                         if (!posix_locks_conflict(request, fl))
> >>>>                                 continue;
> >>>>
> >>>>                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) {
> >>>>                                 spin_unlock(&ctx->flc_lock);
> >>>>                                 ret = fl->fl_lmops->lm_expire_lock(fl, 0);
> >>>>                                 spin_lock(&ctx->flc_lock);
> >>>>                                 if (ret)
> >>>>                                         goto retry;
> >>>We have to retry regardless of the return value.  Once we've dropped
> >>>flc_lock, it's not safe to continue trying to iterate through the list.
> >>Yes, thanks!
> >>
> >>>>                         }
> >>>>
> >>>>                         if (conflock)
> >>>>                                 locks_copy_conflock(conflock, fl);
> >>>>
> >>>>>But the 1 and 0 cases are starting to look pretty different; maybe they
> >>>>>should be two different callbacks.
> >>>>why the case of 1 (test only) is needed,  who would use this call?
> >>>We need to avoid dropping the spinlock in the case there are no clients
> >>>to expire, otherwise we'll make no forward progress.
> >>I think we can remember the last checked file_lock and skip it:
> >I doubt that works in the case there are multiple locks with
> >lm_expire_lock set.
> >
> >If you really don't want another callback here, maybe you could set some
> >kind of flag on the lock.
> >
> >At the time a client expires, you're going to have to walk all of its
> >locks to see if anyone's waiting for them.  At the same time maybe you
> >could set an FL_EXPIRABLE flag on all those locks, and test for that
> >here.
> >
> >If the network partition heals and the client comes back, you'd have to
> >remember to clear that flag again.
> 
> It's too much unnecessary work.
> 
> Would this be suffice:
> 
> retry:
>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
>                         if (!posix_locks_conflict(request, fl))
>                                 continue;
>                         if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock &&
>                                         fl->fl_lmops->lm_expire_lock(fl, 1)) {
>                                 spin_unlock(&ctx->flc_lock);
>                                 fl->fl_lmops->lm_expire_lock(fl, 0);
>                                 spin_lock(&ctx->flc_lock);
>                                 goto retry;
>                         }
>                         if (conflock)
>                                 locks_copy_conflock(conflock, fl);

Looks OK to me.--b.

> 
> -Dai
> 
> >
> >--b.
> >
> >>retry:
> >>                 list_for_each_entry(fl, &ctx->flc_posix, fl_list) {
> >>                         if (!posix_locks_conflict(request, fl))
> >>                                 continue;
> >>
> >>                         if (checked_fl != fl && fl->fl_lmops &&
> >>                                         fl->fl_lmops->lm_expire_lock) {
> >>                                 checked_fl = fl;
> >>                                 spin_unlock(&ctx->flc_lock);
> >>                                 fl->fl_lmops->lm_expire_lock(fl);
> >>                                 spin_lock(&ctx->flc_lock);
> >>                                 goto retry;
> >>                         }
> >>
> >>                         if (conflock)
> >>                                 locks_copy_conflock(conflock, fl);
> >>
> >>-Dai
> >>
> >>>--b.
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index b517a8794400..969995872752 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -171,6 +171,7 @@  renew_client_locked(struct nfs4_client *clp)
 
 	list_move_tail(&clp->cl_lru, &nn->client_lru);
 	clp->cl_time = ktime_get_boottime_seconds();
+	clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
 }
 
 static void put_client_renew_locked(struct nfs4_client *clp)
@@ -4610,6 +4611,38 @@  static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
 	nfsd4_run_cb(&dp->dl_recall);
 }
 
+/*
+ * Set rq_conflict_client if the conflict client have expired
+ * and return true, otherwise return false.
+ */
+static bool
+nfsd_set_conflict_client(struct nfs4_delegation *dp)
+{
+	struct svc_rqst *rqst;
+	struct nfs4_client *clp;
+	struct nfsd_net *nn;
+	bool ret;
+
+	if (!i_am_nfsd())
+		return false;
+	rqst = kthread_data(current);
+	if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4)
+		return false;
+	rqst->rq_conflict_client = NULL;
+	clp = dp->dl_recall.cb_clp;
+	nn = net_generic(clp->net, nfsd_net_id);
+	spin_lock(&nn->client_lock);
+
+	if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) &&
+				!mark_client_expired_locked(clp)) {
+		rqst->rq_conflict_client = clp;
+		ret = true;
+	} else
+		ret = false;
+	spin_unlock(&nn->client_lock);
+	return ret;
+}
+
 /* Called from break_lease() with i_lock held. */
 static bool
 nfsd_break_deleg_cb(struct file_lock *fl)
@@ -4618,6 +4651,8 @@  nfsd_break_deleg_cb(struct file_lock *fl)
 	struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner;
 	struct nfs4_file *fp = dp->dl_stid.sc_file;
 
+	if (nfsd_set_conflict_client(dp))
+		return false;
 	trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid);
 
 	/*
@@ -5237,6 +5272,22 @@  static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open,
 	 */
 }
 
+static bool
+nfs4_destroy_courtesy_client(struct nfs4_client *clp)
+{
+	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
+
+	spin_lock(&nn->client_lock);
+	if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) ||
+			mark_client_expired_locked(clp)) {
+		spin_unlock(&nn->client_lock);
+		return false;
+	}
+	spin_unlock(&nn->client_lock);
+	expire_client(clp);
+	return true;
+}
+
 __be32
 nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open)
 {
@@ -5286,7 +5337,13 @@  nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 			goto out;
 		}
 	} else {
+		rqstp->rq_conflict_client = NULL;
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
+		if (status == nfserr_jukebox && rqstp->rq_conflict_client) {
+			if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client))
+				status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
+		}
+
 		if (status) {
 			stp->st_stid.sc_type = NFS4_CLOSED_STID;
 			release_open_stateid(stp);
@@ -5472,6 +5529,8 @@  nfs4_laundromat(struct nfsd_net *nn)
 	};
 	struct nfs4_cpntf_state *cps;
 	copy_stateid_t *cps_t;
+	struct nfs4_stid *stid;
+	int id = 0;
 	int i;
 
 	if (clients_still_reclaiming(nn)) {
@@ -5495,6 +5554,15 @@  nfs4_laundromat(struct nfsd_net *nn)
 		clp = list_entry(pos, struct nfs4_client, cl_lru);
 		if (!state_expired(&lt, clp->cl_time))
 			break;
+
+		spin_lock(&clp->cl_lock);
+		stid = idr_get_next(&clp->cl_stateids, &id);
+		spin_unlock(&clp->cl_lock);
+		if (stid) {
+			/* client still has states */
+			set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags);
+			continue;
+		}
 		if (mark_client_expired_locked(clp)) {
 			trace_nfsd_clid_expired(&clp->cl_clientid);
 			continue;
@@ -6440,7 +6508,7 @@  static const struct lock_manager_operations nfsd_posix_mng_ops  = {
 	.lm_put_owner = nfsd4_fl_put_owner,
 };
 
-static inline void
+static inline int
 nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
 {
 	struct nfs4_lockowner *lo;
@@ -6453,6 +6521,8 @@  nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
 			/* We just don't care that much */
 			goto nevermind;
 		deny->ld_clientid = lo->lo_owner.so_client->cl_clientid;
+		if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client))
+			return -EAGAIN;
 	} else {
 nevermind:
 		deny->ld_owner.len = 0;
@@ -6467,6 +6537,7 @@  nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny)
 	deny->ld_type = NFS4_READ_LT;
 	if (fl->fl_type != F_RDLCK)
 		deny->ld_type = NFS4_WRITE_LT;
+	return 0;
 }
 
 static struct nfs4_lockowner *
@@ -6734,6 +6805,7 @@  nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	unsigned int fl_flags = FL_POSIX;
 	struct net *net = SVC_NET(rqstp);
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+	bool retried = false;
 
 	dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n",
 		(long long) lock->lk_offset,
@@ -6860,7 +6932,7 @@  nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru);
 		spin_unlock(&nn->blocked_locks_lock);
 	}
-
+again:
 	err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock);
 	switch (err) {
 	case 0: /* success! */
@@ -6875,7 +6947,12 @@  nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	case -EAGAIN:		/* conflock holds conflicting lock */
 		status = nfserr_denied;
 		dprintk("NFSD: nfsd4_lock: conflicting lock found!\n");
-		nfs4_set_lock_denied(conflock, &lock->lk_denied);
+
+		/* try again if conflict with courtesy client  */
+		if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) {
+			retried = true;
+			goto again;
+		}
 		break;
 	case -EDEADLK:
 		status = nfserr_deadlock;
@@ -6962,6 +7039,8 @@  nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	struct nfs4_lockowner *lo = NULL;
 	__be32 status;
 	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
+	bool retried = false;
+	int ret;
 
 	if (locks_in_grace(SVC_NET(rqstp)))
 		return nfserr_grace;
@@ -7010,14 +7089,19 @@  nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length);
 
 	nfs4_transform_lock_offset(file_lock);
-
+again:
 	status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock);
 	if (status)
 		goto out;
 
 	if (file_lock->fl_type != F_UNLCK) {
 		status = nfserr_denied;
-		nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
+		ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied);
+		if (ret == -EAGAIN && !retried) {
+			retried = true;
+			fh_clear_wcc(&cstate->current_fh);
+			goto again;
+		}
 	}
 out:
 	if (lo)
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index e73bdbb1634a..bdc3605e3722 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -345,6 +345,7 @@  struct nfs4_client {
 #define NFSD4_CLIENT_UPCALL_LOCK	(5)	/* upcall serialization */
 #define NFSD4_CLIENT_CB_FLAG_MASK	(1 << NFSD4_CLIENT_CB_UPDATE | \
 					 1 << NFSD4_CLIENT_CB_KILL)
+#define NFSD4_CLIENT_COURTESY		(6)	/* be nice to expired client */
 	unsigned long		cl_flags;
 	const struct cred	*cl_cb_cred;
 	struct rpc_clnt		*cl_cb_client;
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e91d51ea028b..2f0382f9d0ff 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -304,6 +304,7 @@  struct svc_rqst {
 						 * net namespace
 						 */
 	void **			rq_lease_breaker; /* The v4 client breaking a lease */
+	void			*rq_conflict_client;
 };
 
 #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)