diff mbox series

[RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner()

Message ID 165323344948.2381.7808135229977810927.stgit@bazille.1015granger.net (mailing list archive)
State New
Headers show
Series [RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner() | expand

Commit Message

Chuck Lever May 22, 2022, 3:38 p.m. UTC
nfsd4_release_lockowner() holds clp->cl_lock when it calls
check_for_locks(). However, check_for_locks() calls nfsd_file_get()
/ nfsd_file_put() to access the backing inode's flc_posix list, and
nfsd_file_put() can sleep if the inode was recently removed.

Let's instead rely on the stateowner's reference count to gate
whether the release is permitted. This should be a reliable
indication of locks-in-use since file lock operations and
->lm_get_owner take appropriate references, which are released
appropriately when file locks are removed.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
---
 fs/nfsd/nfs4state.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

This might be a naive approach, but let's start with it.

This passes light testing, but it's not clear how much our existing
fleet of tests exercises this area. I've locally built a couple of
pynfs tests (one is based on the one Dai posted last week) and they
pass too.

I don't believe that FREE_STATEID needs the same simplification.

Comments

Jeff Layton May 23, 2022, 1:40 p.m. UTC | #1
On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> nfsd4_release_lockowner() holds clp->cl_lock when it calls
> check_for_locks(). However, check_for_locks() calls nfsd_file_get()
> / nfsd_file_put() to access the backing inode's flc_posix list, and
> nfsd_file_put() can sleep if the inode was recently removed.
> 

It might be good to add a might_sleep() to nfsd_file_put?

> Let's instead rely on the stateowner's reference count to gate
> whether the release is permitted. This should be a reliable
> indication of locks-in-use since file lock operations and
> ->lm_get_owner take appropriate references, which are released
> appropriately when file locks are removed.
> 
> Reported-by: Dai Ngo <dai.ngo@oracle.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Cc: stable@vger.kernel.org
> ---
>  fs/nfsd/nfs4state.c |    9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> This might be a naive approach, but let's start with it.
> 
> This passes light testing, but it's not clear how much our existing
> fleet of tests exercises this area. I've locally built a couple of
> pynfs tests (one is based on the one Dai posted last week) and they
> pass too.
> 
> I don't believe that FREE_STATEID needs the same simplification.
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index a280256cbb03..b77894e668a4 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
>  
>  		/* see if there are still any locks associated with it */
>  		lo = lockowner(sop);
> -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
> -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
> -				status = nfserr_locks_held;
> -				spin_unlock(&clp->cl_lock);
> -				return status;
> -			}
> +		if (atomic_read(&sop->so_count) > 1) {
> +			spin_unlock(&clp->cl_lock);
> +			return nfserr_locks_held;
>  		}
>  
>  		nfs4_get_stateowner(sop);
> 
> 

lm_get_owner is called from locks_copy_conflock, so if someone else
happens to be doing a LOCKT or F_GETLK call at the same time that
RELEASE_LOCKOWNER gets called, then this may end up returning an error
inappropriately. My guess is that that would be pretty hard to hit the
timing right, but not impossible.

What we may want to do is have the kernel do this check and only if it
comes back >1 do the actual check for locks. That won't fix the original
problem though.

In other places in nfsd, we've plumbed in a dispose_list head and
deferred the sleeping functions until the spinlock can be dropped. I
haven't looked closely at whether that's possible here, but it may be a
more reliable approach.
Chuck Lever May 23, 2022, 3 p.m. UTC | #2
> On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
>> nfsd4_release_lockowner() holds clp->cl_lock when it calls
>> check_for_locks(). However, check_for_locks() calls nfsd_file_get()
>> / nfsd_file_put() to access the backing inode's flc_posix list, and
>> nfsd_file_put() can sleep if the inode was recently removed.
>> 
> 
> It might be good to add a might_sleep() to nfsd_file_put?

I intend to include the patch you reviewed last week that
adds the might_sleep(), as part of this series.


>> Let's instead rely on the stateowner's reference count to gate
>> whether the release is permitted. This should be a reliable
>> indication of locks-in-use since file lock operations and
>> ->lm_get_owner take appropriate references, which are released
>> appropriately when file locks are removed.
>> 
>> Reported-by: Dai Ngo <dai.ngo@oracle.com>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> Cc: stable@vger.kernel.org
>> ---
>> fs/nfsd/nfs4state.c |    9 +++------
>> 1 file changed, 3 insertions(+), 6 deletions(-)
>> 
>> This might be a naive approach, but let's start with it.
>> 
>> This passes light testing, but it's not clear how much our existing
>> fleet of tests exercises this area. I've locally built a couple of
>> pynfs tests (one is based on the one Dai posted last week) and they
>> pass too.
>> 
>> I don't believe that FREE_STATEID needs the same simplification.
>> 
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index a280256cbb03..b77894e668a4 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
>> 
>> 		/* see if there are still any locks associated with it */
>> 		lo = lockowner(sop);
>> -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
>> -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
>> -				status = nfserr_locks_held;
>> -				spin_unlock(&clp->cl_lock);
>> -				return status;
>> -			}
>> +		if (atomic_read(&sop->so_count) > 1) {
>> +			spin_unlock(&clp->cl_lock);
>> +			return nfserr_locks_held;
>> 		}
>> 
>> 		nfs4_get_stateowner(sop);
>> 
>> 
> 
> lm_get_owner is called from locks_copy_conflock, so if someone else
> happens to be doing a LOCKT or F_GETLK call at the same time that
> RELEASE_LOCKOWNER gets called, then this may end up returning an error
> inappropriately.

IMO releasing the lockowner while it's being used for _anything_
seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
the client is still using the lockowner for any reason, a
subsequent error will occur if the client tries to use it again.
Heck, I can see the server failing in mid-COMPOUND with this kind
of race. Better I think to just leave the lockowner in place if
there's any ambiguity.

The spec language does not say RELEASE_LOCKOWNER must not return
LOCKS_HELD for other reasons, and it does say that there is no
choice of using another NFSERR value (RFC 7530 Section 13.2).


> My guess is that that would be pretty hard to hit the
> timing right, but not impossible.
> 
> What we may want to do is have the kernel do this check and only if it
> comes back >1 do the actual check for locks. That won't fix the original
> problem though.
> 
> In other places in nfsd, we've plumbed in a dispose_list head and
> deferred the sleeping functions until the spinlock can be dropped. I
> haven't looked closely at whether that's possible here, but it may be a
> more reliable approach.

That was proposed by Dai last week.

https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u

Trond pointed out that if two separate clients were releasing a
lockowner on the same inode, there is nothing that protects the
dispose_list, and it would get corrupted.

https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614


--
Chuck Lever
Jeff Layton May 23, 2022, 3:26 p.m. UTC | #3
On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > check_for_locks(). However, check_for_locks() calls nfsd_file_get()
> > > / nfsd_file_put() to access the backing inode's flc_posix list, and
> > > nfsd_file_put() can sleep if the inode was recently removed.
> > > 
> > 
> > It might be good to add a might_sleep() to nfsd_file_put?
> 
> I intend to include the patch you reviewed last week that
> adds the might_sleep(), as part of this series.
> 
> 
> > > Let's instead rely on the stateowner's reference count to gate
> > > whether the release is permitted. This should be a reliable
> > > indication of locks-in-use since file lock operations and
> > > ->lm_get_owner take appropriate references, which are released
> > > appropriately when file locks are removed.
> > > 
> > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > Cc: stable@vger.kernel.org
> > > ---
> > > fs/nfsd/nfs4state.c |    9 +++------
> > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > 
> > > This might be a naive approach, but let's start with it.
> > > 
> > > This passes light testing, but it's not clear how much our existing
> > > fleet of tests exercises this area. I've locally built a couple of
> > > pynfs tests (one is based on the one Dai posted last week) and they
> > > pass too.
> > > 
> > > I don't believe that FREE_STATEID needs the same simplification.
> > > 
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index a280256cbb03..b77894e668a4 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
> > > 
> > > 		/* see if there are still any locks associated with it */
> > > 		lo = lockowner(sop);
> > > -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
> > > -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
> > > -				status = nfserr_locks_held;
> > > -				spin_unlock(&clp->cl_lock);
> > > -				return status;
> > > -			}
> > > +		if (atomic_read(&sop->so_count) > 1) {
> > > +			spin_unlock(&clp->cl_lock);
> > > +			return nfserr_locks_held;
> > > 		}
> > > 
> > > 		nfs4_get_stateowner(sop);
> > > 
> > > 
> > 
> > lm_get_owner is called from locks_copy_conflock, so if someone else
> > happens to be doing a LOCKT or F_GETLK call at the same time that
> > RELEASE_LOCKOWNER gets called, then this may end up returning an error
> > inappropriately.
> 
> IMO releasing the lockowner while it's being used for _anything_
> seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> the client is still using the lockowner for any reason, a
> subsequent error will occur if the client tries to use it again.
> Heck, I can see the server failing in mid-COMPOUND with this kind
> of race. Better I think to just leave the lockowner in place if
> there's any ambiguity.
> 

The problem here is not the client itself calling RELEASE_LOCKOWNER
while it's still in use, but rather a different client altogether
calling LOCKT (or a local process does a F_GETLK) on an inode where a
lock is held by a client. The LOCKT gets a reference to it (for the
conflock), while the client that has the lockowner releases the lock and
then the lockowner while the refcount is still high.

The race window for this is probably quite small, but I think it's
theoretically possible. The point is that an elevated refcount on the
lockowner doesn't necessarily mean that locks are actually being held by
it.

> The spec language does not say RELEASE_LOCKOWNER must not return
> LOCKS_HELD for other reasons, and it does say that there is no
> choice of using another NFSERR value (RFC 7530 Section 13.2).
> 

What recourse does the client have if this happens? It released all of
its locks and tried to release the lockowner, but the server says "locks
held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
of a courtesy by the client, I suppose...

> 
> > My guess is that that would be pretty hard to hit the
> > timing right, but not impossible.
> > 
> > What we may want to do is have the kernel do this check and only if it
> > comes back >1 do the actual check for locks. That won't fix the original
> > problem though.
> > 
> > In other places in nfsd, we've plumbed in a dispose_list head and
> > deferred the sleeping functions until the spinlock can be dropped. I
> > haven't looked closely at whether that's possible here, but it may be a
> > more reliable approach.
> 
> That was proposed by Dai last week.
> 
> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> 
> Trond pointed out that if two separate clients were releasing a
> lockowner on the same inode, there is nothing that protects the
> dispose_list, and it would get corrupted.
> 
> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> 

Yeah, that doesn't look like what's needed.

What I was going to suggest is a nfsd_file_put variant that takes a
list_head. If the refcount goes to zero and the thing ends up being
unhashed, then you put it on the dispose list rather than doing the
blocking operations, and then clean it up later.

That said, nfsd_file_put has grown significantly in complexity over the
years, so maybe that's not simple to do now.
Chuck Lever May 23, 2022, 3:41 p.m. UTC | #4
> On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
>> 
>>> On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>> 
>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
>>>> nfsd4_release_lockowner() holds clp->cl_lock when it calls
>>>> check_for_locks(). However, check_for_locks() calls nfsd_file_get()
>>>> / nfsd_file_put() to access the backing inode's flc_posix list, and
>>>> nfsd_file_put() can sleep if the inode was recently removed.
>>>> 
>>> 
>>> It might be good to add a might_sleep() to nfsd_file_put?
>> 
>> I intend to include the patch you reviewed last week that
>> adds the might_sleep(), as part of this series.
>> 
>> 
>>>> Let's instead rely on the stateowner's reference count to gate
>>>> whether the release is permitted. This should be a reliable
>>>> indication of locks-in-use since file lock operations and
>>>> ->lm_get_owner take appropriate references, which are released
>>>> appropriately when file locks are removed.
>>>> 
>>>> Reported-by: Dai Ngo <dai.ngo@oracle.com>
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> Cc: stable@vger.kernel.org
>>>> ---
>>>> fs/nfsd/nfs4state.c |    9 +++------
>>>> 1 file changed, 3 insertions(+), 6 deletions(-)
>>>> 
>>>> This might be a naive approach, but let's start with it.
>>>> 
>>>> This passes light testing, but it's not clear how much our existing
>>>> fleet of tests exercises this area. I've locally built a couple of
>>>> pynfs tests (one is based on the one Dai posted last week) and they
>>>> pass too.
>>>> 
>>>> I don't believe that FREE_STATEID needs the same simplification.
>>>> 
>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>> index a280256cbb03..b77894e668a4 100644
>>>> --- a/fs/nfsd/nfs4state.c
>>>> +++ b/fs/nfsd/nfs4state.c
>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
>>>> 
>>>> 		/* see if there are still any locks associated with it */
>>>> 		lo = lockowner(sop);
>>>> -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
>>>> -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
>>>> -				status = nfserr_locks_held;
>>>> -				spin_unlock(&clp->cl_lock);
>>>> -				return status;
>>>> -			}
>>>> +		if (atomic_read(&sop->so_count) > 1) {
>>>> +			spin_unlock(&clp->cl_lock);
>>>> +			return nfserr_locks_held;
>>>> 		}
>>>> 
>>>> 		nfs4_get_stateowner(sop);
>>>> 
>>>> 
>>> 
>>> lm_get_owner is called from locks_copy_conflock, so if someone else
>>> happens to be doing a LOCKT or F_GETLK call at the same time that
>>> RELEASE_LOCKOWNER gets called, then this may end up returning an error
>>> inappropriately.
>> 
>> IMO releasing the lockowner while it's being used for _anything_
>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
>> the client is still using the lockowner for any reason, a
>> subsequent error will occur if the client tries to use it again.
>> Heck, I can see the server failing in mid-COMPOUND with this kind
>> of race. Better I think to just leave the lockowner in place if
>> there's any ambiguity.
>> 
> 
> The problem here is not the client itself calling RELEASE_LOCKOWNER
> while it's still in use, but rather a different client altogether
> calling LOCKT (or a local process does a F_GETLK) on an inode where a
> lock is held by a client. The LOCKT gets a reference to it (for the
> conflock), while the client that has the lockowner releases the lock and
> then the lockowner while the refcount is still high.
> 
> The race window for this is probably quite small, but I think it's
> theoretically possible. The point is that an elevated refcount on the
> lockowner doesn't necessarily mean that locks are actually being held by
> it.

Sure, I get that the lockowner's reference count is not 100%
reliable. The question is whether it's good enough.

We are looking for a mechanism that can simply count the number
of locks held by a lockowner. It sounds like you believe that
lm_get_owner / put_owner might not be a reliable way to do
that.


>> The spec language does not say RELEASE_LOCKOWNER must not return
>> LOCKS_HELD for other reasons, and it does say that there is no
>> choice of using another NFSERR value (RFC 7530 Section 13.2).
>> 
> 
> What recourse does the client have if this happens? It released all of
> its locks and tried to release the lockowner, but the server says "locks
> held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
> of a courtesy by the client, I suppose...

RELEASE_LOCKOWNER is a courtesy for the server. Most clients
ignore the return code IIUC.

So the hazard caused by this race would be a small resource
leak on the server that would go away once the client's lease
was purged.


>>> My guess is that that would be pretty hard to hit the
>>> timing right, but not impossible.
>>> 
>>> What we may want to do is have the kernel do this check and only if it
>>> comes back >1 do the actual check for locks. That won't fix the original
>>> problem though.
>>> 
>>> In other places in nfsd, we've plumbed in a dispose_list head and
>>> deferred the sleeping functions until the spinlock can be dropped. I
>>> haven't looked closely at whether that's possible here, but it may be a
>>> more reliable approach.
>> 
>> That was proposed by Dai last week.
>> 
>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
>> 
>> Trond pointed out that if two separate clients were releasing a
>> lockowner on the same inode, there is nothing that protects the
>> dispose_list, and it would get corrupted.
>> 
>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
>> 
> 
> Yeah, that doesn't look like what's needed.
> 
> What I was going to suggest is a nfsd_file_put variant that takes a
> list_head. If the refcount goes to zero and the thing ends up being
> unhashed, then you put it on the dispose list rather than doing the
> blocking operations, and then clean it up later.

Trond doesn't like that approach; see the e-mail thread.


> That said, nfsd_file_put has grown significantly in complexity over the
> years, so maybe that's not simple to do now.
> -- 
> Jeff Layton <jlayton@kernel.org>

--
Chuck Lever
Jeff Layton May 23, 2022, 4:37 p.m. UTC | #5
On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > > > check_for_locks(). However, check_for_locks() calls nfsd_file_get()
> > > > > / nfsd_file_put() to access the backing inode's flc_posix list, and
> > > > > nfsd_file_put() can sleep if the inode was recently removed.
> > > > > 
> > > > 
> > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > 
> > > I intend to include the patch you reviewed last week that
> > > adds the might_sleep(), as part of this series.
> > > 
> > > 
> > > > > Let's instead rely on the stateowner's reference count to gate
> > > > > whether the release is permitted. This should be a reliable
> > > > > indication of locks-in-use since file lock operations and
> > > > > ->lm_get_owner take appropriate references, which are released
> > > > > appropriately when file locks are removed.
> > > > > 
> > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > Cc: stable@vger.kernel.org
> > > > > ---
> > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > 
> > > > > This might be a naive approach, but let's start with it.
> > > > > 
> > > > > This passes light testing, but it's not clear how much our existing
> > > > > fleet of tests exercises this area. I've locally built a couple of
> > > > > pynfs tests (one is based on the one Dai posted last week) and they
> > > > > pass too.
> > > > > 
> > > > > I don't believe that FREE_STATEID needs the same simplification.
> > > > > 
> > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > index a280256cbb03..b77894e668a4 100644
> > > > > --- a/fs/nfsd/nfs4state.c
> > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp,
> > > > > 
> > > > > 		/* see if there are still any locks associated with it */
> > > > > 		lo = lockowner(sop);
> > > > > -		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
> > > > > -			if (check_for_locks(stp->st_stid.sc_file, lo)) {
> > > > > -				status = nfserr_locks_held;
> > > > > -				spin_unlock(&clp->cl_lock);
> > > > > -				return status;
> > > > > -			}
> > > > > +		if (atomic_read(&sop->so_count) > 1) {
> > > > > +			spin_unlock(&clp->cl_lock);
> > > > > +			return nfserr_locks_held;
> > > > > 		}
> > > > > 
> > > > > 		nfs4_get_stateowner(sop);
> > > > > 
> > > > > 
> > > > 
> > > > lm_get_owner is called from locks_copy_conflock, so if someone else
> > > > happens to be doing a LOCKT or F_GETLK call at the same time that
> > > > RELEASE_LOCKOWNER gets called, then this may end up returning an error
> > > > inappropriately.
> > > 
> > > IMO releasing the lockowner while it's being used for _anything_
> > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> > > the client is still using the lockowner for any reason, a
> > > subsequent error will occur if the client tries to use it again.
> > > Heck, I can see the server failing in mid-COMPOUND with this kind
> > > of race. Better I think to just leave the lockowner in place if
> > > there's any ambiguity.
> > > 
> > 
> > The problem here is not the client itself calling RELEASE_LOCKOWNER
> > while it's still in use, but rather a different client altogether
> > calling LOCKT (or a local process does a F_GETLK) on an inode where a
> > lock is held by a client. The LOCKT gets a reference to it (for the
> > conflock), while the client that has the lockowner releases the lock and
> > then the lockowner while the refcount is still high.
> > 
> > The race window for this is probably quite small, but I think it's
> > theoretically possible. The point is that an elevated refcount on the
> > lockowner doesn't necessarily mean that locks are actually being held by
> > it.
> 
> Sure, I get that the lockowner's reference count is not 100%
> reliable. The question is whether it's good enough.
> 
> We are looking for a mechanism that can simply count the number
> of locks held by a lockowner. It sounds like you believe that
> lm_get_owner / put_owner might not be a reliable way to do
> that.
> 
> 
> > > The spec language does not say RELEASE_LOCKOWNER must not return
> > > LOCKS_HELD for other reasons, and it does say that there is no
> > > choice of using another NFSERR value (RFC 7530 Section 13.2).
> > > 
> > 
> > What recourse does the client have if this happens? It released all of
> > its locks and tried to release the lockowner, but the server says "locks
> > held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort
> > of a courtesy by the client, I suppose...
> 
> RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> ignore the return code IIUC.
> 
> So the hazard caused by this race would be a small resource
> leak on the server that would go away once the client's lease
> was purged.
> 
> 
> > > > My guess is that that would be pretty hard to hit the
> > > > timing right, but not impossible.
> > > > 
> > > > What we may want to do is have the kernel do this check and only if it
> > > > comes back >1 do the actual check for locks. That won't fix the original
> > > > problem though.
> > > > 
> > > > In other places in nfsd, we've plumbed in a dispose_list head and
> > > > deferred the sleeping functions until the spinlock can be dropped. I
> > > > haven't looked closely at whether that's possible here, but it may be a
> > > > more reliable approach.
> > > 
> > > That was proposed by Dai last week.
> > > 
> > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > 
> > > Trond pointed out that if two separate clients were releasing a
> > > lockowner on the same inode, there is nothing that protects the
> > > dispose_list, and it would get corrupted.
> > > 
> > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > 
> > 
> > Yeah, that doesn't look like what's needed.
> > 
> > What I was going to suggest is a nfsd_file_put variant that takes a
> > list_head. If the refcount goes to zero and the thing ends up being
> > unhashed, then you put it on the dispose list rather than doing the
> > blocking operations, and then clean it up later.
> 
> Trond doesn't like that approach; see the e-mail thread.
> 

I didn't see him saying that that would be wrong, per-se, but the
initial implementation was racy.

His suggestion was just to keep a counter in the lockowner of how many
locks are associated with it. That seems like a good suggestion, though
you'd probably need to add a parameter to lm_get_owner to indicate
whether you were adding a new lock or just doing a conflock copy.

Checking the object refcount like this patch does seems wrong though.
Chuck Lever May 23, 2022, 5:25 p.m. UTC | #6
> On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
>> 
>>> On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>> 
>>> What I was going to suggest is a nfsd_file_put variant that takes a
>>> list_head. If the refcount goes to zero and the thing ends up being
>>> unhashed, then you put it on the dispose list rather than doing the
>>> blocking operations, and then clean it up later.
>> 
>> Trond doesn't like that approach; see the e-mail thread.
> 
> I didn't see him saying that that would be wrong, per-se, but the
> initial implementation was racy.

I proposed this for check_for_locks() to use:

> void nfsd_file_put_async(struct nfsd_file *nf)
> {
> 	if (refcount_dec_and_test(&nf->nf_ref))
> 		nfsd_file_close_inode(nf->nf_inode);
> }


Trond's response was:

> That approach moves the sync to the garbage collector, which was
> exactly what we're trying to avoid in the first place.

Basically he's opposed to any flush work being done by
the garbage collector because that has a known negative
performance impact.


> His suggestion was just to keep a counter in the lockowner of how many
> locks are associated with it. That seems like a good suggestion, though
> you'd probably need to add a parameter to lm_get_owner to indicate
> whether you were adding a new lock or just doing a conflock copy.

locks_copy_conflock() would need to take a boolean parameter
that callers would set when they actually manipulate a lock.

I would feel more comfortable with that approach if
locks_copy_conflock() was a private API used only in fs/locks.c
so we can audit its usage to ensure that callers are managing
the lock count correctly.


--
Chuck Lever
Jeff Layton May 23, 2022, 5:38 p.m. UTC | #7
On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > What I was going to suggest is a nfsd_file_put variant that takes a
> > > > list_head. If the refcount goes to zero and the thing ends up being
> > > > unhashed, then you put it on the dispose list rather than doing the
> > > > blocking operations, and then clean it up later.
> > > 
> > > Trond doesn't like that approach; see the e-mail thread.
> > 
> > I didn't see him saying that that would be wrong, per-se, but the
> > initial implementation was racy.
> 
> I proposed this for check_for_locks() to use:
> 
> > void nfsd_file_put_async(struct nfsd_file *nf)
> > {
> > 	if (refcount_dec_and_test(&nf->nf_ref))
> > 		nfsd_file_close_inode(nf->nf_inode);
> > }
> 
> 
> Trond's response was:
> 
> > That approach moves the sync to the garbage collector, which was
> > exactly what we're trying to avoid in the first place.
> 
> Basically he's opposed to any flush work being done by
> the garbage collector because that has a known negative
> performance impact.
> 
> 

Fair enough. I like his other suggestion better anyway.

> > His suggestion was just to keep a counter in the lockowner of how many
> > locks are associated with it. That seems like a good suggestion, though
> > you'd probably need to add a parameter to lm_get_owner to indicate
> > whether you were adding a new lock or just doing a conflock copy.
> 
> locks_copy_conflock() would need to take a boolean parameter
> that callers would set when they actually manipulate a lock.
> 

Yep. You'd also have to add a bool arg to lm_put_owner so that you know
whether you need to decrement the counter.

> I would feel more comfortable with that approach if
> locks_copy_conflock() was a private API used only in fs/locks.c
> so we can audit its usage to ensure that callers are managing
> the lock count correctly.
> 
> 

It basically is. In fact, I'm not sure why it's exported since nothing
outside of locks.c calls it. Looking at the git history, it probably got
exported by mistake when we split up locks_copy_lock and
locks_copy_conflock, as the former was exported.

I think this approach is quite doable...
Trond Myklebust May 23, 2022, 5:43 p.m. UTC | #8
On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > 
> > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org>
> > > wrote:
> > > 
> > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > 
> > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org>
> > > > > wrote:
> > > > > 
> > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > nfsd_file_get()
> > > > > > / nfsd_file_put() to access the backing inode's flc_posix
> > > > > > list, and
> > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > removed.
> > > > > > 
> > > > > 
> > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > 
> > > > I intend to include the patch you reviewed last week that
> > > > adds the might_sleep(), as part of this series.
> > > > 
> > > > 
> > > > > > Let's instead rely on the stateowner's reference count to
> > > > > > gate
> > > > > > whether the release is permitted. This should be a reliable
> > > > > > indication of locks-in-use since file lock operations and
> > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > released
> > > > > > appropriately when file locks are removed.
> > > > > > 
> > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > Cc: stable@vger.kernel.org
> > > > > > ---
> > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > This might be a naive approach, but let's start with it.
> > > > > > 
> > > > > > This passes light testing, but it's not clear how much our
> > > > > > existing
> > > > > > fleet of tests exercises this area. I've locally built a
> > > > > > couple of
> > > > > > pynfs tests (one is based on the one Dai posted last week)
> > > > > > and they
> > > > > > pass too.
> > > > > > 
> > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > simplification.
> > > > > > 
> > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > svc_rqst *rqstp,
> > > > > > 
> > > > > >                 /* see if there are still any locks
> > > > > > associated with it */
> > > > > >                 lo = lockowner(sop);
> > > > > > -               list_for_each_entry(stp, &sop->so_stateids,
> > > > > > st_perstateowner) {
> > > > > > -                       if (check_for_locks(stp-
> > > > > > >st_stid.sc_file, lo)) {
> > > > > > -                               status = nfserr_locks_held;
> > > > > > -                               spin_unlock(&clp->cl_lock);
> > > > > > -                               return status;
> > > > > > -                       }
> > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > +                       return nfserr_locks_held;
> > > > > >                 }
> > > > > > 
> > > > > >                 nfs4_get_stateowner(sop);
> > > > > > 
> > > > > > 
> > > > > 
> > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > someone else
> > > > > happens to be doing a LOCKT or F_GETLK call at the same time
> > > > > that
> > > > > RELEASE_LOCKOWNER gets called, then this may end up returning
> > > > > an error
> > > > > inappropriately.
> > > > 
> > > > IMO releasing the lockowner while it's being used for
> > > > _anything_
> > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> > > > the client is still using the lockowner for any reason, a
> > > > subsequent error will occur if the client tries to use it
> > > > again.
> > > > Heck, I can see the server failing in mid-COMPOUND with this
> > > > kind
> > > > of race. Better I think to just leave the lockowner in place if
> > > > there's any ambiguity.
> > > > 
> > > 
> > > The problem here is not the client itself calling
> > > RELEASE_LOCKOWNER
> > > while it's still in use, but rather a different client altogether
> > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > where a
> > > lock is held by a client. The LOCKT gets a reference to it (for
> > > the
> > > conflock), while the client that has the lockowner releases the
> > > lock and
> > > then the lockowner while the refcount is still high.
> > > 
> > > The race window for this is probably quite small, but I think
> > > it's
> > > theoretically possible. The point is that an elevated refcount on
> > > the
> > > lockowner doesn't necessarily mean that locks are actually being
> > > held by
> > > it.
> > 
> > Sure, I get that the lockowner's reference count is not 100%
> > reliable. The question is whether it's good enough.
> > 
> > We are looking for a mechanism that can simply count the number
> > of locks held by a lockowner. It sounds like you believe that
> > lm_get_owner / put_owner might not be a reliable way to do
> > that.
> > 
> > 
> > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > return
> > > > LOCKS_HELD for other reasons, and it does say that there is no
> > > > choice of using another NFSERR value (RFC 7530 Section 13.2).
> > > > 
> > > 
> > > What recourse does the client have if this happens? It released
> > > all of
> > > its locks and tried to release the lockowner, but the server says
> > > "locks
> > > held". Should it just give up at that point? RELEASE_LOCKOWNER is
> > > a sort
> > > of a courtesy by the client, I suppose...
> > 
> > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > ignore the return code IIUC.
> > 
> > So the hazard caused by this race would be a small resource
> > leak on the server that would go away once the client's lease
> > was purged.
> > 
> > 
> > > > > My guess is that that would be pretty hard to hit the
> > > > > timing right, but not impossible.
> > > > > 
> > > > > What we may want to do is have the kernel do this check and
> > > > > only if it
> > > > > comes back >1 do the actual check for locks. That won't fix
> > > > > the original
> > > > > problem though.
> > > > > 
> > > > > In other places in nfsd, we've plumbed in a dispose_list head
> > > > > and
> > > > > deferred the sleeping functions until the spinlock can be
> > > > > dropped. I
> > > > > haven't looked closely at whether that's possible here, but
> > > > > it may be a
> > > > > more reliable approach.
> > > > 
> > > > That was proposed by Dai last week.
> > > > 
> > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > 
> > > > Trond pointed out that if two separate clients were releasing a
> > > > lockowner on the same inode, there is nothing that protects the
> > > > dispose_list, and it would get corrupted.
> > > > 
> > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > 
> > > 
> > > Yeah, that doesn't look like what's needed.
> > > 
> > > What I was going to suggest is a nfsd_file_put variant that takes
> > > a
> > > list_head. If the refcount goes to zero and the thing ends up
> > > being
> > > unhashed, then you put it on the dispose list rather than doing
> > > the
> > > blocking operations, and then clean it up later.
> > 
> > Trond doesn't like that approach; see the e-mail thread.
> > 
> 
> I didn't see him saying that that would be wrong, per-se, but the
> initial implementation was racy.
> 
> His suggestion was just to keep a counter in the lockowner of how
> many
> locks are associated with it. That seems like a good suggestion,
> though
> you'd probably need to add a parameter to lm_get_owner to indicate
> whether you were adding a new lock or just doing a conflock copy.

I don't think this should be necessary. The posix_lock code doesn't
ever use a struct file_lock that it hasn't allocated itself. We should
always be calling conflock to copy from whatever struct file_lock that
the caller passed as an argument.

IOW: the number of lm_get_owner and lm_put_owner calls should always be
100% balanced once all the locks belonging to a specific lock owner are
removed.

> 
> Checking the object refcount like this patch does seems wrong though.
> 
Yes. This approach does require a separate counter that is only
bumped/decremented in the lock manager callbacks.
Jeff Layton May 23, 2022, 6:04 p.m. UTC | #9
On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org>
> > > > wrote:
> > > > 
> > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org>
> > > > > > wrote:
> > > > > > 
> > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > > nfsd_file_get()
> > > > > > > / nfsd_file_put() to access the backing inode's flc_posix
> > > > > > > list, and
> > > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > > removed.
> > > > > > > 
> > > > > > 
> > > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > > 
> > > > > I intend to include the patch you reviewed last week that
> > > > > adds the might_sleep(), as part of this series.
> > > > > 
> > > > > 
> > > > > > > Let's instead rely on the stateowner's reference count to
> > > > > > > gate
> > > > > > > whether the release is permitted. This should be a reliable
> > > > > > > indication of locks-in-use since file lock operations and
> > > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > > released
> > > > > > > appropriately when file locks are removed.
> > > > > > > 
> > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > > ---
> > > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > This might be a naive approach, but let's start with it.
> > > > > > > 
> > > > > > > This passes light testing, but it's not clear how much our
> > > > > > > existing
> > > > > > > fleet of tests exercises this area. I've locally built a
> > > > > > > couple of
> > > > > > > pynfs tests (one is based on the one Dai posted last week)
> > > > > > > and they
> > > > > > > pass too.
> > > > > > > 
> > > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > > simplification.
> > > > > > > 
> > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > > svc_rqst *rqstp,
> > > > > > > 
> > > > > > >                 /* see if there are still any locks
> > > > > > > associated with it */
> > > > > > >                 lo = lockowner(sop);
> > > > > > > -               list_for_each_entry(stp, &sop->so_stateids,
> > > > > > > st_perstateowner) {
> > > > > > > -                       if (check_for_locks(stp-
> > > > > > > > st_stid.sc_file, lo)) {
> > > > > > > -                               status = nfserr_locks_held;
> > > > > > > -                               spin_unlock(&clp->cl_lock);
> > > > > > > -                               return status;
> > > > > > > -                       }
> > > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > > +                       return nfserr_locks_held;
> > > > > > >                 }
> > > > > > > 
> > > > > > >                 nfs4_get_stateowner(sop);
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > > someone else
> > > > > > happens to be doing a LOCKT or F_GETLK call at the same time
> > > > > > that
> > > > > > RELEASE_LOCKOWNER gets called, then this may end up returning
> > > > > > an error
> > > > > > inappropriately.
> > > > > 
> > > > > IMO releasing the lockowner while it's being used for
> > > > > _anything_
> > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> > > > > the client is still using the lockowner for any reason, a
> > > > > subsequent error will occur if the client tries to use it
> > > > > again.
> > > > > Heck, I can see the server failing in mid-COMPOUND with this
> > > > > kind
> > > > > of race. Better I think to just leave the lockowner in place if
> > > > > there's any ambiguity.
> > > > > 
> > > > 
> > > > The problem here is not the client itself calling
> > > > RELEASE_LOCKOWNER
> > > > while it's still in use, but rather a different client altogether
> > > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > > where a
> > > > lock is held by a client. The LOCKT gets a reference to it (for
> > > > the
> > > > conflock), while the client that has the lockowner releases the
> > > > lock and
> > > > then the lockowner while the refcount is still high.
> > > > 
> > > > The race window for this is probably quite small, but I think
> > > > it's
> > > > theoretically possible. The point is that an elevated refcount on
> > > > the
> > > > lockowner doesn't necessarily mean that locks are actually being
> > > > held by
> > > > it.
> > > 
> > > Sure, I get that the lockowner's reference count is not 100%
> > > reliable. The question is whether it's good enough.
> > > 
> > > We are looking for a mechanism that can simply count the number
> > > of locks held by a lockowner. It sounds like you believe that
> > > lm_get_owner / put_owner might not be a reliable way to do
> > > that.
> > > 
> > > 
> > > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > > return
> > > > > LOCKS_HELD for other reasons, and it does say that there is no
> > > > > choice of using another NFSERR value (RFC 7530 Section 13.2).
> > > > > 
> > > > 
> > > > What recourse does the client have if this happens? It released
> > > > all of
> > > > its locks and tried to release the lockowner, but the server says
> > > > "locks
> > > > held". Should it just give up at that point? RELEASE_LOCKOWNER is
> > > > a sort
> > > > of a courtesy by the client, I suppose...
> > > 
> > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > > ignore the return code IIUC.
> > > 
> > > So the hazard caused by this race would be a small resource
> > > leak on the server that would go away once the client's lease
> > > was purged.
> > > 
> > > 
> > > > > > My guess is that that would be pretty hard to hit the
> > > > > > timing right, but not impossible.
> > > > > > 
> > > > > > What we may want to do is have the kernel do this check and
> > > > > > only if it
> > > > > > comes back >1 do the actual check for locks. That won't fix
> > > > > > the original
> > > > > > problem though.
> > > > > > 
> > > > > > In other places in nfsd, we've plumbed in a dispose_list head
> > > > > > and
> > > > > > deferred the sleeping functions until the spinlock can be
> > > > > > dropped. I
> > > > > > haven't looked closely at whether that's possible here, but
> > > > > > it may be a
> > > > > > more reliable approach.
> > > > > 
> > > > > That was proposed by Dai last week.
> > > > > 
> > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > > 
> > > > > Trond pointed out that if two separate clients were releasing a
> > > > > lockowner on the same inode, there is nothing that protects the
> > > > > dispose_list, and it would get corrupted.
> > > > > 
> > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > > 
> > > > 
> > > > Yeah, that doesn't look like what's needed.
> > > > 
> > > > What I was going to suggest is a nfsd_file_put variant that takes
> > > > a
> > > > list_head. If the refcount goes to zero and the thing ends up
> > > > being
> > > > unhashed, then you put it on the dispose list rather than doing
> > > > the
> > > > blocking operations, and then clean it up later.
> > > 
> > > Trond doesn't like that approach; see the e-mail thread.
> > > 
> > 
> > I didn't see him saying that that would be wrong, per-se, but the
> > initial implementation was racy.
> > 
> > His suggestion was just to keep a counter in the lockowner of how
> > many
> > locks are associated with it. That seems like a good suggestion,
> > though
> > you'd probably need to add a parameter to lm_get_owner to indicate
> > whether you were adding a new lock or just doing a conflock copy.
> 
> I don't think this should be necessary. The posix_lock code doesn't
> ever use a struct file_lock that it hasn't allocated itself. We should
> always be calling conflock to copy from whatever struct file_lock that
> the caller passed as an argument.
> 
> IOW: the number of lm_get_owner and lm_put_owner calls should always be
> 100% balanced once all the locks belonging to a specific lock owner are
> removed.
> 

We take references to the owner when we go to add a lock record, or when
copying a conflicting lock. You want to keep a count of the former
without counting the latter.

lm_get_owner gets called for both though. I don't see how you can
disambiguate the two situations w/o some way to indicate that. Adding a
bool argument to lm_get_owner/lm_put_owner ops would be pretty simple to
implement, I think.
Trond Myklebust May 23, 2022, 6:21 p.m. UTC | #10
On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote:
> On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
> > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > > 
> > > > > On May 23, 2022, at 11:26 AM, Jeff Layton
> > > > > <jlayton@kernel.org>
> > > > > wrote:
> > > > > 
> > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > > > 
> > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton
> > > > > > > <jlayton@kernel.org>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it
> > > > > > > > calls
> > > > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > > > nfsd_file_get()
> > > > > > > > / nfsd_file_put() to access the backing inode's
> > > > > > > > flc_posix
> > > > > > > > list, and
> > > > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > > > removed.
> > > > > > > > 
> > > > > > > 
> > > > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > > > 
> > > > > > I intend to include the patch you reviewed last week that
> > > > > > adds the might_sleep(), as part of this series.
> > > > > > 
> > > > > > 
> > > > > > > > Let's instead rely on the stateowner's reference count
> > > > > > > > to
> > > > > > > > gate
> > > > > > > > whether the release is permitted. This should be a
> > > > > > > > reliable
> > > > > > > > indication of locks-in-use since file lock operations
> > > > > > > > and
> > > > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > > > released
> > > > > > > > appropriately when file locks are removed.
> > > > > > > > 
> > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > > Cc: stable@vger.kernel.org
> > > > > > > > ---
> > > > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > > > 
> > > > > > > > This might be a naive approach, but let's start with
> > > > > > > > it.
> > > > > > > > 
> > > > > > > > This passes light testing, but it's not clear how much
> > > > > > > > our
> > > > > > > > existing
> > > > > > > > fleet of tests exercises this area. I've locally built
> > > > > > > > a
> > > > > > > > couple of
> > > > > > > > pynfs tests (one is based on the one Dai posted last
> > > > > > > > week)
> > > > > > > > and they
> > > > > > > > pass too.
> > > > > > > > 
> > > > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > > > simplification.
> > > > > > > > 
> > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > > > svc_rqst *rqstp,
> > > > > > > > 
> > > > > > > >                 /* see if there are still any locks
> > > > > > > > associated with it */
> > > > > > > >                 lo = lockowner(sop);
> > > > > > > > -               list_for_each_entry(stp, &sop-
> > > > > > > > >so_stateids,
> > > > > > > > st_perstateowner) {
> > > > > > > > -                       if (check_for_locks(stp-
> > > > > > > > > st_stid.sc_file, lo)) {
> > > > > > > > -                               status =
> > > > > > > > nfserr_locks_held;
> > > > > > > > -                               spin_unlock(&clp-
> > > > > > > > >cl_lock);
> > > > > > > > -                               return status;
> > > > > > > > -                       }
> > > > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > > > +                       return nfserr_locks_held;
> > > > > > > >                 }
> > > > > > > > 
> > > > > > > >                 nfs4_get_stateowner(sop);
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > > > someone else
> > > > > > > happens to be doing a LOCKT or F_GETLK call at the same
> > > > > > > time
> > > > > > > that
> > > > > > > RELEASE_LOCKOWNER gets called, then this may end up
> > > > > > > returning
> > > > > > > an error
> > > > > > > inappropriately.
> > > > > > 
> > > > > > IMO releasing the lockowner while it's being used for
> > > > > > _anything_
> > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds
> > > > > > while
> > > > > > the client is still using the lockowner for any reason, a
> > > > > > subsequent error will occur if the client tries to use it
> > > > > > again.
> > > > > > Heck, I can see the server failing in mid-COMPOUND with
> > > > > > this
> > > > > > kind
> > > > > > of race. Better I think to just leave the lockowner in
> > > > > > place if
> > > > > > there's any ambiguity.
> > > > > > 
> > > > > 
> > > > > The problem here is not the client itself calling
> > > > > RELEASE_LOCKOWNER
> > > > > while it's still in use, but rather a different client
> > > > > altogether
> > > > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > > > where a
> > > > > lock is held by a client. The LOCKT gets a reference to it
> > > > > (for
> > > > > the
> > > > > conflock), while the client that has the lockowner releases
> > > > > the
> > > > > lock and
> > > > > then the lockowner while the refcount is still high.
> > > > > 
> > > > > The race window for this is probably quite small, but I think
> > > > > it's
> > > > > theoretically possible. The point is that an elevated
> > > > > refcount on
> > > > > the
> > > > > lockowner doesn't necessarily mean that locks are actually
> > > > > being
> > > > > held by
> > > > > it.
> > > > 
> > > > Sure, I get that the lockowner's reference count is not 100%
> > > > reliable. The question is whether it's good enough.
> > > > 
> > > > We are looking for a mechanism that can simply count the number
> > > > of locks held by a lockowner. It sounds like you believe that
> > > > lm_get_owner / put_owner might not be a reliable way to do
> > > > that.
> > > > 
> > > > 
> > > > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > > > return
> > > > > > LOCKS_HELD for other reasons, and it does say that there is
> > > > > > no
> > > > > > choice of using another NFSERR value (RFC 7530 Section
> > > > > > 13.2).
> > > > > > 
> > > > > 
> > > > > What recourse does the client have if this happens? It
> > > > > released
> > > > > all of
> > > > > its locks and tried to release the lockowner, but the server
> > > > > says
> > > > > "locks
> > > > > held". Should it just give up at that point?
> > > > > RELEASE_LOCKOWNER is
> > > > > a sort
> > > > > of a courtesy by the client, I suppose...
> > > > 
> > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > > > ignore the return code IIUC.
> > > > 
> > > > So the hazard caused by this race would be a small resource
> > > > leak on the server that would go away once the client's lease
> > > > was purged.
> > > > 
> > > > 
> > > > > > > My guess is that that would be pretty hard to hit the
> > > > > > > timing right, but not impossible.
> > > > > > > 
> > > > > > > What we may want to do is have the kernel do this check
> > > > > > > and
> > > > > > > only if it
> > > > > > > comes back >1 do the actual check for locks. That won't
> > > > > > > fix
> > > > > > > the original
> > > > > > > problem though.
> > > > > > > 
> > > > > > > In other places in nfsd, we've plumbed in a dispose_list
> > > > > > > head
> > > > > > > and
> > > > > > > deferred the sleeping functions until the spinlock can be
> > > > > > > dropped. I
> > > > > > > haven't looked closely at whether that's possible here,
> > > > > > > but
> > > > > > > it may be a
> > > > > > > more reliable approach.
> > > > > > 
> > > > > > That was proposed by Dai last week.
> > > > > > 
> > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > > > 
> > > > > > Trond pointed out that if two separate clients were
> > > > > > releasing a
> > > > > > lockowner on the same inode, there is nothing that protects
> > > > > > the
> > > > > > dispose_list, and it would get corrupted.
> > > > > > 
> > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > > > 
> > > > > 
> > > > > Yeah, that doesn't look like what's needed.
> > > > > 
> > > > > What I was going to suggest is a nfsd_file_put variant that
> > > > > takes
> > > > > a
> > > > > list_head. If the refcount goes to zero and the thing ends up
> > > > > being
> > > > > unhashed, then you put it on the dispose list rather than
> > > > > doing
> > > > > the
> > > > > blocking operations, and then clean it up later.
> > > > 
> > > > Trond doesn't like that approach; see the e-mail thread.
> > > > 
> > > 
> > > I didn't see him saying that that would be wrong, per-se, but the
> > > initial implementation was racy.
> > > 
> > > His suggestion was just to keep a counter in the lockowner of how
> > > many
> > > locks are associated with it. That seems like a good suggestion,
> > > though
> > > you'd probably need to add a parameter to lm_get_owner to
> > > indicate
> > > whether you were adding a new lock or just doing a conflock copy.
> > 
> > I don't think this should be necessary. The posix_lock code doesn't
> > ever use a struct file_lock that it hasn't allocated itself. We
> > should
> > always be calling conflock to copy from whatever struct file_lock
> > that
> > the caller passed as an argument.
> > 
> > IOW: the number of lm_get_owner and lm_put_owner calls should
> > always be
> > 100% balanced once all the locks belonging to a specific lock owner
> > are
> > removed.
> > 
> 
> We take references to the owner when we go to add a lock record, or
> when
> copying a conflicting lock. You want to keep a count of the former
> without counting the latter.
> 
> lm_get_owner gets called for both though. I don't see how you can
> disambiguate the two situations w/o some way to indicate that. Adding
> a
> bool argument to lm_get_owner/lm_put_owner ops would be pretty simple
> to
> implement, I think.
> 

Hmm... That should be an extremely unlikely race, given that the
conflicting lock reference would have to be held for long enough to
cover the unlock + the release_lockowner / free_stateid RPC calls from
the client initially holding the lock, however I agree it is a
theoretical possibility.
Jeff Layton May 23, 2022, 6:30 p.m. UTC | #11
On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote:
> On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote:
> > On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
> > > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton
> > > > > > <jlayton@kernel.org>
> > > > > > wrote:
> > > > > > 
> > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > > > > 
> > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton
> > > > > > > > <jlayton@kernel.org>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it
> > > > > > > > > calls
> > > > > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > > > > nfsd_file_get()
> > > > > > > > > / nfsd_file_put() to access the backing inode's
> > > > > > > > > flc_posix
> > > > > > > > > list, and
> > > > > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > > > > removed.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > > > > 
> > > > > > > I intend to include the patch you reviewed last week that
> > > > > > > adds the might_sleep(), as part of this series.
> > > > > > > 
> > > > > > > 
> > > > > > > > > Let's instead rely on the stateowner's reference count
> > > > > > > > > to
> > > > > > > > > gate
> > > > > > > > > whether the release is permitted. This should be a
> > > > > > > > > reliable
> > > > > > > > > indication of locks-in-use since file lock operations
> > > > > > > > > and
> > > > > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > > > > released
> > > > > > > > > appropriately when file locks are removed.
> > > > > > > > > 
> > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > > > Cc: stable@vger.kernel.org
> > > > > > > > > ---
> > > > > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > > > > 
> > > > > > > > > This might be a naive approach, but let's start with
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > This passes light testing, but it's not clear how much
> > > > > > > > > our
> > > > > > > > > existing
> > > > > > > > > fleet of tests exercises this area. I've locally built
> > > > > > > > > a
> > > > > > > > > couple of
> > > > > > > > > pynfs tests (one is based on the one Dai posted last
> > > > > > > > > week)
> > > > > > > > > and they
> > > > > > > > > pass too.
> > > > > > > > > 
> > > > > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > > > > simplification.
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > > > > svc_rqst *rqstp,
> > > > > > > > > 
> > > > > > > > >                 /* see if there are still any locks
> > > > > > > > > associated with it */
> > > > > > > > >                 lo = lockowner(sop);
> > > > > > > > > -               list_for_each_entry(stp, &sop-
> > > > > > > > > > so_stateids,
> > > > > > > > > st_perstateowner) {
> > > > > > > > > -                       if (check_for_locks(stp-
> > > > > > > > > > st_stid.sc_file, lo)) {
> > > > > > > > > -                               status =
> > > > > > > > > nfserr_locks_held;
> > > > > > > > > -                               spin_unlock(&clp-
> > > > > > > > > > cl_lock);
> > > > > > > > > -                               return status;
> > > > > > > > > -                       }
> > > > > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > > > > +                       return nfserr_locks_held;
> > > > > > > > >                 }
> > > > > > > > > 
> > > > > > > > >                 nfs4_get_stateowner(sop);
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > > > > someone else
> > > > > > > > happens to be doing a LOCKT or F_GETLK call at the same
> > > > > > > > time
> > > > > > > > that
> > > > > > > > RELEASE_LOCKOWNER gets called, then this may end up
> > > > > > > > returning
> > > > > > > > an error
> > > > > > > > inappropriately.
> > > > > > > 
> > > > > > > IMO releasing the lockowner while it's being used for
> > > > > > > _anything_
> > > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds
> > > > > > > while
> > > > > > > the client is still using the lockowner for any reason, a
> > > > > > > subsequent error will occur if the client tries to use it
> > > > > > > again.
> > > > > > > Heck, I can see the server failing in mid-COMPOUND with
> > > > > > > this
> > > > > > > kind
> > > > > > > of race. Better I think to just leave the lockowner in
> > > > > > > place if
> > > > > > > there's any ambiguity.
> > > > > > > 
> > > > > > 
> > > > > > The problem here is not the client itself calling
> > > > > > RELEASE_LOCKOWNER
> > > > > > while it's still in use, but rather a different client
> > > > > > altogether
> > > > > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > > > > where a
> > > > > > lock is held by a client. The LOCKT gets a reference to it
> > > > > > (for
> > > > > > the
> > > > > > conflock), while the client that has the lockowner releases
> > > > > > the
> > > > > > lock and
> > > > > > then the lockowner while the refcount is still high.
> > > > > > 
> > > > > > The race window for this is probably quite small, but I think
> > > > > > it's
> > > > > > theoretically possible. The point is that an elevated
> > > > > > refcount on
> > > > > > the
> > > > > > lockowner doesn't necessarily mean that locks are actually
> > > > > > being
> > > > > > held by
> > > > > > it.
> > > > > 
> > > > > Sure, I get that the lockowner's reference count is not 100%
> > > > > reliable. The question is whether it's good enough.
> > > > > 
> > > > > We are looking for a mechanism that can simply count the number
> > > > > of locks held by a lockowner. It sounds like you believe that
> > > > > lm_get_owner / put_owner might not be a reliable way to do
> > > > > that.
> > > > > 
> > > > > 
> > > > > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > > > > return
> > > > > > > LOCKS_HELD for other reasons, and it does say that there is
> > > > > > > no
> > > > > > > choice of using another NFSERR value (RFC 7530 Section
> > > > > > > 13.2).
> > > > > > > 
> > > > > > 
> > > > > > What recourse does the client have if this happens? It
> > > > > > released
> > > > > > all of
> > > > > > its locks and tried to release the lockowner, but the server
> > > > > > says
> > > > > > "locks
> > > > > > held". Should it just give up at that point?
> > > > > > RELEASE_LOCKOWNER is
> > > > > > a sort
> > > > > > of a courtesy by the client, I suppose...
> > > > > 
> > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > > > > ignore the return code IIUC.
> > > > > 
> > > > > So the hazard caused by this race would be a small resource
> > > > > leak on the server that would go away once the client's lease
> > > > > was purged.
> > > > > 
> > > > > 
> > > > > > > > My guess is that that would be pretty hard to hit the
> > > > > > > > timing right, but not impossible.
> > > > > > > > 
> > > > > > > > What we may want to do is have the kernel do this check
> > > > > > > > and
> > > > > > > > only if it
> > > > > > > > comes back >1 do the actual check for locks. That won't
> > > > > > > > fix
> > > > > > > > the original
> > > > > > > > problem though.
> > > > > > > > 
> > > > > > > > In other places in nfsd, we've plumbed in a dispose_list
> > > > > > > > head
> > > > > > > > and
> > > > > > > > deferred the sleeping functions until the spinlock can be
> > > > > > > > dropped. I
> > > > > > > > haven't looked closely at whether that's possible here,
> > > > > > > > but
> > > > > > > > it may be a
> > > > > > > > more reliable approach.
> > > > > > > 
> > > > > > > That was proposed by Dai last week.
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > > > > 
> > > > > > > Trond pointed out that if two separate clients were
> > > > > > > releasing a
> > > > > > > lockowner on the same inode, there is nothing that protects
> > > > > > > the
> > > > > > > dispose_list, and it would get corrupted.
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > > > > 
> > > > > > 
> > > > > > Yeah, that doesn't look like what's needed.
> > > > > > 
> > > > > > What I was going to suggest is a nfsd_file_put variant that
> > > > > > takes
> > > > > > a
> > > > > > list_head. If the refcount goes to zero and the thing ends up
> > > > > > being
> > > > > > unhashed, then you put it on the dispose list rather than
> > > > > > doing
> > > > > > the
> > > > > > blocking operations, and then clean it up later.
> > > > > 
> > > > > Trond doesn't like that approach; see the e-mail thread.
> > > > > 
> > > > 
> > > > I didn't see him saying that that would be wrong, per-se, but the
> > > > initial implementation was racy.
> > > > 
> > > > His suggestion was just to keep a counter in the lockowner of how
> > > > many
> > > > locks are associated with it. That seems like a good suggestion,
> > > > though
> > > > you'd probably need to add a parameter to lm_get_owner to
> > > > indicate
> > > > whether you were adding a new lock or just doing a conflock copy.
> > > 
> > > I don't think this should be necessary. The posix_lock code doesn't
> > > ever use a struct file_lock that it hasn't allocated itself. We
> > > should
> > > always be calling conflock to copy from whatever struct file_lock
> > > that
> > > the caller passed as an argument.
> > > 
> > > IOW: the number of lm_get_owner and lm_put_owner calls should
> > > always be
> > > 100% balanced once all the locks belonging to a specific lock owner
> > > are
> > > removed.
> > > 
> > 
> > We take references to the owner when we go to add a lock record, or
> > when
> > copying a conflicting lock. You want to keep a count of the former
> > without counting the latter.
> > 
> > lm_get_owner gets called for both though. I don't see how you can
> > disambiguate the two situations w/o some way to indicate that. Adding
> > a
> > bool argument to lm_get_owner/lm_put_owner ops would be pretty simple
> > to
> > implement, I think.
> > 
> 
> Hmm... That should be an extremely unlikely race, given that the
> conflicting lock reference would have to be held for long enough to
> cover the unlock + the release_lockowner / free_stateid RPC calls from
> the client initially holding the lock, however I agree it is a
> theoretical possibility.
> 

If we want to live with the possibility of that race, then Chuck's
original patch is fine, since the object refcount would always be
equivalent to the lock count.

That said, I think it'd be better to keep an accurate count of lock
records (sans conflocks) in the owner.
Chuck Lever May 23, 2022, 7:13 p.m. UTC | #12
> On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote:
>> On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote:
>>> On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
>>>> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
>>>>> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
>>>>>> 
>>>>>>> On May 23, 2022, at 11:26 AM, Jeff Layton
>>>>>>> <jlayton@kernel.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
>>>>>>>> 
>>>>>>>>> On May 23, 2022, at 9:40 AM, Jeff Layton
>>>>>>>>> <jlayton@kernel.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
>>>>>>>>>> nfsd4_release_lockowner() holds clp->cl_lock when it
>>>>>>>>>> calls
>>>>>>>>>> check_for_locks(). However, check_for_locks() calls
>>>>>>>>>> nfsd_file_get()
>>>>>>>>>> / nfsd_file_put() to access the backing inode's
>>>>>>>>>> flc_posix
>>>>>>>>>> list, and
>>>>>>>>>> nfsd_file_put() can sleep if the inode was recently
>>>>>>>>>> removed.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It might be good to add a might_sleep() to nfsd_file_put?
>>>>>>>> 
>>>>>>>> I intend to include the patch you reviewed last week that
>>>>>>>> adds the might_sleep(), as part of this series.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> Let's instead rely on the stateowner's reference count
>>>>>>>>>> to
>>>>>>>>>> gate
>>>>>>>>>> whether the release is permitted. This should be a
>>>>>>>>>> reliable
>>>>>>>>>> indication of locks-in-use since file lock operations
>>>>>>>>>> and
>>>>>>>>>> ->lm_get_owner take appropriate references, which are
>>>>>>>>>> released
>>>>>>>>>> appropriately when file locks are removed.
>>>>>>>>>> 
>>>>>>>>>> Reported-by: Dai Ngo <dai.ngo@oracle.com>
>>>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>>>> ---
>>>>>>>>>> fs/nfsd/nfs4state.c |    9 +++------
>>>>>>>>>> 1 file changed, 3 insertions(+), 6 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> This might be a naive approach, but let's start with
>>>>>>>>>> it.
>>>>>>>>>> 
>>>>>>>>>> This passes light testing, but it's not clear how much
>>>>>>>>>> our
>>>>>>>>>> existing
>>>>>>>>>> fleet of tests exercises this area. I've locally built
>>>>>>>>>> a
>>>>>>>>>> couple of
>>>>>>>>>> pynfs tests (one is based on the one Dai posted last
>>>>>>>>>> week)
>>>>>>>>>> and they
>>>>>>>>>> pass too.
>>>>>>>>>> 
>>>>>>>>>> I don't believe that FREE_STATEID needs the same
>>>>>>>>>> simplification.
>>>>>>>>>> 
>>>>>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>>>>>> index a280256cbb03..b77894e668a4 100644
>>>>>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>>>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
>>>>>>>>>> svc_rqst *rqstp,
>>>>>>>>>> 
>>>>>>>>>>                 /* see if there are still any locks
>>>>>>>>>> associated with it */
>>>>>>>>>>                 lo = lockowner(sop);
>>>>>>>>>> -               list_for_each_entry(stp, &sop-
>>>>>>>>>>> so_stateids,
>>>>>>>>>> st_perstateowner) {
>>>>>>>>>> -                       if (check_for_locks(stp-
>>>>>>>>>>> st_stid.sc_file, lo)) {
>>>>>>>>>> -                               status =
>>>>>>>>>> nfserr_locks_held;
>>>>>>>>>> -                               spin_unlock(&clp-
>>>>>>>>>>> cl_lock);
>>>>>>>>>> -                               return status;
>>>>>>>>>> -                       }
>>>>>>>>>> +               if (atomic_read(&sop->so_count) > 1) {
>>>>>>>>>> +                       spin_unlock(&clp->cl_lock);
>>>>>>>>>> +                       return nfserr_locks_held;
>>>>>>>>>>                 }
>>>>>>>>>> 
>>>>>>>>>>                 nfs4_get_stateowner(sop);
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> lm_get_owner is called from locks_copy_conflock, so if
>>>>>>>>> someone else
>>>>>>>>> happens to be doing a LOCKT or F_GETLK call at the same
>>>>>>>>> time
>>>>>>>>> that
>>>>>>>>> RELEASE_LOCKOWNER gets called, then this may end up
>>>>>>>>> returning
>>>>>>>>> an error
>>>>>>>>> inappropriately.
>>>>>>>> 
>>>>>>>> IMO releasing the lockowner while it's being used for
>>>>>>>> _anything_
>>>>>>>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds
>>>>>>>> while
>>>>>>>> the client is still using the lockowner for any reason, a
>>>>>>>> subsequent error will occur if the client tries to use it
>>>>>>>> again.
>>>>>>>> Heck, I can see the server failing in mid-COMPOUND with
>>>>>>>> this
>>>>>>>> kind
>>>>>>>> of race. Better I think to just leave the lockowner in
>>>>>>>> place if
>>>>>>>> there's any ambiguity.
>>>>>>>> 
>>>>>>> 
>>>>>>> The problem here is not the client itself calling
>>>>>>> RELEASE_LOCKOWNER
>>>>>>> while it's still in use, but rather a different client
>>>>>>> altogether
>>>>>>> calling LOCKT (or a local process does a F_GETLK) on an inode
>>>>>>> where a
>>>>>>> lock is held by a client. The LOCKT gets a reference to it
>>>>>>> (for
>>>>>>> the
>>>>>>> conflock), while the client that has the lockowner releases
>>>>>>> the
>>>>>>> lock and
>>>>>>> then the lockowner while the refcount is still high.
>>>>>>> 
>>>>>>> The race window for this is probably quite small, but I think
>>>>>>> it's
>>>>>>> theoretically possible. The point is that an elevated
>>>>>>> refcount on
>>>>>>> the
>>>>>>> lockowner doesn't necessarily mean that locks are actually
>>>>>>> being
>>>>>>> held by
>>>>>>> it.
>>>>>> 
>>>>>> Sure, I get that the lockowner's reference count is not 100%
>>>>>> reliable. The question is whether it's good enough.
>>>>>> 
>>>>>> We are looking for a mechanism that can simply count the number
>>>>>> of locks held by a lockowner. It sounds like you believe that
>>>>>> lm_get_owner / put_owner might not be a reliable way to do
>>>>>> that.
>>>>>> 
>>>>>> 
>>>>>>>> The spec language does not say RELEASE_LOCKOWNER must not
>>>>>>>> return
>>>>>>>> LOCKS_HELD for other reasons, and it does say that there is
>>>>>>>> no
>>>>>>>> choice of using another NFSERR value (RFC 7530 Section
>>>>>>>> 13.2).
>>>>>>>> 
>>>>>>> 
>>>>>>> What recourse does the client have if this happens? It
>>>>>>> released
>>>>>>> all of
>>>>>>> its locks and tried to release the lockowner, but the server
>>>>>>> says
>>>>>>> "locks
>>>>>>> held". Should it just give up at that point?
>>>>>>> RELEASE_LOCKOWNER is
>>>>>>> a sort
>>>>>>> of a courtesy by the client, I suppose...
>>>>>> 
>>>>>> RELEASE_LOCKOWNER is a courtesy for the server. Most clients
>>>>>> ignore the return code IIUC.
>>>>>> 
>>>>>> So the hazard caused by this race would be a small resource
>>>>>> leak on the server that would go away once the client's lease
>>>>>> was purged.
>>>>>> 
>>>>>> 
>>>>>>>>> My guess is that that would be pretty hard to hit the
>>>>>>>>> timing right, but not impossible.
>>>>>>>>> 
>>>>>>>>> What we may want to do is have the kernel do this check
>>>>>>>>> and
>>>>>>>>> only if it
>>>>>>>>> comes back >1 do the actual check for locks. That won't
>>>>>>>>> fix
>>>>>>>>> the original
>>>>>>>>> problem though.
>>>>>>>>> 
>>>>>>>>> In other places in nfsd, we've plumbed in a dispose_list
>>>>>>>>> head
>>>>>>>>> and
>>>>>>>>> deferred the sleeping functions until the spinlock can be
>>>>>>>>> dropped. I
>>>>>>>>> haven't looked closely at whether that's possible here,
>>>>>>>>> but
>>>>>>>>> it may be a
>>>>>>>>> more reliable approach.
>>>>>>>> 
>>>>>>>> That was proposed by Dai last week.
>>>>>>>> 
>>>>>>>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
>>>>>>>> 
>>>>>>>> Trond pointed out that if two separate clients were
>>>>>>>> releasing a
>>>>>>>> lockowner on the same inode, there is nothing that protects
>>>>>>>> the
>>>>>>>> dispose_list, and it would get corrupted.
>>>>>>>> 
>>>>>>>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
>>>>>>>> 
>>>>>>> 
>>>>>>> Yeah, that doesn't look like what's needed.
>>>>>>> 
>>>>>>> What I was going to suggest is a nfsd_file_put variant that
>>>>>>> takes
>>>>>>> a
>>>>>>> list_head. If the refcount goes to zero and the thing ends up
>>>>>>> being
>>>>>>> unhashed, then you put it on the dispose list rather than
>>>>>>> doing
>>>>>>> the
>>>>>>> blocking operations, and then clean it up later.
>>>>>> 
>>>>>> Trond doesn't like that approach; see the e-mail thread.
>>>>>> 
>>>>> 
>>>>> I didn't see him saying that that would be wrong, per-se, but the
>>>>> initial implementation was racy.
>>>>> 
>>>>> His suggestion was just to keep a counter in the lockowner of how
>>>>> many
>>>>> locks are associated with it. That seems like a good suggestion,
>>>>> though
>>>>> you'd probably need to add a parameter to lm_get_owner to
>>>>> indicate
>>>>> whether you were adding a new lock or just doing a conflock copy.
>>>> 
>>>> I don't think this should be necessary. The posix_lock code doesn't
>>>> ever use a struct file_lock that it hasn't allocated itself. We
>>>> should
>>>> always be calling conflock to copy from whatever struct file_lock
>>>> that
>>>> the caller passed as an argument.
>>>> 
>>>> IOW: the number of lm_get_owner and lm_put_owner calls should
>>>> always be
>>>> 100% balanced once all the locks belonging to a specific lock owner
>>>> are
>>>> removed.
>>>> 
>>> 
>>> We take references to the owner when we go to add a lock record, or
>>> when
>>> copying a conflicting lock. You want to keep a count of the former
>>> without counting the latter.
>>> 
>>> lm_get_owner gets called for both though. I don't see how you can
>>> disambiguate the two situations w/o some way to indicate that. Adding
>>> a
>>> bool argument to lm_get_owner/lm_put_owner ops would be pretty simple
>>> to
>>> implement, I think.
>>> 
>> 
>> Hmm... That should be an extremely unlikely race, given that the
>> conflicting lock reference would have to be held for long enough to
>> cover the unlock + the release_lockowner / free_stateid RPC calls from
>> the client initially holding the lock, however I agree it is a
>> theoretical possibility.
>> 
> 
> If we want to live with the possibility of that race, then Chuck's
> original patch is fine, since the object refcount would always be
> equivalent to the lock count.
> 
> That said, I think it'd be better to keep an accurate count of lock
> records (sans conflocks) in the owner.

I don't have an objection to maintaining an accurate count of locks
belonging to each lockowner. I feel comfortable (for now) that using
so_count is a good-enough solution, and is an improvement over holding
a spinlock while trying to sleep. If we go with so_count, I can add
a comment that explains the uses of so_count and possible races.

OTOH, if we alter the lm_get/put_owner API, I would like to ensure the
new parameter will be difficult to misuse. A few more random thoughts
that might count towards due diligence:

- Is there a reliable way lm_get/put_owner can distinguish between
  the two cases by looking at their current set of arguments?

- Is it clear why the conflict case needs to take a reference on a
  lockowner that is not involved? Is there a way to avoid taking that
  reference in the first place?


--
Chuck Lever
Chuck Lever May 23, 2022, 7:35 p.m. UTC | #13
> On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote:
>> 
>>> On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
>>> 
>>> His suggestion was just to keep a counter in the lockowner of how many
>>> locks are associated with it. That seems like a good suggestion, though
>>> you'd probably need to add a parameter to lm_get_owner to indicate
>>> whether you were adding a new lock or just doing a conflock copy.
>> 
>> locks_copy_conflock() would need to take a boolean parameter
>> that callers would set when they actually manipulate a lock.
>> 
> 
> Yep. You'd also have to add a bool arg to lm_put_owner so that you know
> whether you need to decrement the counter.

It's the lm_put_owner() side that looks less than straightforward.
Suggestions and advice welcome there.


--
Chuck Lever
Jeff Layton May 23, 2022, 7:36 p.m. UTC | #14
On Mon, 2022-05-23 at 19:13 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote:
> > > On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote:
> > > > On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
> > > > > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> > > > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > > > > > 
> > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton
> > > > > > > > <jlayton@kernel.org>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > > > > > > 
> > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton
> > > > > > > > > > <jlayton@kernel.org>
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it
> > > > > > > > > > > calls
> > > > > > > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > > > > > > nfsd_file_get()
> > > > > > > > > > > / nfsd_file_put() to access the backing inode's
> > > > > > > > > > > flc_posix
> > > > > > > > > > > list, and
> > > > > > > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > > > > > > removed.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > > > > > > 
> > > > > > > > > I intend to include the patch you reviewed last week that
> > > > > > > > > adds the might_sleep(), as part of this series.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > Let's instead rely on the stateowner's reference count
> > > > > > > > > > > to
> > > > > > > > > > > gate
> > > > > > > > > > > whether the release is permitted. This should be a
> > > > > > > > > > > reliable
> > > > > > > > > > > indication of locks-in-use since file lock operations
> > > > > > > > > > > and
> > > > > > > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > > > > > > released
> > > > > > > > > > > appropriately when file locks are removed.
> > > > > > > > > > > 
> > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > > > > > Cc: stable@vger.kernel.org
> > > > > > > > > > > ---
> > > > > > > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > This might be a naive approach, but let's start with
> > > > > > > > > > > it.
> > > > > > > > > > > 
> > > > > > > > > > > This passes light testing, but it's not clear how much
> > > > > > > > > > > our
> > > > > > > > > > > existing
> > > > > > > > > > > fleet of tests exercises this area. I've locally built
> > > > > > > > > > > a
> > > > > > > > > > > couple of
> > > > > > > > > > > pynfs tests (one is based on the one Dai posted last
> > > > > > > > > > > week)
> > > > > > > > > > > and they
> > > > > > > > > > > pass too.
> > > > > > > > > > > 
> > > > > > > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > > > > > > simplification.
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > > > > > > svc_rqst *rqstp,
> > > > > > > > > > > 
> > > > > > > > > > >                 /* see if there are still any locks
> > > > > > > > > > > associated with it */
> > > > > > > > > > >                 lo = lockowner(sop);
> > > > > > > > > > > -               list_for_each_entry(stp, &sop-
> > > > > > > > > > > > so_stateids,
> > > > > > > > > > > st_perstateowner) {
> > > > > > > > > > > -                       if (check_for_locks(stp-
> > > > > > > > > > > > st_stid.sc_file, lo)) {
> > > > > > > > > > > -                               status =
> > > > > > > > > > > nfserr_locks_held;
> > > > > > > > > > > -                               spin_unlock(&clp-
> > > > > > > > > > > > cl_lock);
> > > > > > > > > > > -                               return status;
> > > > > > > > > > > -                       }
> > > > > > > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > > > > > > +                       return nfserr_locks_held;
> > > > > > > > > > >                 }
> > > > > > > > > > > 
> > > > > > > > > > >                 nfs4_get_stateowner(sop);
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > > > > > > someone else
> > > > > > > > > > happens to be doing a LOCKT or F_GETLK call at the same
> > > > > > > > > > time
> > > > > > > > > > that
> > > > > > > > > > RELEASE_LOCKOWNER gets called, then this may end up
> > > > > > > > > > returning
> > > > > > > > > > an error
> > > > > > > > > > inappropriately.
> > > > > > > > > 
> > > > > > > > > IMO releasing the lockowner while it's being used for
> > > > > > > > > _anything_
> > > > > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds
> > > > > > > > > while
> > > > > > > > > the client is still using the lockowner for any reason, a
> > > > > > > > > subsequent error will occur if the client tries to use it
> > > > > > > > > again.
> > > > > > > > > Heck, I can see the server failing in mid-COMPOUND with
> > > > > > > > > this
> > > > > > > > > kind
> > > > > > > > > of race. Better I think to just leave the lockowner in
> > > > > > > > > place if
> > > > > > > > > there's any ambiguity.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > The problem here is not the client itself calling
> > > > > > > > RELEASE_LOCKOWNER
> > > > > > > > while it's still in use, but rather a different client
> > > > > > > > altogether
> > > > > > > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > > > > > > where a
> > > > > > > > lock is held by a client. The LOCKT gets a reference to it
> > > > > > > > (for
> > > > > > > > the
> > > > > > > > conflock), while the client that has the lockowner releases
> > > > > > > > the
> > > > > > > > lock and
> > > > > > > > then the lockowner while the refcount is still high.
> > > > > > > > 
> > > > > > > > The race window for this is probably quite small, but I think
> > > > > > > > it's
> > > > > > > > theoretically possible. The point is that an elevated
> > > > > > > > refcount on
> > > > > > > > the
> > > > > > > > lockowner doesn't necessarily mean that locks are actually
> > > > > > > > being
> > > > > > > > held by
> > > > > > > > it.
> > > > > > > 
> > > > > > > Sure, I get that the lockowner's reference count is not 100%
> > > > > > > reliable. The question is whether it's good enough.
> > > > > > > 
> > > > > > > We are looking for a mechanism that can simply count the number
> > > > > > > of locks held by a lockowner. It sounds like you believe that
> > > > > > > lm_get_owner / put_owner might not be a reliable way to do
> > > > > > > that.
> > > > > > > 
> > > > > > > 
> > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > > > > > > return
> > > > > > > > > LOCKS_HELD for other reasons, and it does say that there is
> > > > > > > > > no
> > > > > > > > > choice of using another NFSERR value (RFC 7530 Section
> > > > > > > > > 13.2).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > What recourse does the client have if this happens? It
> > > > > > > > released
> > > > > > > > all of
> > > > > > > > its locks and tried to release the lockowner, but the server
> > > > > > > > says
> > > > > > > > "locks
> > > > > > > > held". Should it just give up at that point?
> > > > > > > > RELEASE_LOCKOWNER is
> > > > > > > > a sort
> > > > > > > > of a courtesy by the client, I suppose...
> > > > > > > 
> > > > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > > > > > > ignore the return code IIUC.
> > > > > > > 
> > > > > > > So the hazard caused by this race would be a small resource
> > > > > > > leak on the server that would go away once the client's lease
> > > > > > > was purged.
> > > > > > > 
> > > > > > > 
> > > > > > > > > > My guess is that that would be pretty hard to hit the
> > > > > > > > > > timing right, but not impossible.
> > > > > > > > > > 
> > > > > > > > > > What we may want to do is have the kernel do this check
> > > > > > > > > > and
> > > > > > > > > > only if it
> > > > > > > > > > comes back >1 do the actual check for locks. That won't
> > > > > > > > > > fix
> > > > > > > > > > the original
> > > > > > > > > > problem though.
> > > > > > > > > > 
> > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list
> > > > > > > > > > head
> > > > > > > > > > and
> > > > > > > > > > deferred the sleeping functions until the spinlock can be
> > > > > > > > > > dropped. I
> > > > > > > > > > haven't looked closely at whether that's possible here,
> > > > > > > > > > but
> > > > > > > > > > it may be a
> > > > > > > > > > more reliable approach.
> > > > > > > > > 
> > > > > > > > > That was proposed by Dai last week.
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > > > > > > 
> > > > > > > > > Trond pointed out that if two separate clients were
> > > > > > > > > releasing a
> > > > > > > > > lockowner on the same inode, there is nothing that protects
> > > > > > > > > the
> > > > > > > > > dispose_list, and it would get corrupted.
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yeah, that doesn't look like what's needed.
> > > > > > > > 
> > > > > > > > What I was going to suggest is a nfsd_file_put variant that
> > > > > > > > takes
> > > > > > > > a
> > > > > > > > list_head. If the refcount goes to zero and the thing ends up
> > > > > > > > being
> > > > > > > > unhashed, then you put it on the dispose list rather than
> > > > > > > > doing
> > > > > > > > the
> > > > > > > > blocking operations, and then clean it up later.
> > > > > > > 
> > > > > > > Trond doesn't like that approach; see the e-mail thread.
> > > > > > > 
> > > > > > 
> > > > > > I didn't see him saying that that would be wrong, per-se, but the
> > > > > > initial implementation was racy.
> > > > > > 
> > > > > > His suggestion was just to keep a counter in the lockowner of how
> > > > > > many
> > > > > > locks are associated with it. That seems like a good suggestion,
> > > > > > though
> > > > > > you'd probably need to add a parameter to lm_get_owner to
> > > > > > indicate
> > > > > > whether you were adding a new lock or just doing a conflock copy.
> > > > > 
> > > > > I don't think this should be necessary. The posix_lock code doesn't
> > > > > ever use a struct file_lock that it hasn't allocated itself. We
> > > > > should
> > > > > always be calling conflock to copy from whatever struct file_lock
> > > > > that
> > > > > the caller passed as an argument.
> > > > > 
> > > > > IOW: the number of lm_get_owner and lm_put_owner calls should
> > > > > always be
> > > > > 100% balanced once all the locks belonging to a specific lock owner
> > > > > are
> > > > > removed.
> > > > > 
> > > > 
> > > > We take references to the owner when we go to add a lock record, or
> > > > when
> > > > copying a conflicting lock. You want to keep a count of the former
> > > > without counting the latter.
> > > > 
> > > > lm_get_owner gets called for both though. I don't see how you can
> > > > disambiguate the two situations w/o some way to indicate that. Adding
> > > > a
> > > > bool argument to lm_get_owner/lm_put_owner ops would be pretty simple
> > > > to
> > > > implement, I think.
> > > > 
> > > 
> > > Hmm... That should be an extremely unlikely race, given that the
> > > conflicting lock reference would have to be held for long enough to
> > > cover the unlock + the release_lockowner / free_stateid RPC calls from
> > > the client initially holding the lock, however I agree it is a
> > > theoretical possibility.
> > > 
> > 
> > If we want to live with the possibility of that race, then Chuck's
> > original patch is fine, since the object refcount would always be
> > equivalent to the lock count.
> > 
> > That said, I think it'd be better to keep an accurate count of lock
> > records (sans conflocks) in the owner.
> 
> I don't have an objection to maintaining an accurate count of locks
> belonging to each lockowner. I feel comfortable (for now) that using
> so_count is a good-enough solution, and is an improvement over holding
> a spinlock while trying to sleep. If we go with so_count, I can add
> a comment that explains the uses of so_count and possible races.
> 
> OTOH, if we alter the lm_get/put_owner API, I would like to ensure the
> new parameter will be difficult to misuse. A few more random thoughts
> that might count towards due diligence:
> 
> - Is there a reliable way lm_get/put_owner can distinguish between
>   the two cases by looking at their current set of arguments?
> 

Not really. It just takes a lockowner pointer now so you don't have any
indication of why the reference is being taken or put.

> - Is it clear why the conflict case needs to take a reference on a
>   lockowner that is not involved? Is there a way to avoid taking that
>   reference in the first place?
> 

The other lockowner _is_ involved. It's the one holding the conflicting
lock. nfs4_set_lock_denied copies info from the conflicting lockowner
into the LOCK/LOCKT response. That's safe now because it holds a
reference to the owner. At one point it wasn't (see commit aef9583b234a4
"NFSD: Get reference of lockowner when coping file_lock", which fixed
that).
Jeff Layton May 23, 2022, 7:43 p.m. UTC | #15
On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote:
> 
> > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > His suggestion was just to keep a counter in the lockowner of how many
> > > > locks are associated with it. That seems like a good suggestion, though
> > > > you'd probably need to add a parameter to lm_get_owner to indicate
> > > > whether you were adding a new lock or just doing a conflock copy.
> > > 
> > > locks_copy_conflock() would need to take a boolean parameter
> > > that callers would set when they actually manipulate a lock.
> > > 
> > 
> > Yep. You'd also have to add a bool arg to lm_put_owner so that you know
> > whether you need to decrement the counter.
> 
> It's the lm_put_owner() side that looks less than straightforward.
> Suggestions and advice welcome there.
> 

Maybe add a new fl_flags value that indicates that a particular lock is
a conflock and not a lock record? Then locks_release_private could use
that to pass the appropriate argument to lm_put_owner.

That's probably simpler overall than trying to audit all of the
locks_free_lock callers.
J. Bruce Fields May 23, 2022, 8:17 p.m. UTC | #16
On Mon, May 23, 2022 at 03:43:28PM -0400, Jeff Layton wrote:
> On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote:
> > 
> > > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote:
> > > > 
> > > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > His suggestion was just to keep a counter in the lockowner of how many
> > > > > locks are associated with it. That seems like a good suggestion, though
> > > > > you'd probably need to add a parameter to lm_get_owner to indicate
> > > > > whether you were adding a new lock or just doing a conflock copy.
> > > > 
> > > > locks_copy_conflock() would need to take a boolean parameter
> > > > that callers would set when they actually manipulate a lock.
> > > > 
> > > 
> > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know
> > > whether you need to decrement the counter.
> > 
> > It's the lm_put_owner() side that looks less than straightforward.
> > Suggestions and advice welcome there.
> > 
> 
> Maybe add a new fl_flags value that indicates that a particular lock is
> a conflock and not a lock record? Then locks_release_private could use
> that to pass the appropriate argument to lm_put_owner.
> 
> That's probably simpler overall than trying to audit all of the
> locks_free_lock callers.

Should conflock parameters really be represented by file_lock structures
at all?  It always seemed a little wrong to me.  But, that's a bit of
derail, apologies.

--b.
J. Bruce Fields May 23, 2022, 8:29 p.m. UTC | #17
On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote:
> The other lockowner _is_ involved. It's the one holding the conflicting
> lock. nfs4_set_lock_denied copies info from the conflicting lockowner
> into the LOCK/LOCKT response. That's safe now because it holds a
> reference to the owner. At one point it wasn't (see commit aef9583b234a4
> "NFSD: Get reference of lockowner when coping file_lock", which fixed
> that).

I doubt that commit fixed the whole problem, for what it's worth.  What
if the client holding the conflicting lock expires before we get to
nfs4_set_lock_denied?

--b.
Jeff Layton May 23, 2022, 8:32 p.m. UTC | #18
On Mon, 2022-05-23 at 16:17 -0400, J. Bruce Fields wrote:
> On Mon, May 23, 2022 at 03:43:28PM -0400, Jeff Layton wrote:
> > On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > 
> > > > > > His suggestion was just to keep a counter in the lockowner of how many
> > > > > > locks are associated with it. That seems like a good suggestion, though
> > > > > > you'd probably need to add a parameter to lm_get_owner to indicate
> > > > > > whether you were adding a new lock or just doing a conflock copy.
> > > > > 
> > > > > locks_copy_conflock() would need to take a boolean parameter
> > > > > that callers would set when they actually manipulate a lock.
> > > > > 
> > > > 
> > > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know
> > > > whether you need to decrement the counter.
> > > 
> > > It's the lm_put_owner() side that looks less than straightforward.
> > > Suggestions and advice welcome there.
> > > 
> > 
> > Maybe add a new fl_flags value that indicates that a particular lock is
> > a conflock and not a lock record? Then locks_release_private could use
> > that to pass the appropriate argument to lm_put_owner.
> > 
> > That's probably simpler overall than trying to audit all of the
> > locks_free_lock callers.
> 
> Should conflock parameters really be represented by file_lock structures
> at all?  It always seemed a little wrong to me.  But, that's a bit of
> derail, apologies.
> 

Probably not.

Lock requests should also not be represented by struct file_lock, but
that decision was made quite some time ago. We could change these things
these days, but it'd be a lot of churn.

Even if we did use a different struct for conflocks though, it would
still need a way to point at the lockowner (or clone/free its info
somehow). It wouldn't materially change what we need to do here.
Jeff Layton May 23, 2022, 9:15 p.m. UTC | #19
On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote:
> On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote:
> > The other lockowner _is_ involved. It's the one holding the conflicting
> > lock. nfs4_set_lock_denied copies info from the conflicting lockowner
> > into the LOCK/LOCKT response. That's safe now because it holds a
> > reference to the owner. At one point it wasn't (see commit aef9583b234a4
> > "NFSD: Get reference of lockowner when coping file_lock", which fixed
> > that).
> 
> I doubt that commit fixed the whole problem, for what it's worth.  What
> if the client holding the conflicting lock expires before we get to
> nfs4_set_lock_denied?
> 

Good point -- stateowners can't hold a client reference.

clientid_t is 64 bits, so one thing we could do is just keep a copy of
that in the lockowner. That way we wouldn't need to dereference
so_client in nfs4_set_lock_denied.

Maybe there are better ways to fix it though.
J. Bruce Fields May 23, 2022, 9:28 p.m. UTC | #20
On Mon, May 23, 2022 at 05:15:55PM -0400, Jeff Layton wrote:
> On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote:
> > On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote:
> > > The other lockowner _is_ involved. It's the one holding the conflicting
> > > lock. nfs4_set_lock_denied copies info from the conflicting lockowner
> > > into the LOCK/LOCKT response. That's safe now because it holds a
> > > reference to the owner. At one point it wasn't (see commit aef9583b234a4
> > > "NFSD: Get reference of lockowner when coping file_lock", which fixed
> > > that).
> > 
> > I doubt that commit fixed the whole problem, for what it's worth.  What
> > if the client holding the conflicting lock expires before we get to
> > nfs4_set_lock_denied?
> > 
> 
> Good point -- stateowners can't hold a client reference.
> 
> clientid_t is 64 bits, so one thing we could do is just keep a copy of
> that in the lockowner. That way we wouldn't need to dereference
> so_client in nfs4_set_lock_denied.
> 
> Maybe there are better ways to fix it though.

I kinda feel like lock and testlock should both take a 

	struct conflock {
		void *data
		void (*copier)(struct file_lock, struct conflock)
	}

and the caller should provide a "copier" that they can use to extract
the data they want while the flc_lock is held.

I'm probably overthinking it.

--b.
Chuck Lever May 23, 2022, 10:18 p.m. UTC | #21
> On May 23, 2022, at 1:43 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> IOW: the number of lm_get_owner and lm_put_owner calls should always be
> 100% balanced once all the locks belonging to a specific lock owner are
> removed.

It doesn't look like that's the case. If we rely solely on
lm_get/put_owner, the lockcnt quickly becomes negative, even
without the presence of lock conflicts:

            nfsd-999   [001]   155.888211: nfsd_compound:        xid=0x0000001c opcnt=2
            nfsd-999   [001]   155.888260: nfsd_compound_status: op=1/2 OP_PUTFH status=0
            nfsd-999   [001]   155.888283: nfsd_preprocess:      seqid=2 client 628c04a3:f2adef0d stateid 00000002:00000002
            nfsd-999   [001]   155.888419: bprint:               nfsd4_lm_get_owner: lo=0xffff8881955b1050 conflict=false lockcnt=1 so_count=3
            nfsd-999   [001]   155.888431: bprint:               nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=1 so_count=3
            nfsd-999   [001]   155.888441: nfsd_file_put:        hash=0x339 inode=0xffff88816770d908 ref=3 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0
            nfsd-999   [001]   155.888453: nfsd_compound_status: op=2/2 OP_LOCK status=0
            nfsd-999   [003]   155.890388: nfsd_compound:        xid=0x0000001d opcnt=2
            nfsd-999   [003]   155.890457: nfsd_compound_status: op=1/2 OP_PUTFH status=0
            nfsd-999   [003]   155.890459: nfsd_preprocess:      seqid=1 client 628c04a3:f2adef0d stateid 00000003:00000001
            nfsd-999   [003]   155.890479: bprint:               nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=0 so_count=4
            nfsd-999   [003]   155.890490: nfsd_file_put:        hash=0x339 inode=0xffff88816770d908 ref=3 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0
            nfsd-999   [003]   155.890492: bprint:               nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=-1 so_count=3
            nfsd-999   [003]   155.890502: nfsd_compound_status: op=2/2 OP_LOCKU status=0
            nfsd-999   [003]   155.892494: nfsd_compound:        xid=0x0000001e opcnt=5
            nfsd-999   [003]   155.892498: nfsd_compound_status: op=1/5 OP_PUTROOTFH status=0
            nfsd-999   [003]   155.892528: nfsd_compound_status: op=2/5 OP_LOOKUP status=0
            nfsd-999   [003]   155.892552: nfsd_compound_status: op=3/5 OP_LOOKUP status=0
            nfsd-999   [003]   155.892576: nfsd_compound_status: op=4/5 OP_LOOKUP status=0
            nfsd-999   [003]   155.892604: nfsd_file_fsnotify_handle_event: inode=0xffff88816770d908 nlink=0 mode=0100644 mask=0x4
            nfsd-999   [003]   155.892605: nfsd_file_unhash_and_release_locked: hash=0x339 inode=0xffff88816770d908 ref=2 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0
            nfsd-999   [003]   155.892606: nfsd_file_unhash:     hash=0x339 inode=0xffff88816770d908 ref=2 flags=REFERENCED may=WRITE|READ file=0xffff88811232f7c0
            nfsd-999   [003]   155.892607: nfsd_file_close_inode: hash=0x339 inode=0xffff88816770d908 found=0
            nfsd-999   [003]   155.892610: nfsd_compound_status: op=5/5 OP_REMOVE status=0
            nfsd-999   [001]   155.894663: nfsd_compound:        xid=0x0000001f opcnt=1
            nfsd-999   [001]   155.894667: bprint:               nfsd4_release_lockowner: lo=0xffff8881955b1050 lockcnt=-2 so_count=1
            nfsd-999   [001]   155.894670: nfsd_file_put:        hash=0x339 inode=0xffff88816770d908 ref=2 flags=REFERENCED may=WRITE|READ file=0xffff88811232f7c0
            nfsd-999   [001]   155.894706: nfsd_compound_status: op=1/1 OP_RELEASE_LOCKOWNER status=0

There are other places inside NFSD that call nfs4_get/
put_stateowner on a lockowner. That seems to be what keeps
the so_count above water.

Or, another way to look at it: simply saving the value of
the conflict boolean in file_lock and passing that to
lm_put_owner() doesn't seem to be adequate.


--
Chuck Lever
J. Bruce Fields May 24, 2022, 12:07 a.m. UTC | #22
On Mon, May 23, 2022 at 05:28:16PM -0400, J. Bruce Fields wrote:
> On Mon, May 23, 2022 at 05:15:55PM -0400, Jeff Layton wrote:
> > On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote:
> > > On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote:
> > > > The other lockowner _is_ involved. It's the one holding the conflicting
> > > > lock. nfs4_set_lock_denied copies info from the conflicting lockowner
> > > > into the LOCK/LOCKT response. That's safe now because it holds a
> > > > reference to the owner. At one point it wasn't (see commit aef9583b234a4
> > > > "NFSD: Get reference of lockowner when coping file_lock", which fixed
> > > > that).
> > > 
> > > I doubt that commit fixed the whole problem, for what it's worth.  What
> > > if the client holding the conflicting lock expires before we get to
> > > nfs4_set_lock_denied?
> > > 
> > 
> > Good point -- stateowners can't hold a client reference.
> > 
> > clientid_t is 64 bits, so one thing we could do is just keep a copy of
> > that in the lockowner. That way we wouldn't need to dereference
> > so_client in nfs4_set_lock_denied.
> > 
> > Maybe there are better ways to fix it though.
> 
> I kinda feel like lock and testlock should both take a 
> 
> 	struct conflock {
> 		void *data
> 		void (*copier)(struct file_lock, struct conflock)
> 	}
> 
> and the caller should provide a "copier" that they can use to extract
> the data they want while the flc_lock is held.
> 
> I'm probably overthinking it.

But in any case note the problem isn't just the copy of the clientid_t;
nfs4_put_stateowner also dereferences the client.

--b.
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index a280256cbb03..b77894e668a4 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7559,12 +7559,9 @@  nfsd4_release_lockowner(struct svc_rqst *rqstp,
 
 		/* see if there are still any locks associated with it */
 		lo = lockowner(sop);
-		list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) {
-			if (check_for_locks(stp->st_stid.sc_file, lo)) {
-				status = nfserr_locks_held;
-				spin_unlock(&clp->cl_lock);
-				return status;
-			}
+		if (atomic_read(&sop->so_count) > 1) {
+			spin_unlock(&clp->cl_lock);
+			return nfserr_locks_held;
 		}
 
 		nfs4_get_stateowner(sop);