Message ID | 165323344948.2381.7808135229977810927.stgit@bazille.1015granger.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner() | expand |
On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > nfsd4_release_lockowner() holds clp->cl_lock when it calls > check_for_locks(). However, check_for_locks() calls nfsd_file_get() > / nfsd_file_put() to access the backing inode's flc_posix list, and > nfsd_file_put() can sleep if the inode was recently removed. > It might be good to add a might_sleep() to nfsd_file_put? > Let's instead rely on the stateowner's reference count to gate > whether the release is permitted. This should be a reliable > indication of locks-in-use since file lock operations and > ->lm_get_owner take appropriate references, which are released > appropriately when file locks are removed. > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > Cc: stable@vger.kernel.org > --- > fs/nfsd/nfs4state.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > This might be a naive approach, but let's start with it. > > This passes light testing, but it's not clear how much our existing > fleet of tests exercises this area. I've locally built a couple of > pynfs tests (one is based on the one Dai posted last week) and they > pass too. > > I don't believe that FREE_STATEID needs the same simplification. > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index a280256cbb03..b77894e668a4 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, > > /* see if there are still any locks associated with it */ > lo = lockowner(sop); > - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { > - if (check_for_locks(stp->st_stid.sc_file, lo)) { > - status = nfserr_locks_held; > - spin_unlock(&clp->cl_lock); > - return status; > - } > + if (atomic_read(&sop->so_count) > 1) { > + spin_unlock(&clp->cl_lock); > + return nfserr_locks_held; > } > > nfs4_get_stateowner(sop); > > lm_get_owner is called from locks_copy_conflock, so if someone else happens to be doing a LOCKT or F_GETLK call at the same time that RELEASE_LOCKOWNER gets called, then this may end up returning an error inappropriately. My guess is that that would be pretty hard to hit the timing right, but not impossible. What we may want to do is have the kernel do this check and only if it comes back >1 do the actual check for locks. That won't fix the original problem though. In other places in nfsd, we've plumbed in a dispose_list head and deferred the sleeping functions until the spinlock can be dropped. I haven't looked closely at whether that's possible here, but it may be a more reliable approach.
> On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote: > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: >> nfsd4_release_lockowner() holds clp->cl_lock when it calls >> check_for_locks(). However, check_for_locks() calls nfsd_file_get() >> / nfsd_file_put() to access the backing inode's flc_posix list, and >> nfsd_file_put() can sleep if the inode was recently removed. >> > > It might be good to add a might_sleep() to nfsd_file_put? I intend to include the patch you reviewed last week that adds the might_sleep(), as part of this series. >> Let's instead rely on the stateowner's reference count to gate >> whether the release is permitted. This should be a reliable >> indication of locks-in-use since file lock operations and >> ->lm_get_owner take appropriate references, which are released >> appropriately when file locks are removed. >> >> Reported-by: Dai Ngo <dai.ngo@oracle.com> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> Cc: stable@vger.kernel.org >> --- >> fs/nfsd/nfs4state.c | 9 +++------ >> 1 file changed, 3 insertions(+), 6 deletions(-) >> >> This might be a naive approach, but let's start with it. >> >> This passes light testing, but it's not clear how much our existing >> fleet of tests exercises this area. I've locally built a couple of >> pynfs tests (one is based on the one Dai posted last week) and they >> pass too. >> >> I don't believe that FREE_STATEID needs the same simplification. >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index a280256cbb03..b77894e668a4 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, >> >> /* see if there are still any locks associated with it */ >> lo = lockowner(sop); >> - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { >> - if (check_for_locks(stp->st_stid.sc_file, lo)) { >> - status = nfserr_locks_held; >> - spin_unlock(&clp->cl_lock); >> - return status; >> - } >> + if (atomic_read(&sop->so_count) > 1) { >> + spin_unlock(&clp->cl_lock); >> + return nfserr_locks_held; >> } >> >> nfs4_get_stateowner(sop); >> >> > > lm_get_owner is called from locks_copy_conflock, so if someone else > happens to be doing a LOCKT or F_GETLK call at the same time that > RELEASE_LOCKOWNER gets called, then this may end up returning an error > inappropriately. IMO releasing the lockowner while it's being used for _anything_ seems risky and surprising. If RELEASE_LOCKOWNER succeeds while the client is still using the lockowner for any reason, a subsequent error will occur if the client tries to use it again. Heck, I can see the server failing in mid-COMPOUND with this kind of race. Better I think to just leave the lockowner in place if there's any ambiguity. The spec language does not say RELEASE_LOCKOWNER must not return LOCKS_HELD for other reasons, and it does say that there is no choice of using another NFSERR value (RFC 7530 Section 13.2). > My guess is that that would be pretty hard to hit the > timing right, but not impossible. > > What we may want to do is have the kernel do this check and only if it > comes back >1 do the actual check for locks. That won't fix the original > problem though. > > In other places in nfsd, we've plumbed in a dispose_list head and > deferred the sleeping functions until the spinlock can be dropped. I > haven't looked closely at whether that's possible here, but it may be a > more reliable approach. That was proposed by Dai last week. https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u Trond pointed out that if two separate clients were releasing a lockowner on the same inode, there is nothing that protects the dispose_list, and it would get corrupted. https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 -- Chuck Lever
On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls > > > check_for_locks(). However, check_for_locks() calls nfsd_file_get() > > > / nfsd_file_put() to access the backing inode's flc_posix list, and > > > nfsd_file_put() can sleep if the inode was recently removed. > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > I intend to include the patch you reviewed last week that > adds the might_sleep(), as part of this series. > > > > > Let's instead rely on the stateowner's reference count to gate > > > whether the release is permitted. This should be a reliable > > > indication of locks-in-use since file lock operations and > > > ->lm_get_owner take appropriate references, which are released > > > appropriately when file locks are removed. > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > Cc: stable@vger.kernel.org > > > --- > > > fs/nfsd/nfs4state.c | 9 +++------ > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > This might be a naive approach, but let's start with it. > > > > > > This passes light testing, but it's not clear how much our existing > > > fleet of tests exercises this area. I've locally built a couple of > > > pynfs tests (one is based on the one Dai posted last week) and they > > > pass too. > > > > > > I don't believe that FREE_STATEID needs the same simplification. > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index a280256cbb03..b77894e668a4 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, > > > > > > /* see if there are still any locks associated with it */ > > > lo = lockowner(sop); > > > - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { > > > - if (check_for_locks(stp->st_stid.sc_file, lo)) { > > > - status = nfserr_locks_held; > > > - spin_unlock(&clp->cl_lock); > > > - return status; > > > - } > > > + if (atomic_read(&sop->so_count) > 1) { > > > + spin_unlock(&clp->cl_lock); > > > + return nfserr_locks_held; > > > } > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if someone else > > happens to be doing a LOCKT or F_GETLK call at the same time that > > RELEASE_LOCKOWNER gets called, then this may end up returning an error > > inappropriately. > > IMO releasing the lockowner while it's being used for _anything_ > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while > the client is still using the lockowner for any reason, a > subsequent error will occur if the client tries to use it again. > Heck, I can see the server failing in mid-COMPOUND with this kind > of race. Better I think to just leave the lockowner in place if > there's any ambiguity. > The problem here is not the client itself calling RELEASE_LOCKOWNER while it's still in use, but rather a different client altogether calling LOCKT (or a local process does a F_GETLK) on an inode where a lock is held by a client. The LOCKT gets a reference to it (for the conflock), while the client that has the lockowner releases the lock and then the lockowner while the refcount is still high. The race window for this is probably quite small, but I think it's theoretically possible. The point is that an elevated refcount on the lockowner doesn't necessarily mean that locks are actually being held by it. > The spec language does not say RELEASE_LOCKOWNER must not return > LOCKS_HELD for other reasons, and it does say that there is no > choice of using another NFSERR value (RFC 7530 Section 13.2). > What recourse does the client have if this happens? It released all of its locks and tried to release the lockowner, but the server says "locks held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort of a courtesy by the client, I suppose... > > > My guess is that that would be pretty hard to hit the > > timing right, but not impossible. > > > > What we may want to do is have the kernel do this check and only if it > > comes back >1 do the actual check for locks. That won't fix the original > > problem though. > > > > In other places in nfsd, we've plumbed in a dispose_list head and > > deferred the sleeping functions until the spinlock can be dropped. I > > haven't looked closely at whether that's possible here, but it may be a > > more reliable approach. > > That was proposed by Dai last week. > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > Trond pointed out that if two separate clients were releasing a > lockowner on the same inode, there is nothing that protects the > dispose_list, and it would get corrupted. > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > Yeah, that doesn't look like what's needed. What I was going to suggest is a nfsd_file_put variant that takes a list_head. If the refcount goes to zero and the thing ends up being unhashed, then you put it on the dispose list rather than doing the blocking operations, and then clean it up later. That said, nfsd_file_put has grown significantly in complexity over the years, so maybe that's not simple to do now.
> On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: >> >>> On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote: >>> >>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: >>>> nfsd4_release_lockowner() holds clp->cl_lock when it calls >>>> check_for_locks(). However, check_for_locks() calls nfsd_file_get() >>>> / nfsd_file_put() to access the backing inode's flc_posix list, and >>>> nfsd_file_put() can sleep if the inode was recently removed. >>>> >>> >>> It might be good to add a might_sleep() to nfsd_file_put? >> >> I intend to include the patch you reviewed last week that >> adds the might_sleep(), as part of this series. >> >> >>>> Let's instead rely on the stateowner's reference count to gate >>>> whether the release is permitted. This should be a reliable >>>> indication of locks-in-use since file lock operations and >>>> ->lm_get_owner take appropriate references, which are released >>>> appropriately when file locks are removed. >>>> >>>> Reported-by: Dai Ngo <dai.ngo@oracle.com> >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >>>> Cc: stable@vger.kernel.org >>>> --- >>>> fs/nfsd/nfs4state.c | 9 +++------ >>>> 1 file changed, 3 insertions(+), 6 deletions(-) >>>> >>>> This might be a naive approach, but let's start with it. >>>> >>>> This passes light testing, but it's not clear how much our existing >>>> fleet of tests exercises this area. I've locally built a couple of >>>> pynfs tests (one is based on the one Dai posted last week) and they >>>> pass too. >>>> >>>> I don't believe that FREE_STATEID needs the same simplification. >>>> >>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>> index a280256cbb03..b77894e668a4 100644 >>>> --- a/fs/nfsd/nfs4state.c >>>> +++ b/fs/nfsd/nfs4state.c >>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, >>>> >>>> /* see if there are still any locks associated with it */ >>>> lo = lockowner(sop); >>>> - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { >>>> - if (check_for_locks(stp->st_stid.sc_file, lo)) { >>>> - status = nfserr_locks_held; >>>> - spin_unlock(&clp->cl_lock); >>>> - return status; >>>> - } >>>> + if (atomic_read(&sop->so_count) > 1) { >>>> + spin_unlock(&clp->cl_lock); >>>> + return nfserr_locks_held; >>>> } >>>> >>>> nfs4_get_stateowner(sop); >>>> >>>> >>> >>> lm_get_owner is called from locks_copy_conflock, so if someone else >>> happens to be doing a LOCKT or F_GETLK call at the same time that >>> RELEASE_LOCKOWNER gets called, then this may end up returning an error >>> inappropriately. >> >> IMO releasing the lockowner while it's being used for _anything_ >> seems risky and surprising. If RELEASE_LOCKOWNER succeeds while >> the client is still using the lockowner for any reason, a >> subsequent error will occur if the client tries to use it again. >> Heck, I can see the server failing in mid-COMPOUND with this kind >> of race. Better I think to just leave the lockowner in place if >> there's any ambiguity. >> > > The problem here is not the client itself calling RELEASE_LOCKOWNER > while it's still in use, but rather a different client altogether > calling LOCKT (or a local process does a F_GETLK) on an inode where a > lock is held by a client. The LOCKT gets a reference to it (for the > conflock), while the client that has the lockowner releases the lock and > then the lockowner while the refcount is still high. > > The race window for this is probably quite small, but I think it's > theoretically possible. The point is that an elevated refcount on the > lockowner doesn't necessarily mean that locks are actually being held by > it. Sure, I get that the lockowner's reference count is not 100% reliable. The question is whether it's good enough. We are looking for a mechanism that can simply count the number of locks held by a lockowner. It sounds like you believe that lm_get_owner / put_owner might not be a reliable way to do that. >> The spec language does not say RELEASE_LOCKOWNER must not return >> LOCKS_HELD for other reasons, and it does say that there is no >> choice of using another NFSERR value (RFC 7530 Section 13.2). >> > > What recourse does the client have if this happens? It released all of > its locks and tried to release the lockowner, but the server says "locks > held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort > of a courtesy by the client, I suppose... RELEASE_LOCKOWNER is a courtesy for the server. Most clients ignore the return code IIUC. So the hazard caused by this race would be a small resource leak on the server that would go away once the client's lease was purged. >>> My guess is that that would be pretty hard to hit the >>> timing right, but not impossible. >>> >>> What we may want to do is have the kernel do this check and only if it >>> comes back >1 do the actual check for locks. That won't fix the original >>> problem though. >>> >>> In other places in nfsd, we've plumbed in a dispose_list head and >>> deferred the sleeping functions until the spinlock can be dropped. I >>> haven't looked closely at whether that's possible here, but it may be a >>> more reliable approach. >> >> That was proposed by Dai last week. >> >> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u >> >> Trond pointed out that if two separate clients were releasing a >> lockowner on the same inode, there is nothing that protects the >> dispose_list, and it would get corrupted. >> >> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 >> > > Yeah, that doesn't look like what's needed. > > What I was going to suggest is a nfsd_file_put variant that takes a > list_head. If the refcount goes to zero and the thing ends up being > unhashed, then you put it on the dispose list rather than doing the > blocking operations, and then clean it up later. Trond doesn't like that approach; see the e-mail thread. > That said, nfsd_file_put has grown significantly in complexity over the > years, so maybe that's not simple to do now. > -- > Jeff Layton <jlayton@kernel.org> -- Chuck Lever
On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls > > > > > check_for_locks(). However, check_for_locks() calls nfsd_file_get() > > > > > / nfsd_file_put() to access the backing inode's flc_posix list, and > > > > > nfsd_file_put() can sleep if the inode was recently removed. > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > I intend to include the patch you reviewed last week that > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > Let's instead rely on the stateowner's reference count to gate > > > > > whether the release is permitted. This should be a reliable > > > > > indication of locks-in-use since file lock operations and > > > > > ->lm_get_owner take appropriate references, which are released > > > > > appropriately when file locks are removed. > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > Cc: stable@vger.kernel.org > > > > > --- > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > This might be a naive approach, but let's start with it. > > > > > > > > > > This passes light testing, but it's not clear how much our existing > > > > > fleet of tests exercises this area. I've locally built a couple of > > > > > pynfs tests (one is based on the one Dai posted last week) and they > > > > > pass too. > > > > > > > > > > I don't believe that FREE_STATEID needs the same simplification. > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > --- a/fs/nfsd/nfs4state.c > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, > > > > > > > > > > /* see if there are still any locks associated with it */ > > > > > lo = lockowner(sop); > > > > > - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { > > > > > - if (check_for_locks(stp->st_stid.sc_file, lo)) { > > > > > - status = nfserr_locks_held; > > > > > - spin_unlock(&clp->cl_lock); > > > > > - return status; > > > > > - } > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > + spin_unlock(&clp->cl_lock); > > > > > + return nfserr_locks_held; > > > > > } > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if someone else > > > > happens to be doing a LOCKT or F_GETLK call at the same time that > > > > RELEASE_LOCKOWNER gets called, then this may end up returning an error > > > > inappropriately. > > > > > > IMO releasing the lockowner while it's being used for _anything_ > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while > > > the client is still using the lockowner for any reason, a > > > subsequent error will occur if the client tries to use it again. > > > Heck, I can see the server failing in mid-COMPOUND with this kind > > > of race. Better I think to just leave the lockowner in place if > > > there's any ambiguity. > > > > > > > The problem here is not the client itself calling RELEASE_LOCKOWNER > > while it's still in use, but rather a different client altogether > > calling LOCKT (or a local process does a F_GETLK) on an inode where a > > lock is held by a client. The LOCKT gets a reference to it (for the > > conflock), while the client that has the lockowner releases the lock and > > then the lockowner while the refcount is still high. > > > > The race window for this is probably quite small, but I think it's > > theoretically possible. The point is that an elevated refcount on the > > lockowner doesn't necessarily mean that locks are actually being held by > > it. > > Sure, I get that the lockowner's reference count is not 100% > reliable. The question is whether it's good enough. > > We are looking for a mechanism that can simply count the number > of locks held by a lockowner. It sounds like you believe that > lm_get_owner / put_owner might not be a reliable way to do > that. > > > > > The spec language does not say RELEASE_LOCKOWNER must not return > > > LOCKS_HELD for other reasons, and it does say that there is no > > > choice of using another NFSERR value (RFC 7530 Section 13.2). > > > > > > > What recourse does the client have if this happens? It released all of > > its locks and tried to release the lockowner, but the server says "locks > > held". Should it just give up at that point? RELEASE_LOCKOWNER is a sort > > of a courtesy by the client, I suppose... > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > ignore the return code IIUC. > > So the hazard caused by this race would be a small resource > leak on the server that would go away once the client's lease > was purged. > > > > > > My guess is that that would be pretty hard to hit the > > > > timing right, but not impossible. > > > > > > > > What we may want to do is have the kernel do this check and only if it > > > > comes back >1 do the actual check for locks. That won't fix the original > > > > problem though. > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list head and > > > > deferred the sleeping functions until the spinlock can be dropped. I > > > > haven't looked closely at whether that's possible here, but it may be a > > > > more reliable approach. > > > > > > That was proposed by Dai last week. > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > Trond pointed out that if two separate clients were releasing a > > > lockowner on the same inode, there is nothing that protects the > > > dispose_list, and it would get corrupted. > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > Yeah, that doesn't look like what's needed. > > > > What I was going to suggest is a nfsd_file_put variant that takes a > > list_head. If the refcount goes to zero and the thing ends up being > > unhashed, then you put it on the dispose list rather than doing the > > blocking operations, and then clean it up later. > > Trond doesn't like that approach; see the e-mail thread. > I didn't see him saying that that would be wrong, per-se, but the initial implementation was racy. His suggestion was just to keep a counter in the lockowner of how many locks are associated with it. That seems like a good suggestion, though you'd probably need to add a parameter to lm_get_owner to indicate whether you were adding a new lock or just doing a conflock copy. Checking the object refcount like this patch does seems wrong though.
> On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: >> >>> On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote: >>> >>> What I was going to suggest is a nfsd_file_put variant that takes a >>> list_head. If the refcount goes to zero and the thing ends up being >>> unhashed, then you put it on the dispose list rather than doing the >>> blocking operations, and then clean it up later. >> >> Trond doesn't like that approach; see the e-mail thread. > > I didn't see him saying that that would be wrong, per-se, but the > initial implementation was racy. I proposed this for check_for_locks() to use: > void nfsd_file_put_async(struct nfsd_file *nf) > { > if (refcount_dec_and_test(&nf->nf_ref)) > nfsd_file_close_inode(nf->nf_inode); > } Trond's response was: > That approach moves the sync to the garbage collector, which was > exactly what we're trying to avoid in the first place. Basically he's opposed to any flush work being done by the garbage collector because that has a known negative performance impact. > His suggestion was just to keep a counter in the lockowner of how many > locks are associated with it. That seems like a good suggestion, though > you'd probably need to add a parameter to lm_get_owner to indicate > whether you were adding a new lock or just doing a conflock copy. locks_copy_conflock() would need to take a boolean parameter that callers would set when they actually manipulate a lock. I would feel more comfortable with that approach if locks_copy_conflock() was a private API used only in fs/locks.c so we can audit its usage to ensure that callers are managing the lock count correctly. -- Chuck Lever
On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote: > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > What I was going to suggest is a nfsd_file_put variant that takes a > > > > list_head. If the refcount goes to zero and the thing ends up being > > > > unhashed, then you put it on the dispose list rather than doing the > > > > blocking operations, and then clean it up later. > > > > > > Trond doesn't like that approach; see the e-mail thread. > > > > I didn't see him saying that that would be wrong, per-se, but the > > initial implementation was racy. > > I proposed this for check_for_locks() to use: > > > void nfsd_file_put_async(struct nfsd_file *nf) > > { > > if (refcount_dec_and_test(&nf->nf_ref)) > > nfsd_file_close_inode(nf->nf_inode); > > } > > > Trond's response was: > > > That approach moves the sync to the garbage collector, which was > > exactly what we're trying to avoid in the first place. > > Basically he's opposed to any flush work being done by > the garbage collector because that has a known negative > performance impact. > > Fair enough. I like his other suggestion better anyway. > > His suggestion was just to keep a counter in the lockowner of how many > > locks are associated with it. That seems like a good suggestion, though > > you'd probably need to add a parameter to lm_get_owner to indicate > > whether you were adding a new lock or just doing a conflock copy. > > locks_copy_conflock() would need to take a boolean parameter > that callers would set when they actually manipulate a lock. > Yep. You'd also have to add a bool arg to lm_put_owner so that you know whether you need to decrement the counter. > I would feel more comfortable with that approach if > locks_copy_conflock() was a private API used only in fs/locks.c > so we can audit its usage to ensure that callers are managing > the lock count correctly. > > It basically is. In fact, I'm not sure why it's exported since nothing outside of locks.c calls it. Looking at the git history, it probably got exported by mistake when we split up locks_copy_lock and locks_copy_conflock, as the former was exported. I think this approach is quite doable...
On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> > > > wrote: > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > nfsd_file_get() > > > > > > / nfsd_file_put() to access the backing inode's flc_posix > > > > > > list, and > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > removed. > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > I intend to include the patch you reviewed last week that > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count to > > > > > > gate > > > > > > whether the release is permitted. This should be a reliable > > > > > > indication of locks-in-use since file lock operations and > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > released > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > Cc: stable@vger.kernel.org > > > > > > --- > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > This might be a naive approach, but let's start with it. > > > > > > > > > > > > This passes light testing, but it's not clear how much our > > > > > > existing > > > > > > fleet of tests exercises this area. I've locally built a > > > > > > couple of > > > > > > pynfs tests (one is based on the one Dai posted last week) > > > > > > and they > > > > > > pass too. > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > simplification. > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > /* see if there are still any locks > > > > > > associated with it */ > > > > > > lo = lockowner(sop); > > > > > > - list_for_each_entry(stp, &sop->so_stateids, > > > > > > st_perstateowner) { > > > > > > - if (check_for_locks(stp- > > > > > > >st_stid.sc_file, lo)) { > > > > > > - status = nfserr_locks_held; > > > > > > - spin_unlock(&clp->cl_lock); > > > > > > - return status; > > > > > > - } > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > + return nfserr_locks_held; > > > > > > } > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > someone else > > > > > happens to be doing a LOCKT or F_GETLK call at the same time > > > > > that > > > > > RELEASE_LOCKOWNER gets called, then this may end up returning > > > > > an error > > > > > inappropriately. > > > > > > > > IMO releasing the lockowner while it's being used for > > > > _anything_ > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while > > > > the client is still using the lockowner for any reason, a > > > > subsequent error will occur if the client tries to use it > > > > again. > > > > Heck, I can see the server failing in mid-COMPOUND with this > > > > kind > > > > of race. Better I think to just leave the lockowner in place if > > > > there's any ambiguity. > > > > > > > > > > The problem here is not the client itself calling > > > RELEASE_LOCKOWNER > > > while it's still in use, but rather a different client altogether > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > where a > > > lock is held by a client. The LOCKT gets a reference to it (for > > > the > > > conflock), while the client that has the lockowner releases the > > > lock and > > > then the lockowner while the refcount is still high. > > > > > > The race window for this is probably quite small, but I think > > > it's > > > theoretically possible. The point is that an elevated refcount on > > > the > > > lockowner doesn't necessarily mean that locks are actually being > > > held by > > > it. > > > > Sure, I get that the lockowner's reference count is not 100% > > reliable. The question is whether it's good enough. > > > > We are looking for a mechanism that can simply count the number > > of locks held by a lockowner. It sounds like you believe that > > lm_get_owner / put_owner might not be a reliable way to do > > that. > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > return > > > > LOCKS_HELD for other reasons, and it does say that there is no > > > > choice of using another NFSERR value (RFC 7530 Section 13.2). > > > > > > > > > > What recourse does the client have if this happens? It released > > > all of > > > its locks and tried to release the lockowner, but the server says > > > "locks > > > held". Should it just give up at that point? RELEASE_LOCKOWNER is > > > a sort > > > of a courtesy by the client, I suppose... > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > ignore the return code IIUC. > > > > So the hazard caused by this race would be a small resource > > leak on the server that would go away once the client's lease > > was purged. > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > timing right, but not impossible. > > > > > > > > > > What we may want to do is have the kernel do this check and > > > > > only if it > > > > > comes back >1 do the actual check for locks. That won't fix > > > > > the original > > > > > problem though. > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list head > > > > > and > > > > > deferred the sleeping functions until the spinlock can be > > > > > dropped. I > > > > > haven't looked closely at whether that's possible here, but > > > > > it may be a > > > > > more reliable approach. > > > > > > > > That was proposed by Dai last week. > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > > > Trond pointed out that if two separate clients were releasing a > > > > lockowner on the same inode, there is nothing that protects the > > > > dispose_list, and it would get corrupted. > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > What I was going to suggest is a nfsd_file_put variant that takes > > > a > > > list_head. If the refcount goes to zero and the thing ends up > > > being > > > unhashed, then you put it on the dispose list rather than doing > > > the > > > blocking operations, and then clean it up later. > > > > Trond doesn't like that approach; see the e-mail thread. > > > > I didn't see him saying that that would be wrong, per-se, but the > initial implementation was racy. > > His suggestion was just to keep a counter in the lockowner of how > many > locks are associated with it. That seems like a good suggestion, > though > you'd probably need to add a parameter to lm_get_owner to indicate > whether you were adding a new lock or just doing a conflock copy. I don't think this should be necessary. The posix_lock code doesn't ever use a struct file_lock that it hasn't allocated itself. We should always be calling conflock to copy from whatever struct file_lock that the caller passed as an argument. IOW: the number of lm_get_owner and lm_put_owner calls should always be 100% balanced once all the locks belonging to a specific lock owner are removed. > > Checking the object refcount like this patch does seems wrong though. > Yes. This approach does require a separate counter that is only bumped/decremented in the lock manager callbacks.
On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org> > > > > wrote: > > > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls > > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > > nfsd_file_get() > > > > > > > / nfsd_file_put() to access the backing inode's flc_posix > > > > > > > list, and > > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > > removed. > > > > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > > > I intend to include the patch you reviewed last week that > > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count to > > > > > > > gate > > > > > > > whether the release is permitted. This should be a reliable > > > > > > > indication of locks-in-use since file lock operations and > > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > > released > > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > > Cc: stable@vger.kernel.org > > > > > > > --- > > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > This might be a naive approach, but let's start with it. > > > > > > > > > > > > > > This passes light testing, but it's not clear how much our > > > > > > > existing > > > > > > > fleet of tests exercises this area. I've locally built a > > > > > > > couple of > > > > > > > pynfs tests (one is based on the one Dai posted last week) > > > > > > > and they > > > > > > > pass too. > > > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > > simplification. > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > > > /* see if there are still any locks > > > > > > > associated with it */ > > > > > > > lo = lockowner(sop); > > > > > > > - list_for_each_entry(stp, &sop->so_stateids, > > > > > > > st_perstateowner) { > > > > > > > - if (check_for_locks(stp- > > > > > > > > st_stid.sc_file, lo)) { > > > > > > > - status = nfserr_locks_held; > > > > > > > - spin_unlock(&clp->cl_lock); > > > > > > > - return status; > > > > > > > - } > > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > > + return nfserr_locks_held; > > > > > > > } > > > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > > someone else > > > > > > happens to be doing a LOCKT or F_GETLK call at the same time > > > > > > that > > > > > > RELEASE_LOCKOWNER gets called, then this may end up returning > > > > > > an error > > > > > > inappropriately. > > > > > > > > > > IMO releasing the lockowner while it's being used for > > > > > _anything_ > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while > > > > > the client is still using the lockowner for any reason, a > > > > > subsequent error will occur if the client tries to use it > > > > > again. > > > > > Heck, I can see the server failing in mid-COMPOUND with this > > > > > kind > > > > > of race. Better I think to just leave the lockowner in place if > > > > > there's any ambiguity. > > > > > > > > > > > > > The problem here is not the client itself calling > > > > RELEASE_LOCKOWNER > > > > while it's still in use, but rather a different client altogether > > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > > where a > > > > lock is held by a client. The LOCKT gets a reference to it (for > > > > the > > > > conflock), while the client that has the lockowner releases the > > > > lock and > > > > then the lockowner while the refcount is still high. > > > > > > > > The race window for this is probably quite small, but I think > > > > it's > > > > theoretically possible. The point is that an elevated refcount on > > > > the > > > > lockowner doesn't necessarily mean that locks are actually being > > > > held by > > > > it. > > > > > > Sure, I get that the lockowner's reference count is not 100% > > > reliable. The question is whether it's good enough. > > > > > > We are looking for a mechanism that can simply count the number > > > of locks held by a lockowner. It sounds like you believe that > > > lm_get_owner / put_owner might not be a reliable way to do > > > that. > > > > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > > return > > > > > LOCKS_HELD for other reasons, and it does say that there is no > > > > > choice of using another NFSERR value (RFC 7530 Section 13.2). > > > > > > > > > > > > > What recourse does the client have if this happens? It released > > > > all of > > > > its locks and tried to release the lockowner, but the server says > > > > "locks > > > > held". Should it just give up at that point? RELEASE_LOCKOWNER is > > > > a sort > > > > of a courtesy by the client, I suppose... > > > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > > ignore the return code IIUC. > > > > > > So the hazard caused by this race would be a small resource > > > leak on the server that would go away once the client's lease > > > was purged. > > > > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > > timing right, but not impossible. > > > > > > > > > > > > What we may want to do is have the kernel do this check and > > > > > > only if it > > > > > > comes back >1 do the actual check for locks. That won't fix > > > > > > the original > > > > > > problem though. > > > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list head > > > > > > and > > > > > > deferred the sleeping functions until the spinlock can be > > > > > > dropped. I > > > > > > haven't looked closely at whether that's possible here, but > > > > > > it may be a > > > > > > more reliable approach. > > > > > > > > > > That was proposed by Dai last week. > > > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > > > > > Trond pointed out that if two separate clients were releasing a > > > > > lockowner on the same inode, there is nothing that protects the > > > > > dispose_list, and it would get corrupted. > > > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > > > What I was going to suggest is a nfsd_file_put variant that takes > > > > a > > > > list_head. If the refcount goes to zero and the thing ends up > > > > being > > > > unhashed, then you put it on the dispose list rather than doing > > > > the > > > > blocking operations, and then clean it up later. > > > > > > Trond doesn't like that approach; see the e-mail thread. > > > > > > > I didn't see him saying that that would be wrong, per-se, but the > > initial implementation was racy. > > > > His suggestion was just to keep a counter in the lockowner of how > > many > > locks are associated with it. That seems like a good suggestion, > > though > > you'd probably need to add a parameter to lm_get_owner to indicate > > whether you were adding a new lock or just doing a conflock copy. > > I don't think this should be necessary. The posix_lock code doesn't > ever use a struct file_lock that it hasn't allocated itself. We should > always be calling conflock to copy from whatever struct file_lock that > the caller passed as an argument. > > IOW: the number of lm_get_owner and lm_put_owner calls should always be > 100% balanced once all the locks belonging to a specific lock owner are > removed. > We take references to the owner when we go to add a lock record, or when copying a conflicting lock. You want to keep a count of the former without counting the latter. lm_get_owner gets called for both though. I don't see how you can disambiguate the two situations w/o some way to indicate that. Adding a bool argument to lm_get_owner/lm_put_owner ops would be pretty simple to implement, I think.
On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote: > On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: > > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton > > > > > <jlayton@kernel.org> > > > > > wrote: > > > > > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton > > > > > > > <jlayton@kernel.org> > > > > > > > wrote: > > > > > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it > > > > > > > > calls > > > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > > > nfsd_file_get() > > > > > > > > / nfsd_file_put() to access the backing inode's > > > > > > > > flc_posix > > > > > > > > list, and > > > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > > > removed. > > > > > > > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > > > > > I intend to include the patch you reviewed last week that > > > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count > > > > > > > > to > > > > > > > > gate > > > > > > > > whether the release is permitted. This should be a > > > > > > > > reliable > > > > > > > > indication of locks-in-use since file lock operations > > > > > > > > and > > > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > > > released > > > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > Cc: stable@vger.kernel.org > > > > > > > > --- > > > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > This might be a naive approach, but let's start with > > > > > > > > it. > > > > > > > > > > > > > > > > This passes light testing, but it's not clear how much > > > > > > > > our > > > > > > > > existing > > > > > > > > fleet of tests exercises this area. I've locally built > > > > > > > > a > > > > > > > > couple of > > > > > > > > pynfs tests (one is based on the one Dai posted last > > > > > > > > week) > > > > > > > > and they > > > > > > > > pass too. > > > > > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > > > simplification. > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > > > > > /* see if there are still any locks > > > > > > > > associated with it */ > > > > > > > > lo = lockowner(sop); > > > > > > > > - list_for_each_entry(stp, &sop- > > > > > > > > >so_stateids, > > > > > > > > st_perstateowner) { > > > > > > > > - if (check_for_locks(stp- > > > > > > > > > st_stid.sc_file, lo)) { > > > > > > > > - status = > > > > > > > > nfserr_locks_held; > > > > > > > > - spin_unlock(&clp- > > > > > > > > >cl_lock); > > > > > > > > - return status; > > > > > > > > - } > > > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > > > + return nfserr_locks_held; > > > > > > > > } > > > > > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > > > someone else > > > > > > > happens to be doing a LOCKT or F_GETLK call at the same > > > > > > > time > > > > > > > that > > > > > > > RELEASE_LOCKOWNER gets called, then this may end up > > > > > > > returning > > > > > > > an error > > > > > > > inappropriately. > > > > > > > > > > > > IMO releasing the lockowner while it's being used for > > > > > > _anything_ > > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds > > > > > > while > > > > > > the client is still using the lockowner for any reason, a > > > > > > subsequent error will occur if the client tries to use it > > > > > > again. > > > > > > Heck, I can see the server failing in mid-COMPOUND with > > > > > > this > > > > > > kind > > > > > > of race. Better I think to just leave the lockowner in > > > > > > place if > > > > > > there's any ambiguity. > > > > > > > > > > > > > > > > The problem here is not the client itself calling > > > > > RELEASE_LOCKOWNER > > > > > while it's still in use, but rather a different client > > > > > altogether > > > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > > > where a > > > > > lock is held by a client. The LOCKT gets a reference to it > > > > > (for > > > > > the > > > > > conflock), while the client that has the lockowner releases > > > > > the > > > > > lock and > > > > > then the lockowner while the refcount is still high. > > > > > > > > > > The race window for this is probably quite small, but I think > > > > > it's > > > > > theoretically possible. The point is that an elevated > > > > > refcount on > > > > > the > > > > > lockowner doesn't necessarily mean that locks are actually > > > > > being > > > > > held by > > > > > it. > > > > > > > > Sure, I get that the lockowner's reference count is not 100% > > > > reliable. The question is whether it's good enough. > > > > > > > > We are looking for a mechanism that can simply count the number > > > > of locks held by a lockowner. It sounds like you believe that > > > > lm_get_owner / put_owner might not be a reliable way to do > > > > that. > > > > > > > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > > > return > > > > > > LOCKS_HELD for other reasons, and it does say that there is > > > > > > no > > > > > > choice of using another NFSERR value (RFC 7530 Section > > > > > > 13.2). > > > > > > > > > > > > > > > > What recourse does the client have if this happens? It > > > > > released > > > > > all of > > > > > its locks and tried to release the lockowner, but the server > > > > > says > > > > > "locks > > > > > held". Should it just give up at that point? > > > > > RELEASE_LOCKOWNER is > > > > > a sort > > > > > of a courtesy by the client, I suppose... > > > > > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > > > ignore the return code IIUC. > > > > > > > > So the hazard caused by this race would be a small resource > > > > leak on the server that would go away once the client's lease > > > > was purged. > > > > > > > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > > > timing right, but not impossible. > > > > > > > > > > > > > > What we may want to do is have the kernel do this check > > > > > > > and > > > > > > > only if it > > > > > > > comes back >1 do the actual check for locks. That won't > > > > > > > fix > > > > > > > the original > > > > > > > problem though. > > > > > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list > > > > > > > head > > > > > > > and > > > > > > > deferred the sleeping functions until the spinlock can be > > > > > > > dropped. I > > > > > > > haven't looked closely at whether that's possible here, > > > > > > > but > > > > > > > it may be a > > > > > > > more reliable approach. > > > > > > > > > > > > That was proposed by Dai last week. > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > > > > > > > Trond pointed out that if two separate clients were > > > > > > releasing a > > > > > > lockowner on the same inode, there is nothing that protects > > > > > > the > > > > > > dispose_list, and it would get corrupted. > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > > > > > What I was going to suggest is a nfsd_file_put variant that > > > > > takes > > > > > a > > > > > list_head. If the refcount goes to zero and the thing ends up > > > > > being > > > > > unhashed, then you put it on the dispose list rather than > > > > > doing > > > > > the > > > > > blocking operations, and then clean it up later. > > > > > > > > Trond doesn't like that approach; see the e-mail thread. > > > > > > > > > > I didn't see him saying that that would be wrong, per-se, but the > > > initial implementation was racy. > > > > > > His suggestion was just to keep a counter in the lockowner of how > > > many > > > locks are associated with it. That seems like a good suggestion, > > > though > > > you'd probably need to add a parameter to lm_get_owner to > > > indicate > > > whether you were adding a new lock or just doing a conflock copy. > > > > I don't think this should be necessary. The posix_lock code doesn't > > ever use a struct file_lock that it hasn't allocated itself. We > > should > > always be calling conflock to copy from whatever struct file_lock > > that > > the caller passed as an argument. > > > > IOW: the number of lm_get_owner and lm_put_owner calls should > > always be > > 100% balanced once all the locks belonging to a specific lock owner > > are > > removed. > > > > We take references to the owner when we go to add a lock record, or > when > copying a conflicting lock. You want to keep a count of the former > without counting the latter. > > lm_get_owner gets called for both though. I don't see how you can > disambiguate the two situations w/o some way to indicate that. Adding > a > bool argument to lm_get_owner/lm_put_owner ops would be pretty simple > to > implement, I think. > Hmm... That should be an extremely unlikely race, given that the conflicting lock reference would have to be held for long enough to cover the unlock + the release_lockowner / free_stateid RPC calls from the client initially holding the lock, however I agree it is a theoretical possibility.
On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote: > On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote: > > On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: > > > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton > > > > > > <jlayton@kernel.org> > > > > > > wrote: > > > > > > > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton > > > > > > > > <jlayton@kernel.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it > > > > > > > > > calls > > > > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > > > > nfsd_file_get() > > > > > > > > > / nfsd_file_put() to access the backing inode's > > > > > > > > > flc_posix > > > > > > > > > list, and > > > > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > > > > removed. > > > > > > > > > > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > > > > > > > I intend to include the patch you reviewed last week that > > > > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count > > > > > > > > > to > > > > > > > > > gate > > > > > > > > > whether the release is permitted. This should be a > > > > > > > > > reliable > > > > > > > > > indication of locks-in-use since file lock operations > > > > > > > > > and > > > > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > > > > released > > > > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > > Cc: stable@vger.kernel.org > > > > > > > > > --- > > > > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > > > This might be a naive approach, but let's start with > > > > > > > > > it. > > > > > > > > > > > > > > > > > > This passes light testing, but it's not clear how much > > > > > > > > > our > > > > > > > > > existing > > > > > > > > > fleet of tests exercises this area. I've locally built > > > > > > > > > a > > > > > > > > > couple of > > > > > > > > > pynfs tests (one is based on the one Dai posted last > > > > > > > > > week) > > > > > > > > > and they > > > > > > > > > pass too. > > > > > > > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > > > > simplification. > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > > > > > > > /* see if there are still any locks > > > > > > > > > associated with it */ > > > > > > > > > lo = lockowner(sop); > > > > > > > > > - list_for_each_entry(stp, &sop- > > > > > > > > > > so_stateids, > > > > > > > > > st_perstateowner) { > > > > > > > > > - if (check_for_locks(stp- > > > > > > > > > > st_stid.sc_file, lo)) { > > > > > > > > > - status = > > > > > > > > > nfserr_locks_held; > > > > > > > > > - spin_unlock(&clp- > > > > > > > > > > cl_lock); > > > > > > > > > - return status; > > > > > > > > > - } > > > > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > > > > + return nfserr_locks_held; > > > > > > > > > } > > > > > > > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > > > > someone else > > > > > > > > happens to be doing a LOCKT or F_GETLK call at the same > > > > > > > > time > > > > > > > > that > > > > > > > > RELEASE_LOCKOWNER gets called, then this may end up > > > > > > > > returning > > > > > > > > an error > > > > > > > > inappropriately. > > > > > > > > > > > > > > IMO releasing the lockowner while it's being used for > > > > > > > _anything_ > > > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds > > > > > > > while > > > > > > > the client is still using the lockowner for any reason, a > > > > > > > subsequent error will occur if the client tries to use it > > > > > > > again. > > > > > > > Heck, I can see the server failing in mid-COMPOUND with > > > > > > > this > > > > > > > kind > > > > > > > of race. Better I think to just leave the lockowner in > > > > > > > place if > > > > > > > there's any ambiguity. > > > > > > > > > > > > > > > > > > > The problem here is not the client itself calling > > > > > > RELEASE_LOCKOWNER > > > > > > while it's still in use, but rather a different client > > > > > > altogether > > > > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > > > > where a > > > > > > lock is held by a client. The LOCKT gets a reference to it > > > > > > (for > > > > > > the > > > > > > conflock), while the client that has the lockowner releases > > > > > > the > > > > > > lock and > > > > > > then the lockowner while the refcount is still high. > > > > > > > > > > > > The race window for this is probably quite small, but I think > > > > > > it's > > > > > > theoretically possible. The point is that an elevated > > > > > > refcount on > > > > > > the > > > > > > lockowner doesn't necessarily mean that locks are actually > > > > > > being > > > > > > held by > > > > > > it. > > > > > > > > > > Sure, I get that the lockowner's reference count is not 100% > > > > > reliable. The question is whether it's good enough. > > > > > > > > > > We are looking for a mechanism that can simply count the number > > > > > of locks held by a lockowner. It sounds like you believe that > > > > > lm_get_owner / put_owner might not be a reliable way to do > > > > > that. > > > > > > > > > > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > > > > return > > > > > > > LOCKS_HELD for other reasons, and it does say that there is > > > > > > > no > > > > > > > choice of using another NFSERR value (RFC 7530 Section > > > > > > > 13.2). > > > > > > > > > > > > > > > > > > > What recourse does the client have if this happens? It > > > > > > released > > > > > > all of > > > > > > its locks and tried to release the lockowner, but the server > > > > > > says > > > > > > "locks > > > > > > held". Should it just give up at that point? > > > > > > RELEASE_LOCKOWNER is > > > > > > a sort > > > > > > of a courtesy by the client, I suppose... > > > > > > > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > > > > ignore the return code IIUC. > > > > > > > > > > So the hazard caused by this race would be a small resource > > > > > leak on the server that would go away once the client's lease > > > > > was purged. > > > > > > > > > > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > > > > timing right, but not impossible. > > > > > > > > > > > > > > > > What we may want to do is have the kernel do this check > > > > > > > > and > > > > > > > > only if it > > > > > > > > comes back >1 do the actual check for locks. That won't > > > > > > > > fix > > > > > > > > the original > > > > > > > > problem though. > > > > > > > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list > > > > > > > > head > > > > > > > > and > > > > > > > > deferred the sleeping functions until the spinlock can be > > > > > > > > dropped. I > > > > > > > > haven't looked closely at whether that's possible here, > > > > > > > > but > > > > > > > > it may be a > > > > > > > > more reliable approach. > > > > > > > > > > > > > > That was proposed by Dai last week. > > > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > > > > > > > > > Trond pointed out that if two separate clients were > > > > > > > releasing a > > > > > > > lockowner on the same inode, there is nothing that protects > > > > > > > the > > > > > > > dispose_list, and it would get corrupted. > > > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > > > > > > > What I was going to suggest is a nfsd_file_put variant that > > > > > > takes > > > > > > a > > > > > > list_head. If the refcount goes to zero and the thing ends up > > > > > > being > > > > > > unhashed, then you put it on the dispose list rather than > > > > > > doing > > > > > > the > > > > > > blocking operations, and then clean it up later. > > > > > > > > > > Trond doesn't like that approach; see the e-mail thread. > > > > > > > > > > > > > I didn't see him saying that that would be wrong, per-se, but the > > > > initial implementation was racy. > > > > > > > > His suggestion was just to keep a counter in the lockowner of how > > > > many > > > > locks are associated with it. That seems like a good suggestion, > > > > though > > > > you'd probably need to add a parameter to lm_get_owner to > > > > indicate > > > > whether you were adding a new lock or just doing a conflock copy. > > > > > > I don't think this should be necessary. The posix_lock code doesn't > > > ever use a struct file_lock that it hasn't allocated itself. We > > > should > > > always be calling conflock to copy from whatever struct file_lock > > > that > > > the caller passed as an argument. > > > > > > IOW: the number of lm_get_owner and lm_put_owner calls should > > > always be > > > 100% balanced once all the locks belonging to a specific lock owner > > > are > > > removed. > > > > > > > We take references to the owner when we go to add a lock record, or > > when > > copying a conflicting lock. You want to keep a count of the former > > without counting the latter. > > > > lm_get_owner gets called for both though. I don't see how you can > > disambiguate the two situations w/o some way to indicate that. Adding > > a > > bool argument to lm_get_owner/lm_put_owner ops would be pretty simple > > to > > implement, I think. > > > > Hmm... That should be an extremely unlikely race, given that the > conflicting lock reference would have to be held for long enough to > cover the unlock + the release_lockowner / free_stateid RPC calls from > the client initially holding the lock, however I agree it is a > theoretical possibility. > If we want to live with the possibility of that race, then Chuck's original patch is fine, since the object refcount would always be equivalent to the lock count. That said, I think it'd be better to keep an accurate count of lock records (sans conflocks) in the owner.
> On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote: >> On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote: >>> On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: >>>> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: >>>>> On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: >>>>>> >>>>>>> On May 23, 2022, at 11:26 AM, Jeff Layton >>>>>>> <jlayton@kernel.org> >>>>>>> wrote: >>>>>>> >>>>>>> On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: >>>>>>>> >>>>>>>>> On May 23, 2022, at 9:40 AM, Jeff Layton >>>>>>>>> <jlayton@kernel.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: >>>>>>>>>> nfsd4_release_lockowner() holds clp->cl_lock when it >>>>>>>>>> calls >>>>>>>>>> check_for_locks(). However, check_for_locks() calls >>>>>>>>>> nfsd_file_get() >>>>>>>>>> / nfsd_file_put() to access the backing inode's >>>>>>>>>> flc_posix >>>>>>>>>> list, and >>>>>>>>>> nfsd_file_put() can sleep if the inode was recently >>>>>>>>>> removed. >>>>>>>>>> >>>>>>>>> >>>>>>>>> It might be good to add a might_sleep() to nfsd_file_put? >>>>>>>> >>>>>>>> I intend to include the patch you reviewed last week that >>>>>>>> adds the might_sleep(), as part of this series. >>>>>>>> >>>>>>>> >>>>>>>>>> Let's instead rely on the stateowner's reference count >>>>>>>>>> to >>>>>>>>>> gate >>>>>>>>>> whether the release is permitted. This should be a >>>>>>>>>> reliable >>>>>>>>>> indication of locks-in-use since file lock operations >>>>>>>>>> and >>>>>>>>>> ->lm_get_owner take appropriate references, which are >>>>>>>>>> released >>>>>>>>>> appropriately when file locks are removed. >>>>>>>>>> >>>>>>>>>> Reported-by: Dai Ngo <dai.ngo@oracle.com> >>>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >>>>>>>>>> Cc: stable@vger.kernel.org >>>>>>>>>> --- >>>>>>>>>> fs/nfsd/nfs4state.c | 9 +++------ >>>>>>>>>> 1 file changed, 3 insertions(+), 6 deletions(-) >>>>>>>>>> >>>>>>>>>> This might be a naive approach, but let's start with >>>>>>>>>> it. >>>>>>>>>> >>>>>>>>>> This passes light testing, but it's not clear how much >>>>>>>>>> our >>>>>>>>>> existing >>>>>>>>>> fleet of tests exercises this area. I've locally built >>>>>>>>>> a >>>>>>>>>> couple of >>>>>>>>>> pynfs tests (one is based on the one Dai posted last >>>>>>>>>> week) >>>>>>>>>> and they >>>>>>>>>> pass too. >>>>>>>>>> >>>>>>>>>> I don't believe that FREE_STATEID needs the same >>>>>>>>>> simplification. >>>>>>>>>> >>>>>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>>>>>>> index a280256cbb03..b77894e668a4 100644 >>>>>>>>>> --- a/fs/nfsd/nfs4state.c >>>>>>>>>> +++ b/fs/nfsd/nfs4state.c >>>>>>>>>> @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct >>>>>>>>>> svc_rqst *rqstp, >>>>>>>>>> >>>>>>>>>> /* see if there are still any locks >>>>>>>>>> associated with it */ >>>>>>>>>> lo = lockowner(sop); >>>>>>>>>> - list_for_each_entry(stp, &sop- >>>>>>>>>>> so_stateids, >>>>>>>>>> st_perstateowner) { >>>>>>>>>> - if (check_for_locks(stp- >>>>>>>>>>> st_stid.sc_file, lo)) { >>>>>>>>>> - status = >>>>>>>>>> nfserr_locks_held; >>>>>>>>>> - spin_unlock(&clp- >>>>>>>>>>> cl_lock); >>>>>>>>>> - return status; >>>>>>>>>> - } >>>>>>>>>> + if (atomic_read(&sop->so_count) > 1) { >>>>>>>>>> + spin_unlock(&clp->cl_lock); >>>>>>>>>> + return nfserr_locks_held; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> nfs4_get_stateowner(sop); >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> lm_get_owner is called from locks_copy_conflock, so if >>>>>>>>> someone else >>>>>>>>> happens to be doing a LOCKT or F_GETLK call at the same >>>>>>>>> time >>>>>>>>> that >>>>>>>>> RELEASE_LOCKOWNER gets called, then this may end up >>>>>>>>> returning >>>>>>>>> an error >>>>>>>>> inappropriately. >>>>>>>> >>>>>>>> IMO releasing the lockowner while it's being used for >>>>>>>> _anything_ >>>>>>>> seems risky and surprising. If RELEASE_LOCKOWNER succeeds >>>>>>>> while >>>>>>>> the client is still using the lockowner for any reason, a >>>>>>>> subsequent error will occur if the client tries to use it >>>>>>>> again. >>>>>>>> Heck, I can see the server failing in mid-COMPOUND with >>>>>>>> this >>>>>>>> kind >>>>>>>> of race. Better I think to just leave the lockowner in >>>>>>>> place if >>>>>>>> there's any ambiguity. >>>>>>>> >>>>>>> >>>>>>> The problem here is not the client itself calling >>>>>>> RELEASE_LOCKOWNER >>>>>>> while it's still in use, but rather a different client >>>>>>> altogether >>>>>>> calling LOCKT (or a local process does a F_GETLK) on an inode >>>>>>> where a >>>>>>> lock is held by a client. The LOCKT gets a reference to it >>>>>>> (for >>>>>>> the >>>>>>> conflock), while the client that has the lockowner releases >>>>>>> the >>>>>>> lock and >>>>>>> then the lockowner while the refcount is still high. >>>>>>> >>>>>>> The race window for this is probably quite small, but I think >>>>>>> it's >>>>>>> theoretically possible. The point is that an elevated >>>>>>> refcount on >>>>>>> the >>>>>>> lockowner doesn't necessarily mean that locks are actually >>>>>>> being >>>>>>> held by >>>>>>> it. >>>>>> >>>>>> Sure, I get that the lockowner's reference count is not 100% >>>>>> reliable. The question is whether it's good enough. >>>>>> >>>>>> We are looking for a mechanism that can simply count the number >>>>>> of locks held by a lockowner. It sounds like you believe that >>>>>> lm_get_owner / put_owner might not be a reliable way to do >>>>>> that. >>>>>> >>>>>> >>>>>>>> The spec language does not say RELEASE_LOCKOWNER must not >>>>>>>> return >>>>>>>> LOCKS_HELD for other reasons, and it does say that there is >>>>>>>> no >>>>>>>> choice of using another NFSERR value (RFC 7530 Section >>>>>>>> 13.2). >>>>>>>> >>>>>>> >>>>>>> What recourse does the client have if this happens? It >>>>>>> released >>>>>>> all of >>>>>>> its locks and tried to release the lockowner, but the server >>>>>>> says >>>>>>> "locks >>>>>>> held". Should it just give up at that point? >>>>>>> RELEASE_LOCKOWNER is >>>>>>> a sort >>>>>>> of a courtesy by the client, I suppose... >>>>>> >>>>>> RELEASE_LOCKOWNER is a courtesy for the server. Most clients >>>>>> ignore the return code IIUC. >>>>>> >>>>>> So the hazard caused by this race would be a small resource >>>>>> leak on the server that would go away once the client's lease >>>>>> was purged. >>>>>> >>>>>> >>>>>>>>> My guess is that that would be pretty hard to hit the >>>>>>>>> timing right, but not impossible. >>>>>>>>> >>>>>>>>> What we may want to do is have the kernel do this check >>>>>>>>> and >>>>>>>>> only if it >>>>>>>>> comes back >1 do the actual check for locks. That won't >>>>>>>>> fix >>>>>>>>> the original >>>>>>>>> problem though. >>>>>>>>> >>>>>>>>> In other places in nfsd, we've plumbed in a dispose_list >>>>>>>>> head >>>>>>>>> and >>>>>>>>> deferred the sleeping functions until the spinlock can be >>>>>>>>> dropped. I >>>>>>>>> haven't looked closely at whether that's possible here, >>>>>>>>> but >>>>>>>>> it may be a >>>>>>>>> more reliable approach. >>>>>>>> >>>>>>>> That was proposed by Dai last week. >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u >>>>>>>> >>>>>>>> Trond pointed out that if two separate clients were >>>>>>>> releasing a >>>>>>>> lockowner on the same inode, there is nothing that protects >>>>>>>> the >>>>>>>> dispose_list, and it would get corrupted. >>>>>>>> >>>>>>>> https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 >>>>>>>> >>>>>>> >>>>>>> Yeah, that doesn't look like what's needed. >>>>>>> >>>>>>> What I was going to suggest is a nfsd_file_put variant that >>>>>>> takes >>>>>>> a >>>>>>> list_head. If the refcount goes to zero and the thing ends up >>>>>>> being >>>>>>> unhashed, then you put it on the dispose list rather than >>>>>>> doing >>>>>>> the >>>>>>> blocking operations, and then clean it up later. >>>>>> >>>>>> Trond doesn't like that approach; see the e-mail thread. >>>>>> >>>>> >>>>> I didn't see him saying that that would be wrong, per-se, but the >>>>> initial implementation was racy. >>>>> >>>>> His suggestion was just to keep a counter in the lockowner of how >>>>> many >>>>> locks are associated with it. That seems like a good suggestion, >>>>> though >>>>> you'd probably need to add a parameter to lm_get_owner to >>>>> indicate >>>>> whether you were adding a new lock or just doing a conflock copy. >>>> >>>> I don't think this should be necessary. The posix_lock code doesn't >>>> ever use a struct file_lock that it hasn't allocated itself. We >>>> should >>>> always be calling conflock to copy from whatever struct file_lock >>>> that >>>> the caller passed as an argument. >>>> >>>> IOW: the number of lm_get_owner and lm_put_owner calls should >>>> always be >>>> 100% balanced once all the locks belonging to a specific lock owner >>>> are >>>> removed. >>>> >>> >>> We take references to the owner when we go to add a lock record, or >>> when >>> copying a conflicting lock. You want to keep a count of the former >>> without counting the latter. >>> >>> lm_get_owner gets called for both though. I don't see how you can >>> disambiguate the two situations w/o some way to indicate that. Adding >>> a >>> bool argument to lm_get_owner/lm_put_owner ops would be pretty simple >>> to >>> implement, I think. >>> >> >> Hmm... That should be an extremely unlikely race, given that the >> conflicting lock reference would have to be held for long enough to >> cover the unlock + the release_lockowner / free_stateid RPC calls from >> the client initially holding the lock, however I agree it is a >> theoretical possibility. >> > > If we want to live with the possibility of that race, then Chuck's > original patch is fine, since the object refcount would always be > equivalent to the lock count. > > That said, I think it'd be better to keep an accurate count of lock > records (sans conflocks) in the owner. I don't have an objection to maintaining an accurate count of locks belonging to each lockowner. I feel comfortable (for now) that using so_count is a good-enough solution, and is an improvement over holding a spinlock while trying to sleep. If we go with so_count, I can add a comment that explains the uses of so_count and possible races. OTOH, if we alter the lm_get/put_owner API, I would like to ensure the new parameter will be difficult to misuse. A few more random thoughts that might count towards due diligence: - Is there a reliable way lm_get/put_owner can distinguish between the two cases by looking at their current set of arguments? - Is it clear why the conflict case needs to take a reference on a lockowner that is not involved? Is there a way to avoid taking that reference in the first place? -- Chuck Lever
> On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote: >> >>> On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: >>> >>> His suggestion was just to keep a counter in the lockowner of how many >>> locks are associated with it. That seems like a good suggestion, though >>> you'd probably need to add a parameter to lm_get_owner to indicate >>> whether you were adding a new lock or just doing a conflock copy. >> >> locks_copy_conflock() would need to take a boolean parameter >> that callers would set when they actually manipulate a lock. >> > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know > whether you need to decrement the counter. It's the lm_put_owner() side that looks less than straightforward. Suggestions and advice welcome there. -- Chuck Lever
On Mon, 2022-05-23 at 19:13 +0000, Chuck Lever III wrote: > > > On May 23, 2022, at 2:30 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2022-05-23 at 18:21 +0000, Trond Myklebust wrote: > > > On Mon, 2022-05-23 at 14:04 -0400, Jeff Layton wrote: > > > > On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote: > > > > > On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote: > > > > > > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > On May 23, 2022, at 11:26 AM, Jeff Layton > > > > > > > > <jlayton@kernel.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote: > > > > > > > > > > > > > > > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton > > > > > > > > > > <jlayton@kernel.org> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote: > > > > > > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it > > > > > > > > > > > calls > > > > > > > > > > > check_for_locks(). However, check_for_locks() calls > > > > > > > > > > > nfsd_file_get() > > > > > > > > > > > / nfsd_file_put() to access the backing inode's > > > > > > > > > > > flc_posix > > > > > > > > > > > list, and > > > > > > > > > > > nfsd_file_put() can sleep if the inode was recently > > > > > > > > > > > removed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It might be good to add a might_sleep() to nfsd_file_put? > > > > > > > > > > > > > > > > > > I intend to include the patch you reviewed last week that > > > > > > > > > adds the might_sleep(), as part of this series. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Let's instead rely on the stateowner's reference count > > > > > > > > > > > to > > > > > > > > > > > gate > > > > > > > > > > > whether the release is permitted. This should be a > > > > > > > > > > > reliable > > > > > > > > > > > indication of locks-in-use since file lock operations > > > > > > > > > > > and > > > > > > > > > > > ->lm_get_owner take appropriate references, which are > > > > > > > > > > > released > > > > > > > > > > > appropriately when file locks are removed. > > > > > > > > > > > > > > > > > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com> > > > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > > > > Cc: stable@vger.kernel.org > > > > > > > > > > > --- > > > > > > > > > > > fs/nfsd/nfs4state.c | 9 +++------ > > > > > > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > > > > > > > This might be a naive approach, but let's start with > > > > > > > > > > > it. > > > > > > > > > > > > > > > > > > > > > > This passes light testing, but it's not clear how much > > > > > > > > > > > our > > > > > > > > > > > existing > > > > > > > > > > > fleet of tests exercises this area. I've locally built > > > > > > > > > > > a > > > > > > > > > > > couple of > > > > > > > > > > > pynfs tests (one is based on the one Dai posted last > > > > > > > > > > > week) > > > > > > > > > > > and they > > > > > > > > > > > pass too. > > > > > > > > > > > > > > > > > > > > > > I don't believe that FREE_STATEID needs the same > > > > > > > > > > > simplification. > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > > > > > > index a280256cbb03..b77894e668a4 100644 > > > > > > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct > > > > > > > > > > > svc_rqst *rqstp, > > > > > > > > > > > > > > > > > > > > > > /* see if there are still any locks > > > > > > > > > > > associated with it */ > > > > > > > > > > > lo = lockowner(sop); > > > > > > > > > > > - list_for_each_entry(stp, &sop- > > > > > > > > > > > > so_stateids, > > > > > > > > > > > st_perstateowner) { > > > > > > > > > > > - if (check_for_locks(stp- > > > > > > > > > > > > st_stid.sc_file, lo)) { > > > > > > > > > > > - status = > > > > > > > > > > > nfserr_locks_held; > > > > > > > > > > > - spin_unlock(&clp- > > > > > > > > > > > > cl_lock); > > > > > > > > > > > - return status; > > > > > > > > > > > - } > > > > > > > > > > > + if (atomic_read(&sop->so_count) > 1) { > > > > > > > > > > > + spin_unlock(&clp->cl_lock); > > > > > > > > > > > + return nfserr_locks_held; > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > nfs4_get_stateowner(sop); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lm_get_owner is called from locks_copy_conflock, so if > > > > > > > > > > someone else > > > > > > > > > > happens to be doing a LOCKT or F_GETLK call at the same > > > > > > > > > > time > > > > > > > > > > that > > > > > > > > > > RELEASE_LOCKOWNER gets called, then this may end up > > > > > > > > > > returning > > > > > > > > > > an error > > > > > > > > > > inappropriately. > > > > > > > > > > > > > > > > > > IMO releasing the lockowner while it's being used for > > > > > > > > > _anything_ > > > > > > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds > > > > > > > > > while > > > > > > > > > the client is still using the lockowner for any reason, a > > > > > > > > > subsequent error will occur if the client tries to use it > > > > > > > > > again. > > > > > > > > > Heck, I can see the server failing in mid-COMPOUND with > > > > > > > > > this > > > > > > > > > kind > > > > > > > > > of race. Better I think to just leave the lockowner in > > > > > > > > > place if > > > > > > > > > there's any ambiguity. > > > > > > > > > > > > > > > > > > > > > > > > > The problem here is not the client itself calling > > > > > > > > RELEASE_LOCKOWNER > > > > > > > > while it's still in use, but rather a different client > > > > > > > > altogether > > > > > > > > calling LOCKT (or a local process does a F_GETLK) on an inode > > > > > > > > where a > > > > > > > > lock is held by a client. The LOCKT gets a reference to it > > > > > > > > (for > > > > > > > > the > > > > > > > > conflock), while the client that has the lockowner releases > > > > > > > > the > > > > > > > > lock and > > > > > > > > then the lockowner while the refcount is still high. > > > > > > > > > > > > > > > > The race window for this is probably quite small, but I think > > > > > > > > it's > > > > > > > > theoretically possible. The point is that an elevated > > > > > > > > refcount on > > > > > > > > the > > > > > > > > lockowner doesn't necessarily mean that locks are actually > > > > > > > > being > > > > > > > > held by > > > > > > > > it. > > > > > > > > > > > > > > Sure, I get that the lockowner's reference count is not 100% > > > > > > > reliable. The question is whether it's good enough. > > > > > > > > > > > > > > We are looking for a mechanism that can simply count the number > > > > > > > of locks held by a lockowner. It sounds like you believe that > > > > > > > lm_get_owner / put_owner might not be a reliable way to do > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > The spec language does not say RELEASE_LOCKOWNER must not > > > > > > > > > return > > > > > > > > > LOCKS_HELD for other reasons, and it does say that there is > > > > > > > > > no > > > > > > > > > choice of using another NFSERR value (RFC 7530 Section > > > > > > > > > 13.2). > > > > > > > > > > > > > > > > > > > > > > > > > What recourse does the client have if this happens? It > > > > > > > > released > > > > > > > > all of > > > > > > > > its locks and tried to release the lockowner, but the server > > > > > > > > says > > > > > > > > "locks > > > > > > > > held". Should it just give up at that point? > > > > > > > > RELEASE_LOCKOWNER is > > > > > > > > a sort > > > > > > > > of a courtesy by the client, I suppose... > > > > > > > > > > > > > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients > > > > > > > ignore the return code IIUC. > > > > > > > > > > > > > > So the hazard caused by this race would be a small resource > > > > > > > leak on the server that would go away once the client's lease > > > > > > > was purged. > > > > > > > > > > > > > > > > > > > > > > > > My guess is that that would be pretty hard to hit the > > > > > > > > > > timing right, but not impossible. > > > > > > > > > > > > > > > > > > > > What we may want to do is have the kernel do this check > > > > > > > > > > and > > > > > > > > > > only if it > > > > > > > > > > comes back >1 do the actual check for locks. That won't > > > > > > > > > > fix > > > > > > > > > > the original > > > > > > > > > > problem though. > > > > > > > > > > > > > > > > > > > > In other places in nfsd, we've plumbed in a dispose_list > > > > > > > > > > head > > > > > > > > > > and > > > > > > > > > > deferred the sleeping functions until the spinlock can be > > > > > > > > > > dropped. I > > > > > > > > > > haven't looked closely at whether that's possible here, > > > > > > > > > > but > > > > > > > > > > it may be a > > > > > > > > > > more reliable approach. > > > > > > > > > > > > > > > > > > That was proposed by Dai last week. > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u > > > > > > > > > > > > > > > > > > Trond pointed out that if two separate clients were > > > > > > > > > releasing a > > > > > > > > > lockowner on the same inode, there is nothing that protects > > > > > > > > > the > > > > > > > > > dispose_list, and it would get corrupted. > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614 > > > > > > > > > > > > > > > > > > > > > > > > > Yeah, that doesn't look like what's needed. > > > > > > > > > > > > > > > > What I was going to suggest is a nfsd_file_put variant that > > > > > > > > takes > > > > > > > > a > > > > > > > > list_head. If the refcount goes to zero and the thing ends up > > > > > > > > being > > > > > > > > unhashed, then you put it on the dispose list rather than > > > > > > > > doing > > > > > > > > the > > > > > > > > blocking operations, and then clean it up later. > > > > > > > > > > > > > > Trond doesn't like that approach; see the e-mail thread. > > > > > > > > > > > > > > > > > > > I didn't see him saying that that would be wrong, per-se, but the > > > > > > initial implementation was racy. > > > > > > > > > > > > His suggestion was just to keep a counter in the lockowner of how > > > > > > many > > > > > > locks are associated with it. That seems like a good suggestion, > > > > > > though > > > > > > you'd probably need to add a parameter to lm_get_owner to > > > > > > indicate > > > > > > whether you were adding a new lock or just doing a conflock copy. > > > > > > > > > > I don't think this should be necessary. The posix_lock code doesn't > > > > > ever use a struct file_lock that it hasn't allocated itself. We > > > > > should > > > > > always be calling conflock to copy from whatever struct file_lock > > > > > that > > > > > the caller passed as an argument. > > > > > > > > > > IOW: the number of lm_get_owner and lm_put_owner calls should > > > > > always be > > > > > 100% balanced once all the locks belonging to a specific lock owner > > > > > are > > > > > removed. > > > > > > > > > > > > > We take references to the owner when we go to add a lock record, or > > > > when > > > > copying a conflicting lock. You want to keep a count of the former > > > > without counting the latter. > > > > > > > > lm_get_owner gets called for both though. I don't see how you can > > > > disambiguate the two situations w/o some way to indicate that. Adding > > > > a > > > > bool argument to lm_get_owner/lm_put_owner ops would be pretty simple > > > > to > > > > implement, I think. > > > > > > > > > > Hmm... That should be an extremely unlikely race, given that the > > > conflicting lock reference would have to be held for long enough to > > > cover the unlock + the release_lockowner / free_stateid RPC calls from > > > the client initially holding the lock, however I agree it is a > > > theoretical possibility. > > > > > > > If we want to live with the possibility of that race, then Chuck's > > original patch is fine, since the object refcount would always be > > equivalent to the lock count. > > > > That said, I think it'd be better to keep an accurate count of lock > > records (sans conflocks) in the owner. > > I don't have an objection to maintaining an accurate count of locks > belonging to each lockowner. I feel comfortable (for now) that using > so_count is a good-enough solution, and is an improvement over holding > a spinlock while trying to sleep. If we go with so_count, I can add > a comment that explains the uses of so_count and possible races. > > OTOH, if we alter the lm_get/put_owner API, I would like to ensure the > new parameter will be difficult to misuse. A few more random thoughts > that might count towards due diligence: > > - Is there a reliable way lm_get/put_owner can distinguish between > the two cases by looking at their current set of arguments? > Not really. It just takes a lockowner pointer now so you don't have any indication of why the reference is being taken or put. > - Is it clear why the conflict case needs to take a reference on a > lockowner that is not involved? Is there a way to avoid taking that > reference in the first place? > The other lockowner _is_ involved. It's the one holding the conflicting lock. nfs4_set_lock_denied copies info from the conflicting lockowner into the LOCK/LOCKT response. That's safe now because it holds a reference to the owner. At one point it wasn't (see commit aef9583b234a4 "NFSD: Get reference of lockowner when coping file_lock", which fixed that).
On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote: > > > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote: > > > > > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > His suggestion was just to keep a counter in the lockowner of how many > > > > locks are associated with it. That seems like a good suggestion, though > > > > you'd probably need to add a parameter to lm_get_owner to indicate > > > > whether you were adding a new lock or just doing a conflock copy. > > > > > > locks_copy_conflock() would need to take a boolean parameter > > > that callers would set when they actually manipulate a lock. > > > > > > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know > > whether you need to decrement the counter. > > It's the lm_put_owner() side that looks less than straightforward. > Suggestions and advice welcome there. > Maybe add a new fl_flags value that indicates that a particular lock is a conflock and not a lock record? Then locks_release_private could use that to pass the appropriate argument to lm_put_owner. That's probably simpler overall than trying to audit all of the locks_free_lock callers.
On Mon, May 23, 2022 at 03:43:28PM -0400, Jeff Layton wrote: > On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote: > > > > > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote: > > > > > > > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > > > His suggestion was just to keep a counter in the lockowner of how many > > > > > locks are associated with it. That seems like a good suggestion, though > > > > > you'd probably need to add a parameter to lm_get_owner to indicate > > > > > whether you were adding a new lock or just doing a conflock copy. > > > > > > > > locks_copy_conflock() would need to take a boolean parameter > > > > that callers would set when they actually manipulate a lock. > > > > > > > > > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know > > > whether you need to decrement the counter. > > > > It's the lm_put_owner() side that looks less than straightforward. > > Suggestions and advice welcome there. > > > > Maybe add a new fl_flags value that indicates that a particular lock is > a conflock and not a lock record? Then locks_release_private could use > that to pass the appropriate argument to lm_put_owner. > > That's probably simpler overall than trying to audit all of the > locks_free_lock callers. Should conflock parameters really be represented by file_lock structures at all? It always seemed a little wrong to me. But, that's a bit of derail, apologies. --b.
On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote: > The other lockowner _is_ involved. It's the one holding the conflicting > lock. nfs4_set_lock_denied copies info from the conflicting lockowner > into the LOCK/LOCKT response. That's safe now because it holds a > reference to the owner. At one point it wasn't (see commit aef9583b234a4 > "NFSD: Get reference of lockowner when coping file_lock", which fixed > that). I doubt that commit fixed the whole problem, for what it's worth. What if the client holding the conflicting lock expires before we get to nfs4_set_lock_denied? --b.
On Mon, 2022-05-23 at 16:17 -0400, J. Bruce Fields wrote: > On Mon, May 23, 2022 at 03:43:28PM -0400, Jeff Layton wrote: > > On Mon, 2022-05-23 at 19:35 +0000, Chuck Lever III wrote: > > > > > > > On May 23, 2022, at 1:38 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > On Mon, 2022-05-23 at 17:25 +0000, Chuck Lever III wrote: > > > > > > > > > > > On May 23, 2022, at 12:37 PM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > > > > > His suggestion was just to keep a counter in the lockowner of how many > > > > > > locks are associated with it. That seems like a good suggestion, though > > > > > > you'd probably need to add a parameter to lm_get_owner to indicate > > > > > > whether you were adding a new lock or just doing a conflock copy. > > > > > > > > > > locks_copy_conflock() would need to take a boolean parameter > > > > > that callers would set when they actually manipulate a lock. > > > > > > > > > > > > > Yep. You'd also have to add a bool arg to lm_put_owner so that you know > > > > whether you need to decrement the counter. > > > > > > It's the lm_put_owner() side that looks less than straightforward. > > > Suggestions and advice welcome there. > > > > > > > Maybe add a new fl_flags value that indicates that a particular lock is > > a conflock and not a lock record? Then locks_release_private could use > > that to pass the appropriate argument to lm_put_owner. > > > > That's probably simpler overall than trying to audit all of the > > locks_free_lock callers. > > Should conflock parameters really be represented by file_lock structures > at all? It always seemed a little wrong to me. But, that's a bit of > derail, apologies. > Probably not. Lock requests should also not be represented by struct file_lock, but that decision was made quite some time ago. We could change these things these days, but it'd be a lot of churn. Even if we did use a different struct for conflocks though, it would still need a way to point at the lockowner (or clone/free its info somehow). It wouldn't materially change what we need to do here.
On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote: > On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote: > > The other lockowner _is_ involved. It's the one holding the conflicting > > lock. nfs4_set_lock_denied copies info from the conflicting lockowner > > into the LOCK/LOCKT response. That's safe now because it holds a > > reference to the owner. At one point it wasn't (see commit aef9583b234a4 > > "NFSD: Get reference of lockowner when coping file_lock", which fixed > > that). > > I doubt that commit fixed the whole problem, for what it's worth. What > if the client holding the conflicting lock expires before we get to > nfs4_set_lock_denied? > Good point -- stateowners can't hold a client reference. clientid_t is 64 bits, so one thing we could do is just keep a copy of that in the lockowner. That way we wouldn't need to dereference so_client in nfs4_set_lock_denied. Maybe there are better ways to fix it though.
On Mon, May 23, 2022 at 05:15:55PM -0400, Jeff Layton wrote: > On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote: > > On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote: > > > The other lockowner _is_ involved. It's the one holding the conflicting > > > lock. nfs4_set_lock_denied copies info from the conflicting lockowner > > > into the LOCK/LOCKT response. That's safe now because it holds a > > > reference to the owner. At one point it wasn't (see commit aef9583b234a4 > > > "NFSD: Get reference of lockowner when coping file_lock", which fixed > > > that). > > > > I doubt that commit fixed the whole problem, for what it's worth. What > > if the client holding the conflicting lock expires before we get to > > nfs4_set_lock_denied? > > > > Good point -- stateowners can't hold a client reference. > > clientid_t is 64 bits, so one thing we could do is just keep a copy of > that in the lockowner. That way we wouldn't need to dereference > so_client in nfs4_set_lock_denied. > > Maybe there are better ways to fix it though. I kinda feel like lock and testlock should both take a struct conflock { void *data void (*copier)(struct file_lock, struct conflock) } and the caller should provide a "copier" that they can use to extract the data they want while the flc_lock is held. I'm probably overthinking it. --b.
> On May 23, 2022, at 1:43 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > IOW: the number of lm_get_owner and lm_put_owner calls should always be > 100% balanced once all the locks belonging to a specific lock owner are > removed. It doesn't look like that's the case. If we rely solely on lm_get/put_owner, the lockcnt quickly becomes negative, even without the presence of lock conflicts: nfsd-999 [001] 155.888211: nfsd_compound: xid=0x0000001c opcnt=2 nfsd-999 [001] 155.888260: nfsd_compound_status: op=1/2 OP_PUTFH status=0 nfsd-999 [001] 155.888283: nfsd_preprocess: seqid=2 client 628c04a3:f2adef0d stateid 00000002:00000002 nfsd-999 [001] 155.888419: bprint: nfsd4_lm_get_owner: lo=0xffff8881955b1050 conflict=false lockcnt=1 so_count=3 nfsd-999 [001] 155.888431: bprint: nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=1 so_count=3 nfsd-999 [001] 155.888441: nfsd_file_put: hash=0x339 inode=0xffff88816770d908 ref=3 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0 nfsd-999 [001] 155.888453: nfsd_compound_status: op=2/2 OP_LOCK status=0 nfsd-999 [003] 155.890388: nfsd_compound: xid=0x0000001d opcnt=2 nfsd-999 [003] 155.890457: nfsd_compound_status: op=1/2 OP_PUTFH status=0 nfsd-999 [003] 155.890459: nfsd_preprocess: seqid=1 client 628c04a3:f2adef0d stateid 00000003:00000001 nfsd-999 [003] 155.890479: bprint: nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=0 so_count=4 nfsd-999 [003] 155.890490: nfsd_file_put: hash=0x339 inode=0xffff88816770d908 ref=3 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0 nfsd-999 [003] 155.890492: bprint: nfsd4_lm_put_owner.part.0: lo=0xffff8881955b1050 conflict=false lockcnt=-1 so_count=3 nfsd-999 [003] 155.890502: nfsd_compound_status: op=2/2 OP_LOCKU status=0 nfsd-999 [003] 155.892494: nfsd_compound: xid=0x0000001e opcnt=5 nfsd-999 [003] 155.892498: nfsd_compound_status: op=1/5 OP_PUTROOTFH status=0 nfsd-999 [003] 155.892528: nfsd_compound_status: op=2/5 OP_LOOKUP status=0 nfsd-999 [003] 155.892552: nfsd_compound_status: op=3/5 OP_LOOKUP status=0 nfsd-999 [003] 155.892576: nfsd_compound_status: op=4/5 OP_LOOKUP status=0 nfsd-999 [003] 155.892604: nfsd_file_fsnotify_handle_event: inode=0xffff88816770d908 nlink=0 mode=0100644 mask=0x4 nfsd-999 [003] 155.892605: nfsd_file_unhash_and_release_locked: hash=0x339 inode=0xffff88816770d908 ref=2 flags=HASHED|REFERENCED may=WRITE|READ file=0xffff88811232f7c0 nfsd-999 [003] 155.892606: nfsd_file_unhash: hash=0x339 inode=0xffff88816770d908 ref=2 flags=REFERENCED may=WRITE|READ file=0xffff88811232f7c0 nfsd-999 [003] 155.892607: nfsd_file_close_inode: hash=0x339 inode=0xffff88816770d908 found=0 nfsd-999 [003] 155.892610: nfsd_compound_status: op=5/5 OP_REMOVE status=0 nfsd-999 [001] 155.894663: nfsd_compound: xid=0x0000001f opcnt=1 nfsd-999 [001] 155.894667: bprint: nfsd4_release_lockowner: lo=0xffff8881955b1050 lockcnt=-2 so_count=1 nfsd-999 [001] 155.894670: nfsd_file_put: hash=0x339 inode=0xffff88816770d908 ref=2 flags=REFERENCED may=WRITE|READ file=0xffff88811232f7c0 nfsd-999 [001] 155.894706: nfsd_compound_status: op=1/1 OP_RELEASE_LOCKOWNER status=0 There are other places inside NFSD that call nfs4_get/ put_stateowner on a lockowner. That seems to be what keeps the so_count above water. Or, another way to look at it: simply saving the value of the conflict boolean in file_lock and passing that to lm_put_owner() doesn't seem to be adequate. -- Chuck Lever
On Mon, May 23, 2022 at 05:28:16PM -0400, J. Bruce Fields wrote: > On Mon, May 23, 2022 at 05:15:55PM -0400, Jeff Layton wrote: > > On Mon, 2022-05-23 at 16:29 -0400, J. Bruce Fields wrote: > > > On Mon, May 23, 2022 at 03:36:27PM -0400, Jeff Layton wrote: > > > > The other lockowner _is_ involved. It's the one holding the conflicting > > > > lock. nfs4_set_lock_denied copies info from the conflicting lockowner > > > > into the LOCK/LOCKT response. That's safe now because it holds a > > > > reference to the owner. At one point it wasn't (see commit aef9583b234a4 > > > > "NFSD: Get reference of lockowner when coping file_lock", which fixed > > > > that). > > > > > > I doubt that commit fixed the whole problem, for what it's worth. What > > > if the client holding the conflicting lock expires before we get to > > > nfs4_set_lock_denied? > > > > > > > Good point -- stateowners can't hold a client reference. > > > > clientid_t is 64 bits, so one thing we could do is just keep a copy of > > that in the lockowner. That way we wouldn't need to dereference > > so_client in nfs4_set_lock_denied. > > > > Maybe there are better ways to fix it though. > > I kinda feel like lock and testlock should both take a > > struct conflock { > void *data > void (*copier)(struct file_lock, struct conflock) > } > > and the caller should provide a "copier" that they can use to extract > the data they want while the flc_lock is held. > > I'm probably overthinking it. But in any case note the problem isn't just the copy of the clientid_t; nfs4_put_stateowner also dereferences the client. --b.
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index a280256cbb03..b77894e668a4 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct svc_rqst *rqstp, /* see if there are still any locks associated with it */ lo = lockowner(sop); - list_for_each_entry(stp, &sop->so_stateids, st_perstateowner) { - if (check_for_locks(stp->st_stid.sc_file, lo)) { - status = nfserr_locks_held; - spin_unlock(&clp->cl_lock); - return status; - } + if (atomic_read(&sop->so_count) > 1) { + spin_unlock(&clp->cl_lock); + return nfserr_locks_held; } nfs4_get_stateowner(sop);
nfsd4_release_lockowner() holds clp->cl_lock when it calls check_for_locks(). However, check_for_locks() calls nfsd_file_get() / nfsd_file_put() to access the backing inode's flc_posix list, and nfsd_file_put() can sleep if the inode was recently removed. Let's instead rely on the stateowner's reference count to gate whether the release is permitted. This should be a reliable indication of locks-in-use since file lock operations and ->lm_get_owner take appropriate references, which are released appropriately when file locks are removed. Reported-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org --- fs/nfsd/nfs4state.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) This might be a naive approach, but let's start with it. This passes light testing, but it's not clear how much our existing fleet of tests exercises this area. I've locally built a couple of pynfs tests (one is based on the one Dai posted last week) and they pass too. I don't believe that FREE_STATEID needs the same simplification.