diff mbox series

nfsd: fix delegation_blocked() to block correctly for at least 30 seconds

Message ID 172585839640.4433.13337900639103448371@noble.neil.brown.name (mailing list archive)
State New
Headers show
Series nfsd: fix delegation_blocked() to block correctly for at least 30 seconds | expand

Commit Message

NeilBrown Sept. 9, 2024, 5:06 a.m. UTC
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds.  A
new filehandle would be recorded in the "new" bit set.  That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.

Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set.  This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.

This patch updates bd->new before clearing the set with that index,
instead of afterwards.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfs4state.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Benjamin Coddington Sept. 9, 2024, 12:09 p.m. UTC | #1
On 9 Sep 2024, at 1:06, NeilBrown wrote:

> The pair of bloom filtered used by delegation_blocked() was intended to
> block delegations on given filehandles for between 30 and 60 seconds.  A
> new filehandle would be recorded in the "new" bit set.  That would then
> be switch to the "old" bit set between 0 and 30 seconds later, and it
> would remain as the "old" bit set for 30 seconds.
>
> Unfortunately the code intended to clear the old bit set once it reached
> 30 seconds old, preparing it to be the next new bit set, instead cleared
> the *new* bit set before switching it to be the old bit set.  This means
> that the "old" bit set is always empty and delegations are blocked
> between 0 and 30 seconds.
>
> This patch updates bd->new before clearing the set with that index,
> instead of afterwards.
>
> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/nfs4state.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4313addbe756..6f18c1a7af2e 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>   * When a delegation is recalled, the filehandle is stored in the "new"
>   * filter.
>   * Every 30 seconds we swap the filters and clear the "new" one,
> - * unless both are empty of course.
> + * unless both are empty of course.  This results in delegations for a
> + * given filehandle being blocked for between 30 and 60 seconds.
>   *
>   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>   * low 3 bytes as hash-table indices.
> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>  		if (ktime_get_seconds() - bd->swap_time > 30) {
>  			bd->entries -= bd->old_entries;
>  			bd->old_entries = bd->entries;
> +			bd->new = 1-bd->new;
>  			memset(bd->set[bd->new], 0,
>  			       sizeof(bd->set[0]));
> -			bd->new = 1-bd->new;
>  			bd->swap_time = ktime_get_seconds();
>  		}
>  		spin_unlock(&blocked_delegations_lock);

I stared at this code a long time without seeing the problem, but now it's
obvious.  Thanks very much Neil!

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>

Ben
Jeff Layton Sept. 9, 2024, 12:10 p.m. UTC | #2
On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> The pair of bloom filtered used by delegation_blocked() was intended to
> block delegations on given filehandles for between 30 and 60 seconds.  A
> new filehandle would be recorded in the "new" bit set.  That would then
> be switch to the "old" bit set between 0 and 30 seconds later, and it
> would remain as the "old" bit set for 30 seconds.
> 
> Unfortunately the code intended to clear the old bit set once it reached
> 30 seconds old, preparing it to be the next new bit set, instead cleared
> the *new* bit set before switching it to be the old bit set.  This means
> that the "old" bit set is always empty and delegations are blocked
> between 0 and 30 seconds.
> 
> This patch updates bd->new before clearing the set with that index,
> instead of afterwards.
> 
> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/nfs4state.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4313addbe756..6f18c1a7af2e 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>   * When a delegation is recalled, the filehandle is stored in the "new"
>   * filter.
>   * Every 30 seconds we swap the filters and clear the "new" one,
> - * unless both are empty of course.
> + * unless both are empty of course.  This results in delegations for a
> + * given filehandle being blocked for between 30 and 60 seconds.
>   *
>   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>   * low 3 bytes as hash-table indices.
> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>  		if (ktime_get_seconds() - bd->swap_time > 30) {
>  			bd->entries -= bd->old_entries;
>  			bd->old_entries = bd->entries;
> +			bd->new = 1-bd->new;
>  			memset(bd->set[bd->new], 0,
>  			       sizeof(bd->set[0]));
> -			bd->new = 1-bd->new;
>  			bd->swap_time = ktime_get_seconds();
>  		}
>  		spin_unlock(&blocked_delegations_lock);

Nice catch!

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton Sept. 9, 2024, 12:24 p.m. UTC | #3
On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> The pair of bloom filtered used by delegation_blocked() was intended to
> block delegations on given filehandles for between 30 and 60 seconds.  A
> new filehandle would be recorded in the "new" bit set.  That would then
> be switch to the "old" bit set between 0 and 30 seconds later, and it
> would remain as the "old" bit set for 30 seconds.
> 

Since we're on the subject...

60s seems like an awfully long time to block delegations on an inode.
Recalls generally don't take more than a few seconds when things are
functioning properly.

Should we swap the bloom filters more often?

> Unfortunately the code intended to clear the old bit set once it reached
> 30 seconds old, preparing it to be the next new bit set, instead cleared
> the *new* bit set before switching it to be the old bit set.  This means
> that the "old" bit set is always empty and delegations are blocked
> between 0 and 30 seconds.
> 
> This patch updates bd->new before clearing the set with that index,
> instead of afterwards.
> 
> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/nfs4state.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4313addbe756..6f18c1a7af2e 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>   * When a delegation is recalled, the filehandle is stored in the "new"
>   * filter.
>   * Every 30 seconds we swap the filters and clear the "new" one,
> - * unless both are empty of course.
> + * unless both are empty of course.  This results in delegations for a
> + * given filehandle being blocked for between 30 and 60 seconds.
>   *
>   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>   * low 3 bytes as hash-table indices.
> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>  		if (ktime_get_seconds() - bd->swap_time > 30) {
>  			bd->entries -= bd->old_entries;
>  			bd->old_entries = bd->entries;
> +			bd->new = 1-bd->new;
>  			memset(bd->set[bd->new], 0,
>  			       sizeof(bd->set[0]));
> -			bd->new = 1-bd->new;
>  			bd->swap_time = ktime_get_seconds();
>  		}
>  		spin_unlock(&blocked_delegations_lock);
Olga Kornievskaia Sept. 9, 2024, 2:17 p.m. UTC | #4
On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > The pair of bloom filtered used by delegation_blocked() was intended to
> > block delegations on given filehandles for between 30 and 60 seconds.  A
> > new filehandle would be recorded in the "new" bit set.  That would then
> > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > would remain as the "old" bit set for 30 seconds.
> >
>
> Since we're on the subject...
>
> 60s seems like an awfully long time to block delegations on an inode.
> Recalls generally don't take more than a few seconds when things are
> functioning properly.
>
> Should we swap the bloom filters more often?

I was also thinking that perhaps we can do 15-30s perhaps? Another
thought was should this be a configurable value (with some
non-negotiable min and max)?

> > Unfortunately the code intended to clear the old bit set once it reached
> > 30 seconds old, preparing it to be the next new bit set, instead cleared
> > the *new* bit set before switching it to be the old bit set.  This means
> > that the "old" bit set is always empty and delegations are blocked
> > between 0 and 30 seconds.
> >
> > This patch updates bd->new before clearing the set with that index,
> > instead of afterwards.
> >
> > Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> > Cc: stable@vger.kernel.org
> > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/nfs4state.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 4313addbe756..6f18c1a7af2e 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
> >   * When a delegation is recalled, the filehandle is stored in the "new"
> >   * filter.
> >   * Every 30 seconds we swap the filters and clear the "new" one,
> > - * unless both are empty of course.
> > + * unless both are empty of course.  This results in delegations for a
> > + * given filehandle being blocked for between 30 and 60 seconds.
> >   *
> >   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
> >   * low 3 bytes as hash-table indices.
> > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
> >               if (ktime_get_seconds() - bd->swap_time > 30) {
> >                       bd->entries -= bd->old_entries;
> >                       bd->old_entries = bd->entries;
> > +                     bd->new = 1-bd->new;
> >                       memset(bd->set[bd->new], 0,
> >                              sizeof(bd->set[0]));
> > -                     bd->new = 1-bd->new;
> >                       bd->swap_time = ktime_get_seconds();
> >               }
> >               spin_unlock(&blocked_delegations_lock);
>
> --
> Jeff Layton <jlayton@kernel.org>
>
Jeff Layton Sept. 9, 2024, 2:28 p.m. UTC | #5
On Mon, 2024-09-09 at 10:17 -0400, Olga Kornievskaia wrote:
> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > new filehandle would be recorded in the "new" bit set.  That would then
> > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > would remain as the "old" bit set for 30 seconds.
> > > 
> > 
> > Since we're on the subject...
> > 
> > 60s seems like an awfully long time to block delegations on an inode.
> > Recalls generally don't take more than a few seconds when things are
> > functioning properly.
> > 
> > Should we swap the bloom filters more often?
> 
> I was also thinking that perhaps we can do 15-30s perhaps? Another
> thought was should this be a configurable value (with some
> non-negotiable min and max)?
> 

Even 15-30s sounds like a lot, but that would probably be better. While
I'm not a huge fan of tunables, it's hard to know what the "right"
interval is here without some experimentation.

Maybe we should start with some metrics. For example, how often does
the bloom filter block a delegation vs. allow it to pass? That kind of
thing would be good to know.

> > > Unfortunately the code intended to clear the old bit set once it reached
> > > 30 seconds old, preparing it to be the next new bit set, instead cleared
> > > the *new* bit set before switching it to be the old bit set.  This means
> > > that the "old" bit set is always empty and delegations are blocked
> > > between 0 and 30 seconds.
> > > 
> > > This patch updates bd->new before clearing the set with that index,
> > > instead of afterwards.
> > > 
> > > Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> > > Cc: stable@vger.kernel.org
> > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfsd/nfs4state.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index 4313addbe756..6f18c1a7af2e 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
> > >   * When a delegation is recalled, the filehandle is stored in the "new"
> > >   * filter.
> > >   * Every 30 seconds we swap the filters and clear the "new" one,
> > > - * unless both are empty of course.
> > > + * unless both are empty of course.  This results in delegations for a
> > > + * given filehandle being blocked for between 30 and 60 seconds.
> > >   *
> > >   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
> > >   * low 3 bytes as hash-table indices.
> > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
> > >               if (ktime_get_seconds() - bd->swap_time > 30) {
> > >                       bd->entries -= bd->old_entries;
> > >                       bd->old_entries = bd->entries;
> > > +                     bd->new = 1-bd->new;
> > >                       memset(bd->set[bd->new], 0,
> > >                              sizeof(bd->set[0]));
> > > -                     bd->new = 1-bd->new;
> > >                       bd->swap_time = ktime_get_seconds();
> > >               }
> > >               spin_unlock(&blocked_delegations_lock);
> > 
> > --
> > Jeff Layton <jlayton@kernel.org>
> > 
>
Chuck Lever III Sept. 9, 2024, 2:37 p.m. UTC | #6
On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > new filehandle would be recorded in the "new" bit set.  That would then
> > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > would remain as the "old" bit set for 30 seconds.
> > >
> >
> > Since we're on the subject...
> >
> > 60s seems like an awfully long time to block delegations on an inode.
> > Recalls generally don't take more than a few seconds when things are
> > functioning properly.
> >
> > Should we swap the bloom filters more often?
> 
> I was also thinking that perhaps we can do 15-30s perhaps? Another
> thought was should this be a configurable value (with some
> non-negotiable min and max)?

If it needs to be configurable, then we haven't done our jobs as
system architects. Temporarily blocking delegation ought to be
effective without human intervention IMHO.

At least let's get some specific usage scenarios that demonstrate a
palpable need for enabling an admin to adjust this behavior (ie, a
need that is not simply an implementation bug), then design for
those specific cases.

Does NFSD have metrics in this area, for example?

Generally speaking, though, I'm not opposed to finessing the behavior
of the Bloom filter. I'd like to apply the patch below for v6.12,
preserving the Cc: stable, but handle the behavioral finessing via
a subsequent patch targeting v6.13 so that can be appropriately
reviewed and tested. Ja?

BTW, nice catch!


> > > Unfortunately the code intended to clear the old bit set once it reached
> > > 30 seconds old, preparing it to be the next new bit set, instead cleared
> > > the *new* bit set before switching it to be the old bit set.  This means
> > > that the "old" bit set is always empty and delegations are blocked
> > > between 0 and 30 seconds.
> > >
> > > This patch updates bd->new before clearing the set with that index,
> > > instead of afterwards.
> > >
> > > Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> > > Cc: stable@vger.kernel.org
> > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfsd/nfs4state.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index 4313addbe756..6f18c1a7af2e 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
> > >   * When a delegation is recalled, the filehandle is stored in the "new"
> > >   * filter.
> > >   * Every 30 seconds we swap the filters and clear the "new" one,
> > > - * unless both are empty of course.
> > > + * unless both are empty of course.  This results in delegations for a
> > > + * given filehandle being blocked for between 30 and 60 seconds.
> > >   *
> > >   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
> > >   * low 3 bytes as hash-table indices.
> > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
> > >               if (ktime_get_seconds() - bd->swap_time > 30) {
> > >                       bd->entries -= bd->old_entries;
> > >                       bd->old_entries = bd->entries;
> > > +                     bd->new = 1-bd->new;
> > >                       memset(bd->set[bd->new], 0,
> > >                              sizeof(bd->set[0]));
> > > -                     bd->new = 1-bd->new;
> > >                       bd->swap_time = ktime_get_seconds();
> > >               }
> > >               spin_unlock(&blocked_delegations_lock);
> >
> > --
> > Jeff Layton <jlayton@kernel.org>
> >
>
Chuck Lever Sept. 9, 2024, 2:45 p.m. UTC | #7
From: Chuck Lever <chuck.lever@oracle.com>

On Mon, 09 Sep 2024 15:06:36 +1000, NeilBrown wrote:                                              
> The pair of bloom filtered used by delegation_blocked() was intended to
> block delegations on given filehandles for between 30 and 60 seconds.  A
> new filehandle would be recorded in the "new" bit set.  That would then
> be switch to the "old" bit set between 0 and 30 seconds later, and it
> would remain as the "old" bit set for 30 seconds.
> 
> Unfortunately the code intended to clear the old bit set once it reached
> 30 seconds old, preparing it to be the next new bit set, instead cleared
> the *new* bit set before switching it to be the old bit set.  This means
> that the "old" bit set is always empty and delegations are blocked
> between 0 and 30 seconds.
> 
> [...]                                                                        

Applied to nfsd-next for v6.12, thanks!                                                                

[1/1] nfsd: fix delegation_blocked() to block correctly for at least 30 seconds
      commit: bd400f0ac66e0a6e1276d6455d73aed6ce6a1453                                                                      

--                                                                              
Chuck Lever
Olga Kornievskaia Sept. 9, 2024, 3:02 p.m. UTC | #8
On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
> > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > > new filehandle would be recorded in the "new" bit set.  That would then
> > > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > > would remain as the "old" bit set for 30 seconds.
> > > >
> > >
> > > Since we're on the subject...
> > >
> > > 60s seems like an awfully long time to block delegations on an inode.
> > > Recalls generally don't take more than a few seconds when things are
> > > functioning properly.
> > >
> > > Should we swap the bloom filters more often?
> >
> > I was also thinking that perhaps we can do 15-30s perhaps? Another
> > thought was should this be a configurable value (with some
> > non-negotiable min and max)?
>
> If it needs to be configurable, then we haven't done our jobs as
> system architects. Temporarily blocking delegation ought to be
> effective without human intervention IMHO.
>
> At least let's get some specific usage scenarios that demonstrate a
> palpable need for enabling an admin to adjust this behavior (ie, a
> need that is not simply an implementation bug), then design for
> those specific cases.
>
> Does NFSD have metrics in this area, for example?
>
> Generally speaking, though, I'm not opposed to finessing the behavior
> of the Bloom filter. I'd like to apply the patch below for v6.12,

100% agreed that we need this patch to go in now. The configuration
was just a thought for after which I should have stated explicitly. I
guess I'm not a big fan of hard coded numbers in the code and naively
thinking that it's always better to have a config over a hardcoded
value.

> preserving the Cc: stable, but handle the behavioral finessing via
> a subsequent patch targeting v6.13 so that can be appropriately
> reviewed and tested. Ja?
>
> BTW, nice catch!
>
>
> > > > Unfortunately the code intended to clear the old bit set once it reached
> > > > 30 seconds old, preparing it to be the next new bit set, instead cleared
> > > > the *new* bit set before switching it to be the old bit set.  This means
> > > > that the "old" bit set is always empty and delegations are blocked
> > > > between 0 and 30 seconds.
> > > >
> > > > This patch updates bd->new before clearing the set with that index,
> > > > instead of afterwards.
> > > >
> > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > ---
> > > >  fs/nfsd/nfs4state.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > index 4313addbe756..6f18c1a7af2e 100644
> > > > --- a/fs/nfsd/nfs4state.c
> > > > +++ b/fs/nfsd/nfs4state.c
> > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
> > > >   * When a delegation is recalled, the filehandle is stored in the "new"
> > > >   * filter.
> > > >   * Every 30 seconds we swap the filters and clear the "new" one,
> > > > - * unless both are empty of course.
> > > > + * unless both are empty of course.  This results in delegations for a
> > > > + * given filehandle being blocked for between 30 and 60 seconds.
> > > >   *
> > > >   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
> > > >   * low 3 bytes as hash-table indices.
> > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
> > > >               if (ktime_get_seconds() - bd->swap_time > 30) {
> > > >                       bd->entries -= bd->old_entries;
> > > >                       bd->old_entries = bd->entries;
> > > > +                     bd->new = 1-bd->new;
> > > >                       memset(bd->set[bd->new], 0,
> > > >                              sizeof(bd->set[0]));
> > > > -                     bd->new = 1-bd->new;
> > > >                       bd->swap_time = ktime_get_seconds();
> > > >               }
> > > >               spin_unlock(&blocked_delegations_lock);
> > >
> > > --
> > > Jeff Layton <jlayton@kernel.org>
> > >
> >
>
> --
> Chuck Lever
>
Chuck Lever III Sept. 9, 2024, 3:14 p.m. UTC | #9
> On Sep 9, 2024, at 11:02 AM, Olga Kornievskaia <okorniev@redhat.com> wrote:
> 
> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>> 
>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
>>>>> The pair of bloom filtered used by delegation_blocked() was intended to
>>>>> block delegations on given filehandles for between 30 and 60 seconds.  A
>>>>> new filehandle would be recorded in the "new" bit set.  That would then
>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it
>>>>> would remain as the "old" bit set for 30 seconds.
>>>>> 
>>>> 
>>>> Since we're on the subject...
>>>> 
>>>> 60s seems like an awfully long time to block delegations on an inode.
>>>> Recalls generally don't take more than a few seconds when things are
>>>> functioning properly.
>>>> 
>>>> Should we swap the bloom filters more often?
>>> 
>>> I was also thinking that perhaps we can do 15-30s perhaps? Another
>>> thought was should this be a configurable value (with some
>>> non-negotiable min and max)?
>> 
>> If it needs to be configurable, then we haven't done our jobs as
>> system architects. Temporarily blocking delegation ought to be
>> effective without human intervention IMHO.
>> 
>> At least let's get some specific usage scenarios that demonstrate a
>> palpable need for enabling an admin to adjust this behavior (ie, a
>> need that is not simply an implementation bug), then design for
>> those specific cases.
>> 
>> Does NFSD have metrics in this area, for example?
>> 
>> Generally speaking, though, I'm not opposed to finessing the behavior
>> of the Bloom filter. I'd like to apply the patch below for v6.12,
> 
> 100% agreed that we need this patch to go in now. The configuration
> was just a thought for after which I should have stated explicitly. I
> guess I'm not a big fan of hard coded numbers in the code and naively
> thinking that it's always better to have a config over a hardcoded
> value.

I wouldn't say naive. But there is always a cost to exposing such
parameters (documentation, long-term maintainability, test matrix,
user confusion, and so on). The costs are sometimes well obscured.

I'd like the benefit to at least match that cost.

But point taken. We can handle magic numbers with more care.


>> preserving the Cc: stable, but handle the behavioral finessing via
>> a subsequent patch targeting v6.13 so that can be appropriately
>> reviewed and tested. Ja?
>> 
>> BTW, nice catch!
>> 
>> 
>>>>> Unfortunately the code intended to clear the old bit set once it reached
>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared
>>>>> the *new* bit set before switching it to be the old bit set.  This means
>>>>> that the "old" bit set is always empty and delegations are blocked
>>>>> between 0 and 30 seconds.
>>>>> 
>>>>> This patch updates bd->new before clearing the set with that index,
>>>>> instead of afterwards.
>>>>> 
>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
>>>>> Cc: stable@vger.kernel.org
>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>> ---
>>>>> fs/nfsd/nfs4state.c | 5 +++--
>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>> index 4313addbe756..6f18c1a7af2e 100644
>>>>> --- a/fs/nfsd/nfs4state.c
>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>>>>>  * When a delegation is recalled, the filehandle is stored in the "new"
>>>>>  * filter.
>>>>>  * Every 30 seconds we swap the filters and clear the "new" one,
>>>>> - * unless both are empty of course.
>>>>> + * unless both are empty of course.  This results in delegations for a
>>>>> + * given filehandle being blocked for between 30 and 60 seconds.
>>>>>  *
>>>>>  * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>>>>>  * low 3 bytes as hash-table indices.
>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>>>>>              if (ktime_get_seconds() - bd->swap_time > 30) {
>>>>>                      bd->entries -= bd->old_entries;
>>>>>                      bd->old_entries = bd->entries;
>>>>> +                     bd->new = 1-bd->new;
>>>>>                      memset(bd->set[bd->new], 0,
>>>>>                             sizeof(bd->set[0]));
>>>>> -                     bd->new = 1-bd->new;
>>>>>                      bd->swap_time = ktime_get_seconds();
>>>>>              }
>>>>>              spin_unlock(&blocked_delegations_lock);
>>>> 
>>>> --
>>>> Jeff Layton <jlayton@kernel.org>
>>>> 
>>> 
>> 
>> --
>> Chuck Lever


--
Chuck Lever
Tom Talpey Sept. 10, 2024, 12:32 p.m. UTC | #10
On 9/9/2024 11:02 AM, Olga Kornievskaia wrote:
> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>>
>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
>>>>> The pair of bloom filtered used by delegation_blocked() was intended to
>>>>> block delegations on given filehandles for between 30 and 60 seconds.  A
>>>>> new filehandle would be recorded in the "new" bit set.  That would then
>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it
>>>>> would remain as the "old" bit set for 30 seconds.
>>>>>
>>>>
>>>> Since we're on the subject...
>>>>
>>>> 60s seems like an awfully long time to block delegations on an inode.
>>>> Recalls generally don't take more than a few seconds when things are
>>>> functioning properly.
>>>>
>>>> Should we swap the bloom filters more often?
>>>
>>> I was also thinking that perhaps we can do 15-30s perhaps? Another
>>> thought was should this be a configurable value (with some
>>> non-negotiable min and max)?
>>
>> If it needs to be configurable, then we haven't done our jobs as
>> system architects. Temporarily blocking delegation ought to be
>> effective without human intervention IMHO.
>>
>> At least let's get some specific usage scenarios that demonstrate a
>> palpable need for enabling an admin to adjust this behavior (ie, a
>> need that is not simply an implementation bug), then design for
>> those specific cases.
>>
>> Does NFSD have metrics in this area, for example?
>>
>> Generally speaking, though, I'm not opposed to finessing the behavior
>> of the Bloom filter. I'd like to apply the patch below for v6.12,
> 
> 100% agreed that we need this patch to go in now. The configuration
> was just a thought for after which I should have stated explicitly. I
> guess I'm not a big fan of hard coded numbers in the code and naively
> thinking that it's always better to have a config over a hardcoded
> value.

No constant is ever correct in networking, especially timeouts. So yes,
it should be adjustable. But even then, choosing a number here is
fundamentally difficult.

Delegations can block for perfectly valid long periods, right? Say it
takes a long time to flush a write delegation, or if the network is
partitioned to the (other) client being recalled. 30 seconds to data
corruption is quite the guillotine.

Tom.

>> preserving the Cc: stable, but handle the behavioral finessing via
>> a subsequent patch targeting v6.13 so that can be appropriately
>> reviewed and tested. Ja?
>>
>> BTW, nice catch!
>>
>>
>>>>> Unfortunately the code intended to clear the old bit set once it reached
>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared
>>>>> the *new* bit set before switching it to be the old bit set.  This means
>>>>> that the "old" bit set is always empty and delegations are blocked
>>>>> between 0 and 30 seconds.
>>>>>
>>>>> This patch updates bd->new before clearing the set with that index,
>>>>> instead of afterwards.
>>>>>
>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
>>>>> Cc: stable@vger.kernel.org
>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>> ---
>>>>>   fs/nfsd/nfs4state.c | 5 +++--
>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>> index 4313addbe756..6f18c1a7af2e 100644
>>>>> --- a/fs/nfsd/nfs4state.c
>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>>>>>    * When a delegation is recalled, the filehandle is stored in the "new"
>>>>>    * filter.
>>>>>    * Every 30 seconds we swap the filters and clear the "new" one,
>>>>> - * unless both are empty of course.
>>>>> + * unless both are empty of course.  This results in delegations for a
>>>>> + * given filehandle being blocked for between 30 and 60 seconds.
>>>>>    *
>>>>>    * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>>>>>    * low 3 bytes as hash-table indices.
>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>>>>>                if (ktime_get_seconds() - bd->swap_time > 30) {
>>>>>                        bd->entries -= bd->old_entries;
>>>>>                        bd->old_entries = bd->entries;
>>>>> +                     bd->new = 1-bd->new;
>>>>>                        memset(bd->set[bd->new], 0,
>>>>>                               sizeof(bd->set[0]));
>>>>> -                     bd->new = 1-bd->new;
>>>>>                        bd->swap_time = ktime_get_seconds();
>>>>>                }
>>>>>                spin_unlock(&blocked_delegations_lock);
>>>>
>>>> --
>>>> Jeff Layton <jlayton@kernel.org>
>>>>
>>>
>>
>> --
>> Chuck Lever
>>
> 
>
Jeff Layton Sept. 10, 2024, 12:52 p.m. UTC | #11
On Tue, 2024-09-10 at 08:32 -0400, Tom Talpey wrote:
> On 9/9/2024 11:02 AM, Olga Kornievskaia wrote:
> > On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> > > 
> > > On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
> > > > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > > > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > > > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > > > > new filehandle would be recorded in the "new" bit set.  That would then
> > > > > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > > > > would remain as the "old" bit set for 30 seconds.
> > > > > > 
> > > > > 
> > > > > Since we're on the subject...
> > > > > 
> > > > > 60s seems like an awfully long time to block delegations on an inode.
> > > > > Recalls generally don't take more than a few seconds when things are
> > > > > functioning properly.
> > > > > 
> > > > > Should we swap the bloom filters more often?
> > > > 
> > > > I was also thinking that perhaps we can do 15-30s perhaps? Another
> > > > thought was should this be a configurable value (with some
> > > > non-negotiable min and max)?
> > > 
> > > If it needs to be configurable, then we haven't done our jobs as
> > > system architects. Temporarily blocking delegation ought to be
> > > effective without human intervention IMHO.
> > > 
> > > At least let's get some specific usage scenarios that demonstrate a
> > > palpable need for enabling an admin to adjust this behavior (ie, a
> > > need that is not simply an implementation bug), then design for
> > > those specific cases.
> > > 
> > > Does NFSD have metrics in this area, for example?
> > > 
> > > Generally speaking, though, I'm not opposed to finessing the behavior
> > > of the Bloom filter. I'd like to apply the patch below for v6.12,
> > 
> > 100% agreed that we need this patch to go in now. The configuration
> > was just a thought for after which I should have stated explicitly. I
> > guess I'm not a big fan of hard coded numbers in the code and naively
> > thinking that it's always better to have a config over a hardcoded
> > value.
> 
> No constant is ever correct in networking, especially timeouts. So yes,
> it should be adjustable. But even then, choosing a number here is
> fundamentally difficult.
> 
> Delegations can block for perfectly valid long periods, right? Say it
> takes a long time to flush a write delegation, or if the network is
> partitioned to the (other) client being recalled. 30 seconds to data
> corruption is quite the guillotine.
> 

I don't think this is danger of data corruption here. The bloom filter
is there to keep the server from handing out a new delegation too
quickly after having to recall one. Allowing no delegations for 30-60s
seems a bit too cautious, IMO. 

Ideally it seems like we'd want this to be some function of the delay
between the server issuing the CB_RECALL, and the client doing the
DELEGRETURN.

That value is obviously highly variable, but it would be an interesting
statistic to collect, and might help inform what the bloom filter delay
should be.

> > > preserving the Cc: stable, but handle the behavioral finessing via
> > > a subsequent patch targeting v6.13 so that can be appropriately
> > > reviewed and tested. Ja?
> > > 
> > > BTW, nice catch!
> > > 
> > > 
> > > > > > Unfortunately the code intended to clear the old bit set once it reached
> > > > > > 30 seconds old, preparing it to be the next new bit set, instead cleared
> > > > > > the *new* bit set before switching it to be the old bit set.  This means
> > > > > > that the "old" bit set is always empty and delegations are blocked
> > > > > > between 0 and 30 seconds.
> > > > > > 
> > > > > > This patch updates bd->new before clearing the set with that index,
> > > > > > instead of afterwards.
> > > > > > 
> > > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com>
> > > > > > Cc: stable@vger.kernel.org
> > > > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
> > > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > > ---
> > > > > >   fs/nfsd/nfs4state.c | 5 +++--
> > > > > >   1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > index 4313addbe756..6f18c1a7af2e 100644
> > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
> > > > > >    * When a delegation is recalled, the filehandle is stored in the "new"
> > > > > >    * filter.
> > > > > >    * Every 30 seconds we swap the filters and clear the "new" one,
> > > > > > - * unless both are empty of course.
> > > > > > + * unless both are empty of course.  This results in delegations for a
> > > > > > + * given filehandle being blocked for between 30 and 60 seconds.
> > > > > >    *
> > > > > >    * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
> > > > > >    * low 3 bytes as hash-table indices.
> > > > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
> > > > > >                if (ktime_get_seconds() - bd->swap_time > 30) {
> > > > > >                        bd->entries -= bd->old_entries;
> > > > > >                        bd->old_entries = bd->entries;
> > > > > > +                     bd->new = 1-bd->new;
> > > > > >                        memset(bd->set[bd->new], 0,
> > > > > >                               sizeof(bd->set[0]));
> > > > > > -                     bd->new = 1-bd->new;
> > > > > >                        bd->swap_time = ktime_get_seconds();
> > > > > >                }
> > > > > >                spin_unlock(&blocked_delegations_lock);
> > > > > 
> > > > > --
> > > > > Jeff Layton <jlayton@kernel.org>
> > > > > 
> > > > 
> > > 
> > > --
> > > Chuck Lever
> > > 
> > 
> > 
>
Tom Talpey Sept. 10, 2024, 1:28 p.m. UTC | #12
On 9/10/2024 8:52 AM, Jeff Layton wrote:
> On Tue, 2024-09-10 at 08:32 -0400, Tom Talpey wrote:
>> On 9/9/2024 11:02 AM, Olga Kornievskaia wrote:
>>> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
>>>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>>
>>>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
>>>>>>> The pair of bloom filtered used by delegation_blocked() was intended to
>>>>>>> block delegations on given filehandles for between 30 and 60 seconds.  A
>>>>>>> new filehandle would be recorded in the "new" bit set.  That would then
>>>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it
>>>>>>> would remain as the "old" bit set for 30 seconds.
>>>>>>>
>>>>>>
>>>>>> Since we're on the subject...
>>>>>>
>>>>>> 60s seems like an awfully long time to block delegations on an inode.
>>>>>> Recalls generally don't take more than a few seconds when things are
>>>>>> functioning properly.
>>>>>>
>>>>>> Should we swap the bloom filters more often?
>>>>>
>>>>> I was also thinking that perhaps we can do 15-30s perhaps? Another
>>>>> thought was should this be a configurable value (with some
>>>>> non-negotiable min and max)?
>>>>
>>>> If it needs to be configurable, then we haven't done our jobs as
>>>> system architects. Temporarily blocking delegation ought to be
>>>> effective without human intervention IMHO.
>>>>
>>>> At least let's get some specific usage scenarios that demonstrate a
>>>> palpable need for enabling an admin to adjust this behavior (ie, a
>>>> need that is not simply an implementation bug), then design for
>>>> those specific cases.
>>>>
>>>> Does NFSD have metrics in this area, for example?
>>>>
>>>> Generally speaking, though, I'm not opposed to finessing the behavior
>>>> of the Bloom filter. I'd like to apply the patch below for v6.12,
>>>
>>> 100% agreed that we need this patch to go in now. The configuration
>>> was just a thought for after which I should have stated explicitly. I
>>> guess I'm not a big fan of hard coded numbers in the code and naively
>>> thinking that it's always better to have a config over a hardcoded
>>> value.
>>
>> No constant is ever correct in networking, especially timeouts. So yes,
>> it should be adjustable. But even then, choosing a number here is
>> fundamentally difficult.
>>
>> Delegations can block for perfectly valid long periods, right? Say it
>> takes a long time to flush a write delegation, or if the network is
>> partitioned to the (other) client being recalled. 30 seconds to data
>> corruption is quite the guillotine.
>>
> 
> I don't think this is danger of data corruption here. The bloom filter
> is there to keep the server from handing out a new delegation too
> quickly after having to recall one. Allowing no delegations for 30-60s
> seems a bit too cautious, IMO.

Agreed that the server is doing the right thing, it's about when the
hammer comes down, and again, any constant is inevitably going to be
wrong, sometime or somewhere.

> Ideally it seems like we'd want this to be some function of the delay
> between the server issuing the CB_RECALL, and the client doing the
> DELEGRETURN.
> 
> That value is obviously highly variable, but it would be an interesting
> statistic to collect, and might help inform what the bloom filter delay
> should be.

It would, but it seems it would have to be per-client and per-workload.
There is definitely no single value. Some clients may take a caching
read delegation "just because", and return it promptly. Others may
be at the end of a skinny/flaky link and want to buffer writes. These
are going to have wildly different delegreturn latencies.

Tom.

> 
>>>> preserving the Cc: stable, but handle the behavioral finessing via
>>>> a subsequent patch targeting v6.13 so that can be appropriately
>>>> reviewed and tested. Ja?
>>>>
>>>> BTW, nice catch!
>>>>
>>>>
>>>>>>> Unfortunately the code intended to clear the old bit set once it reached
>>>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared
>>>>>>> the *new* bit set before switching it to be the old bit set.  This means
>>>>>>> that the "old" bit set is always empty and delegations are blocked
>>>>>>> between 0 and 30 seconds.
>>>>>>>
>>>>>>> This patch updates bd->new before clearing the set with that index,
>>>>>>> instead of afterwards.
>>>>>>>
>>>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
>>>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>>>> ---
>>>>>>>    fs/nfsd/nfs4state.c | 5 +++--
>>>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>>> index 4313addbe756..6f18c1a7af2e 100644
>>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>>>>>>>     * When a delegation is recalled, the filehandle is stored in the "new"
>>>>>>>     * filter.
>>>>>>>     * Every 30 seconds we swap the filters and clear the "new" one,
>>>>>>> - * unless both are empty of course.
>>>>>>> + * unless both are empty of course.  This results in delegations for a
>>>>>>> + * given filehandle being blocked for between 30 and 60 seconds.
>>>>>>>     *
>>>>>>>     * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>>>>>>>     * low 3 bytes as hash-table indices.
>>>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>>>>>>>                 if (ktime_get_seconds() - bd->swap_time > 30) {
>>>>>>>                         bd->entries -= bd->old_entries;
>>>>>>>                         bd->old_entries = bd->entries;
>>>>>>> +                     bd->new = 1-bd->new;
>>>>>>>                         memset(bd->set[bd->new], 0,
>>>>>>>                                sizeof(bd->set[0]));
>>>>>>> -                     bd->new = 1-bd->new;
>>>>>>>                         bd->swap_time = ktime_get_seconds();
>>>>>>>                 }
>>>>>>>                 spin_unlock(&blocked_delegations_lock);
>>>>>>
>>>>>> --
>>>>>> Jeff Layton <jlayton@kernel.org>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>>
>>>
>>>
>>
>
Chuck Lever III Sept. 10, 2024, 2:18 p.m. UTC | #13
> On Sep 10, 2024, at 8:32 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 9/9/2024 11:02 AM, Olga Kornievskaia wrote:
>> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote:
>>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>>> 
>>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
>>>>>> The pair of bloom filtered used by delegation_blocked() was intended to
>>>>>> block delegations on given filehandles for between 30 and 60 seconds.  A
>>>>>> new filehandle would be recorded in the "new" bit set.  That would then
>>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it
>>>>>> would remain as the "old" bit set for 30 seconds.
>>>>>> 
>>>>> 
>>>>> Since we're on the subject...
>>>>> 
>>>>> 60s seems like an awfully long time to block delegations on an inode.
>>>>> Recalls generally don't take more than a few seconds when things are
>>>>> functioning properly.
>>>>> 
>>>>> Should we swap the bloom filters more often?
>>>> 
>>>> I was also thinking that perhaps we can do 15-30s perhaps? Another
>>>> thought was should this be a configurable value (with some
>>>> non-negotiable min and max)?
>>> 
>>> If it needs to be configurable, then we haven't done our jobs as
>>> system architects. Temporarily blocking delegation ought to be
>>> effective without human intervention IMHO.
>>> 
>>> At least let's get some specific usage scenarios that demonstrate a
>>> palpable need for enabling an admin to adjust this behavior (ie, a
>>> need that is not simply an implementation bug), then design for
>>> those specific cases.
>>> 
>>> Does NFSD have metrics in this area, for example?
>>> 
>>> Generally speaking, though, I'm not opposed to finessing the behavior
>>> of the Bloom filter. I'd like to apply the patch below for v6.12,
>> 100% agreed that we need this patch to go in now. The configuration
>> was just a thought for after which I should have stated explicitly. I
>> guess I'm not a big fan of hard coded numbers in the code and naively
>> thinking that it's always better to have a config over a hardcoded
>> value.
> 
> No constant is ever correct in networking, especially timeouts.

But a constant can be "good enough". No one has yet clearly
demonstrated, for this mechanism, that one timeout value
is worse than another.


> So yes,
> it should be adjustable. But even then, choosing a number here is
> fundamentally difficult.

I'm not saying it should never be adjustable. We need some
data showing what range of values is useful, and we need to
know where the control needs to be placed: export option?
sysctl? /sys or /proc? Can this knob be abused? Can the setting
be automated? Maybe the control should be "delay a few seconds
vs. put that client in the doghouse until server reboots"?

Indeed, it might be that any delay greater than zero has some
benefit. In which case, the flexibility of changing the filter
reset time starts to look unnecessary.

Point is: we don't know yet. I would like to see a plan for
how to gather more data, first. As I said yesterday, there
is a cost to adding knobs like this. I want to know what
we're getting for the money.

(To be clear, I'm not making a hard objection. I'm just
asking to wait for more information).


> Delegations can block for perfectly valid long periods, right? Say it
> takes a long time to flush a write delegation, or if the network is
> partitioned to the (other) client being recalled. 30 seconds to data
> corruption is quite the guillotine.

The Bloom filter is not about correctness, so data corruption
is not a factor. The purpose of this mechanism is to give a
fair forward progress guarantee in certain corner cases.


> Tom.
> 
>>> preserving the Cc: stable, but handle the behavioral finessing via
>>> a subsequent patch targeting v6.13 so that can be appropriately
>>> reviewed and tested. Ja?
>>> 
>>> BTW, nice catch!
>>> 
>>> 
>>>>>> Unfortunately the code intended to clear the old bit set once it reached
>>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared
>>>>>> the *new* bit set before switching it to be the old bit set.  This means
>>>>>> that the "old" bit set is always empty and delegations are blocked
>>>>>> between 0 and 30 seconds.
>>>>>> 
>>>>>> This patch updates bd->new before clearing the set with that index,
>>>>>> instead of afterwards.
>>>>>> 
>>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.")
>>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>>> ---
>>>>>>  fs/nfsd/nfs4state.c | 5 +++--
>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>>>>>> index 4313addbe756..6f18c1a7af2e 100644
>>>>>> --- a/fs/nfsd/nfs4state.c
>>>>>> +++ b/fs/nfsd/nfs4state.c
>>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid)
>>>>>>   * When a delegation is recalled, the filehandle is stored in the "new"
>>>>>>   * filter.
>>>>>>   * Every 30 seconds we swap the filters and clear the "new" one,
>>>>>> - * unless both are empty of course.
>>>>>> + * unless both are empty of course.  This results in delegations for a
>>>>>> + * given filehandle being blocked for between 30 and 60 seconds.
>>>>>>   *
>>>>>>   * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
>>>>>>   * low 3 bytes as hash-table indices.
>>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh)
>>>>>>               if (ktime_get_seconds() - bd->swap_time > 30) {
>>>>>>                       bd->entries -= bd->old_entries;
>>>>>>                       bd->old_entries = bd->entries;
>>>>>> +                     bd->new = 1-bd->new;
>>>>>>                       memset(bd->set[bd->new], 0,
>>>>>>                              sizeof(bd->set[0]));
>>>>>> -                     bd->new = 1-bd->new;
>>>>>>                       bd->swap_time = ktime_get_seconds();
>>>>>>               }
>>>>>>               spin_unlock(&blocked_delegations_lock);
>>>>> 
>>>>> --
>>>>> Jeff Layton <jlayton@kernel.org>
>>>>> 
>>>> 
>>> 
>>> --
>>> Chuck Lever


--
Chuck Lever
NeilBrown Sept. 11, 2024, midnight UTC | #14
On Mon, 09 Sep 2024, Jeff Layton wrote:
> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > The pair of bloom filtered used by delegation_blocked() was intended to
> > block delegations on given filehandles for between 30 and 60 seconds.  A
> > new filehandle would be recorded in the "new" bit set.  That would then
> > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > would remain as the "old" bit set for 30 seconds.
> > 
> 
> Since we're on the subject...
> 
> 60s seems like an awfully long time to block delegations on an inode.
> Recalls generally don't take more than a few seconds when things are
> functioning properly.
> 
> Should we swap the bloom filters more often?

Or should we take a completely different approach?  Or both?
I'm bothered by the fact that this bug in a heuristic caused rename to
misbehave this way.  So I did some exploration.

try_break_deleg / break_deleg_wait are called:

 - notify_change() - chmod_common, chown_common, and vfs_utimes  waits
 - vfs_unlink() - do_unlinkat waits for the delegation to be broken
 - vfs_link() - do_linkat waits
 - vfs_rename() - do_renameat2 waits
 - vfs_set_acl() - waits itself
 - vfs_remove_acl() - ditto
 - __vfs_setxattr_locked() - vfs_setxattr waits
 - __vfs_removexattr_locked() - vfs_removexattr waits

I would argue that *none* of these justify delaying further delegations
once the operation itself has completed.  They aren't the sort of things
that suggest on-going cross-client "sharing" which delegations interfere
with.

I imagine try_break_lease would increment some counter in ->i_flctx
which prevents further leases being taken, and some new call near where
break_deleg_wait() is called will decrement that counter.  If waiting
for a lease to break, call break_deleg_wait() and retry, else
break_deleg_done().

Delegations are also broken by calls break_lease() which comes from
nfsd_open_break_lease() for conflicting nfsd opens.  I think there *are*
indicitive of shares and *should* result in the filehandle being recorded
in the bloom filter.
Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK,
then the filehandle could be added to the bloom filter at that point.

do_dentry_open also calls break_lease() but we still want the filehandle
to go in the bloom-filter in that case ....  maybe the lm_break callback
could check i_writecount and i_readcount and use those to determine if
delaying delegations is appropriate.

wrt the time to block delegation for, I'm not sure that it matters much.
Once we focus the delay on files that are concurrently opened from
multiple clients (not both read-only) I think there are some files
(most) that are never shared, and some files that are often shared.  We
probably don't want delegations for often shared files at all.  So I'd
be comfortable blocking delegations on often-shared files for quite a
while.

I wouldn't advocate for *longer* than 30 seconds because I don't want
the bloom filters to fill up - that significantly reduced their
usefulness.  So maybe we could also swap filters when the 'new' one
becomes 50% full....

NeilBrown
Olga Kornievskaia Sept. 11, 2024, 2:13 p.m. UTC | #15
On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 09 Sep 2024, Jeff Layton wrote:
> > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > new filehandle would be recorded in the "new" bit set.  That would then
> > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > would remain as the "old" bit set for 30 seconds.
> > >
> >
> > Since we're on the subject...
> >
> > 60s seems like an awfully long time to block delegations on an inode.
> > Recalls generally don't take more than a few seconds when things are
> > functioning properly.
> >
> > Should we swap the bloom filters more often?
>
> Or should we take a completely different approach?  Or both?
> I'm bothered by the fact that this bug in a heuristic caused rename to
> misbehave this way.  So I did some exploration.
>
> try_break_deleg / break_deleg_wait are called:
>
>  - notify_change() - chmod_common, chown_common, and vfs_utimes  waits
>  - vfs_unlink() - do_unlinkat waits for the delegation to be broken
>  - vfs_link() - do_linkat waits
>  - vfs_rename() - do_renameat2 waits
>  - vfs_set_acl() - waits itself
>  - vfs_remove_acl() - ditto
>  - __vfs_setxattr_locked() - vfs_setxattr waits
>  - __vfs_removexattr_locked() - vfs_removexattr waits
>
> I would argue that *none* of these justify delaying further delegations
> once the operation itself has completed.  They aren't the sort of things
> that suggest on-going cross-client "sharing" which delegations interfere
> with.

I wouldn't discount these operations (at least not rename) from being
an operation that can't represent "sharing" of files. An example
workload is where a file gets generated, created, written/read over
the NFS, but then locally then transferred to another filesystem. I
can imagine a pipeline, where then file gets filled up and the
generated data moved to be worked on elsewhere and the same file gets
filled up again. I think this bug was discovered because of an setup
where there was a heavy use of these operations (on various files) and
some got blocked causing problems. For such workload, if we are not
going to block giving out a delegation do we cause too many
cb_recalls?

> I imagine try_break_lease would increment some counter in ->i_flctx
> which prevents further leases being taken, and some new call near where
> break_deleg_wait() is called will decrement that counter.  If waiting
> for a lease to break, call break_deleg_wait() and retry, else
> break_deleg_done().
>
> Delegations are also broken by calls break_lease() which comes from
> nfsd_open_break_lease() for conflicting nfsd opens.  I think there *are*
> indicitive of shares and *should* result in the filehandle being recorded
> in the bloom filter.
> Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK,
> then the filehandle could be added to the bloom filter at that point.

What about vfs_truncate() which also calls break_lease?

> do_dentry_open also calls break_lease() but we still want the filehandle
> to go in the bloom-filter in that case ....  maybe the lm_break callback
> could check i_writecount and i_readcount and use those to determine if
> delaying delegations is appropriate.
>
> wrt the time to block delegation for, I'm not sure that it matters much.
> Once we focus the delay on files that are concurrently opened from
> multiple clients (not both read-only) I think there are some files
> (most) that are never shared, and some files that are often shared.  We
> probably don't want delegations for often shared files at all.  So I'd
> be comfortable blocking delegations on often-shared files for quite a
> while.

So perhaps rename might be an exception among the operations that are
triggering delegation recalls. Though I think unlink might be similar
and truncate. Otherwise, perhaps optimizing other operation would be
useful. However, I would like to ask does the added complexity justify
the benefits? What kind of workload would be greatly penalized? If we
imagine the other operations are low occurrence (ie not representing a
sharing example) then the penalty is just an infrequent 60s block.

> I wouldn't advocate for *longer* than 30 seconds because I don't want
> the bloom filters to fill up - that significantly reduced their
> usefulness.  So maybe we could also swap filters when the 'new' one
> becomes 50% full....

I was suggesting changing that value to 15 because it would mean that
at most the wait would be 30s which is same as what it was before the
fix but we are imposing at least 15s block and it keeps the filter 'at
the same capacity as before'.  But of course in the case of heavy
conflicting operations there will be less blockage and more recalls.
This is where I was hoping a configurable value might be handy. Or, I
would instead argue that we implement a heuristic of detecting that
the server is in a state of frequent delegation recall phase and then
automatically adjust the value for how long the delegations are
blocked. Perhaps  we keep a counter in the bloom_filter when a block
is set and when a threshold is reached we increase the value before we
swap the filters (yes at the risk of having a full bloom filter but
because it's better to be cautious in giving out delegations when
there are frequent recalls?). Then, with perhaps another timer we
adjust the block time down.... Wouldn't that cover either having
different NFS clients having conflicting opens or having some workload
that has NFS/local file system conflicts.


>
> NeilBrown
>
NeilBrown Sept. 16, 2024, 1:17 a.m. UTC | #16
On Thu, 12 Sep 2024, Olga Kornievskaia wrote:
> On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@suse.de> wrote:
> >
> > On Mon, 09 Sep 2024, Jeff Layton wrote:
> > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote:
> > > > The pair of bloom filtered used by delegation_blocked() was intended to
> > > > block delegations on given filehandles for between 30 and 60 seconds.  A
> > > > new filehandle would be recorded in the "new" bit set.  That would then
> > > > be switch to the "old" bit set between 0 and 30 seconds later, and it
> > > > would remain as the "old" bit set for 30 seconds.
> > > >
> > >
> > > Since we're on the subject...
> > >
> > > 60s seems like an awfully long time to block delegations on an inode.
> > > Recalls generally don't take more than a few seconds when things are
> > > functioning properly.
> > >
> > > Should we swap the bloom filters more often?
> >
> > Or should we take a completely different approach?  Or both?
> > I'm bothered by the fact that this bug in a heuristic caused rename to
> > misbehave this way.  So I did some exploration.
> >
> > try_break_deleg / break_deleg_wait are called:
> >
> >  - notify_change() - chmod_common, chown_common, and vfs_utimes  waits
> >  - vfs_unlink() - do_unlinkat waits for the delegation to be broken
> >  - vfs_link() - do_linkat waits
> >  - vfs_rename() - do_renameat2 waits
> >  - vfs_set_acl() - waits itself
> >  - vfs_remove_acl() - ditto
> >  - __vfs_setxattr_locked() - vfs_setxattr waits
> >  - __vfs_removexattr_locked() - vfs_removexattr waits
> >
> > I would argue that *none* of these justify delaying further delegations
> > once the operation itself has completed.  They aren't the sort of things
> > that suggest on-going cross-client "sharing" which delegations interfere
> > with.
> 
> I wouldn't discount these operations (at least not rename) from being
> an operation that can't represent "sharing" of files. An example
> workload is where a file gets generated, created, written/read over
> the NFS, but then locally then transferred to another filesystem. I
> can imagine a pipeline, where then file gets filled up and the
> generated data moved to be worked on elsewhere and the same file gets
> filled up again. I think this bug was discovered because of an setup
> where there was a heavy use of these operations (on various files) and
> some got blocked causing problems. For such workload, if we are not
> going to block giving out a delegation do we cause too many
> cb_recalls?

A pipeline as you describe seem to be a case of serial sharing.
Different applications use the same file, but only at different times.
This sort of sharing isn't hurt by delegations.

The sort of sharing the might trigger excessive cb_recalls if
delegations weren't blocked would almost certainly involve file locking
and an expectation that two separate applications would sometimes access
the file concurrently.  When this is happening, neither should get a
delegation.

The problem you saw was really caused by a delegation being given out
while the rename was still happening.
i.e.:
  - the rename starts
  - the delegation is detected and broken
  - the cb_recall is sent.
  - the client opens the file prior to returning the delegation
  - the client gets a new delegation as part of this open
  - the client returns the original delegation
  - the rename loops around and finds a new delegation which it needs
    to break.

The should only loop once unless the recall takes more than 30 seconds. 
So I'm a bit perplexed that it blocked lock enough to be noticed.  So
maybe there is more going on here than I can see.  Or maybe the recall
is really slow.

In either case I think we want to firmly block delegations while a rename
(unlink etc) is happening, but do not need to continue to block them
after the rename etc completes.

> 
> > I imagine try_break_lease would increment some counter in ->i_flctx
> > which prevents further leases being taken, and some new call near where
> > break_deleg_wait() is called will decrement that counter.  If waiting
> > for a lease to break, call break_deleg_wait() and retry, else
> > break_deleg_done().
> >
> > Delegations are also broken by calls break_lease() which comes from
> > nfsd_open_break_lease() for conflicting nfsd opens.  I think there *are*
> > indicitive of shares and *should* result in the filehandle being recorded
> > in the bloom filter.
> > Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK,
> > then the filehandle could be added to the bloom filter at that point.
> 
> What about vfs_truncate() which also calls break_lease?

vfs_truncate() is quite different, though that isn't a good excuse for
me to leave it out.  It uses break_lease() like open does - not
try_break_deleg() like rename.
It can do this because it has called get_write_access() in the inode
which has incremented i_writecount which will prevent leases from being
granted.  But that DOESN'T prevent delegations from being granted.  I
think it should.

> 
> > do_dentry_open also calls break_lease() but we still want the filehandle
> > to go in the bloom-filter in that case ....  maybe the lm_break callback
> > could check i_writecount and i_readcount and use those to determine if
> > delaying delegations is appropriate.
> >
> > wrt the time to block delegation for, I'm not sure that it matters much.
> > Once we focus the delay on files that are concurrently opened from
> > multiple clients (not both read-only) I think there are some files
> > (most) that are never shared, and some files that are often shared.  We
> > probably don't want delegations for often shared files at all.  So I'd
> > be comfortable blocking delegations on often-shared files for quite a
> > while.
> 
> So perhaps rename might be an exception among the operations that are
> triggering delegation recalls. Though I think unlink might be similar
> and truncate. Otherwise, perhaps optimizing other operation would be
> useful. However, I would like to ask does the added complexity justify
> the benefits? What kind of workload would be greatly penalized? If we
> imagine the other operations are low occurrence (ie not representing a
> sharing example) then the penalty is just an infrequent 60s block.
> 

My concern is that the current design is conceptually wrong.
We need to block delegation while rename etc is happening.  We currently
block for 30-60 seconds.  As rename never takes that long, this works.
But I don't like it.

Some day someone might find a workload for which the current delay is
too long.  If we just reduced the size of the delay, we might make it
take less time than a rename, and something unexpected would break.

So I want to fix the design so that it make sense.  Then I won't really
care how long the delay is for.


> > I wouldn't advocate for *longer* than 30 seconds because I don't want
> > the bloom filters to fill up - that significantly reduced their
> > usefulness.  So maybe we could also swap filters when the 'new' one
> > becomes 50% full....
> 
> I was suggesting changing that value to 15 because it would mean that
> at most the wait would be 30s which is same as what it was before the
> fix but we are imposing at least 15s block and it keeps the filter 'at
> the same capacity as before'.  But of course in the case of heavy
> conflicting operations there will be less blockage and more recalls.
> This is where I was hoping a configurable value might be handy. Or, I
> would instead argue that we implement a heuristic of detecting that
> the server is in a state of frequent delegation recall phase and then
> automatically adjust the value for how long the delegations are
> blocked. Perhaps  we keep a counter in the bloom_filter when a block
> is set and when a threshold is reached we increase the value before we
> swap the filters (yes at the risk of having a full bloom filter but
> because it's better to be cautious in giving out delegations when
> there are frequent recalls?). Then, with perhaps another timer we
> adjust the block time down.... Wouldn't that cover either having
> different NFS clients having conflicting opens or having some workload
> that has NFS/local file system conflicts.
> 

I agree that as we have been use "0-30 seconds" for 10 year with no
complaints, changing to "15-30 seconds" is safer than change to "30-60
seconds".

I think that if we want to fine-tune further, there are other steps we
could take which might more clearly be helpful.
Currently we only block delegation when a recall happens.  This means
the 30-60 second count is started when a client opens a file and sharing
is detected.  I think we should also start that timer when a client
CLOSES a file which some other client has open.  So if it is ever the
case that to clients have the file open at the same time, we block
delegations until 30-60 seconds after the file has been closed by all
clients. 

i.e I don't really want to detect "frequent delegation recall".  I want
to detect "ongoing sharing".  In that case I don't want any delegations
or any recall.

Thanks,
NeilBrown
NeilBrown Sept. 16, 2024, 2:04 a.m. UTC | #17
On Mon, 16 Sep 2024, NeilBrown wrote:
> On Thu, 12 Sep 2024, Olga Kornievskaia wrote:
> > 
> > What about vfs_truncate() which also calls break_lease?
> 
> vfs_truncate() is quite different, though that isn't a good excuse for
> me to leave it out.  It uses break_lease() like open does - not
> try_break_deleg() like rename.
> It can do this because it has called get_write_access() in the inode
> which has incremented i_writecount which will prevent leases from being
> granted.  But that DOESN'T prevent delegations from being granted.  I
> think it should.

Actually it does.  check_conflicting_opens() does cause the delegation
request to fail because it is a no-op for delegations.
But between getting the delegation and hashing the delegation,
nfsd4_check_conflicting_opens() is called and it fails if i_writecount
has been elevated.

So the break_lease() from vfs_truncate() is safe - delegations are
prevented until the truncation completes.

NeilBrown
Benjamin Coddington Sept. 27, 2024, 12:44 p.m. UTC | #18
On 15 Sep 2024, at 21:17, NeilBrown wrote:

> On Thu, 12 Sep 2024, Olga Kornievskaia wrote:
>
>> I wouldn't discount these operations (at least not rename) from being
>> an operation that can't represent "sharing" of files. An example
>> workload is where a file gets generated, created, written/read over
>> the NFS, but then locally then transferred to another filesystem. I
>> can imagine a pipeline, where then file gets filled up and the
>> generated data moved to be worked on elsewhere and the same file gets
>> filled up again. I think this bug was discovered because of an setup
>> where there was a heavy use of these operations (on various files) and
>> some got blocked causing problems. For such workload, if we are not
>> going to block giving out a delegation do we cause too many
>> cb_recalls?
>
> A pipeline as you describe seem to be a case of serial sharing.
> Different applications use the same file, but only at different times.
> This sort of sharing isn't hurt by delegations.
>
> The sort of sharing the might trigger excessive cb_recalls if
> delegations weren't blocked would almost certainly involve file locking
> and an expectation that two separate applications would sometimes access
> the file concurrently.  When this is happening, neither should get a
> delegation.
>
> The problem you saw was really caused by a delegation being given out
> while the rename was still happening.
> i.e.:
>   - the rename starts
>   - the delegation is detected and broken
>   - the cb_recall is sent.
>   - the client opens the file prior to returning the delegation
>   - the client gets a new delegation as part of this open
>   - the client returns the original delegation
>   - the rename loops around and finds a new delegation which it needs
>     to break.
>
> The should only loop once unless the recall takes more than 30 seconds.
> So I'm a bit perplexed that it blocked lock enough to be noticed.  So
> maybe there is more going on here than I can see.  Or maybe the recall
> is really slow.

When the server's local rename process calls __break_lease(), it only calls
fl_lmpops->lm_break() once and sets FL_UNLOCK_PENDING.  After that it will
sleep and wake to check, but never again call ->lm_break() (which will cause
knfsd to recall the delegation).

The check for leases_conflict() is not stateful.

Ben
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 4313addbe756..6f18c1a7af2e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -1078,7 +1078,8 @@  static void nfs4_free_deleg(struct nfs4_stid *stid)
  * When a delegation is recalled, the filehandle is stored in the "new"
  * filter.
  * Every 30 seconds we swap the filters and clear the "new" one,
- * unless both are empty of course.
+ * unless both are empty of course.  This results in delegations for a
+ * given filehandle being blocked for between 30 and 60 seconds.
  *
  * Each filter is 256 bits.  We hash the filehandle to 32bit and use the
  * low 3 bytes as hash-table indices.
@@ -1107,9 +1108,9 @@  static int delegation_blocked(struct knfsd_fh *fh)
 		if (ktime_get_seconds() - bd->swap_time > 30) {
 			bd->entries -= bd->old_entries;
 			bd->old_entries = bd->entries;
+			bd->new = 1-bd->new;
 			memset(bd->set[bd->new], 0,
 			       sizeof(bd->set[0]));
-			bd->new = 1-bd->new;
 			bd->swap_time = ktime_get_seconds();
 		}
 		spin_unlock(&blocked_delegations_lock);