Message ID | 172585839640.4433.13337900639103448371@noble.neil.brown.name (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | nfsd: fix delegation_blocked() to block correctly for at least 30 seconds | expand |
On 9 Sep 2024, at 1:06, NeilBrown wrote: > The pair of bloom filtered used by delegation_blocked() was intended to > block delegations on given filehandles for between 30 and 60 seconds. A > new filehandle would be recorded in the "new" bit set. That would then > be switch to the "old" bit set between 0 and 30 seconds later, and it > would remain as the "old" bit set for 30 seconds. > > Unfortunately the code intended to clear the old bit set once it reached > 30 seconds old, preparing it to be the next new bit set, instead cleared > the *new* bit set before switching it to be the old bit set. This means > that the "old" bit set is always empty and delegations are blocked > between 0 and 30 seconds. > > This patch updates bd->new before clearing the set with that index, > instead of afterwards. > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > Cc: stable@vger.kernel.org > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 4313addbe756..6f18c1a7af2e 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > * When a delegation is recalled, the filehandle is stored in the "new" > * filter. > * Every 30 seconds we swap the filters and clear the "new" one, > - * unless both are empty of course. > + * unless both are empty of course. This results in delegations for a > + * given filehandle being blocked for between 30 and 60 seconds. > * > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > * low 3 bytes as hash-table indices. > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > if (ktime_get_seconds() - bd->swap_time > 30) { > bd->entries -= bd->old_entries; > bd->old_entries = bd->entries; > + bd->new = 1-bd->new; > memset(bd->set[bd->new], 0, > sizeof(bd->set[0])); > - bd->new = 1-bd->new; > bd->swap_time = ktime_get_seconds(); > } > spin_unlock(&blocked_delegations_lock); I stared at this code a long time without seeing the problem, but now it's obvious. Thanks very much Neil! Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Ben
On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > The pair of bloom filtered used by delegation_blocked() was intended to > block delegations on given filehandles for between 30 and 60 seconds. A > new filehandle would be recorded in the "new" bit set. That would then > be switch to the "old" bit set between 0 and 30 seconds later, and it > would remain as the "old" bit set for 30 seconds. > > Unfortunately the code intended to clear the old bit set once it reached > 30 seconds old, preparing it to be the next new bit set, instead cleared > the *new* bit set before switching it to be the old bit set. This means > that the "old" bit set is always empty and delegations are blocked > between 0 and 30 seconds. > > This patch updates bd->new before clearing the set with that index, > instead of afterwards. > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > Cc: stable@vger.kernel.org > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 4313addbe756..6f18c1a7af2e 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > * When a delegation is recalled, the filehandle is stored in the "new" > * filter. > * Every 30 seconds we swap the filters and clear the "new" one, > - * unless both are empty of course. > + * unless both are empty of course. This results in delegations for a > + * given filehandle being blocked for between 30 and 60 seconds. > * > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > * low 3 bytes as hash-table indices. > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > if (ktime_get_seconds() - bd->swap_time > 30) { > bd->entries -= bd->old_entries; > bd->old_entries = bd->entries; > + bd->new = 1-bd->new; > memset(bd->set[bd->new], 0, > sizeof(bd->set[0])); > - bd->new = 1-bd->new; > bd->swap_time = ktime_get_seconds(); > } > spin_unlock(&blocked_delegations_lock); Nice catch! Reviewed-by: Jeff Layton <jlayton@kernel.org>
On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > The pair of bloom filtered used by delegation_blocked() was intended to > block delegations on given filehandles for between 30 and 60 seconds. A > new filehandle would be recorded in the "new" bit set. That would then > be switch to the "old" bit set between 0 and 30 seconds later, and it > would remain as the "old" bit set for 30 seconds. > Since we're on the subject... 60s seems like an awfully long time to block delegations on an inode. Recalls generally don't take more than a few seconds when things are functioning properly. Should we swap the bloom filters more often? > Unfortunately the code intended to clear the old bit set once it reached > 30 seconds old, preparing it to be the next new bit set, instead cleared > the *new* bit set before switching it to be the old bit set. This means > that the "old" bit set is always empty and delegations are blocked > between 0 and 30 seconds. > > This patch updates bd->new before clearing the set with that index, > instead of afterwards. > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > Cc: stable@vger.kernel.org > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/nfs4state.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index 4313addbe756..6f18c1a7af2e 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > * When a delegation is recalled, the filehandle is stored in the "new" > * filter. > * Every 30 seconds we swap the filters and clear the "new" one, > - * unless both are empty of course. > + * unless both are empty of course. This results in delegations for a > + * given filehandle being blocked for between 30 and 60 seconds. > * > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > * low 3 bytes as hash-table indices. > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > if (ktime_get_seconds() - bd->swap_time > 30) { > bd->entries -= bd->old_entries; > bd->old_entries = bd->entries; > + bd->new = 1-bd->new; > memset(bd->set[bd->new], 0, > sizeof(bd->set[0])); > - bd->new = 1-bd->new; > bd->swap_time = ktime_get_seconds(); > } > spin_unlock(&blocked_delegations_lock);
On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > The pair of bloom filtered used by delegation_blocked() was intended to > > block delegations on given filehandles for between 30 and 60 seconds. A > > new filehandle would be recorded in the "new" bit set. That would then > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > would remain as the "old" bit set for 30 seconds. > > > > Since we're on the subject... > > 60s seems like an awfully long time to block delegations on an inode. > Recalls generally don't take more than a few seconds when things are > functioning properly. > > Should we swap the bloom filters more often? I was also thinking that perhaps we can do 15-30s perhaps? Another thought was should this be a configurable value (with some non-negotiable min and max)? > > Unfortunately the code intended to clear the old bit set once it reached > > 30 seconds old, preparing it to be the next new bit set, instead cleared > > the *new* bit set before switching it to be the old bit set. This means > > that the "old" bit set is always empty and delegations are blocked > > between 0 and 30 seconds. > > > > This patch updates bd->new before clearing the set with that index, > > instead of afterwards. > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > > Cc: stable@vger.kernel.org > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/nfs4state.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > index 4313addbe756..6f18c1a7af2e 100644 > > --- a/fs/nfsd/nfs4state.c > > +++ b/fs/nfsd/nfs4state.c > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > > * When a delegation is recalled, the filehandle is stored in the "new" > > * filter. > > * Every 30 seconds we swap the filters and clear the "new" one, > > - * unless both are empty of course. > > + * unless both are empty of course. This results in delegations for a > > + * given filehandle being blocked for between 30 and 60 seconds. > > * > > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > > * low 3 bytes as hash-table indices. > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > > if (ktime_get_seconds() - bd->swap_time > 30) { > > bd->entries -= bd->old_entries; > > bd->old_entries = bd->entries; > > + bd->new = 1-bd->new; > > memset(bd->set[bd->new], 0, > > sizeof(bd->set[0])); > > - bd->new = 1-bd->new; > > bd->swap_time = ktime_get_seconds(); > > } > > spin_unlock(&blocked_delegations_lock); > > -- > Jeff Layton <jlayton@kernel.org> >
On Mon, 2024-09-09 at 10:17 -0400, Olga Kornievskaia wrote: > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > new filehandle would be recorded in the "new" bit set. That would then > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > would remain as the "old" bit set for 30 seconds. > > > > > > > Since we're on the subject... > > > > 60s seems like an awfully long time to block delegations on an inode. > > Recalls generally don't take more than a few seconds when things are > > functioning properly. > > > > Should we swap the bloom filters more often? > > I was also thinking that perhaps we can do 15-30s perhaps? Another > thought was should this be a configurable value (with some > non-negotiable min and max)? > Even 15-30s sounds like a lot, but that would probably be better. While I'm not a huge fan of tunables, it's hard to know what the "right" interval is here without some experimentation. Maybe we should start with some metrics. For example, how often does the bloom filter block a delegation vs. allow it to pass? That kind of thing would be good to know. > > > Unfortunately the code intended to clear the old bit set once it reached > > > 30 seconds old, preparing it to be the next new bit set, instead cleared > > > the *new* bit set before switching it to be the old bit set. This means > > > that the "old" bit set is always empty and delegations are blocked > > > between 0 and 30 seconds. > > > > > > This patch updates bd->new before clearing the set with that index, > > > instead of afterwards. > > > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > > > Cc: stable@vger.kernel.org > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > --- > > > fs/nfsd/nfs4state.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 4313addbe756..6f18c1a7af2e 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > > > * When a delegation is recalled, the filehandle is stored in the "new" > > > * filter. > > > * Every 30 seconds we swap the filters and clear the "new" one, > > > - * unless both are empty of course. > > > + * unless both are empty of course. This results in delegations for a > > > + * given filehandle being blocked for between 30 and 60 seconds. > > > * > > > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > > > * low 3 bytes as hash-table indices. > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > > > if (ktime_get_seconds() - bd->swap_time > 30) { > > > bd->entries -= bd->old_entries; > > > bd->old_entries = bd->entries; > > > + bd->new = 1-bd->new; > > > memset(bd->set[bd->new], 0, > > > sizeof(bd->set[0])); > > > - bd->new = 1-bd->new; > > > bd->swap_time = ktime_get_seconds(); > > > } > > > spin_unlock(&blocked_delegations_lock); > > > > -- > > Jeff Layton <jlayton@kernel.org> > > >
On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > new filehandle would be recorded in the "new" bit set. That would then > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > would remain as the "old" bit set for 30 seconds. > > > > > > > Since we're on the subject... > > > > 60s seems like an awfully long time to block delegations on an inode. > > Recalls generally don't take more than a few seconds when things are > > functioning properly. > > > > Should we swap the bloom filters more often? > > I was also thinking that perhaps we can do 15-30s perhaps? Another > thought was should this be a configurable value (with some > non-negotiable min and max)? If it needs to be configurable, then we haven't done our jobs as system architects. Temporarily blocking delegation ought to be effective without human intervention IMHO. At least let's get some specific usage scenarios that demonstrate a palpable need for enabling an admin to adjust this behavior (ie, a need that is not simply an implementation bug), then design for those specific cases. Does NFSD have metrics in this area, for example? Generally speaking, though, I'm not opposed to finessing the behavior of the Bloom filter. I'd like to apply the patch below for v6.12, preserving the Cc: stable, but handle the behavioral finessing via a subsequent patch targeting v6.13 so that can be appropriately reviewed and tested. Ja? BTW, nice catch! > > > Unfortunately the code intended to clear the old bit set once it reached > > > 30 seconds old, preparing it to be the next new bit set, instead cleared > > > the *new* bit set before switching it to be the old bit set. This means > > > that the "old" bit set is always empty and delegations are blocked > > > between 0 and 30 seconds. > > > > > > This patch updates bd->new before clearing the set with that index, > > > instead of afterwards. > > > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > > > Cc: stable@vger.kernel.org > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > --- > > > fs/nfsd/nfs4state.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 4313addbe756..6f18c1a7af2e 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > > > * When a delegation is recalled, the filehandle is stored in the "new" > > > * filter. > > > * Every 30 seconds we swap the filters and clear the "new" one, > > > - * unless both are empty of course. > > > + * unless both are empty of course. This results in delegations for a > > > + * given filehandle being blocked for between 30 and 60 seconds. > > > * > > > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > > > * low 3 bytes as hash-table indices. > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > > > if (ktime_get_seconds() - bd->swap_time > 30) { > > > bd->entries -= bd->old_entries; > > > bd->old_entries = bd->entries; > > > + bd->new = 1-bd->new; > > > memset(bd->set[bd->new], 0, > > > sizeof(bd->set[0])); > > > - bd->new = 1-bd->new; > > > bd->swap_time = ktime_get_seconds(); > > > } > > > spin_unlock(&blocked_delegations_lock); > > > > -- > > Jeff Layton <jlayton@kernel.org> > > >
From: Chuck Lever <chuck.lever@oracle.com> On Mon, 09 Sep 2024 15:06:36 +1000, NeilBrown wrote: > The pair of bloom filtered used by delegation_blocked() was intended to > block delegations on given filehandles for between 30 and 60 seconds. A > new filehandle would be recorded in the "new" bit set. That would then > be switch to the "old" bit set between 0 and 30 seconds later, and it > would remain as the "old" bit set for 30 seconds. > > Unfortunately the code intended to clear the old bit set once it reached > 30 seconds old, preparing it to be the next new bit set, instead cleared > the *new* bit set before switching it to be the old bit set. This means > that the "old" bit set is always empty and delegations are blocked > between 0 and 30 seconds. > > [...] Applied to nfsd-next for v6.12, thanks! [1/1] nfsd: fix delegation_blocked() to block correctly for at least 30 seconds commit: bd400f0ac66e0a6e1276d6455d73aed6ce6a1453 -- Chuck Lever
On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: > > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > > new filehandle would be recorded in the "new" bit set. That would then > > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > > would remain as the "old" bit set for 30 seconds. > > > > > > > > > > Since we're on the subject... > > > > > > 60s seems like an awfully long time to block delegations on an inode. > > > Recalls generally don't take more than a few seconds when things are > > > functioning properly. > > > > > > Should we swap the bloom filters more often? > > > > I was also thinking that perhaps we can do 15-30s perhaps? Another > > thought was should this be a configurable value (with some > > non-negotiable min and max)? > > If it needs to be configurable, then we haven't done our jobs as > system architects. Temporarily blocking delegation ought to be > effective without human intervention IMHO. > > At least let's get some specific usage scenarios that demonstrate a > palpable need for enabling an admin to adjust this behavior (ie, a > need that is not simply an implementation bug), then design for > those specific cases. > > Does NFSD have metrics in this area, for example? > > Generally speaking, though, I'm not opposed to finessing the behavior > of the Bloom filter. I'd like to apply the patch below for v6.12, 100% agreed that we need this patch to go in now. The configuration was just a thought for after which I should have stated explicitly. I guess I'm not a big fan of hard coded numbers in the code and naively thinking that it's always better to have a config over a hardcoded value. > preserving the Cc: stable, but handle the behavioral finessing via > a subsequent patch targeting v6.13 so that can be appropriately > reviewed and tested. Ja? > > BTW, nice catch! > > > > > > Unfortunately the code intended to clear the old bit set once it reached > > > > 30 seconds old, preparing it to be the next new bit set, instead cleared > > > > the *new* bit set before switching it to be the old bit set. This means > > > > that the "old" bit set is always empty and delegations are blocked > > > > between 0 and 30 seconds. > > > > > > > > This patch updates bd->new before clearing the set with that index, > > > > instead of afterwards. > > > > > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > > > > Cc: stable@vger.kernel.org > > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > --- > > > > fs/nfsd/nfs4state.c | 5 +++-- > > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > index 4313addbe756..6f18c1a7af2e 100644 > > > > --- a/fs/nfsd/nfs4state.c > > > > +++ b/fs/nfsd/nfs4state.c > > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > > > > * When a delegation is recalled, the filehandle is stored in the "new" > > > > * filter. > > > > * Every 30 seconds we swap the filters and clear the "new" one, > > > > - * unless both are empty of course. > > > > + * unless both are empty of course. This results in delegations for a > > > > + * given filehandle being blocked for between 30 and 60 seconds. > > > > * > > > > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > > > > * low 3 bytes as hash-table indices. > > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > > > > if (ktime_get_seconds() - bd->swap_time > 30) { > > > > bd->entries -= bd->old_entries; > > > > bd->old_entries = bd->entries; > > > > + bd->new = 1-bd->new; > > > > memset(bd->set[bd->new], 0, > > > > sizeof(bd->set[0])); > > > > - bd->new = 1-bd->new; > > > > bd->swap_time = ktime_get_seconds(); > > > > } > > > > spin_unlock(&blocked_delegations_lock); > > > > > > -- > > > Jeff Layton <jlayton@kernel.org> > > > > > > > -- > Chuck Lever >
> On Sep 9, 2024, at 11:02 AM, Olga Kornievskaia <okorniev@redhat.com> wrote: > > On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: >>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: >>>> >>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: >>>>> The pair of bloom filtered used by delegation_blocked() was intended to >>>>> block delegations on given filehandles for between 30 and 60 seconds. A >>>>> new filehandle would be recorded in the "new" bit set. That would then >>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it >>>>> would remain as the "old" bit set for 30 seconds. >>>>> >>>> >>>> Since we're on the subject... >>>> >>>> 60s seems like an awfully long time to block delegations on an inode. >>>> Recalls generally don't take more than a few seconds when things are >>>> functioning properly. >>>> >>>> Should we swap the bloom filters more often? >>> >>> I was also thinking that perhaps we can do 15-30s perhaps? Another >>> thought was should this be a configurable value (with some >>> non-negotiable min and max)? >> >> If it needs to be configurable, then we haven't done our jobs as >> system architects. Temporarily blocking delegation ought to be >> effective without human intervention IMHO. >> >> At least let's get some specific usage scenarios that demonstrate a >> palpable need for enabling an admin to adjust this behavior (ie, a >> need that is not simply an implementation bug), then design for >> those specific cases. >> >> Does NFSD have metrics in this area, for example? >> >> Generally speaking, though, I'm not opposed to finessing the behavior >> of the Bloom filter. I'd like to apply the patch below for v6.12, > > 100% agreed that we need this patch to go in now. The configuration > was just a thought for after which I should have stated explicitly. I > guess I'm not a big fan of hard coded numbers in the code and naively > thinking that it's always better to have a config over a hardcoded > value. I wouldn't say naive. But there is always a cost to exposing such parameters (documentation, long-term maintainability, test matrix, user confusion, and so on). The costs are sometimes well obscured. I'd like the benefit to at least match that cost. But point taken. We can handle magic numbers with more care. >> preserving the Cc: stable, but handle the behavioral finessing via >> a subsequent patch targeting v6.13 so that can be appropriately >> reviewed and tested. Ja? >> >> BTW, nice catch! >> >> >>>>> Unfortunately the code intended to clear the old bit set once it reached >>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared >>>>> the *new* bit set before switching it to be the old bit set. This means >>>>> that the "old" bit set is always empty and delegations are blocked >>>>> between 0 and 30 seconds. >>>>> >>>>> This patch updates bd->new before clearing the set with that index, >>>>> instead of afterwards. >>>>> >>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com> >>>>> Cc: stable@vger.kernel.org >>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") >>>>> Signed-off-by: NeilBrown <neilb@suse.de> >>>>> --- >>>>> fs/nfsd/nfs4state.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>> index 4313addbe756..6f18c1a7af2e 100644 >>>>> --- a/fs/nfsd/nfs4state.c >>>>> +++ b/fs/nfsd/nfs4state.c >>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) >>>>> * When a delegation is recalled, the filehandle is stored in the "new" >>>>> * filter. >>>>> * Every 30 seconds we swap the filters and clear the "new" one, >>>>> - * unless both are empty of course. >>>>> + * unless both are empty of course. This results in delegations for a >>>>> + * given filehandle being blocked for between 30 and 60 seconds. >>>>> * >>>>> * Each filter is 256 bits. We hash the filehandle to 32bit and use the >>>>> * low 3 bytes as hash-table indices. >>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) >>>>> if (ktime_get_seconds() - bd->swap_time > 30) { >>>>> bd->entries -= bd->old_entries; >>>>> bd->old_entries = bd->entries; >>>>> + bd->new = 1-bd->new; >>>>> memset(bd->set[bd->new], 0, >>>>> sizeof(bd->set[0])); >>>>> - bd->new = 1-bd->new; >>>>> bd->swap_time = ktime_get_seconds(); >>>>> } >>>>> spin_unlock(&blocked_delegations_lock); >>>> >>>> -- >>>> Jeff Layton <jlayton@kernel.org> >>>> >>> >> >> -- >> Chuck Lever -- Chuck Lever
On 9/9/2024 11:02 AM, Olga Kornievskaia wrote: > On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: >>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: >>>> >>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: >>>>> The pair of bloom filtered used by delegation_blocked() was intended to >>>>> block delegations on given filehandles for between 30 and 60 seconds. A >>>>> new filehandle would be recorded in the "new" bit set. That would then >>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it >>>>> would remain as the "old" bit set for 30 seconds. >>>>> >>>> >>>> Since we're on the subject... >>>> >>>> 60s seems like an awfully long time to block delegations on an inode. >>>> Recalls generally don't take more than a few seconds when things are >>>> functioning properly. >>>> >>>> Should we swap the bloom filters more often? >>> >>> I was also thinking that perhaps we can do 15-30s perhaps? Another >>> thought was should this be a configurable value (with some >>> non-negotiable min and max)? >> >> If it needs to be configurable, then we haven't done our jobs as >> system architects. Temporarily blocking delegation ought to be >> effective without human intervention IMHO. >> >> At least let's get some specific usage scenarios that demonstrate a >> palpable need for enabling an admin to adjust this behavior (ie, a >> need that is not simply an implementation bug), then design for >> those specific cases. >> >> Does NFSD have metrics in this area, for example? >> >> Generally speaking, though, I'm not opposed to finessing the behavior >> of the Bloom filter. I'd like to apply the patch below for v6.12, > > 100% agreed that we need this patch to go in now. The configuration > was just a thought for after which I should have stated explicitly. I > guess I'm not a big fan of hard coded numbers in the code and naively > thinking that it's always better to have a config over a hardcoded > value. No constant is ever correct in networking, especially timeouts. So yes, it should be adjustable. But even then, choosing a number here is fundamentally difficult. Delegations can block for perfectly valid long periods, right? Say it takes a long time to flush a write delegation, or if the network is partitioned to the (other) client being recalled. 30 seconds to data corruption is quite the guillotine. Tom. >> preserving the Cc: stable, but handle the behavioral finessing via >> a subsequent patch targeting v6.13 so that can be appropriately >> reviewed and tested. Ja? >> >> BTW, nice catch! >> >> >>>>> Unfortunately the code intended to clear the old bit set once it reached >>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared >>>>> the *new* bit set before switching it to be the old bit set. This means >>>>> that the "old" bit set is always empty and delegations are blocked >>>>> between 0 and 30 seconds. >>>>> >>>>> This patch updates bd->new before clearing the set with that index, >>>>> instead of afterwards. >>>>> >>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com> >>>>> Cc: stable@vger.kernel.org >>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") >>>>> Signed-off-by: NeilBrown <neilb@suse.de> >>>>> --- >>>>> fs/nfsd/nfs4state.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>> index 4313addbe756..6f18c1a7af2e 100644 >>>>> --- a/fs/nfsd/nfs4state.c >>>>> +++ b/fs/nfsd/nfs4state.c >>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) >>>>> * When a delegation is recalled, the filehandle is stored in the "new" >>>>> * filter. >>>>> * Every 30 seconds we swap the filters and clear the "new" one, >>>>> - * unless both are empty of course. >>>>> + * unless both are empty of course. This results in delegations for a >>>>> + * given filehandle being blocked for between 30 and 60 seconds. >>>>> * >>>>> * Each filter is 256 bits. We hash the filehandle to 32bit and use the >>>>> * low 3 bytes as hash-table indices. >>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) >>>>> if (ktime_get_seconds() - bd->swap_time > 30) { >>>>> bd->entries -= bd->old_entries; >>>>> bd->old_entries = bd->entries; >>>>> + bd->new = 1-bd->new; >>>>> memset(bd->set[bd->new], 0, >>>>> sizeof(bd->set[0])); >>>>> - bd->new = 1-bd->new; >>>>> bd->swap_time = ktime_get_seconds(); >>>>> } >>>>> spin_unlock(&blocked_delegations_lock); >>>> >>>> -- >>>> Jeff Layton <jlayton@kernel.org> >>>> >>> >> >> -- >> Chuck Lever >> > >
On Tue, 2024-09-10 at 08:32 -0400, Tom Talpey wrote: > On 9/9/2024 11:02 AM, Olga Kornievskaia wrote: > > On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > > On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: > > > > On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > > > > new filehandle would be recorded in the "new" bit set. That would then > > > > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > > > > would remain as the "old" bit set for 30 seconds. > > > > > > > > > > > > > > > > Since we're on the subject... > > > > > > > > > > 60s seems like an awfully long time to block delegations on an inode. > > > > > Recalls generally don't take more than a few seconds when things are > > > > > functioning properly. > > > > > > > > > > Should we swap the bloom filters more often? > > > > > > > > I was also thinking that perhaps we can do 15-30s perhaps? Another > > > > thought was should this be a configurable value (with some > > > > non-negotiable min and max)? > > > > > > If it needs to be configurable, then we haven't done our jobs as > > > system architects. Temporarily blocking delegation ought to be > > > effective without human intervention IMHO. > > > > > > At least let's get some specific usage scenarios that demonstrate a > > > palpable need for enabling an admin to adjust this behavior (ie, a > > > need that is not simply an implementation bug), then design for > > > those specific cases. > > > > > > Does NFSD have metrics in this area, for example? > > > > > > Generally speaking, though, I'm not opposed to finessing the behavior > > > of the Bloom filter. I'd like to apply the patch below for v6.12, > > > > 100% agreed that we need this patch to go in now. The configuration > > was just a thought for after which I should have stated explicitly. I > > guess I'm not a big fan of hard coded numbers in the code and naively > > thinking that it's always better to have a config over a hardcoded > > value. > > No constant is ever correct in networking, especially timeouts. So yes, > it should be adjustable. But even then, choosing a number here is > fundamentally difficult. > > Delegations can block for perfectly valid long periods, right? Say it > takes a long time to flush a write delegation, or if the network is > partitioned to the (other) client being recalled. 30 seconds to data > corruption is quite the guillotine. > I don't think this is danger of data corruption here. The bloom filter is there to keep the server from handing out a new delegation too quickly after having to recall one. Allowing no delegations for 30-60s seems a bit too cautious, IMO. Ideally it seems like we'd want this to be some function of the delay between the server issuing the CB_RECALL, and the client doing the DELEGRETURN. That value is obviously highly variable, but it would be an interesting statistic to collect, and might help inform what the bloom filter delay should be. > > > preserving the Cc: stable, but handle the behavioral finessing via > > > a subsequent patch targeting v6.13 so that can be appropriately > > > reviewed and tested. Ja? > > > > > > BTW, nice catch! > > > > > > > > > > > > Unfortunately the code intended to clear the old bit set once it reached > > > > > > 30 seconds old, preparing it to be the next new bit set, instead cleared > > > > > > the *new* bit set before switching it to be the old bit set. This means > > > > > > that the "old" bit set is always empty and delegations are blocked > > > > > > between 0 and 30 seconds. > > > > > > > > > > > > This patch updates bd->new before clearing the set with that index, > > > > > > instead of afterwards. > > > > > > > > > > > > Reported-by: Olga Kornievskaia <okorniev@redhat.com> > > > > > > Cc: stable@vger.kernel.org > > > > > > Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > > > --- > > > > > > fs/nfsd/nfs4state.c | 5 +++-- > > > > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > > > > index 4313addbe756..6f18c1a7af2e 100644 > > > > > > --- a/fs/nfsd/nfs4state.c > > > > > > +++ b/fs/nfsd/nfs4state.c > > > > > > @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) > > > > > > * When a delegation is recalled, the filehandle is stored in the "new" > > > > > > * filter. > > > > > > * Every 30 seconds we swap the filters and clear the "new" one, > > > > > > - * unless both are empty of course. > > > > > > + * unless both are empty of course. This results in delegations for a > > > > > > + * given filehandle being blocked for between 30 and 60 seconds. > > > > > > * > > > > > > * Each filter is 256 bits. We hash the filehandle to 32bit and use the > > > > > > * low 3 bytes as hash-table indices. > > > > > > @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) > > > > > > if (ktime_get_seconds() - bd->swap_time > 30) { > > > > > > bd->entries -= bd->old_entries; > > > > > > bd->old_entries = bd->entries; > > > > > > + bd->new = 1-bd->new; > > > > > > memset(bd->set[bd->new], 0, > > > > > > sizeof(bd->set[0])); > > > > > > - bd->new = 1-bd->new; > > > > > > bd->swap_time = ktime_get_seconds(); > > > > > > } > > > > > > spin_unlock(&blocked_delegations_lock); > > > > > > > > > > -- > > > > > Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > > > -- > > > Chuck Lever > > > > > > > >
On 9/10/2024 8:52 AM, Jeff Layton wrote: > On Tue, 2024-09-10 at 08:32 -0400, Tom Talpey wrote: >> On 9/9/2024 11:02 AM, Olga Kornievskaia wrote: >>> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: >>>> >>>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: >>>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: >>>>>> >>>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: >>>>>>> The pair of bloom filtered used by delegation_blocked() was intended to >>>>>>> block delegations on given filehandles for between 30 and 60 seconds. A >>>>>>> new filehandle would be recorded in the "new" bit set. That would then >>>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it >>>>>>> would remain as the "old" bit set for 30 seconds. >>>>>>> >>>>>> >>>>>> Since we're on the subject... >>>>>> >>>>>> 60s seems like an awfully long time to block delegations on an inode. >>>>>> Recalls generally don't take more than a few seconds when things are >>>>>> functioning properly. >>>>>> >>>>>> Should we swap the bloom filters more often? >>>>> >>>>> I was also thinking that perhaps we can do 15-30s perhaps? Another >>>>> thought was should this be a configurable value (with some >>>>> non-negotiable min and max)? >>>> >>>> If it needs to be configurable, then we haven't done our jobs as >>>> system architects. Temporarily blocking delegation ought to be >>>> effective without human intervention IMHO. >>>> >>>> At least let's get some specific usage scenarios that demonstrate a >>>> palpable need for enabling an admin to adjust this behavior (ie, a >>>> need that is not simply an implementation bug), then design for >>>> those specific cases. >>>> >>>> Does NFSD have metrics in this area, for example? >>>> >>>> Generally speaking, though, I'm not opposed to finessing the behavior >>>> of the Bloom filter. I'd like to apply the patch below for v6.12, >>> >>> 100% agreed that we need this patch to go in now. The configuration >>> was just a thought for after which I should have stated explicitly. I >>> guess I'm not a big fan of hard coded numbers in the code and naively >>> thinking that it's always better to have a config over a hardcoded >>> value. >> >> No constant is ever correct in networking, especially timeouts. So yes, >> it should be adjustable. But even then, choosing a number here is >> fundamentally difficult. >> >> Delegations can block for perfectly valid long periods, right? Say it >> takes a long time to flush a write delegation, or if the network is >> partitioned to the (other) client being recalled. 30 seconds to data >> corruption is quite the guillotine. >> > > I don't think this is danger of data corruption here. The bloom filter > is there to keep the server from handing out a new delegation too > quickly after having to recall one. Allowing no delegations for 30-60s > seems a bit too cautious, IMO. Agreed that the server is doing the right thing, it's about when the hammer comes down, and again, any constant is inevitably going to be wrong, sometime or somewhere. > Ideally it seems like we'd want this to be some function of the delay > between the server issuing the CB_RECALL, and the client doing the > DELEGRETURN. > > That value is obviously highly variable, but it would be an interesting > statistic to collect, and might help inform what the bloom filter delay > should be. It would, but it seems it would have to be per-client and per-workload. There is definitely no single value. Some clients may take a caching read delegation "just because", and return it promptly. Others may be at the end of a skinny/flaky link and want to buffer writes. These are going to have wildly different delegreturn latencies. Tom. > >>>> preserving the Cc: stable, but handle the behavioral finessing via >>>> a subsequent patch targeting v6.13 so that can be appropriately >>>> reviewed and tested. Ja? >>>> >>>> BTW, nice catch! >>>> >>>> >>>>>>> Unfortunately the code intended to clear the old bit set once it reached >>>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared >>>>>>> the *new* bit set before switching it to be the old bit set. This means >>>>>>> that the "old" bit set is always empty and delegations are blocked >>>>>>> between 0 and 30 seconds. >>>>>>> >>>>>>> This patch updates bd->new before clearing the set with that index, >>>>>>> instead of afterwards. >>>>>>> >>>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com> >>>>>>> Cc: stable@vger.kernel.org >>>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") >>>>>>> Signed-off-by: NeilBrown <neilb@suse.de> >>>>>>> --- >>>>>>> fs/nfsd/nfs4state.c | 5 +++-- >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>>>> index 4313addbe756..6f18c1a7af2e 100644 >>>>>>> --- a/fs/nfsd/nfs4state.c >>>>>>> +++ b/fs/nfsd/nfs4state.c >>>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) >>>>>>> * When a delegation is recalled, the filehandle is stored in the "new" >>>>>>> * filter. >>>>>>> * Every 30 seconds we swap the filters and clear the "new" one, >>>>>>> - * unless both are empty of course. >>>>>>> + * unless both are empty of course. This results in delegations for a >>>>>>> + * given filehandle being blocked for between 30 and 60 seconds. >>>>>>> * >>>>>>> * Each filter is 256 bits. We hash the filehandle to 32bit and use the >>>>>>> * low 3 bytes as hash-table indices. >>>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) >>>>>>> if (ktime_get_seconds() - bd->swap_time > 30) { >>>>>>> bd->entries -= bd->old_entries; >>>>>>> bd->old_entries = bd->entries; >>>>>>> + bd->new = 1-bd->new; >>>>>>> memset(bd->set[bd->new], 0, >>>>>>> sizeof(bd->set[0])); >>>>>>> - bd->new = 1-bd->new; >>>>>>> bd->swap_time = ktime_get_seconds(); >>>>>>> } >>>>>>> spin_unlock(&blocked_delegations_lock); >>>>>> >>>>>> -- >>>>>> Jeff Layton <jlayton@kernel.org> >>>>>> >>>>> >>>> >>>> -- >>>> Chuck Lever >>>> >>> >>> >> >
> On Sep 10, 2024, at 8:32 AM, Tom Talpey <tom@talpey.com> wrote: > > On 9/9/2024 11:02 AM, Olga Kornievskaia wrote: >> On Mon, Sep 9, 2024 at 10:38 AM Chuck Lever <chuck.lever@oracle.com> wrote: >>> >>> On Mon, Sep 09, 2024 at 10:17:42AM -0400, Olga Kornievskaia wrote: >>>> On Mon, Sep 9, 2024 at 8:24 AM Jeff Layton <jlayton@kernel.org> wrote: >>>>> >>>>> On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: >>>>>> The pair of bloom filtered used by delegation_blocked() was intended to >>>>>> block delegations on given filehandles for between 30 and 60 seconds. A >>>>>> new filehandle would be recorded in the "new" bit set. That would then >>>>>> be switch to the "old" bit set between 0 and 30 seconds later, and it >>>>>> would remain as the "old" bit set for 30 seconds. >>>>>> >>>>> >>>>> Since we're on the subject... >>>>> >>>>> 60s seems like an awfully long time to block delegations on an inode. >>>>> Recalls generally don't take more than a few seconds when things are >>>>> functioning properly. >>>>> >>>>> Should we swap the bloom filters more often? >>>> >>>> I was also thinking that perhaps we can do 15-30s perhaps? Another >>>> thought was should this be a configurable value (with some >>>> non-negotiable min and max)? >>> >>> If it needs to be configurable, then we haven't done our jobs as >>> system architects. Temporarily blocking delegation ought to be >>> effective without human intervention IMHO. >>> >>> At least let's get some specific usage scenarios that demonstrate a >>> palpable need for enabling an admin to adjust this behavior (ie, a >>> need that is not simply an implementation bug), then design for >>> those specific cases. >>> >>> Does NFSD have metrics in this area, for example? >>> >>> Generally speaking, though, I'm not opposed to finessing the behavior >>> of the Bloom filter. I'd like to apply the patch below for v6.12, >> 100% agreed that we need this patch to go in now. The configuration >> was just a thought for after which I should have stated explicitly. I >> guess I'm not a big fan of hard coded numbers in the code and naively >> thinking that it's always better to have a config over a hardcoded >> value. > > No constant is ever correct in networking, especially timeouts. But a constant can be "good enough". No one has yet clearly demonstrated, for this mechanism, that one timeout value is worse than another. > So yes, > it should be adjustable. But even then, choosing a number here is > fundamentally difficult. I'm not saying it should never be adjustable. We need some data showing what range of values is useful, and we need to know where the control needs to be placed: export option? sysctl? /sys or /proc? Can this knob be abused? Can the setting be automated? Maybe the control should be "delay a few seconds vs. put that client in the doghouse until server reboots"? Indeed, it might be that any delay greater than zero has some benefit. In which case, the flexibility of changing the filter reset time starts to look unnecessary. Point is: we don't know yet. I would like to see a plan for how to gather more data, first. As I said yesterday, there is a cost to adding knobs like this. I want to know what we're getting for the money. (To be clear, I'm not making a hard objection. I'm just asking to wait for more information). > Delegations can block for perfectly valid long periods, right? Say it > takes a long time to flush a write delegation, or if the network is > partitioned to the (other) client being recalled. 30 seconds to data > corruption is quite the guillotine. The Bloom filter is not about correctness, so data corruption is not a factor. The purpose of this mechanism is to give a fair forward progress guarantee in certain corner cases. > Tom. > >>> preserving the Cc: stable, but handle the behavioral finessing via >>> a subsequent patch targeting v6.13 so that can be appropriately >>> reviewed and tested. Ja? >>> >>> BTW, nice catch! >>> >>> >>>>>> Unfortunately the code intended to clear the old bit set once it reached >>>>>> 30 seconds old, preparing it to be the next new bit set, instead cleared >>>>>> the *new* bit set before switching it to be the old bit set. This means >>>>>> that the "old" bit set is always empty and delegations are blocked >>>>>> between 0 and 30 seconds. >>>>>> >>>>>> This patch updates bd->new before clearing the set with that index, >>>>>> instead of afterwards. >>>>>> >>>>>> Reported-by: Olga Kornievskaia <okorniev@redhat.com> >>>>>> Cc: stable@vger.kernel.org >>>>>> Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") >>>>>> Signed-off-by: NeilBrown <neilb@suse.de> >>>>>> --- >>>>>> fs/nfsd/nfs4state.c | 5 +++-- >>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>>>> index 4313addbe756..6f18c1a7af2e 100644 >>>>>> --- a/fs/nfsd/nfs4state.c >>>>>> +++ b/fs/nfsd/nfs4state.c >>>>>> @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) >>>>>> * When a delegation is recalled, the filehandle is stored in the "new" >>>>>> * filter. >>>>>> * Every 30 seconds we swap the filters and clear the "new" one, >>>>>> - * unless both are empty of course. >>>>>> + * unless both are empty of course. This results in delegations for a >>>>>> + * given filehandle being blocked for between 30 and 60 seconds. >>>>>> * >>>>>> * Each filter is 256 bits. We hash the filehandle to 32bit and use the >>>>>> * low 3 bytes as hash-table indices. >>>>>> @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) >>>>>> if (ktime_get_seconds() - bd->swap_time > 30) { >>>>>> bd->entries -= bd->old_entries; >>>>>> bd->old_entries = bd->entries; >>>>>> + bd->new = 1-bd->new; >>>>>> memset(bd->set[bd->new], 0, >>>>>> sizeof(bd->set[0])); >>>>>> - bd->new = 1-bd->new; >>>>>> bd->swap_time = ktime_get_seconds(); >>>>>> } >>>>>> spin_unlock(&blocked_delegations_lock); >>>>> >>>>> -- >>>>> Jeff Layton <jlayton@kernel.org> >>>>> >>>> >>> >>> -- >>> Chuck Lever -- Chuck Lever
On Mon, 09 Sep 2024, Jeff Layton wrote: > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > The pair of bloom filtered used by delegation_blocked() was intended to > > block delegations on given filehandles for between 30 and 60 seconds. A > > new filehandle would be recorded in the "new" bit set. That would then > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > would remain as the "old" bit set for 30 seconds. > > > > Since we're on the subject... > > 60s seems like an awfully long time to block delegations on an inode. > Recalls generally don't take more than a few seconds when things are > functioning properly. > > Should we swap the bloom filters more often? Or should we take a completely different approach? Or both? I'm bothered by the fact that this bug in a heuristic caused rename to misbehave this way. So I did some exploration. try_break_deleg / break_deleg_wait are called: - notify_change() - chmod_common, chown_common, and vfs_utimes waits - vfs_unlink() - do_unlinkat waits for the delegation to be broken - vfs_link() - do_linkat waits - vfs_rename() - do_renameat2 waits - vfs_set_acl() - waits itself - vfs_remove_acl() - ditto - __vfs_setxattr_locked() - vfs_setxattr waits - __vfs_removexattr_locked() - vfs_removexattr waits I would argue that *none* of these justify delaying further delegations once the operation itself has completed. They aren't the sort of things that suggest on-going cross-client "sharing" which delegations interfere with. I imagine try_break_lease would increment some counter in ->i_flctx which prevents further leases being taken, and some new call near where break_deleg_wait() is called will decrement that counter. If waiting for a lease to break, call break_deleg_wait() and retry, else break_deleg_done(). Delegations are also broken by calls break_lease() which comes from nfsd_open_break_lease() for conflicting nfsd opens. I think there *are* indicitive of shares and *should* result in the filehandle being recorded in the bloom filter. Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK, then the filehandle could be added to the bloom filter at that point. do_dentry_open also calls break_lease() but we still want the filehandle to go in the bloom-filter in that case .... maybe the lm_break callback could check i_writecount and i_readcount and use those to determine if delaying delegations is appropriate. wrt the time to block delegation for, I'm not sure that it matters much. Once we focus the delay on files that are concurrently opened from multiple clients (not both read-only) I think there are some files (most) that are never shared, and some files that are often shared. We probably don't want delegations for often shared files at all. So I'd be comfortable blocking delegations on often-shared files for quite a while. I wouldn't advocate for *longer* than 30 seconds because I don't want the bloom filters to fill up - that significantly reduced their usefulness. So maybe we could also swap filters when the 'new' one becomes 50% full.... NeilBrown
On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@suse.de> wrote: > > On Mon, 09 Sep 2024, Jeff Layton wrote: > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > new filehandle would be recorded in the "new" bit set. That would then > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > would remain as the "old" bit set for 30 seconds. > > > > > > > Since we're on the subject... > > > > 60s seems like an awfully long time to block delegations on an inode. > > Recalls generally don't take more than a few seconds when things are > > functioning properly. > > > > Should we swap the bloom filters more often? > > Or should we take a completely different approach? Or both? > I'm bothered by the fact that this bug in a heuristic caused rename to > misbehave this way. So I did some exploration. > > try_break_deleg / break_deleg_wait are called: > > - notify_change() - chmod_common, chown_common, and vfs_utimes waits > - vfs_unlink() - do_unlinkat waits for the delegation to be broken > - vfs_link() - do_linkat waits > - vfs_rename() - do_renameat2 waits > - vfs_set_acl() - waits itself > - vfs_remove_acl() - ditto > - __vfs_setxattr_locked() - vfs_setxattr waits > - __vfs_removexattr_locked() - vfs_removexattr waits > > I would argue that *none* of these justify delaying further delegations > once the operation itself has completed. They aren't the sort of things > that suggest on-going cross-client "sharing" which delegations interfere > with. I wouldn't discount these operations (at least not rename) from being an operation that can't represent "sharing" of files. An example workload is where a file gets generated, created, written/read over the NFS, but then locally then transferred to another filesystem. I can imagine a pipeline, where then file gets filled up and the generated data moved to be worked on elsewhere and the same file gets filled up again. I think this bug was discovered because of an setup where there was a heavy use of these operations (on various files) and some got blocked causing problems. For such workload, if we are not going to block giving out a delegation do we cause too many cb_recalls? > I imagine try_break_lease would increment some counter in ->i_flctx > which prevents further leases being taken, and some new call near where > break_deleg_wait() is called will decrement that counter. If waiting > for a lease to break, call break_deleg_wait() and retry, else > break_deleg_done(). > > Delegations are also broken by calls break_lease() which comes from > nfsd_open_break_lease() for conflicting nfsd opens. I think there *are* > indicitive of shares and *should* result in the filehandle being recorded > in the bloom filter. > Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK, > then the filehandle could be added to the bloom filter at that point. What about vfs_truncate() which also calls break_lease? > do_dentry_open also calls break_lease() but we still want the filehandle > to go in the bloom-filter in that case .... maybe the lm_break callback > could check i_writecount and i_readcount and use those to determine if > delaying delegations is appropriate. > > wrt the time to block delegation for, I'm not sure that it matters much. > Once we focus the delay on files that are concurrently opened from > multiple clients (not both read-only) I think there are some files > (most) that are never shared, and some files that are often shared. We > probably don't want delegations for often shared files at all. So I'd > be comfortable blocking delegations on often-shared files for quite a > while. So perhaps rename might be an exception among the operations that are triggering delegation recalls. Though I think unlink might be similar and truncate. Otherwise, perhaps optimizing other operation would be useful. However, I would like to ask does the added complexity justify the benefits? What kind of workload would be greatly penalized? If we imagine the other operations are low occurrence (ie not representing a sharing example) then the penalty is just an infrequent 60s block. > I wouldn't advocate for *longer* than 30 seconds because I don't want > the bloom filters to fill up - that significantly reduced their > usefulness. So maybe we could also swap filters when the 'new' one > becomes 50% full.... I was suggesting changing that value to 15 because it would mean that at most the wait would be 30s which is same as what it was before the fix but we are imposing at least 15s block and it keeps the filter 'at the same capacity as before'. But of course in the case of heavy conflicting operations there will be less blockage and more recalls. This is where I was hoping a configurable value might be handy. Or, I would instead argue that we implement a heuristic of detecting that the server is in a state of frequent delegation recall phase and then automatically adjust the value for how long the delegations are blocked. Perhaps we keep a counter in the bloom_filter when a block is set and when a threshold is reached we increase the value before we swap the filters (yes at the risk of having a full bloom filter but because it's better to be cautious in giving out delegations when there are frequent recalls?). Then, with perhaps another timer we adjust the block time down.... Wouldn't that cover either having different NFS clients having conflicting opens or having some workload that has NFS/local file system conflicts. > > NeilBrown >
On Thu, 12 Sep 2024, Olga Kornievskaia wrote: > On Tue, Sep 10, 2024 at 8:00 PM NeilBrown <neilb@suse.de> wrote: > > > > On Mon, 09 Sep 2024, Jeff Layton wrote: > > > On Mon, 2024-09-09 at 15:06 +1000, NeilBrown wrote: > > > > The pair of bloom filtered used by delegation_blocked() was intended to > > > > block delegations on given filehandles for between 30 and 60 seconds. A > > > > new filehandle would be recorded in the "new" bit set. That would then > > > > be switch to the "old" bit set between 0 and 30 seconds later, and it > > > > would remain as the "old" bit set for 30 seconds. > > > > > > > > > > Since we're on the subject... > > > > > > 60s seems like an awfully long time to block delegations on an inode. > > > Recalls generally don't take more than a few seconds when things are > > > functioning properly. > > > > > > Should we swap the bloom filters more often? > > > > Or should we take a completely different approach? Or both? > > I'm bothered by the fact that this bug in a heuristic caused rename to > > misbehave this way. So I did some exploration. > > > > try_break_deleg / break_deleg_wait are called: > > > > - notify_change() - chmod_common, chown_common, and vfs_utimes waits > > - vfs_unlink() - do_unlinkat waits for the delegation to be broken > > - vfs_link() - do_linkat waits > > - vfs_rename() - do_renameat2 waits > > - vfs_set_acl() - waits itself > > - vfs_remove_acl() - ditto > > - __vfs_setxattr_locked() - vfs_setxattr waits > > - __vfs_removexattr_locked() - vfs_removexattr waits > > > > I would argue that *none* of these justify delaying further delegations > > once the operation itself has completed. They aren't the sort of things > > that suggest on-going cross-client "sharing" which delegations interfere > > with. > > I wouldn't discount these operations (at least not rename) from being > an operation that can't represent "sharing" of files. An example > workload is where a file gets generated, created, written/read over > the NFS, but then locally then transferred to another filesystem. I > can imagine a pipeline, where then file gets filled up and the > generated data moved to be worked on elsewhere and the same file gets > filled up again. I think this bug was discovered because of an setup > where there was a heavy use of these operations (on various files) and > some got blocked causing problems. For such workload, if we are not > going to block giving out a delegation do we cause too many > cb_recalls? A pipeline as you describe seem to be a case of serial sharing. Different applications use the same file, but only at different times. This sort of sharing isn't hurt by delegations. The sort of sharing the might trigger excessive cb_recalls if delegations weren't blocked would almost certainly involve file locking and an expectation that two separate applications would sometimes access the file concurrently. When this is happening, neither should get a delegation. The problem you saw was really caused by a delegation being given out while the rename was still happening. i.e.: - the rename starts - the delegation is detected and broken - the cb_recall is sent. - the client opens the file prior to returning the delegation - the client gets a new delegation as part of this open - the client returns the original delegation - the rename loops around and finds a new delegation which it needs to break. The should only loop once unless the recall takes more than 30 seconds. So I'm a bit perplexed that it blocked lock enough to be noticed. So maybe there is more going on here than I can see. Or maybe the recall is really slow. In either case I think we want to firmly block delegations while a rename (unlink etc) is happening, but do not need to continue to block them after the rename etc completes. > > > I imagine try_break_lease would increment some counter in ->i_flctx > > which prevents further leases being taken, and some new call near where > > break_deleg_wait() is called will decrement that counter. If waiting > > for a lease to break, call break_deleg_wait() and retry, else > > break_deleg_done(). > > > > Delegations are also broken by calls break_lease() which comes from > > nfsd_open_break_lease() for conflicting nfsd opens. I think there *are* > > indicitive of shares and *should* result in the filehandle being recorded > > in the bloom filter. > > Maybe if break_lease() in nfsd_open_break_lease() returns -EWOULDBLOCK, > > then the filehandle could be added to the bloom filter at that point. > > What about vfs_truncate() which also calls break_lease? vfs_truncate() is quite different, though that isn't a good excuse for me to leave it out. It uses break_lease() like open does - not try_break_deleg() like rename. It can do this because it has called get_write_access() in the inode which has incremented i_writecount which will prevent leases from being granted. But that DOESN'T prevent delegations from being granted. I think it should. > > > do_dentry_open also calls break_lease() but we still want the filehandle > > to go in the bloom-filter in that case .... maybe the lm_break callback > > could check i_writecount and i_readcount and use those to determine if > > delaying delegations is appropriate. > > > > wrt the time to block delegation for, I'm not sure that it matters much. > > Once we focus the delay on files that are concurrently opened from > > multiple clients (not both read-only) I think there are some files > > (most) that are never shared, and some files that are often shared. We > > probably don't want delegations for often shared files at all. So I'd > > be comfortable blocking delegations on often-shared files for quite a > > while. > > So perhaps rename might be an exception among the operations that are > triggering delegation recalls. Though I think unlink might be similar > and truncate. Otherwise, perhaps optimizing other operation would be > useful. However, I would like to ask does the added complexity justify > the benefits? What kind of workload would be greatly penalized? If we > imagine the other operations are low occurrence (ie not representing a > sharing example) then the penalty is just an infrequent 60s block. > My concern is that the current design is conceptually wrong. We need to block delegation while rename etc is happening. We currently block for 30-60 seconds. As rename never takes that long, this works. But I don't like it. Some day someone might find a workload for which the current delay is too long. If we just reduced the size of the delay, we might make it take less time than a rename, and something unexpected would break. So I want to fix the design so that it make sense. Then I won't really care how long the delay is for. > > I wouldn't advocate for *longer* than 30 seconds because I don't want > > the bloom filters to fill up - that significantly reduced their > > usefulness. So maybe we could also swap filters when the 'new' one > > becomes 50% full.... > > I was suggesting changing that value to 15 because it would mean that > at most the wait would be 30s which is same as what it was before the > fix but we are imposing at least 15s block and it keeps the filter 'at > the same capacity as before'. But of course in the case of heavy > conflicting operations there will be less blockage and more recalls. > This is where I was hoping a configurable value might be handy. Or, I > would instead argue that we implement a heuristic of detecting that > the server is in a state of frequent delegation recall phase and then > automatically adjust the value for how long the delegations are > blocked. Perhaps we keep a counter in the bloom_filter when a block > is set and when a threshold is reached we increase the value before we > swap the filters (yes at the risk of having a full bloom filter but > because it's better to be cautious in giving out delegations when > there are frequent recalls?). Then, with perhaps another timer we > adjust the block time down.... Wouldn't that cover either having > different NFS clients having conflicting opens or having some workload > that has NFS/local file system conflicts. > I agree that as we have been use "0-30 seconds" for 10 year with no complaints, changing to "15-30 seconds" is safer than change to "30-60 seconds". I think that if we want to fine-tune further, there are other steps we could take which might more clearly be helpful. Currently we only block delegation when a recall happens. This means the 30-60 second count is started when a client opens a file and sharing is detected. I think we should also start that timer when a client CLOSES a file which some other client has open. So if it is ever the case that to clients have the file open at the same time, we block delegations until 30-60 seconds after the file has been closed by all clients. i.e I don't really want to detect "frequent delegation recall". I want to detect "ongoing sharing". In that case I don't want any delegations or any recall. Thanks, NeilBrown
On Mon, 16 Sep 2024, NeilBrown wrote: > On Thu, 12 Sep 2024, Olga Kornievskaia wrote: > > > > What about vfs_truncate() which also calls break_lease? > > vfs_truncate() is quite different, though that isn't a good excuse for > me to leave it out. It uses break_lease() like open does - not > try_break_deleg() like rename. > It can do this because it has called get_write_access() in the inode > which has incremented i_writecount which will prevent leases from being > granted. But that DOESN'T prevent delegations from being granted. I > think it should. Actually it does. check_conflicting_opens() does cause the delegation request to fail because it is a no-op for delegations. But between getting the delegation and hashing the delegation, nfsd4_check_conflicting_opens() is called and it fails if i_writecount has been elevated. So the break_lease() from vfs_truncate() is safe - delegations are prevented until the truncation completes. NeilBrown
On 15 Sep 2024, at 21:17, NeilBrown wrote: > On Thu, 12 Sep 2024, Olga Kornievskaia wrote: > >> I wouldn't discount these operations (at least not rename) from being >> an operation that can't represent "sharing" of files. An example >> workload is where a file gets generated, created, written/read over >> the NFS, but then locally then transferred to another filesystem. I >> can imagine a pipeline, where then file gets filled up and the >> generated data moved to be worked on elsewhere and the same file gets >> filled up again. I think this bug was discovered because of an setup >> where there was a heavy use of these operations (on various files) and >> some got blocked causing problems. For such workload, if we are not >> going to block giving out a delegation do we cause too many >> cb_recalls? > > A pipeline as you describe seem to be a case of serial sharing. > Different applications use the same file, but only at different times. > This sort of sharing isn't hurt by delegations. > > The sort of sharing the might trigger excessive cb_recalls if > delegations weren't blocked would almost certainly involve file locking > and an expectation that two separate applications would sometimes access > the file concurrently. When this is happening, neither should get a > delegation. > > The problem you saw was really caused by a delegation being given out > while the rename was still happening. > i.e.: > - the rename starts > - the delegation is detected and broken > - the cb_recall is sent. > - the client opens the file prior to returning the delegation > - the client gets a new delegation as part of this open > - the client returns the original delegation > - the rename loops around and finds a new delegation which it needs > to break. > > The should only loop once unless the recall takes more than 30 seconds. > So I'm a bit perplexed that it blocked lock enough to be noticed. So > maybe there is more going on here than I can see. Or maybe the recall > is really slow. When the server's local rename process calls __break_lease(), it only calls fl_lmpops->lm_break() once and sets FL_UNLOCK_PENDING. After that it will sleep and wake to check, but never again call ->lm_break() (which will cause knfsd to recall the delegation). The check for leases_conflict() is not stateful. Ben
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 4313addbe756..6f18c1a7af2e 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -1078,7 +1078,8 @@ static void nfs4_free_deleg(struct nfs4_stid *stid) * When a delegation is recalled, the filehandle is stored in the "new" * filter. * Every 30 seconds we swap the filters and clear the "new" one, - * unless both are empty of course. + * unless both are empty of course. This results in delegations for a + * given filehandle being blocked for between 30 and 60 seconds. * * Each filter is 256 bits. We hash the filehandle to 32bit and use the * low 3 bytes as hash-table indices. @@ -1107,9 +1108,9 @@ static int delegation_blocked(struct knfsd_fh *fh) if (ktime_get_seconds() - bd->swap_time > 30) { bd->entries -= bd->old_entries; bd->old_entries = bd->entries; + bd->new = 1-bd->new; memset(bd->set[bd->new], 0, sizeof(bd->set[0])); - bd->new = 1-bd->new; bd->swap_time = ktime_get_seconds(); } spin_unlock(&blocked_delegations_lock);
The pair of bloom filtered used by delegation_blocked() was intended to block delegations on given filehandles for between 30 and 60 seconds. A new filehandle would be recorded in the "new" bit set. That would then be switch to the "old" bit set between 0 and 30 seconds later, and it would remain as the "old" bit set for 30 seconds. Unfortunately the code intended to clear the old bit set once it reached 30 seconds old, preparing it to be the next new bit set, instead cleared the *new* bit set before switching it to be the old bit set. This means that the "old" bit set is always empty and delegations are blocked between 0 and 30 seconds. This patch updates bd->new before clearing the set with that index, instead of afterwards. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Cc: stable@vger.kernel.org Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfs4state.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)