[REPOST,v4,2/5] kernfs: use VFS negative dentry caching

Message ID	162218364554.34379.636306635794792903.stgit@web.messagingengine.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [REPOST PATCH v4 2/5] kernfs: use VFS negative dentry caching From: Ian Kent <raven@themaw.net> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Tejun Heo <tj@kernel.org> Cc: Eric Sandeen <sandeen@sandeen.net>, Fox Chen <foxhlchen@gmail.com>, Brice Goglin <brice.goglin@gmail.com>, Al Viro <viro@ZenIV.linux.org.uk>, Rick Lindsley <ricklind@linux.vnet.ibm.com>, David Howells <dhowells@redhat.com>, Miklos Szeredi <miklos@szeredi.hu>, Marcelo Tosatti <mtosatti@redhat.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Kernel Mailing List <linux-kernel@vger.kernel.org> Date: Fri, 28 May 2021 14:34:05 +0800 Message-ID: <162218364554.34379.636306635794792903.stgit@web.messagingengine.com> In-Reply-To: <162218354775.34379.5629941272050849549.stgit@web.messagingengine.com> References: <162218354775.34379.5629941272050849549.stgit@web.messagingengine.com> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Precedence: bulk
Series	kernfs: proposed locking and concurrency improvement \| expand [REPOST,v4,0/5] kernfs: proposed locking and concurrency improvement [REPOST,v4,1/5] kernfs: move revalidate to be near lookup [REPOST,v4,2/5] kernfs: use VFS negative dentry caching [REPOST,v4,3/5] kernfs: switch kernfs to use an rwsem [REPOST,v4,4/5] kernfs: use i_lock to protect concurrent inode updates [REPOST,v4,5/5] kernfs: add kernfs_need_inode_refresh()

Ian Kent May 28, 2021, 6:34 a.m. UTC

If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 22 deletions(-)

Miklos Szeredi June 1, 2021, 12:41 p.m. UTC | #1

On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net> wrote:
>
> If there are many lookups for non-existent paths these negative lookups
> can lead to a lot of overhead during path walks.
>
> The VFS allows dentries to be created as negative and hashed, and caches
> them so they can be used to reduce the fairly high overhead alloc/free
> cycle that occurs during these lookups.

Obviously there's a cost associated with negative caching too.  For
normal filesystems it's trivially worth that cost, but in case of
kernfs, not sure...

Can "fairly high" be somewhat substantiated with a microbenchmark for
negative lookups?

More comments inline.

>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 4c69e2af82dac..5151c712f06f5 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>         if (flags & LOOKUP_RCU)
>                 return -ECHILD;
>
> -       /* Always perform fresh lookup for negatives */
> -       if (d_really_is_negative(dentry))
> -               goto out_bad_unlocked;
> +       mutex_lock(&kernfs_mutex);
>
>         kn = kernfs_dentry_node(dentry);
> -       mutex_lock(&kernfs_mutex);
> +
> +       /* Negative hashed dentry? */
> +       if (!kn) {
> +               struct kernfs_node *parent;
> +
> +               /* If the kernfs node can be found this is a stale negative
> +                * hashed dentry so it must be discarded and the lookup redone.
> +                */
> +               parent = kernfs_dentry_node(dentry->d_parent);

This doesn't look safe WRT a racing sys_rename().  In this case
d_move() is called only with parent inode locked, but not with
kernfs_mutex while ->d_revalidate() may not have parent inode locked.
After d_move() the old parent dentry can be freed, resulting in use
after free.  Easily fixed by dget_parent().

> +               if (parent) {
> +                       const void *ns = NULL;
> +
> +                       if (kernfs_ns_enabled(parent))
> +                               ns = kernfs_info(dentry->d_sb)->ns;
> +                       kn = kernfs_find_ns(parent, dentry->d_name.name, ns);

Same thing with d_name.  There's
take_dentry_name_snapshot()/release_dentry_name_snapshot() to properly
take care of that.


> +                       if (kn)
> +                               goto out_bad;
> +               }
> +
> +               /* The kernfs node doesn't exist, leave the dentry negative
> +                * and return success.
> +                */
> +               goto out;
> +       }
>
>         /* The kernfs node has been deactivated */
>         if (!kernfs_active_read(kn))
> @@ -1060,12 +1081,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>         if (kn->parent && kernfs_ns_enabled(kn->parent) &&
>             kernfs_info(dentry->d_sb)->ns != kn->ns)
>                 goto out_bad;
> -
> +out:
>         mutex_unlock(&kernfs_mutex);
>         return 1;
>  out_bad:
>         mutex_unlock(&kernfs_mutex);
> -out_bad_unlocked:
>         return 0;
>  }
>
> @@ -1080,33 +1100,24 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
>         struct dentry *ret;
>         struct kernfs_node *parent = dir->i_private;
>         struct kernfs_node *kn;
> -       struct inode *inode;
> +       struct inode *inode = NULL;
>         const void *ns = NULL;
>
>         mutex_lock(&kernfs_mutex);
> -
>         if (kernfs_ns_enabled(parent))
>                 ns = kernfs_info(dir->i_sb)->ns;
>
>         kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> -
> -       /* no such entry */
> -       if (!kn || !kernfs_active(kn)) {
> -               ret = NULL;
> -               goto out_unlock;
> -       }
> -
>         /* attach dentry and inode */
> -       inode = kernfs_get_inode(dir->i_sb, kn);
> -       if (!inode) {
> -               ret = ERR_PTR(-ENOMEM);
> -               goto out_unlock;
> +       if (kn && kernfs_active(kn)) {
> +               inode = kernfs_get_inode(dir->i_sb, kn);
> +               if (!inode)
> +                       inode = ERR_PTR(-ENOMEM);
>         }
> -
> -       /* instantiate and hash dentry */
> +       /* instantiate and hash (possibly negative) dentry */
>         ret = d_splice_alias(inode, dentry);
> - out_unlock:
>         mutex_unlock(&kernfs_mutex);
> +
>         return ret;
>  }
>
>
>

Ian Kent June 2, 2021, 3:44 a.m. UTC | #2

On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net> wrote:
> > 
> > If there are many lookups for non-existent paths these negative
> > lookups
> > can lead to a lot of overhead during path walks.
> > 
> > The VFS allows dentries to be created as negative and hashed, and
> > caches
> > them so they can be used to reduce the fairly high overhead
> > alloc/free
> > cycle that occurs during these lookups.
> 
> Obviously there's a cost associated with negative caching too.  For
> normal filesystems it's trivially worth that cost, but in case of
> kernfs, not sure...
> 
> Can "fairly high" be somewhat substantiated with a microbenchmark for
> negative lookups?

Well, maybe, but anything we do for a benchmark would be totally
artificial.

The reason I added this is because I saw appreciable contention
on the dentry alloc path in one case I saw. It was a while ago
now but IIRC it was systemd coldplug using at least one path
that didn't exist. I thought that this was done because of some
special case requirement so VFS negative dentry caching was a
sensible countermeasure. I guess there could be lookups for
non-existent paths from other than deterministic programmatic
sources but I still felt it was a sensible thing to do.

> 
> More comments inline.
> 
> > 
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------
> > ------------
> >  1 file changed, 33 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 4c69e2af82dac..5151c712f06f5 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct
> > dentry *dentry, unsigned int flags)
> >         if (flags & LOOKUP_RCU)
> >                 return -ECHILD;
> > 
> > -       /* Always perform fresh lookup for negatives */
> > -       if (d_really_is_negative(dentry))
> > -               goto out_bad_unlocked;
> > +       mutex_lock(&kernfs_mutex);
> > 
> >         kn = kernfs_dentry_node(dentry);
> > -       mutex_lock(&kernfs_mutex);
> > +
> > +       /* Negative hashed dentry? */
> > +       if (!kn) {
> > +               struct kernfs_node *parent;
> > +
> > +               /* If the kernfs node can be found this is a stale
> > negative
> > +                * hashed dentry so it must be discarded and the
> > lookup redone.
> > +                */
> > +               parent = kernfs_dentry_node(dentry->d_parent);
> 
> This doesn't look safe WRT a racing sys_rename().  In this case
> d_move() is called only with parent inode locked, but not with
> kernfs_mutex while ->d_revalidate() may not have parent inode locked.
> After d_move() the old parent dentry can be freed, resulting in use
> after free.  Easily fixed by dget_parent().

Umm ... I'll need some more explanation here ... 

We are in ref-walk mode so the parent dentry isn't going away.
And this is a negative dentry so rename is going to bail out
with ENOENT way early.

Are you talking about a racing parent rename requiring a
READ_ONCE() and dget_parent() being the safest way to do
that?

> 
> > +               if (parent) {
> > +                       const void *ns = NULL;
> > +
> > +                       if (kernfs_ns_enabled(parent))
> > +                               ns = kernfs_info(dentry->d_sb)->ns;
> > +                       kn = kernfs_find_ns(parent, dentry-
> > >d_name.name, ns);
> 
> Same thing with d_name.  There's
> take_dentry_name_snapshot()/release_dentry_name_snapshot() to
> properly
> take care of that.

I don't see that problem either, due to the dentry being negative,
but please explain what your seeing here.

> 
> 
> > +                       if (kn)
> > +                               goto out_bad;
> > +               }
> > +
> > +               /* The kernfs node doesn't exist, leave the dentry
> > negative
> > +                * and return success.
> > +                */
> > +               goto out;
> > +       }
> > 
> >         /* The kernfs node has been deactivated */
> >         if (!kernfs_active_read(kn))
> > @@ -1060,12 +1081,11 @@ static int kernfs_dop_revalidate(struct
> > dentry *dentry, unsigned int flags)
> >         if (kn->parent && kernfs_ns_enabled(kn->parent) &&
> >             kernfs_info(dentry->d_sb)->ns != kn->ns)
> >                 goto out_bad;
> > -
> > +out:
> >         mutex_unlock(&kernfs_mutex);
> >         return 1;
> >  out_bad:
> >         mutex_unlock(&kernfs_mutex);
> > -out_bad_unlocked:
> >         return 0;
> >  }
> > 
> > @@ -1080,33 +1100,24 @@ static struct dentry
> > *kernfs_iop_lookup(struct inode *dir,
> >         struct dentry *ret;
> >         struct kernfs_node *parent = dir->i_private;
> >         struct kernfs_node *kn;
> > -       struct inode *inode;
> > +       struct inode *inode = NULL;
> >         const void *ns = NULL;
> > 
> >         mutex_lock(&kernfs_mutex);
> > -
> >         if (kernfs_ns_enabled(parent))
> >                 ns = kernfs_info(dir->i_sb)->ns;
> > 
> >         kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> > -
> > -       /* no such entry */
> > -       if (!kn || !kernfs_active(kn)) {
> > -               ret = NULL;
> > -               goto out_unlock;
> > -       }
> > -
> >         /* attach dentry and inode */
> > -       inode = kernfs_get_inode(dir->i_sb, kn);
> > -       if (!inode) {
> > -               ret = ERR_PTR(-ENOMEM);
> > -               goto out_unlock;
> > +       if (kn && kernfs_active(kn)) {
> > +               inode = kernfs_get_inode(dir->i_sb, kn);
> > +               if (!inode)
> > +                       inode = ERR_PTR(-ENOMEM);
> >         }
> > -
> > -       /* instantiate and hash dentry */
> > +       /* instantiate and hash (possibly negative) dentry */
> >         ret = d_splice_alias(inode, dentry);
> > - out_unlock:
> >         mutex_unlock(&kernfs_mutex);
> > +
> >         return ret;
> >  }
> > 
> > 
> >

Miklos Szeredi June 2, 2021, 8:58 a.m. UTC | #3

On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@themaw.net> wrote:
>
> On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net> wrote:
> > >
> > > If there are many lookups for non-existent paths these negative
> > > lookups
> > > can lead to a lot of overhead during path walks.
> > >
> > > The VFS allows dentries to be created as negative and hashed, and
> > > caches
> > > them so they can be used to reduce the fairly high overhead
> > > alloc/free
> > > cycle that occurs during these lookups.
> >
> > Obviously there's a cost associated with negative caching too.  For
> > normal filesystems it's trivially worth that cost, but in case of
> > kernfs, not sure...
> >
> > Can "fairly high" be somewhat substantiated with a microbenchmark for
> > negative lookups?
>
> Well, maybe, but anything we do for a benchmark would be totally
> artificial.
>
> The reason I added this is because I saw appreciable contention
> on the dentry alloc path in one case I saw.

If multiple tasks are trying to look up the same negative dentry in
parallel, then there will be contention on the parent inode lock.
Was this the issue?   This could easily be reproduced with an
artificial benchmark.

> > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > index 4c69e2af82dac..5151c712f06f5 100644
> > > --- a/fs/kernfs/dir.c
> > > +++ b/fs/kernfs/dir.c
> > > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct
> > > dentry *dentry, unsigned int flags)
> > >         if (flags & LOOKUP_RCU)
> > >                 return -ECHILD;
> > >
> > > -       /* Always perform fresh lookup for negatives */
> > > -       if (d_really_is_negative(dentry))
> > > -               goto out_bad_unlocked;
> > > +       mutex_lock(&kernfs_mutex);
> > >
> > >         kn = kernfs_dentry_node(dentry);
> > > -       mutex_lock(&kernfs_mutex);
> > > +
> > > +       /* Negative hashed dentry? */
> > > +       if (!kn) {
> > > +               struct kernfs_node *parent;
> > > +
> > > +               /* If the kernfs node can be found this is a stale
> > > negative
> > > +                * hashed dentry so it must be discarded and the
> > > lookup redone.
> > > +                */
> > > +               parent = kernfs_dentry_node(dentry->d_parent);
> >
> > This doesn't look safe WRT a racing sys_rename().  In this case
> > d_move() is called only with parent inode locked, but not with
> > kernfs_mutex while ->d_revalidate() may not have parent inode locked.
> > After d_move() the old parent dentry can be freed, resulting in use
> > after free.  Easily fixed by dget_parent().
>
> Umm ... I'll need some more explanation here ...
>
> We are in ref-walk mode so the parent dentry isn't going away.

The parent that was used to lookup the dentry in __d_lookup() isn't
going away.  But it's not necessarily equal to dentry->d_parent
anymore.

> And this is a negative dentry so rename is going to bail out
> with ENOENT way early.

You are right.  But note that negative dentry in question could be the
target of a rename.  Current implementation doesn't switch the
target's parent or name, but this wasn't always the case (commit
076515fc9267 ("make non-exchanging __d_move() copy ->d_parent rather
than swap them")), so a backport of this patch could become incorrect
on old enough kernels.

So I still think using dget_parent() is the correct way to do this.

> >
> > > +               if (parent) {
> > > +                       const void *ns = NULL;
> > > +
> > > +                       if (kernfs_ns_enabled(parent))
> > > +                               ns = kernfs_info(dentry->d_sb)->ns;
> > > +                       kn = kernfs_find_ns(parent, dentry-
> > > >d_name.name, ns);
> >
> > Same thing with d_name.  There's
> > take_dentry_name_snapshot()/release_dentry_name_snapshot() to
> > properly
> > take care of that.
>
> I don't see that problem either, due to the dentry being negative,
> but please explain what your seeing here.

Yeah.  Negative dentries' names weren't always stable, but that was a
long time ago (commit 8d85b4845a66 ("Allow sharing external names
after __d_move()")).

Thanks,
Miklos

Ian Kent June 2, 2021, 10:57 a.m. UTC | #4

On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote:
> On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@themaw.net> wrote:
> > 
> > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net> wrote:
> > > > 
> > > > If there are many lookups for non-existent paths these negative
> > > > lookups
> > > > can lead to a lot of overhead during path walks.
> > > > 
> > > > The VFS allows dentries to be created as negative and hashed,
> > > > and
> > > > caches
> > > > them so they can be used to reduce the fairly high overhead
> > > > alloc/free
> > > > cycle that occurs during these lookups.
> > > 
> > > Obviously there's a cost associated with negative caching too. 
> > > For
> > > normal filesystems it's trivially worth that cost, but in case of
> > > kernfs, not sure...
> > > 
> > > Can "fairly high" be somewhat substantiated with a microbenchmark
> > > for
> > > negative lookups?
> > 
> > Well, maybe, but anything we do for a benchmark would be totally
> > artificial.
> > 
> > The reason I added this is because I saw appreciable contention
> > on the dentry alloc path in one case I saw.
> 
> If multiple tasks are trying to look up the same negative dentry in
> parallel, then there will be contention on the parent inode lock.
> Was this the issue?   This could easily be reproduced with an
> artificial benchmark.

Not that I remember, I'll need to dig up the sysrq dumps to have a
look and get back to you.

> 
> > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > --- a/fs/kernfs/dir.c
> > > > +++ b/fs/kernfs/dir.c
> > > > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct
> > > > dentry *dentry, unsigned int flags)
> > > >         if (flags & LOOKUP_RCU)
> > > >                 return -ECHILD;
> > > > 
> > > > -       /* Always perform fresh lookup for negatives */
> > > > -       if (d_really_is_negative(dentry))
> > > > -               goto out_bad_unlocked;
> > > > +       mutex_lock(&kernfs_mutex);
> > > > 
> > > >         kn = kernfs_dentry_node(dentry);
> > > > -       mutex_lock(&kernfs_mutex);
> > > > +
> > > > +       /* Negative hashed dentry? */
> > > > +       if (!kn) {
> > > > +               struct kernfs_node *parent;
> > > > +
> > > > +               /* If the kernfs node can be found this is a
> > > > stale
> > > > negative
> > > > +                * hashed dentry so it must be discarded and
> > > > the
> > > > lookup redone.
> > > > +                */
> > > > +               parent = kernfs_dentry_node(dentry->d_parent);
> > > 
> > > This doesn't look safe WRT a racing sys_rename().  In this case
> > > d_move() is called only with parent inode locked, but not with
> > > kernfs_mutex while ->d_revalidate() may not have parent inode
> > > locked.
> > > After d_move() the old parent dentry can be freed, resulting in
> > > use
> > > after free.  Easily fixed by dget_parent().
> > 
> > Umm ... I'll need some more explanation here ...
> > 
> > We are in ref-walk mode so the parent dentry isn't going away.
> 
> The parent that was used to lookup the dentry in __d_lookup() isn't
> going away.  But it's not necessarily equal to dentry->d_parent
> anymore.
> 
> > And this is a negative dentry so rename is going to bail out
> > with ENOENT way early.
> 
> You are right.  But note that negative dentry in question could be
> the
> target of a rename.  Current implementation doesn't switch the
> target's parent or name, but this wasn't always the case (commit
> 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent rather
> than swap them")), so a backport of this patch could become incorrect
> on old enough kernels.

Right, that __lookup_hash() will find the negative target.

> 
> So I still think using dget_parent() is the correct way to do this.

The rename code does my head in, ;)

The dget_parent() would ensure we had an up to date parent so
yes, that would be the right thing to do regardless.

But now I'm not sure that will be sufficient for kernfs. I'm still
thinking about it.

I'm wondering if there's a missing check in there to account for
what happens with revalidate after ->rename() but before move.
There's already a kernfs node check in there so it's probably ok
...
 
> 
> > > 
> > > > +               if (parent) {
> > > > +                       const void *ns = NULL;
> > > > +
> > > > +                       if (kernfs_ns_enabled(parent))
> > > > +                               ns = kernfs_info(dentry->d_sb)-
> > > > >ns;
> > > > +                       kn = kernfs_find_ns(parent, dentry-
> > > > > d_name.name, ns);
> > > 
> > > Same thing with d_name.  There's
> > > take_dentry_name_snapshot()/release_dentry_name_snapshot() to
> > > properly
> > > take care of that.
> > 
> > I don't see that problem either, due to the dentry being negative,
> > but please explain what your seeing here.
> 
> Yeah.  Negative dentries' names weren't always stable, but that was a
> long time ago (commit 8d85b4845a66 ("Allow sharing external names
> after __d_move()")).

Right, I'll make that change too.

> 
> Thanks,
> Miklos

Ian Kent June 3, 2021, 2:15 a.m. UTC | #5

On Wed, 2021-06-02 at 18:57 +0800, Ian Kent wrote:
> On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote:
> > On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@themaw.net> wrote:
> > > 
> > > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > 
> > > > > If there are many lookups for non-existent paths these
> > > > > negative
> > > > > lookups
> > > > > can lead to a lot of overhead during path walks.
> > > > > 
> > > > > The VFS allows dentries to be created as negative and hashed,
> > > > > and
> > > > > caches
> > > > > them so they can be used to reduce the fairly high overhead
> > > > > alloc/free
> > > > > cycle that occurs during these lookups.
> > > > 
> > > > Obviously there's a cost associated with negative caching too. 
> > > > For
> > > > normal filesystems it's trivially worth that cost, but in case
> > > > of
> > > > kernfs, not sure...
> > > > 
> > > > Can "fairly high" be somewhat substantiated with a
> > > > microbenchmark
> > > > for
> > > > negative lookups?
> > > 
> > > Well, maybe, but anything we do for a benchmark would be totally
> > > artificial.
> > > 
> > > The reason I added this is because I saw appreciable contention
> > > on the dentry alloc path in one case I saw.
> > 
> > If multiple tasks are trying to look up the same negative dentry in
> > parallel, then there will be contention on the parent inode lock.
> > Was this the issue?   This could easily be reproduced with an
> > artificial benchmark.
> 
> Not that I remember, I'll need to dig up the sysrq dumps to have a
> look and get back to you.

After doing that though I could grab Fox Chen's reproducer and give
it varying sysfs paths as well as some percentage of non-existent
sysfs paths and see what I get ...

That should give it a more realistic usage profile and, if I can
get the percentage of non-existent paths right, demonstrate that
case as well ... but nothing is easy, so we'll have to wait and
see, ;)

> 
> > 
> > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > > --- a/fs/kernfs/dir.c
> > > > > +++ b/fs/kernfs/dir.c
> > > > > @@ -1037,12 +1037,33 @@ static int
> > > > > kernfs_dop_revalidate(struct
> > > > > dentry *dentry, unsigned int flags)
> > > > >         if (flags & LOOKUP_RCU)
> > > > >                 return -ECHILD;
> > > > > 
> > > > > -       /* Always perform fresh lookup for negatives */
> > > > > -       if (d_really_is_negative(dentry))
> > > > > -               goto out_bad_unlocked;
> > > > > +       mutex_lock(&kernfs_mutex);
> > > > > 
> > > > >         kn = kernfs_dentry_node(dentry);
> > > > > -       mutex_lock(&kernfs_mutex);
> > > > > +
> > > > > +       /* Negative hashed dentry? */
> > > > > +       if (!kn) {
> > > > > +               struct kernfs_node *parent;
> > > > > +
> > > > > +               /* If the kernfs node can be found this is a
> > > > > stale
> > > > > negative
> > > > > +                * hashed dentry so it must be discarded and
> > > > > the
> > > > > lookup redone.
> > > > > +                */
> > > > > +               parent = kernfs_dentry_node(dentry-
> > > > > >d_parent);
> > > > 
> > > > This doesn't look safe WRT a racing sys_rename().  In this case
> > > > d_move() is called only with parent inode locked, but not with
> > > > kernfs_mutex while ->d_revalidate() may not have parent inode
> > > > locked.
> > > > After d_move() the old parent dentry can be freed, resulting in
> > > > use
> > > > after free.  Easily fixed by dget_parent().
> > > 
> > > Umm ... I'll need some more explanation here ...
> > > 
> > > We are in ref-walk mode so the parent dentry isn't going away.
> > 
> > The parent that was used to lookup the dentry in __d_lookup() isn't
> > going away.  But it's not necessarily equal to dentry->d_parent
> > anymore.
> > 
> > > And this is a negative dentry so rename is going to bail out
> > > with ENOENT way early.
> > 
> > You are right.  But note that negative dentry in question could be
> > the
> > target of a rename.  Current implementation doesn't switch the
> > target's parent or name, but this wasn't always the case (commit
> > 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent
> > rather
> > than swap them")), so a backport of this patch could become
> > incorrect
> > on old enough kernels.
> 
> Right, that __lookup_hash() will find the negative target.
> 
> > 
> > So I still think using dget_parent() is the correct way to do this.
> 
> The rename code does my head in, ;)
> 
> The dget_parent() would ensure we had an up to date parent so
> yes, that would be the right thing to do regardless.
> 
> But now I'm not sure that will be sufficient for kernfs. I'm still
> thinking about it.
> 
> I'm wondering if there's a missing check in there to account for
> what happens with revalidate after ->rename() but before move.
> There's already a kernfs node check in there so it's probably ok
> ...
>  
> > 
> > > > 
> > > > > +               if (parent) {
> > > > > +                       const void *ns = NULL;
> > > > > +
> > > > > +                       if (kernfs_ns_enabled(parent))
> > > > > +                               ns = kernfs_info(dentry-
> > > > > >d_sb)-
> > > > > > ns;
> > > > > +                       kn = kernfs_find_ns(parent, dentry-
> > > > > > d_name.name, ns);
> > > > 
> > > > Same thing with d_name.  There's
> > > > take_dentry_name_snapshot()/release_dentry_name_snapshot() to
> > > > properly
> > > > take care of that.
> > > 
> > > I don't see that problem either, due to the dentry being
> > > negative,
> > > but please explain what your seeing here.
> > 
> > Yeah.  Negative dentries' names weren't always stable, but that was
> > a
> > long time ago (commit 8d85b4845a66 ("Allow sharing external names
> > after __d_move()")).
> 
> Right, I'll make that change too.
> 
> > 
> > Thanks,
> > Miklos
>

Eric W. Biederman June 3, 2021, 5:26 p.m. UTC | #6

Ian Kent <raven@themaw.net> writes:

> If there are many lookups for non-existent paths these negative lookups
> can lead to a lot of overhead during path walks.
>
> The VFS allows dentries to be created as negative and hashed, and caches
> them so they can be used to reduce the fairly high overhead alloc/free
> cycle that occurs during these lookups.
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> ---
>  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 4c69e2af82dac..5151c712f06f5 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>  	if (flags & LOOKUP_RCU)
>  		return -ECHILD;
>  
> -	/* Always perform fresh lookup for negatives */
> -	if (d_really_is_negative(dentry))
> -		goto out_bad_unlocked;
> +	mutex_lock(&kernfs_mutex);
>  
>  	kn = kernfs_dentry_node(dentry);
> -	mutex_lock(&kernfs_mutex);

Why bring kernfs_dentry_node inside the mutex?

The inode lock of the parent should protect negative to positive
transitions not the kernfs_mutex.  So moving the code inside
the mutex looks unnecessary and confusing.

What NFS does is to check to see if the parent has been modified
since the negative dentry was created, can't kernfs do the same
and remove the need for taking the lock until the lookup that
makes the dentry positive?

Doing the lookup twice seems strange.

Perhaps this should happen as two changes.  One change to enable
negative dentries and a second change to optimize d_revalidate
of negative dentries.  That way the issues could be clearly separated
and looked at separately.

> +
> +	/* Negative hashed dentry? */
> +	if (!kn) {
> +		struct kernfs_node *parent;
> +
> +		/* If the kernfs node can be found this is a stale negative
> +		 * hashed dentry so it must be discarded and the lookup redone.
> +		 */
> +		parent = kernfs_dentry_node(dentry->d_parent);
> +		if (parent) {
> +			const void *ns = NULL;
> +
> +			if (kernfs_ns_enabled(parent))
> +				ns = kernfs_info(dentry->d_sb)->ns;
> +			kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> +			if (kn)
> +				goto out_bad;
> +		}
> +
> +		/* The kernfs node doesn't exist, leave the dentry negative
> +		 * and return success.
> +		 */
> +		goto out;
> +	}
>  
>  	/* The kernfs node has been deactivated */
>  	if (!kernfs_active_read(kn))
> @@ -1060,12 +1081,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>  	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
>  	    kernfs_info(dentry->d_sb)->ns != kn->ns)
>  		goto out_bad;
> -
> +out:
>  	mutex_unlock(&kernfs_mutex);
>  	return 1;
>  out_bad:
>  	mutex_unlock(&kernfs_mutex);
> -out_bad_unlocked:
>  	return 0;
>  }
>  
> @@ -1080,33 +1100,24 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
>  	struct dentry *ret;
>  	struct kernfs_node *parent = dir->i_private;
>  	struct kernfs_node *kn;
> -	struct inode *inode;
> +	struct inode *inode = NULL;
>  	const void *ns = NULL;
>  
>  	mutex_lock(&kernfs_mutex);
> -
>  	if (kernfs_ns_enabled(parent))
>  		ns = kernfs_info(dir->i_sb)->ns;
>  
>  	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> -
> -	/* no such entry */
> -	if (!kn || !kernfs_active(kn)) {
> -		ret = NULL;
> -		goto out_unlock;
> -	}
> -
>  	/* attach dentry and inode */
> -	inode = kernfs_get_inode(dir->i_sb, kn);
> -	if (!inode) {
> -		ret = ERR_PTR(-ENOMEM);
> -		goto out_unlock;
> +	if (kn && kernfs_active(kn)) {
> +		inode = kernfs_get_inode(dir->i_sb, kn);
> +		if (!inode)
> +			inode = ERR_PTR(-ENOMEM);
>  	}
> -
> -	/* instantiate and hash dentry */
> +	/* instantiate and hash (possibly negative) dentry */
>  	ret = d_splice_alias(inode, dentry);
> - out_unlock:
>  	mutex_unlock(&kernfs_mutex);
> +
>  	return ret;
>  }
>

Miklos Szeredi June 3, 2021, 6:06 p.m. UTC | #7

On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Ian Kent <raven@themaw.net> writes:
>
> > If there are many lookups for non-existent paths these negative lookups
> > can lead to a lot of overhead during path walks.
> >
> > The VFS allows dentries to be created as negative and hashed, and caches
> > them so they can be used to reduce the fairly high overhead alloc/free
> > cycle that occurs during these lookups.
> >
> > Signed-off-by: Ian Kent <raven@themaw.net>
> > ---
> >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 33 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 4c69e2af82dac..5151c712f06f5 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
> >       if (flags & LOOKUP_RCU)
> >               return -ECHILD;
> >
> > -     /* Always perform fresh lookup for negatives */
> > -     if (d_really_is_negative(dentry))
> > -             goto out_bad_unlocked;
> > +     mutex_lock(&kernfs_mutex);
> >
> >       kn = kernfs_dentry_node(dentry);
> > -     mutex_lock(&kernfs_mutex);
>
> Why bring kernfs_dentry_node inside the mutex?
>
> The inode lock of the parent should protect negative to positive
> transitions not the kernfs_mutex.  So moving the code inside
> the mutex looks unnecessary and confusing.

Except that d_revalidate() may or may not be called with parent lock held.

Thanks,
Miklos

Eric W. Biederman June 3, 2021, 10:02 p.m. UTC | #8

Miklos Szeredi <miklos@szeredi.hu> writes:

> On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Ian Kent <raven@themaw.net> writes:
>>
>> > If there are many lookups for non-existent paths these negative lookups
>> > can lead to a lot of overhead during path walks.
>> >
>> > The VFS allows dentries to be created as negative and hashed, and caches
>> > them so they can be used to reduce the fairly high overhead alloc/free
>> > cycle that occurs during these lookups.
>> >
>> > Signed-off-by: Ian Kent <raven@themaw.net>
>> > ---
>> >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++----------------------
>> >  1 file changed, 33 insertions(+), 22 deletions(-)
>> >
>> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> > index 4c69e2af82dac..5151c712f06f5 100644
>> > --- a/fs/kernfs/dir.c
>> > +++ b/fs/kernfs/dir.c
>> > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>> >       if (flags & LOOKUP_RCU)
>> >               return -ECHILD;
>> >
>> > -     /* Always perform fresh lookup for negatives */
>> > -     if (d_really_is_negative(dentry))
>> > -             goto out_bad_unlocked;
>> > +     mutex_lock(&kernfs_mutex);
>> >
>> >       kn = kernfs_dentry_node(dentry);
>> > -     mutex_lock(&kernfs_mutex);
>>
>> Why bring kernfs_dentry_node inside the mutex?
>>
>> The inode lock of the parent should protect negative to positive
>> transitions not the kernfs_mutex.  So moving the code inside
>> the mutex looks unnecessary and confusing.
>
> Except that d_revalidate() may or may not be called with parent lock
> held.

I grant that this works because kernfs_io_lookup today holds
kernfs_mutex over d_splice_alias.

The problem is that the kernfs_mutex only should be protecting the
kernfs data structures not the vfs data structures.

Reading through the code history that looks like a hold over from when
sysfs lived in the dcache before it was reimplemented as a distributed
file system.  So it was probably a complete over sight and something
that did not matter.

The big problem is that if the code starts depending upon the
kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest of the
filesystems does not the code will diverge from the rest of the
filesystems and maintenance will become much more difficult.

Diverging from other filesystems and becoming a maintenance pain has
already been seen once in the life of sysfs and I don't think we want to
go back there.

Further extending the scope of lock, when the problem is that the
locking is causing problems seems like the opposite of the direction we
want the code to grow.

I really suspect all we want kernfs_dop_revalidate doing for negative
dentries is something as simple as comparing the timestamp of the
negative dentry to the timestamp of the parent dentry, and if the
timestamp has changed perform the lookup.  That is roughly what
nfs does today with negative dentries.

The dentry cache will always lag the kernfs_node data structures, and
that is fundamental.  We should take advantage of that to make the code
as simple and as fast as we can not to perform lots of work that creates
overhead.

Plus the kernfs data structures should not change much so I expect
there will be effectively 0 penalty in always performing the lookup of a
negative dentry when the directory itself has changed.

Eric

Ian Kent June 3, 2021, 11:57 p.m. UTC | #9

On Thu, 2021-06-03 at 10:15 +0800, Ian Kent wrote:
> On Wed, 2021-06-02 at 18:57 +0800, Ian Kent wrote:
> > On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote:
> > > On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@themaw.net> wrote:
> > > > 
> > > > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > > > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > 
> > > > > > If there are many lookups for non-existent paths these
> > > > > > negative
> > > > > > lookups
> > > > > > can lead to a lot of overhead during path walks.
> > > > > > 
> > > > > > The VFS allows dentries to be created as negative and
> > > > > > hashed,
> > > > > > and
> > > > > > caches
> > > > > > them so they can be used to reduce the fairly high overhead
> > > > > > alloc/free
> > > > > > cycle that occurs during these lookups.
> > > > > 
> > > > > Obviously there's a cost associated with negative caching
> > > > > too. 
> > > > > For
> > > > > normal filesystems it's trivially worth that cost, but in
> > > > > case
> > > > > of
> > > > > kernfs, not sure...
> > > > > 
> > > > > Can "fairly high" be somewhat substantiated with a
> > > > > microbenchmark
> > > > > for
> > > > > negative lookups?
> > > > 
> > > > Well, maybe, but anything we do for a benchmark would be
> > > > totally
> > > > artificial.
> > > > 
> > > > The reason I added this is because I saw appreciable contention
> > > > on the dentry alloc path in one case I saw.
> > > 
> > > If multiple tasks are trying to look up the same negative dentry
> > > in
> > > parallel, then there will be contention on the parent inode lock.
> > > Was this the issue?   This could easily be reproduced with an
> > > artificial benchmark.
> > 
> > Not that I remember, I'll need to dig up the sysrq dumps to have a
> > look and get back to you.
> 
> After doing that though I could grab Fox Chen's reproducer and give
> it varying sysfs paths as well as some percentage of non-existent
> sysfs paths and see what I get ...
> 
> That should give it a more realistic usage profile and, if I can
> get the percentage of non-existent paths right, demonstrate that
> case as well ... but nothing is easy, so we'll have to wait and
> see, ;)

Ok, so I grabbed Fox's benckmark repo. and used a non-existent path
to check the negative dentry contention.

I've taken the baseline readings and the contention is see is the
same as I originally saw. It's with d_alloc_parallel() on lockref.

While I haven't run the patched check I'm pretty sure that using
dget_parent() and taking a snapshot will move the contention to
that. So if I do retain the negative dentry caching change I would
need to use the dentry seq lock for it to be useful.

Thoughts Miklos, anyone?

> 
> > 
> > > 
> > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > > > --- a/fs/kernfs/dir.c
> > > > > > +++ b/fs/kernfs/dir.c
> > > > > > @@ -1037,12 +1037,33 @@ static int
> > > > > > kernfs_dop_revalidate(struct
> > > > > > dentry *dentry, unsigned int flags)
> > > > > >         if (flags & LOOKUP_RCU)
> > > > > >                 return -ECHILD;
> > > > > > 
> > > > > > -       /* Always perform fresh lookup for negatives */
> > > > > > -       if (d_really_is_negative(dentry))
> > > > > > -               goto out_bad_unlocked;
> > > > > > +       mutex_lock(&kernfs_mutex);
> > > > > > 
> > > > > >         kn = kernfs_dentry_node(dentry);
> > > > > > -       mutex_lock(&kernfs_mutex);
> > > > > > +
> > > > > > +       /* Negative hashed dentry? */
> > > > > > +       if (!kn) {
> > > > > > +               struct kernfs_node *parent;
> > > > > > +
> > > > > > +               /* If the kernfs node can be found this is
> > > > > > a
> > > > > > stale
> > > > > > negative
> > > > > > +                * hashed dentry so it must be discarded
> > > > > > and
> > > > > > the
> > > > > > lookup redone.
> > > > > > +                */
> > > > > > +               parent = kernfs_dentry_node(dentry-
> > > > > > > d_parent);
> > > > > 
> > > > > This doesn't look safe WRT a racing sys_rename().  In this
> > > > > case
> > > > > d_move() is called only with parent inode locked, but not
> > > > > with
> > > > > kernfs_mutex while ->d_revalidate() may not have parent inode
> > > > > locked.
> > > > > After d_move() the old parent dentry can be freed, resulting
> > > > > in
> > > > > use
> > > > > after free.  Easily fixed by dget_parent().
> > > > 
> > > > Umm ... I'll need some more explanation here ...
> > > > 
> > > > We are in ref-walk mode so the parent dentry isn't going away.
> > > 
> > > The parent that was used to lookup the dentry in __d_lookup()
> > > isn't
> > > going away.  But it's not necessarily equal to dentry->d_parent
> > > anymore.
> > > 
> > > > And this is a negative dentry so rename is going to bail out
> > > > with ENOENT way early.
> > > 
> > > You are right.  But note that negative dentry in question could
> > > be
> > > the
> > > target of a rename.  Current implementation doesn't switch the
> > > target's parent or name, but this wasn't always the case (commit
> > > 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent
> > > rather
> > > than swap them")), so a backport of this patch could become
> > > incorrect
> > > on old enough kernels.
> > 
> > Right, that __lookup_hash() will find the negative target.
> > 
> > > 
> > > So I still think using dget_parent() is the correct way to do
> > > this.
> > 
> > The rename code does my head in, ;)
> > 
> > The dget_parent() would ensure we had an up to date parent so
> > yes, that would be the right thing to do regardless.
> > 
> > But now I'm not sure that will be sufficient for kernfs. I'm still
> > thinking about it.
> > 
> > I'm wondering if there's a missing check in there to account for
> > what happens with revalidate after ->rename() but before move.
> > There's already a kernfs node check in there so it's probably ok
> > ...
> >  
> > > 
> > > > > 
> > > > > > +               if (parent) {
> > > > > > +                       const void *ns = NULL;
> > > > > > +
> > > > > > +                       if (kernfs_ns_enabled(parent))
> > > > > > +                               ns = kernfs_info(dentry-
> > > > > > > d_sb)-
> > > > > > > ns;
> > > > > > +                       kn = kernfs_find_ns(parent, dentry-
> > > > > > > d_name.name, ns);
> > > > > 
> > > > > Same thing with d_name.  There's
> > > > > take_dentry_name_snapshot()/release_dentry_name_snapshot() to
> > > > > properly
> > > > > take care of that.
> > > > 
> > > > I don't see that problem either, due to the dentry being
> > > > negative,
> > > > but please explain what your seeing here.
> > > 
> > > Yeah.  Negative dentries' names weren't always stable, but that
> > > was
> > > a
> > > long time ago (commit 8d85b4845a66 ("Allow sharing external names
> > > after __d_move()")).
> > 
> > Right, I'll make that change too.
> > 
> > > 
> > > Thanks,
> > > Miklos
> > 
>

Ian Kent June 4, 2021, 1:07 a.m. UTC | #10

On Fri, 2021-06-04 at 07:57 +0800, Ian Kent wrote:
> On Thu, 2021-06-03 at 10:15 +0800, Ian Kent wrote:
> > On Wed, 2021-06-02 at 18:57 +0800, Ian Kent wrote:
> > > On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote:
> > > > On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@themaw.net> wrote:
> > > > > 
> > > > > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > > > > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@themaw.net>
> > > > > > wrote:
> > > > > > > 
> > > > > > > If there are many lookups for non-existent paths these
> > > > > > > negative
> > > > > > > lookups
> > > > > > > can lead to a lot of overhead during path walks.
> > > > > > > 
> > > > > > > The VFS allows dentries to be created as negative and
> > > > > > > hashed,
> > > > > > > and
> > > > > > > caches
> > > > > > > them so they can be used to reduce the fairly high
> > > > > > > overhead
> > > > > > > alloc/free
> > > > > > > cycle that occurs during these lookups.
> > > > > > 
> > > > > > Obviously there's a cost associated with negative caching
> > > > > > too. 
> > > > > > For
> > > > > > normal filesystems it's trivially worth that cost, but in
> > > > > > case
> > > > > > of
> > > > > > kernfs, not sure...
> > > > > > 
> > > > > > Can "fairly high" be somewhat substantiated with a
> > > > > > microbenchmark
> > > > > > for
> > > > > > negative lookups?
> > > > > 
> > > > > Well, maybe, but anything we do for a benchmark would be
> > > > > totally
> > > > > artificial.
> > > > > 
> > > > > The reason I added this is because I saw appreciable
> > > > > contention
> > > > > on the dentry alloc path in one case I saw.
> > > > 
> > > > If multiple tasks are trying to look up the same negative
> > > > dentry
> > > > in
> > > > parallel, then there will be contention on the parent inode
> > > > lock.
> > > > Was this the issue?   This could easily be reproduced with an
> > > > artificial benchmark.
> > > 
> > > Not that I remember, I'll need to dig up the sysrq dumps to have
> > > a
> > > look and get back to you.
> > 
> > After doing that though I could grab Fox Chen's reproducer and give
> > it varying sysfs paths as well as some percentage of non-existent
> > sysfs paths and see what I get ...
> > 
> > That should give it a more realistic usage profile and, if I can
> > get the percentage of non-existent paths right, demonstrate that
> > case as well ... but nothing is easy, so we'll have to wait and
> > see, ;)
> 
> Ok, so I grabbed Fox's benckmark repo. and used a non-existent path
> to check the negative dentry contention.
> 
> I've taken the baseline readings and the contention is see is the
> same as I originally saw. It's with d_alloc_parallel() on lockref.
> 
> While I haven't run the patched check I'm pretty sure that using
> dget_parent() and taking a snapshot will move the contention to
> that. So if I do retain the negative dentry caching change I would
> need to use the dentry seq lock for it to be useful.
> 
> Thoughts Miklos, anyone?

Mmm ... never mind, I'd still need to take a snapshot anyway and
dget_parent() looks lightweight if there's no conflict. I will
need to test it.

> 
> > 
> > > 
> > > > 
> > > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > > > > --- a/fs/kernfs/dir.c
> > > > > > > +++ b/fs/kernfs/dir.c
> > > > > > > @@ -1037,12 +1037,33 @@ static int
> > > > > > > kernfs_dop_revalidate(struct
> > > > > > > dentry *dentry, unsigned int flags)
> > > > > > >         if (flags & LOOKUP_RCU)
> > > > > > >                 return -ECHILD;
> > > > > > > 
> > > > > > > -       /* Always perform fresh lookup for negatives */
> > > > > > > -       if (d_really_is_negative(dentry))
> > > > > > > -               goto out_bad_unlocked;
> > > > > > > +       mutex_lock(&kernfs_mutex);
> > > > > > > 
> > > > > > >         kn = kernfs_dentry_node(dentry);
> > > > > > > -       mutex_lock(&kernfs_mutex);
> > > > > > > +
> > > > > > > +       /* Negative hashed dentry? */
> > > > > > > +       if (!kn) {
> > > > > > > +               struct kernfs_node *parent;
> > > > > > > +
> > > > > > > +               /* If the kernfs node can be found this
> > > > > > > is
> > > > > > > a
> > > > > > > stale
> > > > > > > negative
> > > > > > > +                * hashed dentry so it must be discarded
> > > > > > > and
> > > > > > > the
> > > > > > > lookup redone.
> > > > > > > +                */
> > > > > > > +               parent = kernfs_dentry_node(dentry-
> > > > > > > > d_parent);
> > > > > > 
> > > > > > This doesn't look safe WRT a racing sys_rename().  In this
> > > > > > case
> > > > > > d_move() is called only with parent inode locked, but not
> > > > > > with
> > > > > > kernfs_mutex while ->d_revalidate() may not have parent
> > > > > > inode
> > > > > > locked.
> > > > > > After d_move() the old parent dentry can be freed,
> > > > > > resulting
> > > > > > in
> > > > > > use
> > > > > > after free.  Easily fixed by dget_parent().
> > > > > 
> > > > > Umm ... I'll need some more explanation here ...
> > > > > 
> > > > > We are in ref-walk mode so the parent dentry isn't going
> > > > > away.
> > > > 
> > > > The parent that was used to lookup the dentry in __d_lookup()
> > > > isn't
> > > > going away.  But it's not necessarily equal to dentry->d_parent
> > > > anymore.
> > > > 
> > > > > And this is a negative dentry so rename is going to bail out
> > > > > with ENOENT way early.
> > > > 
> > > > You are right.  But note that negative dentry in question could
> > > > be
> > > > the
> > > > target of a rename.  Current implementation doesn't switch the
> > > > target's parent or name, but this wasn't always the case
> > > > (commit
> > > > 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent
> > > > rather
> > > > than swap them")), so a backport of this patch could become
> > > > incorrect
> > > > on old enough kernels.
> > > 
> > > Right, that __lookup_hash() will find the negative target.
> > > 
> > > > 
> > > > So I still think using dget_parent() is the correct way to do
> > > > this.
> > > 
> > > The rename code does my head in, ;)
> > > 
> > > The dget_parent() would ensure we had an up to date parent so
> > > yes, that would be the right thing to do regardless.
> > > 
> > > But now I'm not sure that will be sufficient for kernfs. I'm
> > > still
> > > thinking about it.
> > > 
> > > I'm wondering if there's a missing check in there to account for
> > > what happens with revalidate after ->rename() but before move.
> > > There's already a kernfs node check in there so it's probably ok
> > > ...
> > >  
> > > > 
> > > > > > 
> > > > > > > +               if (parent) {
> > > > > > > +                       const void *ns = NULL;
> > > > > > > +
> > > > > > > +                       if (kernfs_ns_enabled(parent))
> > > > > > > +                               ns = kernfs_info(dentry-
> > > > > > > > d_sb)-
> > > > > > > > ns;
> > > > > > > +                       kn = kernfs_find_ns(parent,
> > > > > > > dentry-
> > > > > > > > d_name.name, ns);
> > > > > > 
> > > > > > Same thing with d_name.  There's
> > > > > > take_dentry_name_snapshot()/release_dentry_name_snapshot()
> > > > > > to
> > > > > > properly
> > > > > > take care of that.
> > > > > 
> > > > > I don't see that problem either, due to the dentry being
> > > > > negative,
> > > > > but please explain what your seeing here.
> > > > 
> > > > Yeah.  Negative dentries' names weren't always stable, but that
> > > > was
> > > > a
> > > > long time ago (commit 8d85b4845a66 ("Allow sharing external
> > > > names
> > > > after __d_move()")).
> > > 
> > > Right, I'll make that change too.
> > > 
> > > > 
> > > > Thanks,
> > > > Miklos
> > > 
> > 
>

Ian Kent June 4, 2021, 3:14 a.m. UTC | #11

On Thu, 2021-06-03 at 17:02 -0500, Eric W. Biederman wrote:
> Miklos Szeredi <miklos@szeredi.hu> writes:
> 
> > On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman < 
> > ebiederm@xmission.com> wrote:
> > > 
> > > Ian Kent <raven@themaw.net> writes:
> > > 
> > > > If there are many lookups for non-existent paths these negative
> > > > lookups
> > > > can lead to a lot of overhead during path walks.
> > > > 
> > > > The VFS allows dentries to be created as negative and hashed,
> > > > and caches
> > > > them so they can be used to reduce the fairly high overhead
> > > > alloc/free
> > > > cycle that occurs during these lookups.
> > > > 
> > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > ---
> > > >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++------
> > > > ----------------
> > > >  1 file changed, 33 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > --- a/fs/kernfs/dir.c
> > > > +++ b/fs/kernfs/dir.c
> > > > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct
> > > > dentry *dentry, unsigned int flags)
> > > >       if (flags & LOOKUP_RCU)
> > > >               return -ECHILD;
> > > > 
> > > > -     /* Always perform fresh lookup for negatives */
> > > > -     if (d_really_is_negative(dentry))
> > > > -             goto out_bad_unlocked;
> > > > +     mutex_lock(&kernfs_mutex);
> > > > 
> > > >       kn = kernfs_dentry_node(dentry);
> > > > -     mutex_lock(&kernfs_mutex);
> > > 
> > > Why bring kernfs_dentry_node inside the mutex?
> > > 
> > > The inode lock of the parent should protect negative to positive
> > > transitions not the kernfs_mutex.  So moving the code inside
> > > the mutex looks unnecessary and confusing.
> > 
> > Except that d_revalidate() may or may not be called with parent
> > lock
> > held.

Bringing the kernfs_dentry_node() inside taking the mutex is probably
wasteful, as you say, oddly the reason I did it that conceptually it
makes sense to me since the kernfs node is being grabbed. But it
probably isn't possible for a concurrent unlink so is not necessary.

Since you feel strongly about I can change it.

> 
> I grant that this works because kernfs_io_lookup today holds
> kernfs_mutex over d_splice_alias.

Changing that will require some thought but your points about
maintainability are well taken.

> 
> The problem is that the kernfs_mutex only should be protecting the
> kernfs data structures not the vfs data structures.
> 
> Reading through the code history that looks like a hold over from
> when
> sysfs lived in the dcache before it was reimplemented as a
> distributed
> file system.  So it was probably a complete over sight and something
> that did not matter.
> 
> The big problem is that if the code starts depending upon the
> kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest of
> the
> filesystems does not the code will diverge from the rest of the
> filesystems and maintenance will become much more difficult.
> 
> Diverging from other filesystems and becoming a maintenance pain has
> already been seen once in the life of sysfs and I don't think we want
> to
> go back there.
> 
> Further extending the scope of lock, when the problem is that the
> locking is causing problems seems like the opposite of the direction
> we
> want the code to grow.
> 
> I really suspect all we want kernfs_dop_revalidate doing for negative
> dentries is something as simple as comparing the timestamp of the
> negative dentry to the timestamp of the parent dentry, and if the
> timestamp has changed perform the lookup.  That is roughly what
> nfs does today with negative dentries.
> 
> The dentry cache will always lag the kernfs_node data structures, and
> that is fundamental.  We should take advantage of that to make the
> code
> as simple and as fast as we can not to perform lots of work that
> creates
> overhead.
> 
> Plus the kernfs data structures should not change much so I expect
> there will be effectively 0 penalty in always performing the lookup
> of a
> negative dentry when the directory itself has changed.

This sounds good to me.

In fact this approach should be able to be used to resolve the
potential race Miklos pointed out in a much simpler way, not to
mention the revalidate simplification itself.

But isn't knowing whether the directory has changed harder to
do than checking a time stamp?

Look at kernfs_refresh_inode() and it's callers for example.

I suspect that would require bringing back the series patch to use
a generation number to identify directory changes (and also getting
rid of the search in revalidate).

Ian

Eric W. Biederman June 4, 2021, 2:28 p.m. UTC | #12

Ian Kent <raven@themaw.net> writes:

> On Thu, 2021-06-03 at 17:02 -0500, Eric W. Biederman wrote:
>> Miklos Szeredi <miklos@szeredi.hu> writes:
>> 
>> > On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman < 
>> > ebiederm@xmission.com> wrote:
>> > > 
>> > > Ian Kent <raven@themaw.net> writes:
>> > > 
>> > > > If there are many lookups for non-existent paths these negative
>> > > > lookups
>> > > > can lead to a lot of overhead during path walks.
>> > > > 
>> > > > The VFS allows dentries to be created as negative and hashed,
>> > > > and caches
>> > > > them so they can be used to reduce the fairly high overhead
>> > > > alloc/free
>> > > > cycle that occurs during these lookups.
>> > > > 
>> > > > Signed-off-by: Ian Kent <raven@themaw.net>
>> > > > ---
>> > > >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++------
>> > > > ----------------
>> > > >  1 file changed, 33 insertions(+), 22 deletions(-)
>> > > > 
>> > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> > > > index 4c69e2af82dac..5151c712f06f5 100644
>> > > > --- a/fs/kernfs/dir.c
>> > > > +++ b/fs/kernfs/dir.c
>> > > > @@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct
>> > > > dentry *dentry, unsigned int flags)
>> > > >       if (flags & LOOKUP_RCU)
>> > > >               return -ECHILD;
>> > > > 
>> > > > -     /* Always perform fresh lookup for negatives */
>> > > > -     if (d_really_is_negative(dentry))
>> > > > -             goto out_bad_unlocked;
>> > > > +     mutex_lock(&kernfs_mutex);
>> > > > 
>> > > >       kn = kernfs_dentry_node(dentry);
>> > > > -     mutex_lock(&kernfs_mutex);
>> > > 
>> > > Why bring kernfs_dentry_node inside the mutex?
>> > > 
>> > > The inode lock of the parent should protect negative to positive
>> > > transitions not the kernfs_mutex.  So moving the code inside
>> > > the mutex looks unnecessary and confusing.
>> > 
>> > Except that d_revalidate() may or may not be called with parent
>> > lock
>> > held.
>
> Bringing the kernfs_dentry_node() inside taking the mutex is probably
> wasteful, as you say, oddly the reason I did it that conceptually it
> makes sense to me since the kernfs node is being grabbed. But it
> probably isn't possible for a concurrent unlink so is not necessary.
>
> Since you feel strongly about I can change it.
>
>> 
>> I grant that this works because kernfs_io_lookup today holds
>> kernfs_mutex over d_splice_alias.
>
> Changing that will require some thought but your points about
> maintainability are well taken.
>
>> 
>> The problem is that the kernfs_mutex only should be protecting the
>> kernfs data structures not the vfs data structures.
>> 
>> Reading through the code history that looks like a hold over from
>> when
>> sysfs lived in the dcache before it was reimplemented as a
>> distributed
>> file system.  So it was probably a complete over sight and something
>> that did not matter.
>> 
>> The big problem is that if the code starts depending upon the
>> kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest of
>> the
>> filesystems does not the code will diverge from the rest of the
>> filesystems and maintenance will become much more difficult.
>> 
>> Diverging from other filesystems and becoming a maintenance pain has
>> already been seen once in the life of sysfs and I don't think we want
>> to
>> go back there.
>> 
>> Further extending the scope of lock, when the problem is that the
>> locking is causing problems seems like the opposite of the direction
>> we
>> want the code to grow.
>> 
>> I really suspect all we want kernfs_dop_revalidate doing for negative
>> dentries is something as simple as comparing the timestamp of the
>> negative dentry to the timestamp of the parent dentry, and if the
>> timestamp has changed perform the lookup.  That is roughly what
>> nfs does today with negative dentries.
>> 
>> The dentry cache will always lag the kernfs_node data structures, and
>> that is fundamental.  We should take advantage of that to make the
>> code
>> as simple and as fast as we can not to perform lots of work that
>> creates
>> overhead.
>> 
>> Plus the kernfs data structures should not change much so I expect
>> there will be effectively 0 penalty in always performing the lookup
>> of a
>> negative dentry when the directory itself has changed.
>
> This sounds good to me.
>
> In fact this approach should be able to be used to resolve the
> potential race Miklos pointed out in a much simpler way, not to
> mention the revalidate simplification itself.
>
> But isn't knowing whether the directory has changed harder to
> do than checking a time stamp?
>
> Look at kernfs_refresh_inode() and it's callers for example.
>
> I suspect that would require bringing back the series patch to use
> a generation number to identify directory changes (and also getting
> rid of the search in revalidate).

In essence it is a simple as looking at a sequence number or a timestamp
to detect the directory has changed.

In practice there are always details that make things more complicated.

I was actually wondering if the approach should be to have an seqlock
around an individual directories rbtree.  I think that would give a lot
of potential for rcu style optimization during lookups.



All of the little details and choices on how to optimize this is why I
was suggesting splitting the patch in two.  Starting first with
something that allows negative dentries.  Then adds the tests so that
the negative dentries are not always invalidated.  That should allow
focusing on the tricky bits.

Eric

Ian Kent June 5, 2021, 3:19 a.m. UTC | #13

On Fri, 2021-06-04 at 09:28 -0500, Eric W. Biederman wrote:
> Ian Kent <raven@themaw.net> writes:
> 
> > On Thu, 2021-06-03 at 17:02 -0500, Eric W. Biederman wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> writes:
> > > 
> > > > On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman < 
> > > > ebiederm@xmission.com> wrote:
> > > > > 
> > > > > Ian Kent <raven@themaw.net> writes:
> > > > > 
> > > > > > If there are many lookups for non-existent paths these
> > > > > > negative
> > > > > > lookups
> > > > > > can lead to a lot of overhead during path walks.
> > > > > > 
> > > > > > The VFS allows dentries to be created as negative and
> > > > > > hashed,
> > > > > > and caches
> > > > > > them so they can be used to reduce the fairly high overhead
> > > > > > alloc/free
> > > > > > cycle that occurs during these lookups.
> > > > > > 
> > > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > > ---
> > > > > >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++--
> > > > > > ----
> > > > > > ----------------
> > > > > >  1 file changed, 33 insertions(+), 22 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > > > --- a/fs/kernfs/dir.c
> > > > > > +++ b/fs/kernfs/dir.c
> > > > > > @@ -1037,12 +1037,33 @@ static int
> > > > > > kernfs_dop_revalidate(struct
> > > > > > dentry *dentry, unsigned int flags)
> > > > > >       if (flags & LOOKUP_RCU)
> > > > > >               return -ECHILD;
> > > > > > 
> > > > > > -     /* Always perform fresh lookup for negatives */
> > > > > > -     if (d_really_is_negative(dentry))
> > > > > > -             goto out_bad_unlocked;
> > > > > > +     mutex_lock(&kernfs_mutex);
> > > > > > 
> > > > > >       kn = kernfs_dentry_node(dentry);
> > > > > > -     mutex_lock(&kernfs_mutex);
> > > > > 
> > > > > Why bring kernfs_dentry_node inside the mutex?
> > > > > 
> > > > > The inode lock of the parent should protect negative to
> > > > > positive
> > > > > transitions not the kernfs_mutex.  So moving the code inside
> > > > > the mutex looks unnecessary and confusing.
> > > > 
> > > > Except that d_revalidate() may or may not be called with parent
> > > > lock
> > > > held.
> > 
> > Bringing the kernfs_dentry_node() inside taking the mutex is
> > probably
> > wasteful, as you say, oddly the reason I did it that conceptually
> > it
> > makes sense to me since the kernfs node is being grabbed. But it
> > probably isn't possible for a concurrent unlink so is not
> > necessary.
> > 
> > Since you feel strongly about I can change it.
> > 
> > > 
> > > I grant that this works because kernfs_io_lookup today holds
> > > kernfs_mutex over d_splice_alias.
> > 
> > Changing that will require some thought but your points about
> > maintainability are well taken.
> > 
> > > 
> > > The problem is that the kernfs_mutex only should be protecting
> > > the
> > > kernfs data structures not the vfs data structures.
> > > 
> > > Reading through the code history that looks like a hold over from
> > > when
> > > sysfs lived in the dcache before it was reimplemented as a
> > > distributed
> > > file system.  So it was probably a complete over sight and
> > > something
> > > that did not matter.
> > > 
> > > The big problem is that if the code starts depending upon the
> > > kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest
> > > of
> > > the
> > > filesystems does not the code will diverge from the rest of the
> > > filesystems and maintenance will become much more difficult.
> > > 
> > > Diverging from other filesystems and becoming a maintenance pain
> > > has
> > > already been seen once in the life of sysfs and I don't think we
> > > want
> > > to
> > > go back there.
> > > 
> > > Further extending the scope of lock, when the problem is that the
> > > locking is causing problems seems like the opposite of the
> > > direction
> > > we
> > > want the code to grow.
> > > 
> > > I really suspect all we want kernfs_dop_revalidate doing for
> > > negative
> > > dentries is something as simple as comparing the timestamp of the
> > > negative dentry to the timestamp of the parent dentry, and if the
> > > timestamp has changed perform the lookup.  That is roughly what
> > > nfs does today with negative dentries.
> > > 
> > > The dentry cache will always lag the kernfs_node data structures,
> > > and
> > > that is fundamental.  We should take advantage of that to make
> > > the
> > > code
> > > as simple and as fast as we can not to perform lots of work that
> > > creates
> > > overhead.
> > > 
> > > Plus the kernfs data structures should not change much so I
> > > expect
> > > there will be effectively 0 penalty in always performing the
> > > lookup
> > > of a
> > > negative dentry when the directory itself has changed.
> > 
> > This sounds good to me.
> > 
> > In fact this approach should be able to be used to resolve the
> > potential race Miklos pointed out in a much simpler way, not to
> > mention the revalidate simplification itself.
> > 
> > But isn't knowing whether the directory has changed harder to
> > do than checking a time stamp?
> > 
> > Look at kernfs_refresh_inode() and it's callers for example.
> > 
> > I suspect that would require bringing back the series patch to use
> > a generation number to identify directory changes (and also getting
> > rid of the search in revalidate).
> 
> In essence it is a simple as looking at a sequence number or a
> timestamp
> to detect the directory has changed.

Yes, both Miklos and Al suggested using a simple revision to detect
changes to the parent. I did that early on and I don't think I grokked
what Al recommended and ended up with something more complex than was
needed. So I dropped it because I wanted to keep the changes to a
minimum.

But a quick test, bringing that patch back, and getting rid of the
search in revalidate works well. It's as effective at eliminating
contention I saw with d_alloc_parallel() for the case of a lot of
deterministic accesses to the same non-existent file as the racy
search method I had there, perhaps a bit better, it's certainly
more straight forward.

> 
> In practice there are always details that make things more
> complicated.
> 
> I was actually wondering if the approach should be to have an seqlock
> around an individual directories rbtree.  I think that would give a
> lot
> of potential for rcu style optimization during lookups.

Yeah, it's tempting, but another constraint I had is to not increase
the size of the kernfs_node struct (Greg and Tejun) and there's a
hole in the node union variant kernfs_elem_dir at least big enough
for sizeof(pointer) so I can put the revision there. And, given the
simplification in revalidate, as well as that extra code being pretty
straight forward itself, it's not too bad from the minimal change
POV.

So I'd like to go with using a revision for now.

> 
> 
> 
> All of the little details and choices on how to optimize this is why
> I
> was suggesting splitting the patch in two.  Starting first with
> something that allows negative dentries.  Then adds the tests so that
> the negative dentries are not always invalidated.  That should allow
> focusing on the tricky bits.
> 
> Eric

Eric W. Biederman June 5, 2021, 8:52 p.m. UTC | #14

Ian Kent <raven@themaw.net> writes:

> On Fri, 2021-06-04 at 09:28 -0500, Eric W. Biederman wrote:
>> Ian Kent <raven@themaw.net> writes:
>> 
>> > On Thu, 2021-06-03 at 17:02 -0500, Eric W. Biederman wrote:
>> > > Miklos Szeredi <miklos@szeredi.hu> writes:
>> > > 
>> > > > On Thu, 3 Jun 2021 at 19:26, Eric W. Biederman < 
>> > > > ebiederm@xmission.com> wrote:
>> > > > > 
>> > > > > Ian Kent <raven@themaw.net> writes:
>> > > > > 
>> > > > > > If there are many lookups for non-existent paths these
>> > > > > > negative
>> > > > > > lookups
>> > > > > > can lead to a lot of overhead during path walks.
>> > > > > > 
>> > > > > > The VFS allows dentries to be created as negative and
>> > > > > > hashed,
>> > > > > > and caches
>> > > > > > them so they can be used to reduce the fairly high overhead
>> > > > > > alloc/free
>> > > > > > cycle that occurs during these lookups.
>> > > > > > 
>> > > > > > Signed-off-by: Ian Kent <raven@themaw.net>
>> > > > > > ---
>> > > > > >  fs/kernfs/dir.c |   55 +++++++++++++++++++++++++++++++++--
>> > > > > > ----
>> > > > > > ----------------
>> > > > > >  1 file changed, 33 insertions(+), 22 deletions(-)
>> > > > > > 
>> > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> > > > > > index 4c69e2af82dac..5151c712f06f5 100644
>> > > > > > --- a/fs/kernfs/dir.c
>> > > > > > +++ b/fs/kernfs/dir.c
>> > > > > > @@ -1037,12 +1037,33 @@ static int
>> > > > > > kernfs_dop_revalidate(struct
>> > > > > > dentry *dentry, unsigned int flags)
>> > > > > >       if (flags & LOOKUP_RCU)
>> > > > > >               return -ECHILD;
>> > > > > > 
>> > > > > > -     /* Always perform fresh lookup for negatives */
>> > > > > > -     if (d_really_is_negative(dentry))
>> > > > > > -             goto out_bad_unlocked;
>> > > > > > +     mutex_lock(&kernfs_mutex);
>> > > > > > 
>> > > > > >       kn = kernfs_dentry_node(dentry);
>> > > > > > -     mutex_lock(&kernfs_mutex);
>> > > > > 
>> > > > > Why bring kernfs_dentry_node inside the mutex?
>> > > > > 
>> > > > > The inode lock of the parent should protect negative to
>> > > > > positive
>> > > > > transitions not the kernfs_mutex.  So moving the code inside
>> > > > > the mutex looks unnecessary and confusing.
>> > > > 
>> > > > Except that d_revalidate() may or may not be called with parent
>> > > > lock
>> > > > held.
>> > 
>> > Bringing the kernfs_dentry_node() inside taking the mutex is
>> > probably
>> > wasteful, as you say, oddly the reason I did it that conceptually
>> > it
>> > makes sense to me since the kernfs node is being grabbed. But it
>> > probably isn't possible for a concurrent unlink so is not
>> > necessary.
>> > 
>> > Since you feel strongly about I can change it.
>> > 
>> > > 
>> > > I grant that this works because kernfs_io_lookup today holds
>> > > kernfs_mutex over d_splice_alias.
>> > 
>> > Changing that will require some thought but your points about
>> > maintainability are well taken.
>> > 
>> > > 
>> > > The problem is that the kernfs_mutex only should be protecting
>> > > the
>> > > kernfs data structures not the vfs data structures.
>> > > 
>> > > Reading through the code history that looks like a hold over from
>> > > when
>> > > sysfs lived in the dcache before it was reimplemented as a
>> > > distributed
>> > > file system.  So it was probably a complete over sight and
>> > > something
>> > > that did not matter.
>> > > 
>> > > The big problem is that if the code starts depending upon the
>> > > kernfs_mutex (or the kernfs_rwsem) to provide semantics the rest
>> > > of
>> > > the
>> > > filesystems does not the code will diverge from the rest of the
>> > > filesystems and maintenance will become much more difficult.
>> > > 
>> > > Diverging from other filesystems and becoming a maintenance pain
>> > > has
>> > > already been seen once in the life of sysfs and I don't think we
>> > > want
>> > > to
>> > > go back there.
>> > > 
>> > > Further extending the scope of lock, when the problem is that the
>> > > locking is causing problems seems like the opposite of the
>> > > direction
>> > > we
>> > > want the code to grow.
>> > > 
>> > > I really suspect all we want kernfs_dop_revalidate doing for
>> > > negative
>> > > dentries is something as simple as comparing the timestamp of the
>> > > negative dentry to the timestamp of the parent dentry, and if the
>> > > timestamp has changed perform the lookup.  That is roughly what
>> > > nfs does today with negative dentries.
>> > > 
>> > > The dentry cache will always lag the kernfs_node data structures,
>> > > and
>> > > that is fundamental.  We should take advantage of that to make
>> > > the
>> > > code
>> > > as simple and as fast as we can not to perform lots of work that
>> > > creates
>> > > overhead.
>> > > 
>> > > Plus the kernfs data structures should not change much so I
>> > > expect
>> > > there will be effectively 0 penalty in always performing the
>> > > lookup
>> > > of a
>> > > negative dentry when the directory itself has changed.
>> > 
>> > This sounds good to me.
>> > 
>> > In fact this approach should be able to be used to resolve the
>> > potential race Miklos pointed out in a much simpler way, not to
>> > mention the revalidate simplification itself.
>> > 
>> > But isn't knowing whether the directory has changed harder to
>> > do than checking a time stamp?
>> > 
>> > Look at kernfs_refresh_inode() and it's callers for example.
>> > 
>> > I suspect that would require bringing back the series patch to use
>> > a generation number to identify directory changes (and also getting
>> > rid of the search in revalidate).
>> 
>> In essence it is a simple as looking at a sequence number or a
>> timestamp
>> to detect the directory has changed.
>
> Yes, both Miklos and Al suggested using a simple revision to detect
> changes to the parent. I did that early on and I don't think I grokked
> what Al recommended and ended up with something more complex than was
> needed. So I dropped it because I wanted to keep the changes to a
> minimum.
>
> But a quick test, bringing that patch back, and getting rid of the
> search in revalidate works well. It's as effective at eliminating
> contention I saw with d_alloc_parallel() for the case of a lot of
> deterministic accesses to the same non-existent file as the racy
> search method I had there, perhaps a bit better, it's certainly
> more straight forward.
>
>> 
>> In practice there are always details that make things more
>> complicated.
>> 
>> I was actually wondering if the approach should be to have an seqlock
>> around an individual directories rbtree.  I think that would give a
>> lot
>> of potential for rcu style optimization during lookups.
>
> Yeah, it's tempting, but another constraint I had is to not increase
> the size of the kernfs_node struct (Greg and Tejun) and there's a
> hole in the node union variant kernfs_elem_dir at least big enough
> for sizeof(pointer) so I can put the revision there. And, given the
> simplification in revalidate, as well as that extra code being pretty
> straight forward itself, it's not too bad from the minimal change
> POV.
>
> So I'd like to go with using a revision for now.

No objection from me.

Eric

[REPOST,v4,2/5] kernfs: use VFS negative dentry caching

Commit Message

Comments

Patch