diff mbox series

[RFC] vfs: Introduce a new open flag to imply dentry deletion on file removal

Message ID 20240912091548.98132-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] vfs: Introduce a new open flag to imply dentry deletion on file removal | expand

Commit Message

Yafang Shao Sept. 12, 2024, 9:15 a.m. UTC
Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a
file") introduced an unconditional deletion of the associated dentry when a
file is removed. However, this led to performance regressions in specific
benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in
commit 4a4be1ad3a6e ("Revert 'vfs: Delete the associated dentry when
deleting a file'").

This patch seeks to reintroduce the concept conditionally, where the
associated dentry is deleted only when the user explicitly opts for it
during file removal.

There are practical use cases for this proactive dentry reclamation.
Besides the Elasticsearch use case mentioned in commit 681ce8623567,
additional examples have surfaced in our production environment. For
instance, in video rendering services that continuously generate temporary
files, upload them to persistent storage servers, and then delete them, a
large number of negative dentries—serving no useful purpose—accumulate.
Users in such cases would benefit from proactively reclaiming these
negative dentries. This patch provides an API allowing users to actively
delete these unnecessary negative dentries.

Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
---
 fs/dcache.c                      | 7 ++++++-
 fs/open.c                        | 9 ++++++++-
 include/linux/dcache.h           | 2 +-
 include/linux/sched.h            | 2 +-
 include/uapi/asm-generic/fcntl.h | 4 ++++
 5 files changed, 20 insertions(+), 4 deletions(-)

Comments

Jan Kara Sept. 12, 2024, 10:53 a.m. UTC | #1
On Thu 12-09-24 17:15:48, Yafang Shao wrote:
> Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a
> file") introduced an unconditional deletion of the associated dentry when a
> file is removed. However, this led to performance regressions in specific
> benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in
> commit 4a4be1ad3a6e ("Revert 'vfs: Delete the associated dentry when
> deleting a file'").
> 
> This patch seeks to reintroduce the concept conditionally, where the
> associated dentry is deleted only when the user explicitly opts for it
> during file removal.
> 
> There are practical use cases for this proactive dentry reclamation.
> Besides the Elasticsearch use case mentioned in commit 681ce8623567,
> additional examples have surfaced in our production environment. For
> instance, in video rendering services that continuously generate temporary
> files, upload them to persistent storage servers, and then delete them, a
> large number of negative dentries—serving no useful purpose—accumulate.
> Users in such cases would benefit from proactively reclaiming these
> negative dentries. This patch provides an API allowing users to actively
> delete these unnecessary negative dentries.
> 
> Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>

Umm, I don't think we want to burn a FMODE flag and expose these details of
dentry reclaim to userspace. BTW, if we wanted to do this, we already have
d_mark_dontcache() for in-kernel users which we could presumably reuse.

But I would not completely give up on trying to handle this in an automated
way inside the kernel. The original solution you mention above was perhaps
too aggressive but maybe d_delete() could just mark the dentry with a
"deleted" flag, retain_dentry() would move such dentries into a special LRU
list which we'd prune once in a while (say once per 5 s). Also this list
would be automatically pruned from prune_dcache_sb(). This way if there's
quick reuse of a dentry, it will get reused and no harm is done, if it
isn't quickly reused, we'll free them to not waste memory.

What do people think about such scheme?

								Honza

> ---
>  fs/dcache.c                      | 7 ++++++-
>  fs/open.c                        | 9 ++++++++-
>  include/linux/dcache.h           | 2 +-
>  include/linux/sched.h            | 2 +-
>  include/uapi/asm-generic/fcntl.h | 4 ++++
>  5 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 3d8daaecb6d1..6d744b5e5a6c 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1667,7 +1667,10 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>  	smp_store_release(&dentry->d_name.name, dname); /* ^^^ */
>  
>  	dentry->d_lockref.count = 1;
> -	dentry->d_flags = 0;
> +	if (current->flags & PF_REMOVE_DENTRY)
> +		dentry->d_flags = DCACHE_FILE_REMOVE;
> +	else
> +		dentry->d_flags = 0;
>  	spin_lock_init(&dentry->d_lock);
>  	seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);
>  	dentry->d_inode = NULL;
> @@ -2394,6 +2397,8 @@ void d_delete(struct dentry * dentry)
>  	 * Are we the only user?
>  	 */
>  	if (dentry->d_lockref.count == 1) {
> +		if (dentry->d_flags & DCACHE_FILE_REMOVE)
> +			__d_drop(dentry);
>  		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
>  		dentry_unlink_inode(dentry);
>  	} else {
> diff --git a/fs/open.c b/fs/open.c
> index 22adbef7ecc2..3441a004a841 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1428,7 +1428,14 @@ static long do_sys_openat2(int dfd, const char __user *filename,
>  long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
>  {
>  	struct open_how how = build_open_how(flags, mode);
> -	return do_sys_openat2(dfd, filename, &how);
> +	long err;
> +
> +	if (flags & O_NODENTRY)
> +		current->flags |= PF_REMOVE_DENTRY;
> +	err = do_sys_openat2(dfd, filename, &how);
> +	if (flags & O_NODENTRY)
> +		current->flags &= ~PF_REMOVE_DENTRY;
> +	return err;
>  }
>  
>  
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index bff956f7b2b9..82ba79bc0072 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -215,7 +215,7 @@ struct dentry_operations {
>  
>  #define DCACHE_NOKEY_NAME		BIT(25) /* Encrypted name encoded without key */
>  #define DCACHE_OP_REAL			BIT(26)
> -
> +#define DCACHE_FILE_REMOVE		BIT(27) /* remove this dentry when file is removed */
>  #define DCACHE_PAR_LOOKUP		BIT(28) /* being looked up (with parent locked shared) */
>  #define DCACHE_DENTRY_CURSOR		BIT(29)
>  #define DCACHE_NORCU			BIT(30) /* No RCU delay for freeing */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f8d150343d42..f931a3a882e0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1649,7 +1649,7 @@ extern struct pid *cad_pid;
>  #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
>  #define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
>  #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
> -#define PF__HOLE__00010000	0x00010000
> +#define PF_REMOVE_DENTRY	0x00010000      /* Remove the dentry when the file is removed */
>  #define PF_KSWAPD		0x00020000	/* I am kswapd */
>  #define PF_MEMALLOC_NOFS	0x00040000	/* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */
>  #define PF_MEMALLOC_NOIO	0x00080000	/* All allocations inherit GFP_NOIO. See memalloc_noio_save() */
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 80f37a0d40d7..ca5f402d5e7d 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -89,6 +89,10 @@
>  #define __O_TMPFILE	020000000
>  #endif
>  
> +#ifndef O_NODENTRY
> +#define O_NODENTRY     040000000
> +#endif
> +
>  /* a horrid kludge trying to make sure that this will fail on old kernels */
>  #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>  
> -- 
> 2.43.5
>
Mateusz Guzik Sept. 12, 2024, 11:36 a.m. UTC | #2
On Thu, Sep 12, 2024 at 12:53:40PM +0200, Jan Kara wrote:
> On Thu 12-09-24 17:15:48, Yafang Shao wrote:
> > This patch seeks to reintroduce the concept conditionally, where the
> > associated dentry is deleted only when the user explicitly opts for it
> > during file removal.
> > 
>
> Umm, I don't think we want to burn a FMODE flag and expose these details of
> dentry reclaim to userspace. BTW, if we wanted to do this, we already have
> d_mark_dontcache() for in-kernel users which we could presumably reuse.
> 

I don't believe any mechanism letting userspace hint at what to do with
a dentry is warranted at this point.

> But I would not completely give up on trying to handle this in an automated
> way inside the kernel. The original solution you mention above was perhaps
> too aggressive but maybe d_delete() could just mark the dentry with a
> "deleted" flag, retain_dentry() would move such dentries into a special LRU
> list which we'd prune once in a while (say once per 5 s). Also this list
> would be automatically pruned from prune_dcache_sb(). This way if there's
> quick reuse of a dentry, it will get reused and no harm is done, if it
> isn't quickly reused, we'll free them to not waste memory.
> 
> What do people think about such scheme?
>

I have to note what to do with a dentry after unlink is merely a subset
of the general problem of what to do about negative entries.  I had a
look at it $elsewhere some years back and as one might suspect userspace
likes to do counterproductive shit. For example it is going to stat a
non-existent path 2-3 times and then open(..., O_CREAT) on it.

I don't have numbers handy and someone(tm) will need to re-evaluate, but
crux of the findings was as follows:
- there is a small subset of negative entries which keep getting tons of
  hits
- a sizeable count literally does not get any hits after being created
  (aka wastes memory)
- some negative entries get 2-3 hits and get converted into a positive
  entry afterwards (see that stat shitter)
- some flip flop with deletion/creation

So whatever magic mechanism, if it wants to mostly not get in the way in
terms of performance, will have to account for the above.

I ended up with a kludge where negative entries hang out on some number
of LRU lists and get promoted to a hot list if they manage to get some
number of hits. The hot list is merely a FIFO and entries there no
longer count any hits. Removal from the cold LRU also demotes an entry
from the hot list.

The total count is limited and if you want to create a negative dentry
you have to whack one from the LRU.

This is not perfect by any means but manages to succesfully separate the
high churn entries from the one which are likely to stay in the long
run. Definitely something to tinker with.

If I read the original problem correctly this would be sorted out as a
side effect by limiting how many entries are there to evict to begin
with.

I'm not signing up to do squat though. :)
Christian Brauner Sept. 12, 2024, 12:04 p.m. UTC | #3
On Thu, Sep 12, 2024 at 12:53:40PM GMT, Jan Kara wrote:
> On Thu 12-09-24 17:15:48, Yafang Shao wrote:
> > Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a
> > file") introduced an unconditional deletion of the associated dentry when a
> > file is removed. However, this led to performance regressions in specific
> > benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in
> > commit 4a4be1ad3a6e ("Revert 'vfs: Delete the associated dentry when
> > deleting a file'").
> > 
> > This patch seeks to reintroduce the concept conditionally, where the
> > associated dentry is deleted only when the user explicitly opts for it
> > during file removal.
> > 
> > There are practical use cases for this proactive dentry reclamation.
> > Besides the Elasticsearch use case mentioned in commit 681ce8623567,
> > additional examples have surfaced in our production environment. For
> > instance, in video rendering services that continuously generate temporary
> > files, upload them to persistent storage servers, and then delete them, a
> > large number of negative dentries—serving no useful purpose—accumulate.
> > Users in such cases would benefit from proactively reclaiming these
> > negative dentries. This patch provides an API allowing users to actively
> > delete these unnecessary negative dentries.
> > 
> > Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0]
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Jan Kara <jack@suse.cz>
> 
> Umm, I don't think we want to burn a FMODE flag and expose these details of
> dentry reclaim to userspace. BTW, if we wanted to do this, we already have
> d_mark_dontcache() for in-kernel users which we could presumably reuse.
> 
> But I would not completely give up on trying to handle this in an automated
> way inside the kernel. The original solution you mention above was perhaps
> too aggressive but maybe d_delete() could just mark the dentry with a
> "deleted" flag, retain_dentry() would move such dentries into a special LRU
> list which we'd prune once in a while (say once per 5 s). Also this list
> would be automatically pruned from prune_dcache_sb(). This way if there's
> quick reuse of a dentry, it will get reused and no harm is done, if it
> isn't quickly reused, we'll free them to not waste memory.
> 
> What do people think about such scheme?

I agree that a new open flag is not the right way to go and it also
wastes a PF_* flag.

I think it'll probably be rather difficult to ensure that we don't see
performance regressions for some benchmark. Plus there will be users
that want aggressive negative dentry reclaim being fine with possibly
increased lookup cost independent of the directory case.

So imo the simple option is to add 681ce8623567 back behind a sysctl
that users that need aggressive negative dentry reclaim can simply turn on.
Yafang Shao Sept. 12, 2024, 1:32 p.m. UTC | #4
On Thu, Sep 12, 2024 at 8:04 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Sep 12, 2024 at 12:53:40PM GMT, Jan Kara wrote:
> > On Thu 12-09-24 17:15:48, Yafang Shao wrote:
> > > Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a
> > > file") introduced an unconditional deletion of the associated dentry when a
> > > file is removed. However, this led to performance regressions in specific
> > > benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in
> > > commit 4a4be1ad3a6e ("Revert 'vfs: Delete the associated dentry when
> > > deleting a file'").
> > >
> > > This patch seeks to reintroduce the concept conditionally, where the
> > > associated dentry is deleted only when the user explicitly opts for it
> > > during file removal.
> > >
> > > There are practical use cases for this proactive dentry reclamation.
> > > Besides the Elasticsearch use case mentioned in commit 681ce8623567,
> > > additional examples have surfaced in our production environment. For
> > > instance, in video rendering services that continuously generate temporary
> > > files, upload them to persistent storage servers, and then delete them, a
> > > large number of negative dentries—serving no useful purpose—accumulate.
> > > Users in such cases would benefit from proactively reclaiming these
> > > negative dentries. This patch provides an API allowing users to actively
> > > delete these unnecessary negative dentries.
> > >
> > > Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0]
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Jan Kara <jack@suse.cz>
> >
> > Umm, I don't think we want to burn a FMODE flag and expose these details of
> > dentry reclaim to userspace. BTW, if we wanted to do this, we already have
> > d_mark_dontcache() for in-kernel users which we could presumably reuse.
> >
> > But I would not completely give up on trying to handle this in an automated
> > way inside the kernel. The original solution you mention above was perhaps
> > too aggressive but maybe d_delete() could just mark the dentry with a
> > "deleted" flag, retain_dentry() would move such dentries into a special LRU
> > list which we'd prune once in a while (say once per 5 s). Also this list
> > would be automatically pruned from prune_dcache_sb(). This way if there's
> > quick reuse of a dentry, it will get reused and no harm is done, if it
> > isn't quickly reused, we'll free them to not waste memory.
> >
> > What do people think about such scheme?
>
> I agree that a new open flag is not the right way to go and it also
> wastes a PF_* flag.
>
> I think it'll probably be rather difficult to ensure that we don't see
> performance regressions for some benchmark. Plus there will be users
> that want aggressive negative dentry reclaim being fine with possibly
> increased lookup cost independent of the directory case.
>
> So imo the simple option is to add 681ce8623567 back behind a sysctl
> that users that need aggressive negative dentry reclaim can simply turn on.

We are currently using the sysctl method on our production servers.
This feature has been deployed across many of our production
environments for several months without any reported regressions.
Matthew Wilcox Sept. 15, 2024, 10:50 p.m. UTC | #5
On Thu, Sep 12, 2024 at 01:36:45PM +0200, Mateusz Guzik wrote:
> I have to note what to do with a dentry after unlink is merely a subset
> of the general problem of what to do about negative entries.  I had a
> look at it $elsewhere some years back and as one might suspect userspace
> likes to do counterproductive shit. For example it is going to stat a
> non-existent path 2-3 times and then open(..., O_CREAT) on it.
> 
> I don't have numbers handy and someone(tm) will need to re-evaluate, but
> crux of the findings was as follows:
> - there is a small subset of negative entries which keep getting tons of
>   hits
> - a sizeable count literally does not get any hits after being created
>   (aka wastes memory)
> - some negative entries get 2-3 hits and get converted into a positive
>   entry afterwards (see that stat shitter)
> - some flip flop with deletion/creation
> 
> So whatever magic mechanism, if it wants to mostly not get in the way in
> terms of performance, will have to account for the above.
> 
> I ended up with a kludge where negative entries hang out on some number
> of LRU lists and get promoted to a hot list if they manage to get some
> number of hits. The hot list is merely a FIFO and entries there no
> longer count any hits. Removal from the cold LRU also demotes an entry
> from the hot list.

This all reminds me of that paper you pointed me at.

https://arxiv.org/pdf/1512.00727

Summary for the impatient: Use 10% of the memory for a "not yet proven
to be useful" entries, and use multiple Bloom filters to decide which
ones are sufficiently useful to be added to the "more permanent" cache.

I don't think it solves every problem our current dcache implementation
has, but I'm 90% sure it'll be a huge improvement.
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 3d8daaecb6d1..6d744b5e5a6c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1667,7 +1667,10 @@  static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	smp_store_release(&dentry->d_name.name, dname); /* ^^^ */
 
 	dentry->d_lockref.count = 1;
-	dentry->d_flags = 0;
+	if (current->flags & PF_REMOVE_DENTRY)
+		dentry->d_flags = DCACHE_FILE_REMOVE;
+	else
+		dentry->d_flags = 0;
 	spin_lock_init(&dentry->d_lock);
 	seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);
 	dentry->d_inode = NULL;
@@ -2394,6 +2397,8 @@  void d_delete(struct dentry * dentry)
 	 * Are we the only user?
 	 */
 	if (dentry->d_lockref.count == 1) {
+		if (dentry->d_flags & DCACHE_FILE_REMOVE)
+			__d_drop(dentry);
 		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
 		dentry_unlink_inode(dentry);
 	} else {
diff --git a/fs/open.c b/fs/open.c
index 22adbef7ecc2..3441a004a841 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1428,7 +1428,14 @@  static long do_sys_openat2(int dfd, const char __user *filename,
 long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 {
 	struct open_how how = build_open_how(flags, mode);
-	return do_sys_openat2(dfd, filename, &how);
+	long err;
+
+	if (flags & O_NODENTRY)
+		current->flags |= PF_REMOVE_DENTRY;
+	err = do_sys_openat2(dfd, filename, &how);
+	if (flags & O_NODENTRY)
+		current->flags &= ~PF_REMOVE_DENTRY;
+	return err;
 }
 
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bff956f7b2b9..82ba79bc0072 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -215,7 +215,7 @@  struct dentry_operations {
 
 #define DCACHE_NOKEY_NAME		BIT(25) /* Encrypted name encoded without key */
 #define DCACHE_OP_REAL			BIT(26)
-
+#define DCACHE_FILE_REMOVE		BIT(27) /* remove this dentry when file is removed */
 #define DCACHE_PAR_LOOKUP		BIT(28) /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		BIT(29)
 #define DCACHE_NORCU			BIT(30) /* No RCU delay for freeing */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8d150343d42..f931a3a882e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,7 +1649,7 @@  extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF__HOLE__00010000	0x00010000
+#define PF_REMOVE_DENTRY	0x00010000      /* Remove the dentry when the file is removed */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocations inherit GFP_NOIO. See memalloc_noio_save() */
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 80f37a0d40d7..ca5f402d5e7d 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@ 
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_NODENTRY
+#define O_NODENTRY     040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)