diff mbox series

[RFC] fs: dcache: Delete the associated dentry when deleting a file

Message ID 20240511022729.35144-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] fs: dcache: Delete the associated dentry when deleting a file | expand

Commit Message

Yafang Shao May 11, 2024, 2:27 a.m. UTC
Our applications, built on Elasticsearch[0], frequently create and delete
files. These applications operate within containers, some with a memory
limit exceeding 100GB. Over prolonged periods, the accumulation of negative
dentries within these containers can amount to tens of gigabytes.

Upon container exit, directories are deleted. However, due to the numerous
associated dentries, this process can be time-consuming. Our users have
expressed frustration with this prolonged exit duration, which constitutes
our first issue.

Simultaneously, other processes may attempt to access the parent directory
of the Elasticsearch directories. Since the task responsible for deleting
the dentries holds the inode lock, processes attempting directory lookup
experience significant delays. This issue, our second problem, is easily
demonstrated:

  - Task 1 generates negative dentries: 
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]  

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file. Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries. Given our current
challenges, we're resubmitting the proposal. All relevant stakeholders from
previous discussions have been included for reference.

[0]. https://github.com/elastic/elasticsearch
[1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Colin Walters <walters@verbum.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/dcache.c            | 4 ++++
 include/linux/dcache.h | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Linus Torvalds May 11, 2024, 2:53 a.m. UTC | #1
On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> We've devised a solution to address both issues by deleting associated
> dentry when removing a file.

This patch is buggy. You are modifying d_flags outside the locked region.

So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
just go into the

        if (dentry->d_lockref.count == 1) {

side of the conditional, since the other side of that conditional
already unhashes the dentry which makes this all moot anyway.

That said, I think it's buggy in another way too: what if somebody
else looks up the dentry before it actually gets unhashed? Then you
have another ref to it, and the dentry might live long enough that it
then gets re-used for a newly created file (which is why we have those
negative dentries in the first place).

So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
made live by a file creation or rename or whatever.

So that d_flags thing is actually pretty complicated.

But since you made all this unconditional anyway, I think having a new
dentry flag is unnecessary in the first place, and I suspect you are
better off just unhashing the dentry unconditionally instead.

IOW, I think the simpler patch is likely just something like this:

  --- a/fs/dcache.c
  +++ b/fs/dcache.c
  @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)

        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
  +     __d_drop(dentry);
        /*
         * Are we the only user?
         */
  @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
                dentry->d_flags &= ~DCACHE_CANT_MOUNT;
                dentry_unlink_inode(dentry);
        } else {
  -             __d_drop(dentry);
                spin_unlock(&dentry->d_lock);
                spin_unlock(&inode->i_lock);
        }

although I think Al needs to ACK this, and I suspect that unhashing
the dentry also makes that

                dentry->d_flags &= ~DCACHE_CANT_MOUNT;

pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
just won't matter).

I do worry that there are loads that actually love our current
behavior, but maybe it's worth doing the simple unconditional "make
d_delete() always unhash" and only worry about whether that causes
performance problems for people who commonly create a new file in its
place when we get such a report.

IOW, the more complex thing might be to actually take other behavior
into account (eg "do we have so many negative dentries that we really
don't want to create new ones").

Al - can you please step in and tell us what else I've missed, and why
my suggested version of the patch is also broken garbage?

             Linus
Yafang Shao May 11, 2024, 3:35 a.m. UTC | #2
On Sat, May 11, 2024 at 10:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > We've devised a solution to address both issues by deleting associated
> > dentry when removing a file.
>
> This patch is buggy. You are modifying d_flags outside the locked region.
>
> So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
> just go into the
>
>         if (dentry->d_lockref.count == 1) {
>
> side of the conditional, since the other side of that conditional
> already unhashes the dentry which makes this all moot anyway.
>
> That said, I think it's buggy in another way too: what if somebody
> else looks up the dentry before it actually gets unhashed? Then you
> have another ref to it, and the dentry might live long enough that it
> then gets re-used for a newly created file (which is why we have those
> negative dentries in the first place).
>
> So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
> made live by a file creation or rename or whatever.
>
> So that d_flags thing is actually pretty complicated.
>
> But since you made all this unconditional anyway, I think having a new
> dentry flag is unnecessary in the first place, and I suspect you are
> better off just unhashing the dentry unconditionally instead.
>
> IOW, I think the simpler patch is likely just something like this:

It's simpler. I used to contemplate handling it that way, but lack the
knowledge and courage to proceed, hence I opted for the d_flags
solution.
I'll conduct tests on the revised change. Appreciate your suggestion.

>
>   --- a/fs/dcache.c
>   +++ b/fs/dcache.c
>   @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)
>
>         spin_lock(&inode->i_lock);
>         spin_lock(&dentry->d_lock);
>   +     __d_drop(dentry);
>         /*
>          * Are we the only user?
>          */
>   @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
>                 dentry_unlink_inode(dentry);
>         } else {
>   -             __d_drop(dentry);
>                 spin_unlock(&dentry->d_lock);
>                 spin_unlock(&inode->i_lock);
>         }
>
> although I think Al needs to ACK this, and I suspect that unhashing
> the dentry also makes that
>
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
>
> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> just won't matter).
>
> I do worry that there are loads that actually love our current
> behavior, but maybe it's worth doing the simple unconditional "make
> d_delete() always unhash" and only worry about whether that causes
> performance problems for people who commonly create a new file in its
> place when we get such a report.
>
> IOW, the more complex thing might be to actually take other behavior
> into account (eg "do we have so many negative dentries that we really
> don't want to create new ones").

This poses a substantial challenge. Despite recurrent discussions
within the community about improving negative dentry over and over,
there hasn't been a consensus on how to address it.

>
> Al - can you please step in and tell us what else I've missed, and why
> my suggested version of the patch is also broken garbage?
>
>              Linus
Al Viro May 11, 2024, 3:36 a.m. UTC | #3
On Fri, May 10, 2024 at 07:53:49PM -0700, Linus Torvalds wrote:

> although I think Al needs to ACK this, and I suspect that unhashing
> the dentry also makes that
> 
>                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> 
> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> just won't matter).
> 
> I do worry that there are loads that actually love our current
> behavior, but maybe it's worth doing the simple unconditional "make
> d_delete() always unhash" and only worry about whether that causes
> performance problems for people who commonly create a new file in its
> place when we get such a report.
> 
> IOW, the more complex thing might be to actually take other behavior
> into account (eg "do we have so many negative dentries that we really
> don't want to create new ones").
> 
> Al - can you please step in and tell us what else I've missed, and why
> my suggested version of the patch is also broken garbage?

Need to RTFS and think for a while; I think it should be OK, but I'll need
to dig through the tree to tell if there's anything nasty for e.g. network
filesystems.

Said that, I seriously suspect that there are loads where it would become
painful.  unlink() + creat() is _not_ a rare sequence, and this would
shove an extra negative lookup into each of those.

I would like to see the details on original posters' setup.  Note that
successful rmdir() evicts all children, so that it would seem that their
arseloads of negative dentries come from a bunch of unlinks in surviving
directories.

It would be interesting to see if using something like
	mkdir graveyard
	rename victim over there, then unlink in new place
	rename next victim over there, then unlink in new place
	....
	rmdir graveyard
would change the situation with memory pressure - it would trigger
eviction of all those negatives at controlled point(s) (rmdir).
I'm not saying that it's a good mitigation, but it would get more
details on that memory pressure.
Yafang Shao May 11, 2024, 3:51 a.m. UTC | #4
On Sat, May 11, 2024 at 11:36 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, May 10, 2024 at 07:53:49PM -0700, Linus Torvalds wrote:
>
> > although I think Al needs to ACK this, and I suspect that unhashing
> > the dentry also makes that
> >
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >
> > pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> > just won't matter).
> >
> > I do worry that there are loads that actually love our current
> > behavior, but maybe it's worth doing the simple unconditional "make
> > d_delete() always unhash" and only worry about whether that causes
> > performance problems for people who commonly create a new file in its
> > place when we get such a report.
> >
> > IOW, the more complex thing might be to actually take other behavior
> > into account (eg "do we have so many negative dentries that we really
> > don't want to create new ones").
> >
> > Al - can you please step in and tell us what else I've missed, and why
> > my suggested version of the patch is also broken garbage?
>
> Need to RTFS and think for a while; I think it should be OK, but I'll need
> to dig through the tree to tell if there's anything nasty for e.g. network
> filesystems.
>
> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.
>
> I would like to see the details on original posters' setup.  Note that
> successful rmdir() evicts all children, so that it would seem that their
> arseloads of negative dentries come from a bunch of unlinks in surviving
> directories.

Right, the directories are fixed. We've engaged in discussions with ES
users regarding changing the directory, but it would entail a
significant adjustment for them.

>
> It would be interesting to see if using something like
>         mkdir graveyard
>         rename victim over there, then unlink in new place
>         rename next victim over there, then unlink in new place
>         ....
>         rmdir graveyard
> would change the situation with memory pressure - it would trigger
> eviction of all those negatives at controlled point(s) (rmdir).
> I'm not saying that it's a good mitigation, but it would get more
> details on that memory pressure.
Waiman Long May 11, 2024, 4:54 a.m. UTC | #5
On 5/10/24 23:35, Yafang Shao wrote:
>> pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
>> just won't matter).
>>
>> I do worry that there are loads that actually love our current
>> behavior, but maybe it's worth doing the simple unconditional "make
>> d_delete() always unhash" and only worry about whether that causes
>> performance problems for people who commonly create a new file in its
>> place when we get such a report.
>>
>> IOW, the more complex thing might be to actually take other behavior
>> into account (eg "do we have so many negative dentries that we really
>> don't want to create new ones").
> This poses a substantial challenge. Despite recurrent discussions
> within the community about improving negative dentry over and over,
> there hasn't been a consensus on how to address it.

I had suggested in the past to have a sysctl parameter to set a 
threshold on how many negative dentries (as a percentage of total system 
memory) are regarded as too many and activate some processes to control 
dentry creation. The hard part is to discard the oldest negative 
dentries as we can have many memcg LRU lists to look at and it can be a 
time consuming process. We could theoretically start removing dentries 
when removing files when 50% of the threshold is reached. At 100%, we 
start discarding old negative dentries. At 150%, we stop negative dentry 
creation altogether.

My 2 cents.

Cheers,
Longman
Al Viro May 11, 2024, 5:18 a.m. UTC | #6
On Sat, May 11, 2024 at 04:36:19AM +0100, Al Viro wrote:

> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.
> 
> I would like to see the details on original posters' setup.  Note that
> successful rmdir() evicts all children, so that it would seem that their
> arseloads of negative dentries come from a bunch of unlinks in surviving
> directories.
> 
> It would be interesting to see if using something like
> 	mkdir graveyard
> 	rename victim over there, then unlink in new place
> 	rename next victim over there, then unlink in new place
> 	....
> 	rmdir graveyard
> would change the situation with memory pressure - it would trigger
> eviction of all those negatives at controlled point(s) (rmdir).
> I'm not saying that it's a good mitigation, but it would get more
> details on that memory pressure.

BTW, how about adding to do_vfs_ioctl() something like
	case FS_IOC_FORGET:
		shrink_dcache_parent(file->f_path.dentry);
		return 0;
possibly with option for dropping only negatives?

Even in the minimal form it would allow userland to force eviction
in given directory tree, without disrupting things anywhere else.
That's a trivial (completely untested) patch, really -

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1d5abfdf0f22..342bb71cf76c 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -878,6 +878,12 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FS_IOC_GETFSSYSFSPATH:
 		return ioctl_get_fs_sysfs_path(filp, argp);
 
+	case FS_IOC_FORGET:
+		if (arg != 0)	// only 0 for now
+			break;
+		shrink_dcache_parent(filp->f_path.dentry);
+		return 0;
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			return file_ioctl(filp, cmd, argp);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..143129510289 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -222,6 +222,7 @@ struct fsxattr {
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
+#define	FS_IOC_FORGET			_IOW('f', 255, int)
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
Linus Torvalds May 11, 2024, 5:32 a.m. UTC | #7
On Fri, 10 May 2024 at 20:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Said that, I seriously suspect that there are loads where it would become
> painful.  unlink() + creat() is _not_ a rare sequence, and this would
> shove an extra negative lookup into each of those.

The other side of this is that the "lots of negative dentries after
people have removed files" is definitely something that has come up
before as a latency problem.

So I do wonder if we've just been too worried about not changing the
status quo, and maybe the whole "make unlink() turn a positive dentry
into a negative one" was always a mis-optimization.

I do agree that we might have to do something more complicated, but I
think that before we just _assume_ we have to do that, maybe we should
just do the simple and stupid thing.

Because while "unlink and create" is most definitely a very real
pattern, maybe it's not really _so_ common that we should care about
it as a primary issue?

The reason to keep the negative dentry around is to avoid the
unnecessary "->lookup()" followed by "->create()" directory
operations.

But network filesystems - where that is _particularly_ expensive - end
up having done the whole ->atomic_open thing anyway, so maybe that
reason isn't as big of a deal as it used to be?

And I don't particularly like your "give people a flush ioctl" either.
There are security concerns with that one - both for timing and for
just basically messing with performance.

At the very least, I think it should require write permissions on the
directory you are flushing, but I really think we should instead aim
to be better about this in the first place.

Anyway, in how many loads is that "unlink -> create" pattern more than
a very occasional thing? Which is why I think maybe we're just
overthinking this and being too timid when saying "we should count
negative dentries or something".

                 Linus
Matthew Wilcox May 11, 2024, 3:58 p.m. UTC | #8
On Sat, May 11, 2024 at 12:54:18AM -0400, Waiman Long wrote:
> I had suggested in the past to have a sysctl parameter to set a threshold on
> how many negative dentries (as a percentage of total system memory) are

Yes, but that's obviously bogus.  System memory is completely
uncorrelated with the optimum number of negative dentries to keep
around for workload performance.

Some multiple of "number of positive dentries" might make sense.
Linus Torvalds May 11, 2024, 4:07 p.m. UTC | #9
On Sat, 11 May 2024 at 08:59, Matthew Wilcox <willy@infradead.org> wrote:
>
> Some multiple of "number of positive dentries" might make sense.

The thing is, it would probably have to be per parent-dentry to be
useful, since that is typically what causes latency concerns: not
"global negative dentries", but "negative dentries that have to be
flushed for this parent as it is deleted".

And you'd probably need to make the child list be some kind of LRU
list to make this all effective.

Which is all likely rather expensive in both CPU time (stats tend to
cause lots of dirty caches) and in memory (the dentry would grow for
both stats and for the d_children list - it would almost certainly
have to change from a 'struct hlist_head' to a 'struct list_head' so
that you can put things either at the to the beginning or end).

I dunno. I haven't looked at just *how* nasty it would be. Maybe it
wouldn't be as bad as I suspect it would be.

Now, the other option might be to just make the latency concerns
smaller. It's not like removing negative dentries is very costly per
se. I think the issue has always been the dcache_lock, not the work to
remove the dentries themselves.

                Linus
Linus Torvalds May 11, 2024, 4:13 p.m. UTC | #10
On Sat, 11 May 2024 at 09:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, the other option might be to just make the latency concerns
> smaller. It's not like removing negative dentries is very costly per
> se. I think the issue has always been the dcache_lock, not the work to
> remove the dentries themselves.

Actually, going back to re-read this particular report, at least this
time it was the inode lock of the parent, not the dcache_lock.

But the point ends up being the same - lots of negative dentries
aren't necessarily a problem in themselves, because the common
operation that matters is the hash lookup, which scales fairly well.

They mainly tend to become a problem when they hold up something else.

So better batching, or maybe just walking the negative child dentry
list after having marked the parent dead and then released the lock,
might also be the solution.

                      Linus
Linus Torvalds May 11, 2024, 6:05 p.m. UTC | #11
On Sat, 11 May 2024 at 09:13, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So better batching, or maybe just walking the negative child dentry
> list after having marked the parent dead and then released the lock,
> might also be the solution.

IOW, maybe the solution is something as simple as this UNTESTED patch instead?

  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -4207,16 +4207,19 @@ int vfs_rmdir(struct mnt_idmap *idmap,
struct inode *dir,
        if (error)
                goto out;

  -     shrink_dcache_parent(dentry);
        dentry->d_inode->i_flags |= S_DEAD;
        dont_mount(dentry);
        detach_mounts(dentry);
  +     inode_unlock(dentry->d_inode);
  +
  +     shrink_dcache_parent(dentry);
  +     dput(dentry);
  +     d_delete_notify(dir, dentry);
  +     return 0;

   out:
        inode_unlock(dentry->d_inode);
        dput(dentry);
  -     if (!error)
  -             d_delete_notify(dir, dentry);
        return error;
   }
   EXPORT_SYMBOL(vfs_rmdir);

where that "shrink_dcache_parent()" will still be quite expensive if
the directory has a ton of negative dentries, but because we now free
them after we've marked the dentry dead and released the inode, nobody
much cares any more?

Note (as usual) that this is untested, and maybe I haven't thought
everything through and I might be missing some important detail.

But I think shrinking the children later should be just fine - there's
nothing people can *do* with them. At worst they are reachable for
some lookup that doesn't take the locks, but that will just result in
a negative dentry which is all good, since they cannot become positive
dentries any more at this point.

So I think the inode lock is irrelevant for negative dentries, and
shrinking them outside the lock feels like a very natural thing to do.

Again: this is more of a "brainstorming patch" than an actual
suggestion. Yafang - it might be worth testing for your load, but
please do so knowing that it might have some consistency issues, so
don't test it on any production machinery, please ;)

Can anybody point out why I'm being silly, and the above change is
completely broken garbage? Please use small words to point out my
mental deficiencies to make sure I get it.

Just to clarify: this obviously will *not* speed up the actual rmdir
itself. *ALL* it does is to avoid holding the lock over the
potentially long cleanup operation. So the latency of the rmdir is
still the same, but now it should no longer matter for anything else.

            Linus
Yafang Shao May 15, 2024, 2:18 a.m. UTC | #12
On Sat, May 11, 2024 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, May 11, 2024 at 10:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, 10 May 2024 at 19:28, Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > We've devised a solution to address both issues by deleting associated
> > > dentry when removing a file.
> >
> > This patch is buggy. You are modifying d_flags outside the locked region.
> >
> > So at a minimum, the DCACHE_FILE_DELETED bit setting would need to
> > just go into the
> >
> >         if (dentry->d_lockref.count == 1) {
> >
> > side of the conditional, since the other side of that conditional
> > already unhashes the dentry which makes this all moot anyway.
> >
> > That said, I think it's buggy in another way too: what if somebody
> > else looks up the dentry before it actually gets unhashed? Then you
> > have another ref to it, and the dentry might live long enough that it
> > then gets re-used for a newly created file (which is why we have those
> > negative dentries in the first place).
> >
> > So you'd have to clear the DCACHE_FILE_DELETED if the dentry is then
> > made live by a file creation or rename or whatever.
> >
> > So that d_flags thing is actually pretty complicated.
> >
> > But since you made all this unconditional anyway, I think having a new
> > dentry flag is unnecessary in the first place, and I suspect you are
> > better off just unhashing the dentry unconditionally instead.
> >
> > IOW, I think the simpler patch is likely just something like this:
>
> It's simpler. I used to contemplate handling it that way, but lack the
> knowledge and courage to proceed, hence I opted for the d_flags
> solution.
> I'll conduct tests on the revised change. Appreciate your suggestion.
>

We have successfully applied a hotfix to a subset of our production
servers, totaling several thousand. The hotfix is as follows:

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5f..30eb733 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2557,14 +2557,14 @@ void d_delete(struct dentry * dentry)

        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
+       __d_drop(dentry);
+
        /*
         * Are we the only user?
         */
        if (dentry->d_lockref.count == 1) {
-               dentry->d_flags &= ~DCACHE_CANT_MOUNT;
                dentry_unlink_inode(dentry);
        } else {
-               __d_drop(dentry);
                spin_unlock(&dentry->d_lock);
                spin_unlock(&inode->i_lock);
        }

So far, it has been functioning well without any regressions. We are
planning to roll this update out to our entire fleet, which consists
of hundreds of thousands of servers.

I believe this change is still necessary. Would you prefer to commit
it directly, or should I send an official patch?

If the "unlink-create" issue is a concern, perhaps we can address it
by adding a /sys/kernel/debug/vfs/delete_file_legacy entry?

> >
> >   --- a/fs/dcache.c
> >   +++ b/fs/dcache.c
> >   @@ -2381,6 +2381,7 @@ void d_delete(struct dentry * dentry)
> >
> >         spin_lock(&inode->i_lock);
> >         spin_lock(&dentry->d_lock);
> >   +     __d_drop(dentry);
> >         /*
> >          * Are we the only user?
> >          */
> >   @@ -2388,7 +2389,6 @@ void d_delete(struct dentry * dentry)
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >                 dentry_unlink_inode(dentry);
> >         } else {
> >   -             __d_drop(dentry);
> >                 spin_unlock(&dentry->d_lock);
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > although I think Al needs to ACK this, and I suspect that unhashing
> > the dentry also makes that
> >
> >                 dentry->d_flags &= ~DCACHE_CANT_MOUNT;
> >
> > pointless (because the dentry won't be reused, so DCACHE_CANT_MOUNT
> > just won't matter).
> >
> > I do worry that there are loads that actually love our current
> > behavior, but maybe it's worth doing the simple unconditional "make
> > d_delete() always unhash" and only worry about whether that causes
> > performance problems for people who commonly create a new file in its
> > place when we get such a report.
> >
> > IOW, the more complex thing might be to actually take other behavior
> > into account (eg "do we have so many negative dentries that we really
> > don't want to create new ones").
>
> This poses a substantial challenge. Despite recurrent discussions
> within the community about improving negative dentry over and over,
> there hasn't been a consensus on how to address it.
>
> >
> > Al - can you please step in and tell us what else I've missed, and why
> > my suggested version of the patch is also broken garbage?
> >
> >              Linus
>
>
> --
> Regards
> Yafang



--
Regards
Yafang
Linus Torvalds May 15, 2024, 2:36 a.m. UTC | #13
On Tue, 14 May 2024 at 19:19, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> I believe this change is still necessary. Would you prefer to commit
> it directly, or should I send an official patch?

Sending an official patch just for people to try out sounds good
regardless, but I'd really like some real performance testing on other
loads too before just taking it.

It *may* be acceptable, and it's certainly simple. It's also almost
certainly safe from a correctness angle.

But it might regress performance on other important loads even if it
fixes an issue on your load.

That's why we had the whole discussion about alternatives.

I'm still of the opinion that we should probably *try* this simple
approach, but I really would hope we could have some test tree that
people run a lot of benchmarks on.

Does anybody do filesystem benchmarking on the mm tree? Or do we have
some good tree to just give it a good testing? It feels a bit
excessive to put it in my development tree just to get some
performance testing coverage, but maybe that's what we have to do..

                 Linus
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 71a8e943a0fa..4b97f60f0e64 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -701,6 +701,9 @@  static inline bool retain_dentry(struct dentry *dentry, bool locked)
 	if (unlikely(d_flags & DCACHE_DONTCACHE))
 		return false;
 
+	if (unlikely(dentry->d_flags & DCACHE_FILE_DELETED))
+		return false;
+
 	// At this point it looks like we ought to keep it.  We also might
 	// need to do something - put it on LRU if it wasn't there already
 	// and mark it referenced if it was on LRU, but not marked yet.
@@ -2392,6 +2395,7 @@  void d_delete(struct dentry * dentry)
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&inode->i_lock);
 	}
+	dentry->d_flags |= DCACHE_FILE_DELETED;
 }
 EXPORT_SYMBOL(d_delete);
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf53e3894aae..55a69682918c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -210,7 +210,7 @@  struct dentry_operations {
 
 #define DCACHE_NOKEY_NAME		BIT(25) /* Encrypted name encoded without key */
 #define DCACHE_OP_REAL			BIT(26)
-
+#define DCACHE_FILE_DELETED		BIT(27) /* File is deleted */
 #define DCACHE_PAR_LOOKUP		BIT(28) /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		BIT(29)
 #define DCACHE_NORCU			BIT(30) /* No RCU delay for freeing */