eventfs: Have inodes have unique inode numbers

Message ID	20240126150209.367ff402@gandalf.local.home (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4816D1E876; Fri, 26 Jan 2024 20:02:07 +0000 (UTC) Date: Fri, 26 Jan 2024 15:02:09 -0500 From: Steven Rostedt <rostedt@goodmis.org> To: LKML <linux-kernel@vger.kernel.org>, Linux Trace Devel <linux-trace-devel@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Christian Brauner <brauner@kernel.org>, Ajay Kaher <ajay.kaher@broadcom.com>, Geert Uytterhoeven <geert@linux-m68k.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH] eventfs: Have inodes have unique inode numbers Message-ID: <20240126150209.367ff402@gandalf.local.home> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit
Series	eventfs: Have inodes have unique inode numbers \| expand eventfs: Have inodes have unique inode numbers

Steven Rostedt Jan. 26, 2024, 8:02 p.m. UTC

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Linus suggested to use the same inode numbers to make it easier to
implement getdents(), as it was creating inodes just for generating a
unique and consistent inode number. Linus suggested to just use the same
inode for all files and directories.

Later it was discovered that having directories with the same inode number
would mess up the "find" command, but Linus found that on 64 bit machines,
there was a hole in the eventfs_inode structure due to alignment that
could be used to store the inode numbers for directories. That fixed the
directory issue, but the files still had their own inode number.

The 'tar' command uses inode numbers for determining uniqueness between
files, which this would break. Currently, tar is broken with tracefs
because all files show a stat of zero size and tar doesn't copy anything.
But because tar cares about inode numbers, there could be other
applications that do too. It's best to have all files have unique inode
numbers.

Copy the get_next_ino() to tracefs_get_next_ino() that takes a "files"
parameter. As eventfs directories have a fixed number of files within
them, the number of inodes needed for the eventfs directory files is known
when the directory is created. The tracefs_get_next_ino() will return a
new inode number but also reserve the next "files" inode numbers that the
caller is free to use. Then when an inode for a file is created, its inode
number will be its parent directory's inode number plus the index into the
file array of that directory, giving each file a unique inode number that
can be retrieved at any time.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c | 31 +++++++++++++++----------------
 fs/tracefs/inode.c       | 37 ++++++++++++++++++++++++++++++++++---
 fs/tracefs/internal.h    |  1 +
 3 files changed, 50 insertions(+), 19 deletions(-)

Linus Torvalds Jan. 26, 2024, 8:25 p.m. UTC | #1

Steven,
 stop making things more complicated than they need to be.

And dammit, STOP COPYING VFS LAYER FUNCTIONS.

It was a bad idea last time, it's a horribly bad idea this time too.

I'm not taking this kind of crap.

The whole "get_next_ino()" should be "atomic64_add_return()". End of story.

You arent' special. If the VFS functions don't work for you, you don't
use them, but dammit, you also don't then steal them without
understanding what they do, and why they were necessary.

The reason get_next_ino() is critical is because it's used by things
like pipes and sockets etc that get created at high rates, the the
inode numbers most definitely do not get cached.

You copied that function without understanding why it does what it
does, and as a result your code IS GARBAGE.

AGAIN.

Honestly, kill this thing with fire. It was a bad idea. I'm putting my
foot down, and you are *NOT* doing unique regular file inode numbers
uintil somebody points to a real problem.

Because this whole "I make up problems, and then I write overly
complicated crap code to solve them" has to stop,.

No more. This stops here.

I don't want to see a single eventfs patch that doesn't have a real
bug report associated with it. And the next time I see you copying VFS
functions (or any other core functions) without udnerstanding what the
f*ck they do, and why they do it, I'm going to put you in my
spam-filter for a week.

I'm done. I'm really *really* tired of having to look at eventfs garbage.

              Linus

Steven Rostedt Jan. 26, 2024, 9:26 p.m. UTC | #2

On Fri, 26 Jan 2024 12:25:05 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Steven,
>  stop making things more complicated than they need to be.
> 
> And dammit, STOP COPYING VFS LAYER FUNCTIONS.
> 
> It was a bad idea last time, it's a horribly bad idea this time too.
> 
> I'm not taking this kind of crap.
> 
> The whole "get_next_ino()" should be "atomic64_add_return()". End of story.

I originally wrote it that way, and thought to myself that the VFS version
is "faster" and switched to that.

My fault for being too much into micro-optimizations.

> 
> You arent' special. If the VFS functions don't work for you, you don't
> use them, but dammit, you also don't then steal them without
> understanding what they do, and why they were necessary.
> 
> The reason get_next_ino() is critical is because it's used by things
> like pipes and sockets etc that get created at high rates, the the
> inode numbers most definitely do not get cached.

Yes, I understood why it's optimized, and took it because it's been there
since 2010 and figured it's pretty solid.

> 
> You copied that function without understanding why it does what it
> does, and as a result your code IS GARBAGE.
> 
> AGAIN.
> 
> Honestly, kill this thing with fire. It was a bad idea. I'm putting my
> foot down, and you are *NOT* doing unique regular file inode numbers
> uintil somebody points to a real problem.
> 
> Because this whole "I make up problems, and then I write overly
> complicated crap code to solve them" has to stop,.

If I had just used the atomic_add_return() is it really that overly
complicated? Yes, I copied from VFS because I figured if they put in the
effort to make it faster then why not use that, even though it was overkill.

> 
> No more. This stops here.
> 
> I don't want to see a single eventfs patch that doesn't have a real
> bug report associated with it. And the next time I see you copying VFS
> functions (or any other core functions) without udnerstanding what the
> f*ck they do, and why they do it, I'm going to put you in my
> spam-filter for a week.
> 
> I'm done. I'm really *really* tired of having to look at eventfs garbage.

So we keep the same inode number until something breaks with it, even
though, using unique ones is not that complicated?

I'd be happy to change that patch to what I originally did before deciding
to copy get_next_ino():

unsigned int tracefs_get_next_ino(int files)
{
	static atomic_t next_inode;
	unsigned int res;

	do {
		res = atomic_add_return(files + 1, &next_inode);

		/* Check for overflow */
	} while (unlikely(res < files + 1));

	return res - files;
}

If I knew going back and copying over get_next_ino() was going to piss you
off so much, I wouldn't have done that.

Not to mention that the current code calls into get_next_ino() and then
throws it away. That is, eventfs gets its inode structure from tracefs that
adds the inode number to it using the VFS get_next_ino(). That gets thrown
away by the single inode assigned. This just makes it more likely that the
global get_inode_ino() is going to overflow due to eventfs, even though
eventfs isn't even using them.

I only did the one inode number because that's what you wanted. Is it that
you want to move away from having inode numbers completely? At least for
pseudo file systems? If that's the case, then we can look to get people to
start doing that. First it would be fixing tools like 'tar' to ignore the
inode numbers.

Otherwise, I really rather keep it the way it has always been. That is,
each file has its own unique inode number, and not have to deal with some
strange bug report because it's not. Is there another file system that has
just one inode number?

-- Steve

Linus Torvalds Jan. 26, 2024, 9:31 p.m. UTC | #3

On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> So we keep the same inode number until something breaks with it, even
> though, using unique ones is not that complicated?

Using unique ones for directories was a trivial cleanup.

The file case is clearly different. I thought it would be the same
trivial one-liner, but nope.

When you have to add 30 lines of code just to get unique inode numbers
that nobody has shown any interest in, it's 30 lines too much.

And when it happens in a filesystem that has a history of copying code
from the VFS layer and having nasty bugs, it's *definitely* too much.

Simplify. If you can clean things up and we have a few release of
not-horrendous-bugs every other day, I may change my mind.

As it is, I feel like I have to waste my time checking all your
patches, and I'm saying "it's not worth it".

               Linus

Linus Torvalds Jan. 26, 2024, 9:36 p.m. UTC | #4

On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I'd be happy to change that patch to what I originally did before deciding
> to copy get_next_ino():
>
> unsigned int tracefs_get_next_ino(int files)
> {
>         static atomic_t next_inode;
>         unsigned int res;
>
>         do {
>                 res = atomic_add_return(files + 1, &next_inode);
>
>                 /* Check for overflow */
>         } while (unlikely(res < files + 1));
>
>         return res - files;

Still entirely pointless.

If you have more than 4 billion inodes, something is really really wrong.

So why is it ten lines instead of one?

Dammit, this is a number that NOBODY HAS SHOWN IS EVEN WORTH EXISTING
IN THE FIRST PLACE.

So no. I'm not taking this. End of discussion. My point stands: I want
this filesystem *stabilized*, and in s sane format.

Look to *simplify* things. Send me patches that *remove* complexity,
not add new complexity that you have zero evidence is worth it.

Face it, eventfs isn't some kind of "real filesystem". It shouldn't
even attempt to look like one.

If somebody goes "I want to tar this thiing up", you should laugh in
their face and call them names, not say "sure, let me whip up a
50-line patch to make this fragile thing even more complex".

            Linus

Steven Rostedt Jan. 26, 2024, 9:42 p.m. UTC | #5

On Fri, 26 Jan 2024 13:36:20 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'd be happy to change that patch to what I originally did before deciding
> > to copy get_next_ino():
> >
> > unsigned int tracefs_get_next_ino(int files)
> > {
> >         static atomic_t next_inode;
> >         unsigned int res;
> >
> >         do {
> >                 res = atomic_add_return(files + 1, &next_inode);
> >
> >                 /* Check for overflow */
> >         } while (unlikely(res < files + 1));
> >
> >         return res - files;  
> 
> Still entirely pointless.
> 
> If you have more than 4 billion inodes, something is really really wrong.

No, but you can trivially make a loop that creates and destroys
directories that will eventually overflow the count.

-- Steve

Steven Rostedt Jan. 26, 2024, 9:43 p.m. UTC | #6

On Fri, 26 Jan 2024 13:31:01 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> As it is, I feel like I have to waste my time checking all your
> patches, and I'm saying "it's not worth it".

I'm basically done with this. I never said I was a VFS guy and I
learned a lot doing this. I had really nobody to look at my code even
though most of it went to the fsdevel list. Nobody said I was doing it
wrong.

Sorry to have wasted your time

-- Steve

Linus Torvalds Jan. 26, 2024, 9:49 p.m. UTC | #7

On Fri, 26 Jan 2024 at 13:36, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you have more than 4 billion inodes, something is really really wrong.

Btw, once again, the vfs layer function you took this from *does* have
some reason to worry. Somebody might be doing 'pipe()' in a loop.

Also, if your worry is "what if somebody mounts that thing a million
times", the solution to *that* would have been to make it a per-sb
counter, which I think would be cleaner anyway.

But my real issue is that I think you would be *much* better off just
deleting code, instead of adding new code.

For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have?
Isn't that entirely a left-over from the bad old days?

So please try to look at things to *fix* and simplify, not at things
to mess around with and make more complicated.

              Linus

Steven Rostedt Jan. 26, 2024, 10:08 p.m. UTC | #8

On January 26, 2024 4:49:13 PM EST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, 26 Jan 2024 at 13:36, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>>
>> If you have more than 4 billion inodes, something is really really wrong.
>
>Btw, once again, the vfs layer function you took this from *does* have
>some reason to worry. Somebody might be doing 'pipe()' in a loop.
>
>Also, if your worry is "what if somebody mounts that thing a million
>times", the solution to *that* would have been to make it a per-sb
>counter, which I think would be cleaner anyway.
>

I'm more worried about a loop of:

cd /sys/kernel/tracing/instances
while:; do mkdir foo ; rmdir foo: done

Which is what my tests do. And I have run that for over a weekend.


>But my real issue is that I think you would be *much* better off just
>deleting code, instead of adding new code.
>
>For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have?
>Isn't that entirely a left-over from the bad old days?
>

I'm not at my computer, but when I tried deleting that, it caused issues with the lookup code.

-- Steve 

>So please try to look at things to *fix* and simplify, not at things
>to mess around with and make more complicated.
>
>              Linus

Mathieu Desnoyers Jan. 26, 2024, 10:14 p.m. UTC | #9

On 2024-01-26 16:49, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 13:36, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
[...]
> So please try to look at things to *fix* and simplify, not at things
> to mess around with and make more complicated.

Hi Linus,

I'm all aboard with making things as simple as possible and
making sure no complexity is added for the sake of micro-optimization
of slow-paths.

I do however have a concern with the approach of using the same
inode number for various files on the same filesystem: AFAIU it
breaks userspace ABI expectations. See inode(7) for instance:

        Inode number
               stat.st_ino; statx.stx_ino

               Each  file in a filesystem has a unique inode number.  Inode numbers
               are guaranteed to be unique only within a filesystem (i.e., the same
               inode  numbers  may  be  used by different filesystems, which is the
               reason that hard links may not cross filesystem  boundaries).   This
               field contains the file's inode number.

So user-space expecting inode numbers to be unique within a filesystem
is not "legacy" in any way. Userspace is allowed to expect this from the
ABI.

I think that a safe approach to prevent ABI regressions, and just to prevent
adding more ABI-corner cases that userspace will have to work-around, would
be to issue unique numbers to files within eventfs, but in the
simplest/obviously correct implementation possible. It is, after all, a slow
path.

The issue with the atomic_add_return without any kinds of checks is the
scenarios of a userspace loop that would create/delete directories endlessly,
thus causing inode re-use. This approach is simple, but it's unfortunately
not obviously correct. Because eventfs allows userspace to do mkdir/rmdir,
this is unfortunately possible. It would be OK if only the kernel had control
over directory creation/removal, but it's not the case here.

I would suggest this straightforward solution to this:

a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8),

b) keep track of inode allocation in a bitmap (within a single page),

c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs.

This way even the mkdir/rmdir loop will work fine, but it will prevent
keeping too many inodes alive at any given time. The cost is a single
page (4K) per eventfs instance.

Thanks,

Mathieu

Linus Torvalds Jan. 26, 2024, 10:26 p.m. UTC | #10

On Fri, 26 Jan 2024 at 14:09, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I'm not at my computer, but when I tried deleting that, it caused issues with the lookup code.

The VSF layer should be serializing all lookups of the same name. If
it didn't, we'd have serious issues on other filesystems.

So you should never get more than one concurrent lookup of one
particular entry, and as long as the dentry exists, you should then
not get a new one. It's one of the things that the VFS layer does to
make things simple for the filesystem.

But it's worth noting that that is about *one* entry. You can get
concurrent lookups in the same directory for different names.

Another thing that worries me is that odd locking that releases the
lock in the middle. I don't understand why you release the
tracefs_mutex() over create_file(), for example. There's a lot of
"take, drop, re-take, re-drop" of that mutex that seems strange.

           Linus

Linus Torvalds Jan. 26, 2024, 10:29 p.m. UTC | #11

On Fri, 26 Jan 2024 at 14:14, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> I do however have a concern with the approach of using the same
> inode number for various files on the same filesystem: AFAIU it
> breaks userspace ABI expectations.

Virtual filesystems have always done that in various ways.

Look at the whole discussion about the size of the file. Then look at /proc.

And honestly, eventfs needs to be simplified. It's a mess. It's less
of a mess than it used to be, but people should *NOT* think that it's
a real filesystem.

Don't use some POSIX standard as an expectation for things like /proc,
/sys or tracefs.

              Linus

Matthew Wilcox Jan. 26, 2024, 10:34 p.m. UTC | #12

On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote:
> I would suggest this straightforward solution to this:
> 
> a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8),
> 
> b) keep track of inode allocation in a bitmap (within a single page),
> 
> c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs.

... reinventing the IDA?

Mathieu Desnoyers Jan. 26, 2024, 10:40 p.m. UTC | #13

On 2024-01-26 17:34, Matthew Wilcox wrote:
> On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote:
>> I would suggest this straightforward solution to this:
>>
>> a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8),
>>
>> b) keep track of inode allocation in a bitmap (within a single page),
>>
>> c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs.
> 
> ... reinventing the IDA?

Looking at include/linux/idr.h now that you mention it, yes,
you are absolutely right!

Thanks,

Mathieu

Mathieu Desnoyers Jan. 26, 2024, 10:41 p.m. UTC | #14

On 2024-01-26 17:29, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 14:14, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I do however have a concern with the approach of using the same
>> inode number for various files on the same filesystem: AFAIU it
>> breaks userspace ABI expectations.
> 
> Virtual filesystems have always done that in various ways.
> 
> Look at the whole discussion about the size of the file. Then look at /proc.

Yes, there is even a note about stat.st_size in inode(7) explaining
this:

NOTES
        For pseudofiles that are autogenerated by the  kernel,  the  file  size
        (stat.st_size;  statx.stx_size) reported by the kernel is not accurate.
        For example, the value 0 is returned for many files under the /proc di‐
        rectory,  while  various  files under /sys report a size of 4096 bytes,
        even though the file content is smaller.  For such  files,  one  should
        simply  try  to  read as many bytes as possible (and append '\0' to the
        returned buffer if it is to be interpreted as a string).

But having a pseudo-filesystem entirely consisting of duplicated inodes
which are not hard links to the same file is something new/unexpected.

> And honestly, eventfs needs to be simplified. It's a mess. It's less
> of a mess than it used to be, but people should *NOT* think that it's
> a real filesystem.

I agree with simplifying it, but would rather not introduce userspace ABI
regressions in the process, which will cause yet another kind of mess.

> Don't use some POSIX standard as an expectation for things like /proc,
> /sys or tracefs.

If those filesystems choose to do things differently from POSIX, then it
should be documented with the relevant ABIs, because userspace should be
able to know (rather than guess) what it can expect.

Thanks,

Mathieu

Linus Torvalds Jan. 26, 2024, 10:48 p.m. UTC | #15

On Fri, 26 Jan 2024 at 14:34, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote:
> > I would suggest this straightforward solution to this:
> >
> > a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8),
> >
> > b) keep track of inode allocation in a bitmap (within a single page),
> >
> > c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs.
>
> ... reinventing the IDA?

Guysm, this is a random number that is *so* interesting that I
seriously think we shouldn't have it at all.

End result: nobody should care. Even the general VFS layer doesn't care.

It literally avoids inode number zero, not because it would be a bad
inode number, but simply because of some random historical oddity.

In fact, I don't think we even have a reason for it. We have a commit
2adc376c5519 ("vfs: avoid creation of inode number 0 in get_next_ino")
and that one calls out glibc for not deleting them. That makes no
sense to me, but whatever.

But note how the generic function does *not* try to make them unique,
for example. They are just "unique enough".

The generic function *does* care about being scalable in an SMP
environment. To a disturbing degree. Oh well.

                Linus

Linus Torvalds Jan. 26, 2024, 10:49 p.m. UTC | #16

On Fri, 26 Jan 2024 at 14:41, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Yes, there is even a note about stat.st_size in inode(7) explaining
> this:

Good. Send a patch to do the same for st_ino.

             Linus

Matthew Wilcox Jan. 26, 2024, 11:04 p.m. UTC | #17

On Fri, Jan 26, 2024 at 02:48:45PM -0800, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 14:34, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote:
> > > I would suggest this straightforward solution to this:
> > >
> > > a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8),
> > >
> > > b) keep track of inode allocation in a bitmap (within a single page),
> > >
> > > c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs.
> >
> > ... reinventing the IDA?
> 
> Guysm, this is a random number that is *so* interesting that I
> seriously think we shouldn't have it at all.
> 
> End result: nobody should care. Even the general VFS layer doesn't care.
> 
> It literally avoids inode number zero, not because it would be a bad
> inode number, but simply because of some random historical oddity.
> 
> In fact, I don't think we even have a reason for it. We have a commit
> 2adc376c5519 ("vfs: avoid creation of inode number 0 in get_next_ino")
> and that one calls out glibc for not deleting them. That makes no
> sense to me, but whatever.

Maybe we should take advantage of that historical oddity.  All files
in eventfs have inode number 0, problem solved.

Linus Torvalds Jan. 26, 2024, 11:11 p.m. UTC | #18

On Fri, 26 Jan 2024 at 15:04, Matthew Wilcox <willy@infradead.org> wrote:
>
> Maybe we should take advantage of that historical oddity.  All files
> in eventfs have inode number 0, problem solved.

That might not be a horrible idea.

The same way "a directory with st_nlink 1 clearly cannot have a
subdirectory count" (which tools like 'find' know about), maybe it
would be a good idea to use a zero inode number for 'this file doesn't
have an inode number".

Now, presumably no tool knows that, but we could try to aim for that
being some future standard thing.

(And by "standard", I mean a practical one, not some POSIX thing. I
think POSIX just mentions "numberr of hardlinks", and then the whole
"a value of one means that we can't count subdirectories" is just a
practical reality for things like FAT).

               Linus

Linus Torvalds Jan. 26, 2024, 11:17 p.m. UTC | #19

On Fri, 26 Jan 2024 at 15:11, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 26 Jan 2024 at 15:04, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Maybe we should take advantage of that historical oddity.  All files
> > in eventfs have inode number 0, problem solved.
>
> That might not be a horrible idea.

Note the "might". I don't know why glibc would have special-cased
st_ino of 0, but I suspect it's some internal oddity in the readdir()
implementation.

So considering that we do have that commit 2adc376c5519, I suspect it
would just be more pain than its worth to try to teach user space
about the whole "no inode number" thing.

It might be safer to pick something like -1 instead.

               Linus

Andreas Schwab Jan. 27, 2024, 9:36 a.m. UTC | #20

On Jan 26 2024, Linus Torvalds wrote:

> Note the "might". I don't know why glibc would have special-cased
> st_ino of 0, but I suspect it's some internal oddity in the readdir()
> implementation.

It was traditionally used for deleted entries in a directory.

Steven Rostedt Jan. 27, 2024, 2:47 p.m. UTC | #21

On Fri, 26 Jan 2024 14:26:08 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> nother thing that worries me is that odd locking that releases the
> lock in the middle. I don't understand why you release the
> tracefs_mutex() over create_file(), for example. There's a lot of
> "take, drop, re-take, re-drop" of that mutex that seems strange.

This was because the create_file/dir() would call into the VFS which
would grab locks, and on a final dput() on a ei dentry that is to be
freed, calls back into eventfs_set_ei_status_free() which also grabs
the eventfs_mutex. But it gets called with the same VFS locks that are
taken by create_file/dir() VFS calls. This was caught by lockdep. Hence
the dropping of those locks.

The eventfs_mutex is just protecting the ei list and also assigning and
clearing the ei->dentry. Now that dentry is used to synchronize the last
close, and also to know if the ei was ever referenced. If ei->dentry is
NULL it can be freed immediately (after SRCU) when the directory is
deleted. But if ei->dentry is set, it means that something may still
have a reference to it and must be freed after the last dput() and SRCU.

Now some of this was needed due to the way the dir wrapper worked so I
may be able to revisit this and possibly just use an ei->ref counter.
But I wasted enough time on this and I'm way behind in my other
responsibilities, so this is not something I can work on now.

-- Steve

David Laight Jan. 27, 2024, 3:26 p.m. UTC | #22

From: Linus Torvalds
> Sent: 26 January 2024 21:36
> 
> On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'd be happy to change that patch to what I originally did before deciding
> > to copy get_next_ino():
> >
> > unsigned int tracefs_get_next_ino(int files)
> > {
> >         static atomic_t next_inode;
> >         unsigned int res;
> >
> >         do {
> >                 res = atomic_add_return(files + 1, &next_inode);
> >
> >                 /* Check for overflow */
> >         } while (unlikely(res < files + 1));
> >
> >         return res - files;
> 
> Still entirely pointless.
> 
> If you have more than 4 billion inodes, something is really really wrong.
> 
> So why is it ten lines instead of one?

Doesn't Linux support 64bit inode numbers?
They solve the wrap problem...

I also don't know what filesystems like NTFS use - they don't have
the concept of inode.

IIRC NFS used to use the inode number for its 'file handle'.
Rather a pain when trying to write code to export a layered FS
(especially if a layer might be an NFS mount!).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Linus Torvalds Jan. 27, 2024, 8:01 p.m. UTC | #23

On Sat, 27 Jan 2024 at 07:27, David Laight <David.Laight@aculab.com> wrote:
>
> Doesn't Linux support 64bit inode numbers?
> They solve the wrap problem...

Yes, but we've historically had issues with actually exposing them.

The legacy stat() code has things like this:

        tmp.st_ino = stat->ino;
        if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino)
                return -EOVERFLOW;

so if you have really old 32-bit user space, you generally do not
actually want to have 64-bit inode numbers.

This is why "get_next_ino()" returns a strictly 32-bit only inode
number. You don't want to error out on a 'fstat()' just because you're
on a big system that has been running for a long time.

Now, 'stat64' was introduced for this reason back in 2.3.34, so back
in 1999. So any half-way modern 32-bit environment doesn't have that
issue, and maybe it's time to just sunset all the old stat() calls.

Of course, the *really* old stat has a 16-bit inode number. Search for
__old_kernel_stat to still see that. That's more of a curiosity than
anything else.

          Linus

Linus Torvalds Jan. 27, 2024, 9:47 p.m. UTC | #24

On Fri, 26 Jan 2024 at 13:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have?
> Isn't that entirely a left-over from the bad old days?
>
> So please try to look at things to *fix* and simplify, not at things
> to mess around with and make more complicated.

So here's an attempt at some fairly trivial but entirely untested cleanup.

I have *not* tested this at all, and I assume you have some extensive
test-suite that you run. So these are "signed-off' in the sense that
the patch looks fine, it builds in one configuration for me, but maybe
there's something really odd going on.

The first patch is trivial dead code removal.

The second patch is because I really do not understand the reason for
the 'ei->dentry' pointer, and it just looks messy.

It looks _particularly_ messy when it is mixed up in operations that
really do not need it and really shouldn't use it.

The eventfs_find_events() code was using the dentry parent pointer to
find the parent (fine, and simple), then looking up the tracefs inode
using that (fine so far), but then it looked up the dentry using
*that*. But it already *had* the dentry - it's that parent dentry it
just used to find the tracefs inode. The code looked nonsensical.

Similarly, it then (in the set_top_events_ownership() helper) used
'ei->dentry' to update the events attr, but all that really wants is
the superblock root. So instead of passing a dentry, just pass the
superblock pointer, which you can find in either the dentry or in the
VFS inode, depending on which level you're working at.

There are tons of other 'ei->dentry' uses, and I didn't look at those.
Baby steps. But this *seems* like an obvious cleanup, and many small
obvious cleanups later and perhaps the 'ei->dentry' pointer (and the
'->d_children[]' array) can eventually go away. They should all be
entirely useless - there's really no reason for a filesystem to hold
on to back-pointers of dentries.

Anybody willing to run the test-suite on this?

                    Linus

Steven Rostedt Jan. 28, 2024, 2:42 p.m. UTC | #25

On Sat, 27 Jan 2024 09:47:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Now some of this was needed due to the way the dir wrapper worked so I
> may be able to revisit this and possibly just use an ei->ref counter.
> But I wasted enough time on this and I'm way behind in my other
> responsibilities, so this is not something I can work on now.

Ironically, one of the responsibilities that I've been putting off to
fix up eventfs was writing that document on a support group for
maintainer burnout. :-p

-- Steve

Steven Rostedt Jan. 28, 2024, 8:15 p.m. UTC | #26

On Sat, 27 Jan 2024 13:47:32 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> So here's an attempt at some fairly trivial but entirely untested cleanup.
> 
> I have *not* tested this at all, and I assume you have some extensive
> test-suite that you run. So these are "signed-off' in the sense that
> the patch looks fine, it builds in one configuration for me, but maybe
> there's something really odd going on.
> 
> The first patch is trivial dead code removal.

Ah yes. That was leftover code from the mount dentry walk that I
removed. I missed that code removal.

> 
> The second patch is because I really do not understand the reason for
> the 'ei->dentry' pointer, and it just looks messy.

I have to understand how the dentry lookup works. Basically, when the
ei gets deleted, it can't be freed until all dentries it references
(including its children) are no longer being accessed. Does that lookup
get called only when a dentry with the name doesn't already exist?

That is, can I assume that eventfs_root_lookup() is only called when
the VFS file system could not find an existing dentry and it has to
create one?

If that's the case, then I can remove the ei->dentry and just add a ref
counter that it was accessed. Then the final dput() should call
eventfs_set_ei_status_free() (I hate that name and need to change it),
and if the ei->is_freed is set, it can free the ei.

The eventfs_remove_dir() can free the ei (after SRCU) if it has no
references, otherwise it needs to wait for the final dput() to do the
free.

Again, if the ->lookup() only gets called if no dentry exists (where
ei->dentry would already be NULL), then I can do this.

I think the ei->dentry was required for the dir wrapper logic that we
removed.

> 
> It looks _particularly_ messy when it is mixed up in operations that
> really do not need it and really shouldn't use it.
> 
> The eventfs_find_events() code was using the dentry parent pointer to
> find the parent (fine, and simple), then looking up the tracefs inode
> using that (fine so far), but then it looked up the dentry using
> *that*. But it already *had* the dentry - it's that parent dentry it
> just used to find the tracefs inode. The code looked nonsensical.

That was probably from rewriting that function a few different ways.

> 
> Similarly, it then (in the set_top_events_ownership() helper) used
> 'ei->dentry' to update the events attr, but all that really wants is
> the superblock root. So instead of passing a dentry, just pass the
> superblock pointer, which you can find in either the dentry or in the
> VFS inode, depending on which level you're working at.
> 
> There are tons of other 'ei->dentry' uses, and I didn't look at those.
> Baby steps. But this *seems* like an obvious cleanup, and many small
> obvious cleanups later and perhaps the 'ei->dentry' pointer (and the
> '->d_children[]' array) can eventually go away. They should all be
> entirely useless - there's really no reason for a filesystem to hold
> on to back-pointers of dentries.
> 
> Anybody willing to run the test-suite on this?

It passed the ftrace selftests that are in the kernel, although the
ownership test fails if you run it a second time. That fails before
this patch too. It's because the test assumes that there's been no
chgrp done on any of the files/directories, which then permanently
assigns owership and ignores the default. The test needs to be fixed to
handle this case, as it calls chgrp and chown so itself is permanently
assigning ownership.

I'll have to run this on my entire test suit.

-- Steve

Linus Torvalds Jan. 28, 2024, 8:53 p.m. UTC | #27

On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I have to understand how the dentry lookup works. Basically, when the
> ei gets deleted, it can't be freed until all dentries it references
> (including its children) are no longer being accessed. Does that lookup
> get called only when a dentry with the name doesn't already exist?

Dentry lookup gets called with the parent inode locked for reading, so
a lookup can happen in parallel with readdir and other dentry lookup.

BUT.

Each dentry is also "self-serializing", so you will never see a lookup
on the same name (in the same directory) concurrently.

The implementation is very much internal to the VFS layer, and it's
all kinds of nasty, with a new potential lookup waiting for the old
one, verifying that the old one is still usable, and maybe repeating
it all until we find a successful previous lookup or we're the only
dentry remaining.

It's nasty code that is very much in the "Al Viro" camp, but the point
is that any normal filesystem should treat lookups as being concurrent
with non-creation events, but not concurrently multiples.

There *is* some common code with "atomic_open()", where filesystems
that implement that then want to know if it's the *first* lookup, or a
use of a previously looked up dentry, and they'll use the
"d_in_lookup()" thing to determine that. So this whole "keep track of
which dentries are *currently* being looked up is actually exposed,
but any normal filesystem should never care.

But if you think you have that issue (tracefs does not), you really
want to talk to Al extensively.

> That is, can I assume that eventfs_root_lookup() is only called when
> the VFS file system could not find an existing dentry and it has to
> create one?

Correct. For any _particular_ name, you should think of lookup as serialized.

> If that's the case, then I can remove the ei->dentry and just add a ref
> counter that it was accessed. Then the final dput() should call
> eventfs_set_ei_status_free() (I hate that name and need to change it),
> and if the ei->is_freed is set, it can free the ei.

Note that the final 'dput()' will happen *after* the dentry has been
removed, so what can happen is

   lookup("name", d1);
   ... lookup successful, dentry is used ..
   ... dentry at some point has no more users ..
   ... memory pressure prunes unused dentries ..
   ... dentry gets unhashed and is no longer visible ..
   lookup("name", d2);
   ... new dentry is created ..
   final dput(d1);
   .. old dentry - that wasn't accessible any more is freed ..

and this is actually one of the *reasons* that virtual filesystems
must not try to cache dentry pointers in their internal data
structures. Because note how the fuilesystem saw the new lookup(d2) of
the same name *before* it saw the >d_release(d1) of the old dentry.

And the above is fundamental: we obviously cannot call '->d_release()'
until the old dentry is all dead and buried (lockref_mark_dead() etc),
so pretty much by definition you'll have that ordering being possible.

It's extremely unlikely, of course. I'll be you'll never hit it in testing.

So if if you associate some internal data structure with a dentry,
just *what* happens when you haven't been told abotu the old dentry
being dead when the new one happens?

See why I say that it's fundamentally wrong for a filesystem to try to
track dentries? All the operations that can use a dentry will get one
passed down to them by the VFS layer. The filesystem has no business
trying to remember some dentry from a previous operation, and the
filesystem *will* get it wrong.

But also note how refcounting works fine. In fact, refcounting is
pretty much the *only* thing that works fine. So what you *should* do
is

 - at lookup(), when you save your filesystem data in "->d_fsdata",
you increment a refcount

 - at ->d_release(), you decrement a refcount

and now you're fine. Yes, when the above (very very unlikely)
situation happens, you'll temporarily have a refcount incremented
twice, but that's kind of the *point* of refcounts.

Side note: this is pretty much true of any kernel data structure. If
you have a kernel data structure that isn't just used within one
thread, it must be refcounted. But it'as *doubly* true when you save
references to something that the VFS maintains, because you DO NOT
CONTROL the lifetime of that entity.

> The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> references, otherwise it needs to wait for the final dput() to do the
> free.

Honestly, you should just *always* do refcounting. No "free after RCU
delay" as an alternative. Just refcount it.

Now, the RCU delay may be needed if the lookup of said structure
happens under RCU, but no, saying "I use SRCU to make sure the
lifetime is at least X" is just broken.

The refcount is what gives the lifetime. Any form of RCU-delaying
should then be purely about non-refcounting RCU lookups that may
happen as the thing is dying (and said lookup should *look* at the
refcount and say "oh, this is dead, I'm not returning this".

> I think the ei->dentry was required for the dir wrapper logic that we
> removed.

I think all of this was due to the bogus readdir that created dentries
willy-nilly and without the required serialization.

And that was all horribly broken.  It wasn't even the above kind of
"really subtle race that you'll never hit in practice" broken. It was
just absolutely broken with readdir and lookup racing on the same name
and creating an unholy dentry mess.

            Linus

Linus Torvalds Jan. 28, 2024, 9:08 p.m. UTC | #28

On Sun, 28 Jan 2024 at 12:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.

Put another way, the only reason for any RCU should be that you don't
use locking at lookup, and the normal lookup routine should follow a
pattern something like this:

    rcu_read_lock();
    entry = find_entry(...);
    if (entry && !atomic_inc_not_zero(&entry->refcount))
        entry = NULL;
    rcu_read_unlock();

and the freeing should basically follow a pattern like

    if (atomic_dec_and_test(&entry->refcount))
        rcu_free(entry);

IOW, the *lifetime* is entirely about the refcount. No "I have killed
this entry" stuff. The RCU is purely about "look, we have to look up
the entry while it's being torn down, so I can fundamentally race with
the teardown, and so I need to be able to see that zero refcount".

Of course, the "remove it from whatever hash lists or other data
structures that can reach it" happens before the freeing,

*One* such thing would be the "->d_release()" of a dentry that has a
ref to it in d_fsdata, but presumably there are then other
subsystem-specific hash tables etc that have their own refcounts.

And a side note - I personally happen to believe that if you think you
need SRCU rather than regular RCU, you've already done something
wrong.

And the reason for that is possibly because you've mixed up the
refcount logic with some other subsystem locking logic, so you're
using sleeping locks to protect a refcount. That's a mistake of its
own. The refcounts are generally better just done using atomics (maybe
krefs).

               Linus

Steven Rostedt Jan. 28, 2024, 9:11 p.m. UTC | #29

On Sun, 28 Jan 2024 12:53:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I have to understand how the dentry lookup works. Basically, when the
> > ei gets deleted, it can't be freed until all dentries it references
> > (including its children) are no longer being accessed. Does that lookup
> > get called only when a dentry with the name doesn't already exist?  
> 
> Dentry lookup gets called with the parent inode locked for reading, so
> a lookup can happen in parallel with readdir and other dentry lookup.
> 
> BUT.
> 
> Each dentry is also "self-serializing", so you will never see a lookup
> on the same name (in the same directory) concurrently.

The above is what I wanted to know.

> 
> The implementation is very much internal to the VFS layer, and it's
> all kinds of nasty, with a new potential lookup waiting for the old
> one, verifying that the old one is still usable, and maybe repeating
> it all until we find a successful previous lookup or we're the only
> dentry remaining.
> 
> It's nasty code that is very much in the "Al Viro" camp, but the point
> is that any normal filesystem should treat lookups as being concurrent
> with non-creation events, but not concurrently multiples.
> 
> There *is* some common code with "atomic_open()", where filesystems
> that implement that then want to know if it's the *first* lookup, or a
> use of a previously looked up dentry, and they'll use the
> "d_in_lookup()" thing to determine that. So this whole "keep track of
> which dentries are *currently* being looked up is actually exposed,
> but any normal filesystem should never care.
> 
> But if you think you have that issue (tracefs does not), you really
> want to talk to Al extensively.
> 
> > That is, can I assume that eventfs_root_lookup() is only called when
> > the VFS file system could not find an existing dentry and it has to
> > create one?  
> 
> Correct. For any _particular_ name, you should think of lookup as serialized.
> 
> > If that's the case, then I can remove the ei->dentry and just add a ref
> > counter that it was accessed. Then the final dput() should call
> > eventfs_set_ei_status_free() (I hate that name and need to change it),
> > and if the ei->is_freed is set, it can free the ei.  
> 
> Note that the final 'dput()' will happen *after* the dentry has been
> removed, so what can happen is
> 
>    lookup("name", d1);
>    ... lookup successful, dentry is used ..
>    ... dentry at some point has no more users ..
>    ... memory pressure prunes unused dentries ..
>    ... dentry gets unhashed and is no longer visible ..
>    lookup("name", d2);
>    ... new dentry is created ..
>    final dput(d1);
>    .. old dentry - that wasn't accessible any more is freed ..

Actually I was mistaken. I'm looking at the final iput() not dput().

> 
> and this is actually one of the *reasons* that virtual filesystems
> must not try to cache dentry pointers in their internal data
> structures. Because note how the fuilesystem saw the new lookup(d2) of
> the same name *before* it saw the >d_release(d1) of the old dentry.
> 
> And the above is fundamental: we obviously cannot call '->d_release()'
> until the old dentry is all dead and buried (lockref_mark_dead() etc),
> so pretty much by definition you'll have that ordering being possible.
> 
> It's extremely unlikely, of course. I'll be you'll never hit it in testing.
> 
> So if if you associate some internal data structure with a dentry,
> just *what* happens when you haven't been told abotu the old dentry
> being dead when the new one happens?
> 
> See why I say that it's fundamentally wrong for a filesystem to try to
> track dentries? All the operations that can use a dentry will get one
> passed down to them by the VFS layer. The filesystem has no business
> trying to remember some dentry from a previous operation, and the
> filesystem *will* get it wrong.
> 
> But also note how refcounting works fine. In fact, refcounting is
> pretty much the *only* thing that works fine. So what you *should* do
> is
> 
>  - at lookup(), when you save your filesystem data in "->d_fsdata",
> you increment a refcount
> 
>  - at ->d_release(), you decrement a refcount
> 
> and now you're fine. Yes, when the above (very very unlikely)
> situation happens, you'll temporarily have a refcount incremented
> twice, but that's kind of the *point* of refcounts.
> 
> Side note: this is pretty much true of any kernel data structure. If
> you have a kernel data structure that isn't just used within one
> thread, it must be refcounted. But it'as *doubly* true when you save
> references to something that the VFS maintains, because you DO NOT
> CONTROL the lifetime of that entity.
> 
> > The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> > references, otherwise it needs to wait for the final dput() to do the
> > free.  
> 
> Honestly, you should just *always* do refcounting. No "free after RCU
> delay" as an alternative. Just refcount it.
> 
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.
> 
> The refcount is what gives the lifetime. Any form of RCU-delaying
> should then be purely about non-refcounting RCU lookups that may
> happen as the thing is dying (and said lookup should *look* at the
> refcount and say "oh, this is dead, I'm not returning this".
> 
> > I think the ei->dentry was required for the dir wrapper logic that we
> > removed.  
> 
> I think all of this was due to the bogus readdir that created dentries
> willy-nilly and without the required serialization.
> 
> And that was all horribly broken.  It wasn't even the above kind of
> "really subtle race that you'll never hit in practice" broken. It was
> just absolutely broken with readdir and lookup racing on the same name
> and creating an unholy dentry mess.

Hmm, if I understand the above, I could get rid of keeping around
dentry and even remove the eventfs_set_ei_status_free().

I can try something to see if it works.

-- Steve

Steven Rostedt Jan. 28, 2024, 9:19 p.m. UTC | #30

On Sun, 28 Jan 2024 12:53:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Honestly, you should just *always* do refcounting. No "free after RCU
> delay" as an alternative. Just refcount it.
> 
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.
> 
> The refcount is what gives the lifetime. Any form of RCU-delaying
> should then be purely about non-refcounting RCU lookups that may
> happen as the thing is dying (and said lookup should *look* at the
> refcount and say "oh, this is dead, I'm not returning this".

The deleting of the ei is done outside the VFS logic. I use SRCU to
synchronize looking at the ei children in the lookup. On deletion, I
grab the eventfs_mutex, set ei->is_freed and then wait for SRCU to
finish before freeing.

The lookup checks ei->is_freed and doesn't do anything if set, but most
that logic is under the SRCU, which is what I want to make sure is
finished before the ei is deleted.

Hmm, I still need the logic for iput(), as dentry->d_fsdata can still
access the ei. That's where I need to have the ref counters. For a
lookup, I need to up the ref count when I create a new inode for the ei
or its children. Then in the iput() I decrement the ei ref count. I can
only free the ei if the ref count is zero.

The ref count is for knowing if an ei is referenced by a
dentry->d_fsdata, and the SRCU is to make sure there's no lookups
accessing an ei.

-- Steve

Linus Torvalds Jan. 28, 2024, 9:43 p.m. UTC | #31

On Sun, 28 Jan 2024 at 13:19, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The deleting of the ei is done outside the VFS logic.

No.

You're fundamentally doing it wrong.

What you call "deletion" is just "remove from my hashes" or whatever.

The lifetime of the object remains entirely unrelated to that. It is
not free'd - removing it from the hashes should just be a reference
counter decrement.

> I use SRCU to synchronize looking at the ei children in the lookup.

That's just wrong.

Either you look things up under your own locks, in which case the SRCU
dance is unnecessary and pointless.

Or you use refcounts.

In which case SRCU is also unnecessary and pointless.

> On deletion, I
> grab the eventfs_mutex, set ei->is_freed and then wait for SRCU to
> finish before freeing.

Again, bogus.

Sure, you could do is "set ei->is_freed" to let any other users know
(if they even care - why would they?). You'd use *locking* to
serialize that.

btu that has *NOTHING* to do with actually freing the data structure,
and it has nothing to do with S$RCU - even if the locking might be
blocking.

Because *after* you have changed your data structures, and prefereably
after you have already dropped your locks (to not hold them
unnecessarily over any memory management) then you just do the normal
"free the reference count", because you've removed the ref from your
own data structures.

You don't use "use SRCU before freeing". You use the pattern I showed:

    if (atomic_dec_and_test(&entry->refcount))
        rcu_free(entry);

in a "put_entry()" function, and EVERYBODY uses that function when
they are done with it.

In fact, the "rcu_free()" is likely entirely unnecessary, since I
don't see that you ever look anything up under RCU.

If all your lookups are done under the eventfs_mutex lock you have, just do

    if (atomic_dec_and_test(&entry->refcount))
        kfree(entry);

and you're done. By definition, once the refcount goes down to zero,
there are no users, and if all your own data structures are maintained
with a lock, there is never ever any reason to use a RCU delay.

Sure, you'll have things like "dentry->d_fsdata" accesses that happen
before you even take the lock, but that's fine - the d_fsdata pointer
has a ref to it, so there's no locking needed for that lookup. It's
just a direct pointer dereference, and it's protected by the refcount.

No special cases. The user that sets "is_freed" is not special. Never
will be. It's just one reference among many others, and YOU DO NOT
CONTROL THE OTHER REFERENCES.

If you've given a ref to dentry->d_fsdata, it's no longer yours to
mess around with. All you can do is wait for the dentry to go away, at
which point you do the same "put_dentry()" because exactly like your
own data structures, it's JUST ANOTHER REFERENCE.

See what I'm saying?

                Linus

Steven Rostedt Jan. 28, 2024, 10:01 p.m. UTC | #32

On Sun, 28 Jan 2024 13:08:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 12:53, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Now, the RCU delay may be needed if the lookup of said structure
> > happens under RCU, but no, saying "I use SRCU to make sure the
> > lifetime is at least X" is just broken.  
> 
> Put another way, the only reason for any RCU should be that you don't
> use locking at lookup, and the normal lookup routine should follow a
> pattern something like this:
> 
>     rcu_read_lock();
>     entry = find_entry(...);
>     if (entry && !atomic_inc_not_zero(&entry->refcount))
>         entry = NULL;
>     rcu_read_unlock();
> 
> and the freeing should basically follow a pattern like
> 
>     if (atomic_dec_and_test(&entry->refcount))
>         rcu_free(entry);

Basically you are saying that when the ei is created, it should have a
ref count of 1. If the lookup happens and does the
atomic_inc_not_zero() it will only increment if the ref count is not 0
(which is basically equivalent to is_freed).

And then we can have deletion of the object happen in both where the
caller (kprobes) deletes the directory and in the final iput()
reference (can I use iput and not the d_release()?), that it does the
same as well.

Where whatever sees the refcount of zero calls rcu_free?

-- Steve

Linus Torvalds Jan. 28, 2024, 10:07 p.m. UTC | #33

On Sun, 28 Jan 2024 at 13:43, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That's just wrong.
>
> Either you look things up under your own locks, in which case the SRCU
> dance is unnecessary and pointless.
>
> Or you use refcounts.
>
> In which case SRCU is also unnecessary and pointless.

So from what I can see, you actually protect almost everything with
the eventfs_mutex, but the problem is that you then occasionally drop
that mutex in the middle.

The one valid reason for dropping it is the readdir callback, which
does need to write to user space memory.

But no, that's not a valid reason to use SRCU. It's a very *bad*
reason to use SRCU.

The thing is, you can fix it two ways:

 - either refcount things properly, ie when you do that lookup under your lock:

        mutex_lock(&eventfs_mutex);
        ei = READ_ONCE(ti->private);
        if (ei && ei->is_freed)
                ei = NULL;
        mutex_unlock(&eventfs_mutex);

   you just go "I now have a ref" to the ei, and you increment the
refcount like you should, and then you dcrement it at the end when
you're done.

Btw, what's with the READ_ONCE()? You have locking.

The other option is to simply re-lookup the ei when you re-get the
eventfs_mutext anyway.

Either of those cases, and the SRCU is entirely pointless. It  really
looks wrong, because you seem to take that eventfs_mutex everywhere
anyway.

             Linus

Steven Rostedt Jan. 28, 2024, 10:17 p.m. UTC | #34

On Sun, 28 Jan 2024 14:07:49 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 13:43, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > That's just wrong.
> >
> > Either you look things up under your own locks, in which case the SRCU
> > dance is unnecessary and pointless.
> >
> > Or you use refcounts.
> >
> > In which case SRCU is also unnecessary and pointless.  
> 
> So from what I can see, you actually protect almost everything with
> the eventfs_mutex, but the problem is that you then occasionally drop
> that mutex in the middle.
> 
> The one valid reason for dropping it is the readdir callback, which
> does need to write to user space memory.
> 
> But no, that's not a valid reason to use SRCU. It's a very *bad*
> reason to use SRCU.
> 
> The thing is, you can fix it two ways:
> 
>  - either refcount things properly, ie when you do that lookup under your lock:
> 
>         mutex_lock(&eventfs_mutex);
>         ei = READ_ONCE(ti->private);
>         if (ei && ei->is_freed)
>                 ei = NULL;
>         mutex_unlock(&eventfs_mutex);
> 
>    you just go "I now have a ref" to the ei, and you increment the
> refcount like you should, and then you dcrement it at the end when
> you're done.
> 
> Btw, what's with the READ_ONCE()? You have locking.
> 
> The other option is to simply re-lookup the ei when you re-get the
> eventfs_mutext anyway.
> 
> Either of those cases, and the SRCU is entirely pointless. It  really
> looks wrong, because you seem to take that eventfs_mutex everywhere
> anyway.

The original code just used the mutex, but then we were hitting
deadlocks because we used the mutex in the iput() logic. But this could
have been due to the readdir logic causing the deadlocks.

A lot of the design decisions were based on doing the dentry creation
in the readdir code. Now that it's no longer there, I could go back and
try taking the eventfs_mutex for the entirety of the lookup and see if
lockdep complains again about also using it in the iput logic.

Then yes, we can get rid of the SRCU as that was added as a way to get
out of that deadlock.

-- Steve

Linus Torvalds Jan. 28, 2024, 10:17 p.m. UTC | #35

On Sun, 28 Jan 2024 at 14:01, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Basically you are saying that when the ei is created, it should have a
> ref count of 1. If the lookup happens and does the
> atomic_inc_not_zero() it will only increment if the ref count is not 0
> (which is basically equivalent to is_freed).

Exactly.

That's what we do with almost all our data structures.

You can use the kref() infrastructure to give this a bit more
structure, and avoid doing the atomics by hand. So then "get a ref" is
literally

    kref_get(&ei->kref);

and releasing it is

    kref_put(&ei->kref, ei_release);

so you don't have the write out that "if (atomic_dec_and_test(..) kfree();".

And "ei_release()" looks just something like this:

    static void ei_release(struct kref *ref)
    {
        kfree(container_of(ref, struct eventfs_inode, kref);
    }

and that's literally all you need (ok, you need to add the 'kref' to
the eventfs_inode, and initialize it at allocation time, of course).

> And then we can have deletion of the object happen in both where the
> caller (kprobes) deletes the directory and in the final iput()
> reference (can I use iput and not the d_release()?), that it does the
> same as well.
>
> Where whatever sees the refcount of zero calls rcu_free?

Having looked more at your code, I actually don't see you even needing
rcu_free().

It's not that you don't need SRCU (which is *much* heavier than RCU) -
you don't even need the regular quick RCU at all.

As far as I can see, you do all the lookups under eventfs_mutex, so
there are actually no RCU users.

And the VFS code (that obviously uses RCU internally) has a ref to it
and just a direct pointer, so again, there's no real RCU involved.

You seem to have used SRCU as a "I don't want to do refcounts" thing.
I bet you'll notice that it clarifies things *enormously* to just use
refcounts.

                Linus

Linus Torvalds Jan. 28, 2024, 10:25 p.m. UTC | #36

On Sun, 28 Jan 2024 at 14:17, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The original code just used the mutex, but then we were hitting
> deadlocks because we used the mutex in the iput() logic. But this could
> have been due to the readdir logic causing the deadlocks.

I agree that it's likely that the readdir logic caused the deadlocks
on the eventfs_mutex, but at the same time it really does tend to be a
good idea to drop any locks when dealing with readdir().

The issue with the readdir iterator is that it needs to access user
space memory for every dirent it fills in, and any time you hold a
lock across a user space access, you are now opening yourself up to
having the lock have really long hold times. It's basically a great
way to cauise a mini-DoS.

And even if you now can do without the mutex in the release paths etc
by just using refcounts, and even if you thus get rid of the deadlock
itself, it's typically a very good idea to have the 'iterate_shared()'
function drop all locks before writing to user space.

The same is obviously true of 'read()' etc that also writes to user
space, but you tend to see the issue much more with the readdir code,
because it iterates over all these small things, and readdir()
typically wants the lock more too (because there's all that directory
metadata).

So dropping the lock might not be something you *have* to do in
iterate_shared, but it's often a good idea.

But dropping the lock also doesn't tend to be a big problem, if you
just have refcounted data structures. Sure, the data might change
(because you dropped the lock), but at least the data structure itself
is still there when you get the lock, so at most it might be a "I will
need to re-check that the entry hasn't been removed" kind of thing.

                 Linus

Steven Rostedt Jan. 28, 2024, 10:26 p.m. UTC | #37

On Sun, 28 Jan 2024 14:17:43 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> You seem to have used SRCU as a "I don't want to do refcounts" thing.
> I bet you'll notice that it clarifies things *enormously* to just use
> refcounts.

Well, removing creating dentries in the readdir() logic is what opened
up the door to a lot of simplification. Thanks for helping me with that.
As I believe that may have been the source of most of the deadlocks we
were struggling with.

But yeah, kref probably could have fixed that too.

-- Steve

Steven Rostedt Jan. 28, 2024, 10:51 p.m. UTC | #38

On Sat, 27 Jan 2024 13:47:32 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> There are tons of other 'ei->dentry' uses, and I didn't look at those.
> Baby steps. But this *seems* like an obvious cleanup, and many small
> obvious cleanups later and perhaps the 'ei->dentry' pointer (and the
> '->d_children[]' array) can eventually go away. They should all be
> entirely useless - there's really no reason for a filesystem to hold
> on to back-pointers of dentries.

I was working on getting rid of ei->dentry, but then I hit:

void eventfs_remove_dir(struct eventfs_inode *ei)
{
	struct dentry *dentry;

	if (!ei)
		return;

	mutex_lock(&eventfs_mutex);
	dentry = ei->dentry;
	eventfs_remove_rec(ei, 0);
	mutex_unlock(&eventfs_mutex);

	/*
	 * If any of the ei children has a dentry, then the ei itself
	 * must have a dentry.
	 */
	if (dentry)
		simple_recursive_removal(dentry, NULL);
}

Where it deletes the all the existing dentries in a tree. Is this a
valid place to keep ei->dentry? I believe this is what makes the
directory disappear from the user's view. But the ei->dentry is there to
know that it is in the user's view to begin with.

-- Steve

Linus Torvalds Jan. 28, 2024, 11:24 p.m. UTC | #39

On Sun, 28 Jan 2024 at 14:51, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I was working on getting rid of ei->dentry, but then I hit:
>
> void eventfs_remove_dir(struct eventfs_inode *ei)
> { [..]
>
> Where it deletes the all the existing dentries in a tree. Is this a
> valid place to keep ei->dentry?

No, when you remove the directory, just leave the child dentries
alone. Remember: they are purely caches, and you either have

 - somebody is still using it (you can 'rmdir()' a directory that some
other process has as its cwd, for example), which keeps it alive and
active anyway

 - when the last user is done, the dcache code will just free the
dentries anyway

so there's no reason to remove any of the dentries by hand - and in
fact simple_recursive_removal() never did that anyway for anything
that was still busy.

For a pure cached set of dentries (that have no users), doing the last
"dput()" on a directory will free that directory dentry, but it will
also automatically free all the unreachable children recursively.

Sure, simple_recursive_removal() does other things (sets inode flags
to S_DEAD, does fsnotify etc), but none of those should actually
matter.

I think that whole logic is simply left-over from when the dentries
weren't a filesystem cache, but were the *actual* filesystem. So it
actually became actively wrong when you started doing your own backing
store, but it just didn't hurt (except for code legibility).

Of course, eventfs is slightly odd and special in that this isn't a
normal "rmdir()", so it can happen with files still populated. And
those children will stick around  and be useless baggage until they
are shrunk under memory pressure.

But I don't think it should *semantically* matter, exactly because
they always could stay around anyway due to having users.

There are some cacheability knobs like

        .d_delete = always_delete_dentry,

which probably makes sense for any virtual filesystem (it limits
caching of dentries with no more users - normally the dcache will
happily keep caching dentries for the *next* user, but that may or may
not make sense for virtual filesystems)

So there is tuning that can be done, but in general, I think you
should actively think of the dcache as just "it's just a cache, leave
stale stuff alone"

                  Linus

Steven Rostedt Jan. 28, 2024, 11:59 p.m. UTC | #40

On Sun, 28 Jan 2024 15:24:09 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 14:51, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I was working on getting rid of ei->dentry, but then I hit:
> >
> > void eventfs_remove_dir(struct eventfs_inode *ei)
> > { [..]
> >
> > Where it deletes the all the existing dentries in a tree. Is this a
> > valid place to keep ei->dentry?  
> 
> No, when you remove the directory, just leave the child dentries
> alone. Remember: they are purely caches, and you either have

My fear is that they can be accessed again when there's no file around.

Thus, the dentry and the inode are created when accessed with the fops
loaded that calls into the tracing code.

For example, when some creates a kprobe:

 # cd /sys/kernel/tracing
 # echo 'p:sched schedule' > kprobe_events
 # ls events/kprobes/sched/enable
enable

Now I do;

  # echo > kprobe_events

Currently if I do:

 # ls events/kprobes/sched/enable
ls: cannot access 'events/kprobes/sched/enable'

But because the dentry doesn't go away immediately, it can cause
issues. I rather not force them to be freed every time the dentry goes
to zero, as that could cause more overhead in tracing where the tooling
already does multiple scans of the eventfs directory for various
reasons.

If that dentry is still there, and someone does:

 echo 1 > events/kprobes/sched/enable

after the ei was removed, wouldn't it call back into the open callback
of the inode represented by that dentry?

If that happens, it will call into the kprobe code and reference
structures that have already been freed.

Note, before adding that simple_recursive_removal() or my open coded
version I had earlier, the kprobe files would stick around after the
kprobe was freed, and I was able to crash the kernel.

> 
>  - somebody is still using it (you can 'rmdir()' a directory that some
> other process has as its cwd, for example), which keeps it alive and
> active anyway

Wouldn't it be bad if the dentry hung around after the rmdir. You don't
want to be able to access files after rmdir has finished.

> 
>  - when the last user is done, the dcache code will just free the
> dentries anyway

But it doesn't. That debug code I had before that showed the ei->dentry
in show_event_dentries file would show that the dentries exist for some
time when their ref count was zero. They only got freed when on
drop_caches or memory reclaim.

I like the fact that they hang around that we are not constantly
allocating them for every reference.

> 
> so there's no reason to remove any of the dentries by hand - and in
> fact simple_recursive_removal() never did that anyway for anything
> that was still busy.
> 
> For a pure cached set of dentries (that have no users), doing the last
> "dput()" on a directory will free that directory dentry, but it will
> also automatically free all the unreachable children recursively.

But it doesn't free it. It just makes it available to be freed.

> 
> Sure, simple_recursive_removal() does other things (sets inode flags
> to S_DEAD, does fsnotify etc), but none of those should actually
> matter.
> 
> I think that whole logic is simply left-over from when the dentries
> weren't a filesystem cache, but were the *actual* filesystem. So it
> actually became actively wrong when you started doing your own backing
> store, but it just didn't hurt (except for code legibility).
> 
> Of course, eventfs is slightly odd and special in that this isn't a
> normal "rmdir()", so it can happen with files still populated. And
> those children will stick around  and be useless baggage until they
> are shrunk under memory pressure.
> 
> But I don't think it should *semantically* matter, exactly because
> they always could stay around anyway due to having users.

It does, because things like kprobes will call into tracefs to free its
files so that it can free its own resources as the open callback will
reference those resources. If those dentries sick around and allow the
user to open the file again, it will cause a use after free bug.

It does keep track of opens so the kprobe code will not be freed if a
task has a reference already. You get an -EBUSY on the removal of
kprobes if something has a reference to it.

But on deleting, the return from the eventfs code that deleted the
directory, is expected that there will be no more opens on the kprobe
files. With stale dentries hanging around after the directory is
deleted, that is not the case.

> 
> There are some cacheability knobs like
> 
>         .d_delete = always_delete_dentry,
> 
> which probably makes sense for any virtual filesystem (it limits
> caching of dentries with no more users - normally the dcache will
> happily keep caching dentries for the *next* user, but that may or may
> not make sense for virtual filesystems)

But if it were freed every time there's no more users, then when you do
a stat() it will get referenced then freed (as the dentry ref count
goes to zero after the stat() is completed). then the open/write/close
will create and delete it, and this could be done several times for the
same file. If the dentry is freed every time its ref count goes to
zero, then it will need to be created for every operation.

> 
> So there is tuning that can be done, but in general, I think you
> should actively think of the dcache as just "it's just a cache, leave
> stale stuff alone"

I would like to keep it as a cache, but deleting every time the ref
count goes to zero is not much of a cache.

-- Steve

Steven Rostedt Jan. 29, 2024, 12:21 a.m. UTC | #41

On Sun, 28 Jan 2024 18:59:43 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> >  - somebody is still using it (you can 'rmdir()' a directory that some
> > other process has as its cwd, for example), which keeps it alive and
> > active anyway  
> 
> Wouldn't it be bad if the dentry hung around after the rmdir. You don't
> want to be able to access files after rmdir has finished.

And thinking about this more, this is one thing that is different with
eventfs than a normal file system. The rmdir in most cases where
directories are deleted in eventfs will fail if there's any open files
within it.

eventfs doesn't itself enforce this, but the users of eventfs do.

-- Steve

Linus Torvalds Jan. 29, 2024, 1 a.m. UTC | #42

On Sun, 28 Jan 2024 at 16:21, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> >
> > Wouldn't it be bad if the dentry hung around after the rmdir. You don't
> > want to be able to access files after rmdir has finished.

Steven, I already told you that that is NORMAL.

This is how UNIX filesystems work. Try this:

   mkdir dummy
   cd dummy
   echo "Hello" > hello
   ( sleep 10; cat ) < hello &
   rm hello
   cd ..
   rmdir dummy

and guess what? It will print "hello" after that file has been
removed, and the whole directory is gone.

YOU NEED TO DEAL WITH THIS.

> And thinking about this more, this is one thing that is different with
> eventfs than a normal file system. The rmdir in most cases where
> directories are deleted in eventfs will fail if there's any open files
> within it.

No.

Stop thinking that eventfs is special. It's not.

You need to deal with the realities of having made a filesystem. And
one of those realities is that you don't control the dentries, and you
can't randomly cache dentry state and then do things behind the VFS
layer's back.

So remove that broken function. Really.  You did a filesystem, and
that means that you had better play by the VFS rules.

End of discussion.

Now, you can then make your own "read" and "lookup" etc functions say
"if the backing store data has been marked dead, I'll not do this".
That's *YOUR* data structures, and that's your choice.

But you need to STOP thinking you can play games with dentries.  And
you need to stop making up BS arguments for why  you should be able
to.

So if you are thinking of a "Here's how to do a virtual filesystem"
talk, I would suggest you start with one single word: "Don't".

I'm convinced that we have made it much too easy to do a half-arsed
virtual filesystem. And eventfs is way beyond half-arsed.

It's now gone from half-arsed to "I told you how to do this right, and
you are still arguing". That makes it full-arsed.

So just stop the arsing around, and just get rid of those _broken_ dentry games.

               Linus

Linus Torvalds Jan. 29, 2024, 1:42 a.m. UTC | #43

On Sun, 28 Jan 2024 at 17:00, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>    mkdir dummy
>    cd dummy
>    echo "Hello" > hello
>    ( sleep 10; cat ) < hello &
>    rm hello
>    cd ..
>    rmdir dummy

Note that it's worth repeating that simple_recursive_removal()
wouldn't change any of the above. It only unhashes things and makes
them *look* gone, doing things like clearing i_nlink etc.

But those VFS data structures would still exist, and the files that
had them open would still continue to be open.

So if you thought that simple_recursive_removal() would make the above
kind of thing not able to happen, and that eventfs wouldn't have to
deal with dentries that point to event_inodes that are dead, you were
always wrong.

simple_recursive_removal() is mostly just lipstick on a pig. It does
cause the cached dentries that have no active use be removed earlier,
so it has that "memory pressure" kind of effect, but it has no real
fundamental semantic effect.

Of course, for a filesystem where the dentry tree *is* the underlying
data (ie the 'tmpfs' kind, but also things like debugfs or ipathfs,
for example), then things are different.

There the dentries are the primary thing, and not just a cache in
front of the backing store.

But you didn't want that, and those days are long gone as far as
tracefs is concerned.

              Linus

Steven Rostedt Jan. 29, 2024, 2:09 a.m. UTC | #44

On Sun, 28 Jan 2024 17:00:08 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 16:21, Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > >
> > > Wouldn't it be bad if the dentry hung around after the rmdir. You don't
> > > want to be able to access files after rmdir has finished.  
> 
> Steven, I already told you that that is NORMAL.
> 
> This is how UNIX filesystems work. Try this:
> 
>    mkdir dummy
>    cd dummy
>    echo "Hello" > hello
>    ( sleep 10; cat ) < hello &

Running strace on the above we have:

openat(AT_FDCWD, "hello", O_RDONLY)     = 3
dup2(3, 0)                              = 0
close(3)                                = 0
newfstatat(AT_FDCWD, "/usr/local/sbin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr/local/bin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr/sbin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr/bin/sleep", {st_mode=S_IFREG|0755, st_size=43888, ...}, 0) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], NULL, 8) = 0

So the file is ***opened*** and gets a referenced. YES I DEAL WITH THIS!!!

This works fine! I have no problems with this.

>    rm hello
>    cd ..
>    rmdir dummy
> 
> and guess what? It will print "hello" after that file has been
> removed, and the whole directory is gone.
> 
> YOU NEED TO DEAL WITH THIS.

And I do very well THANK YOU!

But if this does not call that simple_recursive_removal() the dentry
*STAYS AROUND* and things CAN OPEN IT AFTER IT HAS BEEN REMOVED!
That's equivalent to doing an ls on a directory you just deleted with
rmdir and you still see the files.

Note, eventfs has no call to rmdir here. It's a virtual file system. The
directories disappear without user accessing the directory.

Same for /proc. The directories for pids come and go when processes
fork and exit. You don't want someone to be able to access /proc/1234
*after* the task 1234 exited and was cleaned up by its parent. Do you?

And I'm not talking about if something opened the files in /proc/1234
before the task disappeared. That is dealt with. Just like if a file in
eventfs is opened and the backing data is to be deleted. It either
prevents the deletion, or in some cases it uses ref counts to know that
something still has one of its files open. And it won't delete the data
until everything has closed it. But after a file or directory has been
deleted, NO file system allows it to be opened again.

This isn't about something opening a file in eventfs and getting a
reference to it, and then the file or directory is being deleted.
That's covered. I'm talking about the directory being deleted and then
allowing something to open a file within it AFTER the deletion has
occurred. If a dentry is still around, THAT CAN HAPPEN! 

With a dentry still around with nothing accessing it, and you remove the
data it represents, if you don't make that dentry invisible to user
space, it can be opened AFTER it has been deleted. Without calling
d_invalidate (which calls shrink_dcache_parent) on the dentry, it is
still visible. Even with a ref count of zero and nothing has it opened.
That means you can open that file again AFTER it has been deleted. The
vfs_rmdir() calls shrink_dcache_parent() that looks to prune the dcache
to make it not visible any more. But vfs_rmdir isn't ever called for
eventfs.

procfs calls d_invalidate which removes the dentry from being visible
to the file system. I *use* to do that too until Al Viro suggested that
I use the simple_recursive_removal() call that does all that for me.

> 
> > And thinking about this more, this is one thing that is different with
> > eventfs than a normal file system. The rmdir in most cases where
> > directories are deleted in eventfs will fail if there's any open files
> > within it.  
> 
> No.
> 
> Stop thinking that eventfs is special. It's not.

It's not special with respect to other virtual file systems, but
virtual file systems *are* special compared to regular file systems.
Why? Because regular file systems go through the VFS layer for pretty
much *all* interactions with them.

Virtual file systems interact with the kernel without going through VFS
layer. In normal file systems, to remove a directory you have to go
through rmdir which does all the nice things your are telling me about.

But virtual file systems directories (except for tmpfs) have their
directories removed by other means. The VFS layer *has no idea* that a
directory is removed. With eventfs calling that simple_recursive_removal()
tells the VFS layer this directory is being deleted, just as if someone
called rmdir(). If I don't call that function VFS will think the
directory is still around and be happy to allow users to open files in
it AFTER the directory has been deleted.

Your example above does not do what I'm talking about here. It shows
something OPENING a file and then deleting the directory. Yes, if you
have an opened reference to something and it gets deleted, you still
have access to that reference. But you should not be able to get a new
reference to something after it has been deleted.

> 
> You need to deal with the realities of having made a filesystem. And
> one of those realities is that you don't control the dentries, and you
> can't randomly cache dentry state and then do things behind the VFS
> layer's back.

I'm not. I'm trying to let VFS know a directory is deleted. Because
when you delete a kprobe, the directory that has the control files for
that kprobe (like enabling it) go away too. I have to let VFS know that
the directory is deleted, just like procfs has to tell it when a
directory for a process id is no more.

You don't kill tasks with: rmdir /proc/1234

And you don't delete kprobes with: rmdir events/kprobe/sched

> 
> So remove that broken function. Really.  You did a filesystem, and
> that means that you had better play by the VFS rules.
> 
> End of discussion.

And I do it just like debugfs when it deletes files outside of VFS or
procfs, and pretty much most virtual file systems.

> 
> Now, you can then make your own "read" and "lookup" etc functions say
> "if the backing store data has been marked dead, I'll not do this".
> That's *YOUR* data structures, and that's your choice.
> 
> But you need to STOP thinking you can play games with dentries.  And
> you need to stop making up BS arguments for why  you should be able
> to.
> 
> So if you are thinking of a "Here's how to do a virtual filesystem"
> talk, I would suggest you start with one single word: "Don't".
> 
> I'm convinced that we have made it much too easy to do a half-arsed
> virtual filesystem. And eventfs is way beyond half-arsed.
> 
> It's now gone from half-arsed to "I told you how to do this right, and
> you are still arguing". That makes it full-arsed.
> 
> So just stop the arsing around, and just get rid of those _broken_ dentry games.

Sorry, but you didn't prove your point. The example you gave me is
already covered. Please tell me when a kprobe goes away, how do I let
VFS know? Currently the tracing code (like kprobes and synthetic
events) calls eventfs_remove_dir() with just a reference to that ei
eventfs_inode structure. I currently use the ei->dentry to tell VFS
"this directory is being deleted". What other means do I have to
accomplish the same thing?

-- Steve

Steven Rostedt Jan. 29, 2024, 2:32 a.m. UTC | #45

On Sun, 28 Jan 2024 17:42:30 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 17:00, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >    mkdir dummy
> >    cd dummy
> >    echo "Hello" > hello
> >    ( sleep 10; cat ) < hello &
> >    rm hello
> >    cd ..
> >    rmdir dummy  
> 
> Note that it's worth repeating that simple_recursive_removal()
> wouldn't change any of the above. It only unhashes things and makes
> them *look* gone, doing things like clearing i_nlink etc.

I know, but I already cover the above case. And that case is not what
simple_recursive_removal() is covering.

I'm worried about what can be opened after a deletion. Not what has
already been opened. The simple_recrusive_removal() is the way to clear
the dcache on those files and directories that are being removed so
that no new references can happen on them.

So, I removed the simple_recursive_removal() from the code to see what
happened. Interesting, the opposite occurred.

 # cd /sys/kernel/tracing
 # echo 'p:sched schedule' > kprobe_events
 # ls events/kprobes
enable  filter  sched
 # ls events/kprobes/sched
enable  filter  format  hist  hist_debug  id  inject  trigger
 # cat events/kprobes/sched/enable
0

 # echo 'p:timer read_current_timer' >> kprobe_events
 # ls events/kprobes
enable  filter  sched  timer

Now delete just one kprobe (keeping the kprobes directory around)

 # echo '-:sched schedule' >> kprobe_events
 # ls events/kprobes/
enable  filter  timer

Now recreate that kprobe

 # echo 'p:sched schedule' >> kprobe_events
 # ls events/kprobes
enable  filter  sched  timer

 # ls events/kprobes/sched/
ls: reading directory 'events/kprobes/sched/': Invalid argument

I have no access to the directory that was deleted and recreated.

> 
> But those VFS data structures would still exist, and the files that
> had them open would still continue to be open.
> 
> So if you thought that simple_recursive_removal() would make the above
> kind of thing not able to happen, and that eventfs wouldn't have to
> deal with dentries that point to event_inodes that are dead, you were
> always wrong.

No but I want to shrink the dentries after the directory is removed.

Perhaps something else is the error here.

> 
> simple_recursive_removal() is mostly just lipstick on a pig. It does
> cause the cached dentries that have no active use be removed earlier,
> so it has that "memory pressure" kind of effect, but it has no real
> fundamental semantic effect.

I was using it to "flush" the cache on that directory. Nothing more.

> 
> Of course, for a filesystem where the dentry tree *is* the underlying
> data (ie the 'tmpfs' kind, but also things like debugfs or ipathfs,
> for example), then things are different.

Note, tracefs was built on debugfs. Only the "events" directory is
"different". The rest of /sys/kernel/tracing behaves exactly like
debugfs.

> 
> There the dentries are the primary thing, and not just a cache in
> front of the backing store.
> 
> But you didn't want that, and those days are long gone as far as
> tracefs is concerned.

Well, as long as eventfs is ;-)

-- Steve

Steven Rostedt Jan. 29, 2024, 3:40 a.m. UTC | #46

On Sun, 28 Jan 2024 21:32:49 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

>  # echo 'p:sched schedule' >> kprobe_events
>  # ls events/kprobes
> enable  filter  sched  timer
> 
>  # ls events/kprobes/sched/
> ls: reading directory 'events/kprobes/sched/': Invalid argument
> 
> I have no access to the directory that was deleted and recreated.

Ah, this was because the final iput() does dentry->d_fsdata = NULL, and
in the lookup code I have:


	mutex_lock(&eventfs_mutex);
	ei = READ_ONCE(ti->private);
	if (ei && ei->is_freed)
		ei = NULL;
	mutex_unlock(&eventfs_mutex);

	if (!ei) {
		printk("HELLO no ei\n");
		goto out;
	}

Where that printk() was triggering.

So at least it's not calling back into the tracing code ;-)

Interesting that it still did the lookup, even though it was already
referenced.

I'm still learning the internals of VFS.

Anyway, after keeping the d_fsdata untouched (not going to NULL), just
to see what would happen, I ran it again with KASAN and did trigger:

[  106.255468] ==================================================================
[  106.258400] BUG: KASAN: slab-use-after-free in tracing_open_file_tr+0x3a/0x120
[  106.261228] Read of size 8 at addr ffff8881136f27b8 by task cat/868

[  106.264506] CPU: 2 PID: 868 Comm: cat Not tainted 6.8.0-rc1-test-00008-gbee668990ac4-dirty #454
[  106.267810] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  106.271337] Call Trace:
[  106.272406]  <TASK>
[  106.273317]  dump_stack_lvl+0x5c/0xc0
[  106.274750]  print_report+0xcf/0x670
[  106.276173]  ? __virt_addr_valid+0x15a/0x330
[  106.278807]  kasan_report+0xd8/0x110
[  106.280172]  ? tracing_open_file_tr+0x3a/0x120
[  106.281745]  ? tracing_open_file_tr+0x3a/0x120
[  106.283343]  tracing_open_file_tr+0x3a/0x120
[  106.284887]  do_dentry_open+0x3b7/0x950
[  106.286284]  ? __pfx_tracing_open_file_tr+0x10/0x10
[  106.287992]  path_openat+0xea8/0x11d0


That was with just these commands:

  cd /sys/kernel/tracing/
  echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events 
  echo 'p:timer read_current_timer' >> kprobe_events 
  ls events/kprobes/
  cat events/kprobes/sched/enable
  ls events/kprobes/sched
  echo '-:sched schedule' >> /sys/kernel/tracing/kprobe_events 
  ls events/kprobes/sched/enable
  cat events/kprobes/sched/enable

BTW, the ls after the deletion returned:

 # ls events/kprobes/sched/enable
 events/kprobes/sched/enable

In a normal file system that would be equivalent to:

 # mkdir events/kprobes/sched
 # touch events/kprobes/sched/enable
 # rm -rf events/kprobes/sched
 # ls events/kprobes/sched/enable
 events/kprobes/sched/enable

-- Steve

Linus Torvalds Jan. 29, 2024, 4:01 a.m. UTC | #47

On Sun, 28 Jan 2024 at 19:40, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> [  106.258400] BUG: KASAN: slab-use-after-free in tracing_open_file_tr+0x3a/0x120
> [  106.261228] Read of size 8 at addr ffff8881136f27b8 by task cat/868

Are you refcounting the pointers that you have in the dentries (and
inodes)? Like we talked about you needing to do?

Every time you assign a pointer to d_fsdata, you need to kref_get() it.

You try to work around the tracefs weaknesses by trying to clean up
the dentry data, but it's WRONG.

You should refcount the data properly, so that you don't NEED to clean it out.

               Linus

Amir Goldstein Jan. 29, 2024, 6:44 a.m. UTC | #48

> >
> > You need to deal with the realities of having made a filesystem. And
> > one of those realities is that you don't control the dentries, and you
> > can't randomly cache dentry state and then do things behind the VFS
> > layer's back.
>
> I'm not. I'm trying to let VFS know a directory is deleted. Because
> when you delete a kprobe, the directory that has the control files for
> that kprobe (like enabling it) go away too. I have to let VFS know that
> the directory is deleted, just like procfs has to tell it when a
> directory for a process id is no more.
>
> You don't kill tasks with: rmdir /proc/1234
>
> And you don't delete kprobes with: rmdir events/kprobe/sched
>
> >
> > So remove that broken function. Really.  You did a filesystem, and
> > that means that you had better play by the VFS rules.
> >
> > End of discussion.
>
> And I do it just like debugfs when it deletes files outside of VFS or
> procfs, and pretty much most virtual file systems.
>

I think it is better if we used the term "pseudo" file systems, because
to me VFS already stands for "virtual file system".

> >
> > Now, you can then make your own "read" and "lookup" etc functions say
> > "if the backing store data has been marked dead, I'll not do this".
> > That's *YOUR* data structures, and that's your choice.
> >
> > But you need to STOP thinking you can play games with dentries.  And
> > you need to stop making up BS arguments for why  you should be able
> > to.
> >
> > So if you are thinking of a "Here's how to do a virtual filesystem"
> > talk, I would suggest you start with one single word: "Don't".
> >
> > I'm convinced that we have made it much too easy to do a half-arsed
> > virtual filesystem. And eventfs is way beyond half-arsed.
> >
> > It's now gone from half-arsed to "I told you how to do this right, and
> > you are still arguing". That makes it full-arsed.
> >
> > So just stop the arsing around, and just get rid of those _broken_ dentry games.
>
> Sorry, but you didn't prove your point. The example you gave me is
> already covered. Please tell me when a kprobe goes away, how do I let
> VFS know? Currently the tracing code (like kprobes and synthetic
> events) calls eventfs_remove_dir() with just a reference to that ei
> eventfs_inode structure. I currently use the ei->dentry to tell VFS
> "this directory is being deleted". What other means do I have to
> accomplish the same thing?
>

Would kernfs_node_dentry() have been useful in that case?
Just looking at kernfs_node and eventfs_inode gives a very strong
smell of reinventing.

Note that the fact that eventfs has no rename makes the whole dentry
business a lot less complicated than it is in the general VFS case -
IIUC, an eventfs path either exists or it does not exist, but if it exists,
it conceptually always refers to the same underlying object (kprobe).

I am not claiming that kernfs can do everything that eventfs needs  -
I don't know, because I did not look into it.

But the concept of detaching the pseudo filesystem backing objects
from the vfs objects was already invented once and Greg has also
said the same thing.

My *feeling* at this point is that the best course of action is to use
kernfs and to improve kernfs to meet eventfs needs, if anything needs
improving at all.

IMO, the length and content of this correspondence in itself
is proof that virtual file systems should use an abstraction that
is much less flexible than the VFS.

Think of it this way - kernefs and VFS are both under-documented,
but the chances of getting kernfs well documented are far better
than ever being able to document all the subtleties of VFS for mortals.

IOW, I would be happy if instead of the LSFMM topic
"Making pseudo file systems inodes/dentries more like normal file systems"
We would be talking about
"Best practices for writing a pseudo filesystem" and/or
"Missing kernfs functionality for writing pseudo filesystems"

Thanks,
Amir.

Steven Rostedt Jan. 29, 2024, 9:32 a.m. UTC | #49

On Mon, 29 Jan 2024 08:44:29 +0200
Amir Goldstein <amir73il@gmail.com> wrote:

Hi Amir,

[ Suffering a bit of insomnia, I made the mistake of reading email from
  bed, and now I have to reply :-p ]

> >
> > And I do it just like debugfs when it deletes files outside of VFS or
> > procfs, and pretty much most virtual file systems.
> >  
> 
> I think it is better if we used the term "pseudo" file systems, because
> to me VFS already stands for "virtual file system".

Oops, you are absolutely correct. I meant "pseudo" but somehow switch
to saying "virtual". :-p

> >
> > Sorry, but you didn't prove your point. The example you gave me is
> > already covered. Please tell me when a kprobe goes away, how do I let
> > VFS know? Currently the tracing code (like kprobes and synthetic
> > events) calls eventfs_remove_dir() with just a reference to that ei
> > eventfs_inode structure. I currently use the ei->dentry to tell VFS
> > "this directory is being deleted". What other means do I have to
> > accomplish the same thing?
> >  
> 
> Would kernfs_node_dentry() have been useful in that case?
> Just looking at kernfs_node and eventfs_inode gives a very strong
> smell of reinventing.

Close, but looking at kernfs real quick, I think I see the difference
and why eventfs relies on the dentry, and why it doesn't need to.

> 
> Note that the fact that eventfs has no rename makes the whole dentry
> business a lot less complicated than it is in the general VFS case -
> IIUC, an eventfs path either exists or it does not exist, but if it exists,
> it conceptually always refers to the same underlying object (kprobe).
> 
> I am not claiming that kernfs can do everything that eventfs needs  -
> I don't know, because I did not look into it.

eventfs has one other major simplicity that kernfs does not. Not only
does it not have renames, but files are not created or deleted
individually. That is, when a directory is created, it will have a
fixed number of files. It will have those same files until the
directory is deleted.

That means I have no meta data saving anything about the files, except
a fixed array that that holds only a name and a callback for each file
in that directory.

> 
> But the concept of detaching the pseudo filesystem backing objects
> from the vfs objects was already invented once and Greg has also
> said the same thing.
> 
> My *feeling* at this point is that the best course of action is to use
> kernfs and to improve kernfs to meet eventfs needs, if anything needs
> improving at all.
> 
> IMO, the length and content of this correspondence in itself
> is proof that virtual file systems should use an abstraction that
> is much less flexible than the VFS.
> 
> Think of it this way - kernefs and VFS are both under-documented,
> but the chances of getting kernfs well documented are far better
> than ever being able to document all the subtleties of VFS for mortals.
> 
> IOW, I would be happy if instead of the LSFMM topic
> "Making pseudo file systems inodes/dentries more like normal file systems"
> We would be talking about
> "Best practices for writing a pseudo filesystem" and/or
> "Missing kernfs functionality for writing pseudo filesystems"

I'm fine with that, and I think I also understand how to solve the
issue with Linus here.

Linus hates the fact that eventfs_inode has a reference to the dentry.
The reason I do that is because when the dentry is created, it happens
in lookup with a given file name. The registered callback for the file
name will return back a "data" pointer that gets assigned to the
dentry's inode->i_private data. This gets returned directly to the
open function calls.

That is, the kprobe will register a handle that gets assigned to the
inode->i_private data just like it did when it was in debugfs. Then the
open() call would get the kprode data via the inode->i_private.

This is why reference counters do not work here. If I don't make the
dentry and inode go away after the last reference, it is possible to
call that open function with the i_private data directly.

I noticed that kernfs assigns the kernfs_inode to the i_private data.
Thus it has some control over what's in the i_private.

I use that simple_recursive_removal() to clear out any inodes that have
a dentry ref count of zero so that there will be no inode->i_private
references in the open.

I'll have to look more at kernfs to see how it works. Perhaps it can
give me some ideas on not having to use dentry.

Hmm, I may have been looking at this all wrong. I don't need the
dentry. I need the inode! Instead of keeping a pointer to the dentry in
the eventfs_inode, I should just be keeping a pointer to the inode. As
then I could remove the inode->i_private to NULL on the last reference.
The open() calls will have to check for NULL in that case.

Are inodes part of the VFS system like dentry is?

I would think not as tracefs has a "container_of" around the inode to
get to the tracefs_inode, just like many other file systems do.

That would remove my need to have to keep around any dentry.

I first need to know why I'm seeing "ghost" files. That is without that
simple_recursive_removal() call, I get:

 # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events 
 # echo 'p:timer read_current_timer' >> kprobe_events 
 # ls events/kprobes/
enable  filter  sched  timer

 # cat events/kprobes/sched/enable
0

 # ls events/kprobes/sched
enable  filter  format  hist  hist_debug  id  inject  trigger

 # echo '-:sched schedule' >> /sys/kernel/tracing/kprobe_events 
 # ls events/kprobes
enable  filter  timer

 # ls events/kprobes/sched/enable
events/kprobes/sched/enable

  ??

The sched directory has been deleted but I can still "ls" the files
within it.

-- Steve

Mathieu Desnoyers Jan. 29, 2024, 4 p.m. UTC | #50

On 2024-01-26 17:49, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 14:41, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Yes, there is even a note about stat.st_size in inode(7) explaining
>> this:
> 
> Good. Send a patch to do the same for st_ino.

This breaks "cp -aH" and "cp -aL". Even setting st_nlink={0,1}
does not help there, from coreutils 9.1:

copy_internal():

[...]
   else if (x->preserve_links
            && !x->hard_link
            && (1 < src_sb.st_nlink
                || (command_line_arg
                    && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
                || x->dereference == DEREF_ALWAYS))
     {
       earlier_file = remember_copied (dst_relname,
                                       src_sb.st_ino, src_sb.st_dev);
     }

Thanks,

Mathieu

Linus Torvalds Jan. 29, 2024, 6:58 p.m. UTC | #51

On Mon, 29 Jan 2024 at 08:00, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This breaks "cp -aH" and "cp -aL".

Do we care? Do we have a user that cares? Has anybody ever hit it?

Why would you ever do anything like that to tracefs filesystem?

In other words: my point is that tracefs just isn't a regular
filesystem. Never was, never will be.

And people should be *aware* of that. We should not say "hey, if it
doesn't work like a normal filesystem, it's a bug".

Try "cp -aL" on /proc, and guess what? It won't work all that well
either. For entirely *different* reasons. You'll get some variation of
"Input/output error"s, and insanely big files and quite possibly
you'll end up with recursive copying as you try to copy the file that
is /proc/self/fd/<output>.

It's just a nonsensical operation to do, and if somebody says "I can't
copy /proc on my system" it's a PEBKAC, not a kernel problem.

The "no regressions" rule is not about made-up "if I do this, behavior changes".

The "no regressions" rule is about *users*.

If you have an actual user that has been doing insane things, and we
change something, and now the insane thing no longer works, at that
point it's a regression, and we'll sigh, and go "Users are insane" and
have to fix it.

But if you have some random test that now behaves differently, it's
not a regression. It's a *warning* sign, sure: tests are useful.

So tests can show when something user-visible changed, and as such
they are a "there be monsters here" sign that maybe some user
experience will hit it too.

So I agree that "just use the same inode number" changes behavior. I
agree that it can be a bit of a danger. But these "look, I can see a
difference" isn't an argument.

And honestly, I have now spent *days* looking at tracefs, and I'm
finding core fundamental bugs that would cause actual oopses and/or
wild pointer accesses.

All of which makes me go "this code needs to be simpler and *cleaner*
and stop making problems".

In other words: tracefs is such a complete mess that I do not care one
*whit* about "cp -aL". I care about "this is actual kernel
instability".

           Linus

eventfs: Have inodes have unique inode numbers

Commit Message

Comments

Patch