diff mbox series

[6/6] eventfs: clean up dentry ops and add revalidate function

Message ID 20240130190355.11486-6-torvalds@linux-foundation.org (mailing list archive)
State Superseded
Headers show
Series [1/6] tracefs: avoid using the ei->dentry pointer unnecessarily | expand

Commit Message

Linus Torvalds Jan. 30, 2024, 7:03 p.m. UTC
In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 21 +++++----------------
 fs/tracefs/inode.c       | 27 ++++++++++++++++++---------
 fs/tracefs/internal.h    |  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

Comments

Steven Rostedt Jan. 30, 2024, 9:25 p.m. UTC | #1
On Tue, 30 Jan 2024 11:03:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Another thing that might be worth doing is to make all eventfs lookups
> mark their dentries as not worth caching.  We could do that with
> d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.
> 
> As it is, the dentries are all freeable, but they only tend to get freed
> at memory pressure rather than more proactively.  But that's a separate
> issue.

I actually find the dentry's sticking around when their ref counts go to
zero a feature. Tracing applications will open and close the eventfs files
many times. If the dentry were to be deleted on every close, it would need
to be create over again in short spurts.

There's some operations that will traverse the tree many times. One
operation is on reset, as there's some dependencies in the order of
disabling triggers. If there's a dependency, and a trigger is to be
disabled out of order, the disabling will fail with -EBUSY. Thus the tools
will iterate the trigger files several times until it there's no more
changes. It's not a critical path, but I rather not add more overhead to it.

-- Steve
Linus Torvalds Jan. 30, 2024, 9:36 p.m. UTC | #2
On Tue, 30 Jan 2024 at 13:25, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I actually find the dentry's sticking around when their ref counts go to
> zero a feature. Tracing applications will open and close the eventfs files
> many times. If the dentry were to be deleted on every close, it would need
> to be create over again in short spurts.

Sure. I think this is literally a tuning thing, and we'll see if
anybody ever says "the dentry cache grows too much with tracefs".

               Linus
Al Viro Jan. 31, 2024, 1:12 a.m. UTC | #3
On Tue, Jan 30, 2024 at 11:03:55AM -0800, Linus Torvalds wrote:

> +void eventfs_d_release(struct dentry *dentry)
>  {
> -	struct eventfs_inode *ei;
> -
> -	mutex_lock(&eventfs_mutex);
> -	ei = dentry->d_fsdata;
> -	if (ei) {
> -		dentry->d_fsdata = NULL;
> -		put_ei(ei);
> -	}
> -	mutex_unlock(&eventfs_mutex);
> +	put_ei(dentry->d_fsdata);
>  }

I'd rather pass ->d_fsdata to that sucker (or exposed put_ei(),
for that matter).

> @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
>  	 * sticks around while the other ei->dentry are created
>  	 * and destroyed dynamically.
>  	 */
> -	simple_recursive_removal(dentry, NULL);

That also needs to move earlier in the series - bisect hazard.

> + *
> + * Note that d_revalidate is called potentially under RCU,
> + * so it can't take the eventfs mutex etc. It's fine - if
> + * we open a file just as it's marked dead, things will
> + * still work just fine, and just see the old stale case.

Looks like use after free, unless freeing ei is RCU-delayed...

> +	return !(ei && ei->is_freed);
Linus Torvalds Jan. 31, 2024, 2:37 a.m. UTC | #4
On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> > + *
> > + * Note that d_revalidate is called potentially under RCU,
> > + * so it can't take the eventfs mutex etc. It's fine - if
> > + * we open a file just as it's marked dead, things will
> > + * still work just fine, and just see the old stale case.
>
> Looks like use after free, unless freeing ei is RCU-delayed...

We hold the ref to the ei in the very dentry that is doing d_revalidate().

So it should be fine. The race is with eventfs marking the ei
'is_freed' (under the mutex that we don't hold here), but when that
happens and we end up still using the dentry, the ei is still there,
all the operations are just going to fail.

             Linus
Al Viro Jan. 31, 2024, 2:46 a.m. UTC | #5
On Tue, Jan 30, 2024 at 06:37:49PM -0800, Linus Torvalds wrote:
> On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > > + *
> > > + * Note that d_revalidate is called potentially under RCU,
> > > + * so it can't take the eventfs mutex etc. It's fine - if
> > > + * we open a file just as it's marked dead, things will
> > > + * still work just fine, and just see the old stale case.
> >
> > Looks like use after free, unless freeing ei is RCU-delayed...
> 
> We hold the ref to the ei in the very dentry that is doing d_revalidate().

What's to stop ->d_revalidate() from being called in parallel with
__dentry_kill()?
Linus Torvalds Jan. 31, 2024, 3:39 a.m. UTC | #6
On Tue, 30 Jan 2024 at 18:46, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's to stop ->d_revalidate() from being called in parallel with
> __dentry_kill()?

Oh, you're right.

For some reason I thought we did the d_release() _after_ the RCU grace
period, but we don't.

Why don't we, btw? It would be so much better if we did the
d_release() from __d_free(). We have all that smarts in fs/dcache.c to
decide if we need to RCU-delay it or not, and then we don't let
filesystems use it.

I assume the reason is that some 'd_delete' cases might want to sleep
etc. Still, for things like this that just want to release memory, it
would be *much* better to have d_release called when the dentry is
actually released.

Hmm. Not very many d_delete cases, but I did see a couple that
definitely want process context (dma_buf_release goes to things that
do vfree() etc).

So I guess the "make low-level filesystems do their own kfree_rcu() is
what we're doing.

In this case it's as simple as doing that

-       kfree(ei);
+       kfree_rcu(ei, rcu);

and we'd just make the rcu entry a union with something that isn't
that 'is_freed' field so that it doesn't take more space.

             Linus
Al Viro Jan. 31, 2024, 4:28 a.m. UTC | #7
On Tue, Jan 30, 2024 at 07:39:55PM -0800, Linus Torvalds wrote:

> Why don't we, btw? It would be so much better if we did the
> d_release() from __d_free(). We have all that smarts in fs/dcache.c to
> decide if we need to RCU-delay it or not, and then we don't let
> filesystems use it.

Because we want to give filesystems a chance to do something useful
in it?  Something that might need access to the parent.  Or superblock,
for that matter...  By the time we are past the RCU delay there's
very little left; if you want to look at some per-superblock data,
you would have to do something like putting it into a refcounted
structure, with each dentry holding a counting reference, with obvious
effects...

We could add a separate method just for freeing stuff, but... we already
have 4 of them related to that path (->d_delete(), ->d_prune(), ->d_iput(),
->d_release()) and the things are pretty confusing as it is, without
throwing another one into the mix.

I'll look through the RCU pathwalk fixes (will repost the rebased set in
a couple of days, need to finish the re-audit of that stuff) and see how
much would such a method simplify, but I don't think it would buy us
a lot.

> So I guess the "make low-level filesystems do their own kfree_rcu() is
> what we're doing.
> 
> In this case it's as simple as doing that
> 
> -       kfree(ei);
> +       kfree_rcu(ei, rcu);
> 
> and we'd just make the rcu entry a union with something that isn't
> that 'is_freed' field so that it doesn't take more space.
Steven Rostedt Jan. 31, 2024, 6 p.m. UTC | #8
On Tue, 30 Jan 2024 11:03:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It would probably be cleaner to make eventfs its own filesystem, or at
> least set its own dentry ops when looking up eventfs files.  But as it
> is, only eventfs dentries use d_fsdata, so we don't really need to split
> these things up by use.

BTW, I have been thinking about making eventfs a completely separate file
system that could be mounted outside of tracefs. One that is readonly that
only contains the "id" and "format" files and nothing more.

Why? Because perf and powertop both use those files to know how to parse
the raw event formats. I don't think there's anything in there that
requires root privileges to read. They should not be exposing any internal
kernel information besides the event format layouts, and it would be nice
to have a /sys/kernel/events directory that only had that.

Making eventfs a separate file system where, when added to tracefs, has the
control files for the specific trace_array, but for the /sys/kernel
directory, only cares about the trace format files.

Then tracefs could be nicely converted over to kernfs, and eventfs would be
its own entity.

-- Steve
Masami Hiramatsu (Google) Jan. 31, 2024, 11:07 p.m. UTC | #9
On Wed, 31 Jan 2024 13:00:39 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 30 Jan 2024 11:03:55 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > It would probably be cleaner to make eventfs its own filesystem, or at
> > least set its own dentry ops when looking up eventfs files.  But as it
> > is, only eventfs dentries use d_fsdata, so we don't really need to split
> > these things up by use.
> 
> BTW, I have been thinking about making eventfs a completely separate file
> system that could be mounted outside of tracefs. One that is readonly that
> only contains the "id" and "format" files and nothing more.
> 
> Why? Because perf and powertop both use those files to know how to parse
> the raw event formats. I don't think there's anything in there that
> requires root privileges to read. They should not be exposing any internal
> kernel information besides the event format layouts, and it would be nice
> to have a /sys/kernel/events directory that only had that.

That's a good idea! So maybe we can allow perf to read it without root user.

> 
> Making eventfs a separate file system where, when added to tracefs, has the
> control files for the specific trace_array, but for the /sys/kernel
> directory, only cares about the trace format files.
> 
> Then tracefs could be nicely converted over to kernfs, and eventfs would be
> its own entity.

If so, maybe we can just make symlinks to the 'id' and 'format' from events
under tracefs? :)

Thank you,

> 
> -- Steve
Steven Rostedt Jan. 31, 2024, 11:23 p.m. UTC | #10
On Thu, 1 Feb 2024 08:07:15 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Then tracefs could be nicely converted over to kernfs, and eventfs would be
> > its own entity.  
> 
> If so, maybe we can just make symlinks to the 'id' and 'format' from events
> under tracefs? :)

I don't think that will save anything. The files currently do not allocate
any memory. If we make symlinks, we need to allocate a path, to them. I
think that would be rather difficult to do. Not to mention, that could
cause a lot of breakage. What do you do if the other filesystem isn't
mounted?

I could possibly make a light way handle to pass back to the callbacks.

struct trace_event_light {
	unsigned long			flags
	struct trace_event_call		*event_call;
};

struct trace_event_file {
	struct trace_event_light	call;
	[..]
	// Remove he flags and event_call and have it above
};

if the callback data has:

 callback(..., void **data)
 {
	struct trace_event_light *call = *data;
	struct trace_event_file *file;

	If (strcmp(name, "id") == 0 || strcmp(name, "format") == 0) {
		*data = call->event_call;
		return 1;
	}

	/* Return if this is just a light data entry */
	if (!(data->flags & TRACE_EVENT_FULL))
		return 0;

	file = container_of(data, struct trace_event_file, call);

	// continue processing the full data
}

This way the lonely eventfs could still share a lot of the code.

-- Steve
diff mbox series

Patch

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@  static inline struct eventfs_inode *alloc_ei(const char *name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-	struct eventfs_inode *ei;
-
-	mutex_lock(&eventfs_mutex);
-	ei = dentry->d_fsdata;
-	if (ei) {
-		dentry->d_fsdata = NULL;
-		put_ei(ei);
-	}
-	mutex_unlock(&eventfs_mutex);
+	put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 	}
 
  enoent:
-	/* Nothing found? */
-	d_add(dentry, NULL);
-	result = NULL;
+	/* Don't cache negative lookups, just return an error */
+	result = ERR_PTR(-ENOENT);
 
  out:
 	mutex_unlock(&eventfs_mutex);
@@ -857,6 +847,5 @@  void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
-	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@  static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@  struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */