Message ID | 20240130190355.11486-6-torvalds@linux-foundation.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/6] tracefs: avoid using the ei->dentry pointer unnecessarily | expand |
On Tue, 30 Jan 2024 11:03:55 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Another thing that might be worth doing is to make all eventfs lookups > mark their dentries as not worth caching. We could do that with > d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. > > As it is, the dentries are all freeable, but they only tend to get freed > at memory pressure rather than more proactively. But that's a separate > issue. I actually find the dentry's sticking around when their ref counts go to zero a feature. Tracing applications will open and close the eventfs files many times. If the dentry were to be deleted on every close, it would need to be create over again in short spurts. There's some operations that will traverse the tree many times. One operation is on reset, as there's some dependencies in the order of disabling triggers. If there's a dependency, and a trigger is to be disabled out of order, the disabling will fail with -EBUSY. Thus the tools will iterate the trigger files several times until it there's no more changes. It's not a critical path, but I rather not add more overhead to it. -- Steve
On Tue, 30 Jan 2024 at 13:25, Steven Rostedt <rostedt@goodmis.org> wrote: > > I actually find the dentry's sticking around when their ref counts go to > zero a feature. Tracing applications will open and close the eventfs files > many times. If the dentry were to be deleted on every close, it would need > to be create over again in short spurts. Sure. I think this is literally a tuning thing, and we'll see if anybody ever says "the dentry cache grows too much with tracefs". Linus
On Tue, Jan 30, 2024 at 11:03:55AM -0800, Linus Torvalds wrote: > +void eventfs_d_release(struct dentry *dentry) > { > - struct eventfs_inode *ei; > - > - mutex_lock(&eventfs_mutex); > - ei = dentry->d_fsdata; > - if (ei) { > - dentry->d_fsdata = NULL; > - put_ei(ei); > - } > - mutex_unlock(&eventfs_mutex); > + put_ei(dentry->d_fsdata); > } I'd rather pass ->d_fsdata to that sucker (or exposed put_ei(), for that matter). > @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) > * sticks around while the other ei->dentry are created > * and destroyed dynamically. > */ > - simple_recursive_removal(dentry, NULL); That also needs to move earlier in the series - bisect hazard. > + * > + * Note that d_revalidate is called potentially under RCU, > + * so it can't take the eventfs mutex etc. It's fine - if > + * we open a file just as it's marked dead, things will > + * still work just fine, and just see the old stale case. Looks like use after free, unless freeing ei is RCU-delayed... > + return !(ei && ei->is_freed);
On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > + * > > + * Note that d_revalidate is called potentially under RCU, > > + * so it can't take the eventfs mutex etc. It's fine - if > > + * we open a file just as it's marked dead, things will > > + * still work just fine, and just see the old stale case. > > Looks like use after free, unless freeing ei is RCU-delayed... We hold the ref to the ei in the very dentry that is doing d_revalidate(). So it should be fine. The race is with eventfs marking the ei 'is_freed' (under the mutex that we don't hold here), but when that happens and we end up still using the dentry, the ei is still there, all the operations are just going to fail. Linus
On Tue, Jan 30, 2024 at 06:37:49PM -0800, Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > + * > > > + * Note that d_revalidate is called potentially under RCU, > > > + * so it can't take the eventfs mutex etc. It's fine - if > > > + * we open a file just as it's marked dead, things will > > > + * still work just fine, and just see the old stale case. > > > > Looks like use after free, unless freeing ei is RCU-delayed... > > We hold the ref to the ei in the very dentry that is doing d_revalidate(). What's to stop ->d_revalidate() from being called in parallel with __dentry_kill()?
On Tue, 30 Jan 2024 at 18:46, Al Viro <viro@zeniv.linux.org.uk> wrote: > > What's to stop ->d_revalidate() from being called in parallel with > __dentry_kill()? Oh, you're right. For some reason I thought we did the d_release() _after_ the RCU grace period, but we don't. Why don't we, btw? It would be so much better if we did the d_release() from __d_free(). We have all that smarts in fs/dcache.c to decide if we need to RCU-delay it or not, and then we don't let filesystems use it. I assume the reason is that some 'd_delete' cases might want to sleep etc. Still, for things like this that just want to release memory, it would be *much* better to have d_release called when the dentry is actually released. Hmm. Not very many d_delete cases, but I did see a couple that definitely want process context (dma_buf_release goes to things that do vfree() etc). So I guess the "make low-level filesystems do their own kfree_rcu() is what we're doing. In this case it's as simple as doing that - kfree(ei); + kfree_rcu(ei, rcu); and we'd just make the rcu entry a union with something that isn't that 'is_freed' field so that it doesn't take more space. Linus
On Tue, Jan 30, 2024 at 07:39:55PM -0800, Linus Torvalds wrote: > Why don't we, btw? It would be so much better if we did the > d_release() from __d_free(). We have all that smarts in fs/dcache.c to > decide if we need to RCU-delay it or not, and then we don't let > filesystems use it. Because we want to give filesystems a chance to do something useful in it? Something that might need access to the parent. Or superblock, for that matter... By the time we are past the RCU delay there's very little left; if you want to look at some per-superblock data, you would have to do something like putting it into a refcounted structure, with each dentry holding a counting reference, with obvious effects... We could add a separate method just for freeing stuff, but... we already have 4 of them related to that path (->d_delete(), ->d_prune(), ->d_iput(), ->d_release()) and the things are pretty confusing as it is, without throwing another one into the mix. I'll look through the RCU pathwalk fixes (will repost the rebased set in a couple of days, need to finish the re-audit of that stuff) and see how much would such a method simplify, but I don't think it would buy us a lot. > So I guess the "make low-level filesystems do their own kfree_rcu() is > what we're doing. > > In this case it's as simple as doing that > > - kfree(ei); > + kfree_rcu(ei, rcu); > > and we'd just make the rcu entry a union with something that isn't > that 'is_freed' field so that it doesn't take more space.
On Tue, 30 Jan 2024 11:03:55 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > It would probably be cleaner to make eventfs its own filesystem, or at > least set its own dentry ops when looking up eventfs files. But as it > is, only eventfs dentries use d_fsdata, so we don't really need to split > these things up by use. BTW, I have been thinking about making eventfs a completely separate file system that could be mounted outside of tracefs. One that is readonly that only contains the "id" and "format" files and nothing more. Why? Because perf and powertop both use those files to know how to parse the raw event formats. I don't think there's anything in there that requires root privileges to read. They should not be exposing any internal kernel information besides the event format layouts, and it would be nice to have a /sys/kernel/events directory that only had that. Making eventfs a separate file system where, when added to tracefs, has the control files for the specific trace_array, but for the /sys/kernel directory, only cares about the trace format files. Then tracefs could be nicely converted over to kernfs, and eventfs would be its own entity. -- Steve
On Wed, 31 Jan 2024 13:00:39 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 30 Jan 2024 11:03:55 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > It would probably be cleaner to make eventfs its own filesystem, or at > > least set its own dentry ops when looking up eventfs files. But as it > > is, only eventfs dentries use d_fsdata, so we don't really need to split > > these things up by use. > > BTW, I have been thinking about making eventfs a completely separate file > system that could be mounted outside of tracefs. One that is readonly that > only contains the "id" and "format" files and nothing more. > > Why? Because perf and powertop both use those files to know how to parse > the raw event formats. I don't think there's anything in there that > requires root privileges to read. They should not be exposing any internal > kernel information besides the event format layouts, and it would be nice > to have a /sys/kernel/events directory that only had that. That's a good idea! So maybe we can allow perf to read it without root user. > > Making eventfs a separate file system where, when added to tracefs, has the > control files for the specific trace_array, but for the /sys/kernel > directory, only cares about the trace format files. > > Then tracefs could be nicely converted over to kernfs, and eventfs would be > its own entity. If so, maybe we can just make symlinks to the 'id' and 'format' from events under tracefs? :) Thank you, > > -- Steve
On Thu, 1 Feb 2024 08:07:15 +0900 Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote: > > Then tracefs could be nicely converted over to kernfs, and eventfs would be > > its own entity. > > If so, maybe we can just make symlinks to the 'id' and 'format' from events > under tracefs? :) I don't think that will save anything. The files currently do not allocate any memory. If we make symlinks, we need to allocate a path, to them. I think that would be rather difficult to do. Not to mention, that could cause a lot of breakage. What do you do if the other filesystem isn't mounted? I could possibly make a light way handle to pass back to the callbacks. struct trace_event_light { unsigned long flags struct trace_event_call *event_call; }; struct trace_event_file { struct trace_event_light call; [..] // Remove he flags and event_call and have it above }; if the callback data has: callback(..., void **data) { struct trace_event_light *call = *data; struct trace_event_file *file; If (strcmp(name, "id") == 0 || strcmp(name, "format") == 0) { *data = call->event_call; return 1; } /* Return if this is just a light data entry */ if (!(data->flags & TRACE_EVENT_FULL)) return 0; file = container_of(data, struct trace_event_file, call); // continue processing the full data } This way the lonely eventfs could still share a lot of the code. -- Steve
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a37db0dac302..acdc797bd9c0 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name) /** - * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode - * @ti: the tracefs_inode of the dentry + * eventfs_d_release - dentry is going away * @dentry: dentry which has the reference to remove. * * Remove the association between a dentry from an eventfs_inode. */ -void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) +void eventfs_d_release(struct dentry *dentry) { - struct eventfs_inode *ei; - - mutex_lock(&eventfs_mutex); - ei = dentry->d_fsdata; - if (ei) { - dentry->d_fsdata = NULL; - put_ei(ei); - } - mutex_unlock(&eventfs_mutex); + put_ei(dentry->d_fsdata); } /** @@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, } enoent: - /* Nothing found? */ - d_add(dentry, NULL); - result = NULL; + /* Don't cache negative lookups, just return an error */ + result = ERR_PTR(-ENOENT); out: mutex_unlock(&eventfs_mutex); @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) * sticks around while the other ei->dentry are created * and destroyed dynamically. */ - simple_recursive_removal(dentry, NULL); dput(dentry); } diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..64122787e5d0 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = { .show_options = tracefs_show_options, }; -static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode) +/* + * It would be cleaner if eventfs had its own dentry ops. + * + * Note that d_revalidate is called potentially under RCU, + * so it can't take the eventfs mutex etc. It's fine - if + * we open a file just as it's marked dead, things will + * still work just fine, and just see the old stale case. + */ +static void tracefs_d_release(struct dentry *dentry) { - struct tracefs_inode *ti; + if (dentry->d_fsdata) + eventfs_d_release(dentry); +} - if (!dentry || !inode) - return; +static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct eventfs_inode *ei = dentry->d_fsdata; - ti = get_tracefs(inode); - if (ti && ti->flags & TRACEFS_EVENT_INODE) - eventfs_set_ei_status_free(ti, dentry); - iput(inode); + return !(ei && ei->is_freed); } static const struct dentry_operations tracefs_dentry_operations = { - .d_iput = tracefs_dentry_iput, + .d_revalidate = tracefs_d_revalidate, + .d_release = tracefs_d_release, }; static int trace_fill_super(struct super_block *sb, void *data, int silent) diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 72db3bdc4dfb..d4194466b643 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb); struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); struct dentry *eventfs_failed_creating(struct dentry *dentry); struct dentry *eventfs_end_creating(struct dentry *dentry); -void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry); + +void eventfs_d_release(struct dentry *dentry); #endif /* _TRACEFS_INTERNAL_H */
In order for the dentries to stay up-to-date with the eventfs changes, just add a 'd_revalidate' function that checks the 'is_freed' bit. Also, clean up the dentry release to actually use d_release() rather than the slightly odd d_iput() function. We don't care about the inode, all we want to do is to get rid of the refcount to the eventfs data added by dentry->d_fsdata. It would probably be cleaner to make eventfs its own filesystem, or at least set its own dentry ops when looking up eventfs files. But as it is, only eventfs dentries use d_fsdata, so we don't really need to split these things up by use. Another thing that might be worth doing is to make all eventfs lookups mark their dentries as not worth caching. We could do that with d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. As it is, the dentries are all freeable, but they only tend to get freed at memory pressure rather than more proactively. But that's a separate issue. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/tracefs/event_inode.c | 21 +++++---------------- fs/tracefs/inode.c | 27 ++++++++++++++++++--------- fs/tracefs/internal.h | 3 ++- 3 files changed, 25 insertions(+), 26 deletions(-)