diff mbox series

eventfs: Stop using dcache_readdir() for getdents()

Message ID 20240103102553.17a19cea@gandalf.local.home (mailing list archive)
State Accepted
Headers show
Series eventfs: Stop using dcache_readdir() for getdents() | expand

Commit Message

Steven Rostedt Jan. 3, 2024, 3:25 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The eventfs creates dynamically allocated dentries and inodes. Using the
dcache_readdir() logic for its own directory lookups requires hiding the
cursor of the dcache logic and playing games to allow the dcache_readdir()
to still have access to the cursor while the eventfs saved what it created
and what it needs to release.

Instead, just have eventfs have its own iterate_shared callback function
that will fill in the dent entries. This simplifies the code quite a bit.

Also, remove the "lookup" parameter to the create_file/dir_dentry() and
always have it return a dentry that has its ref count incremented, and
have the caller call the dput. This simplifies that code as well.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c | 237 +++++++++++++--------------------------
 1 file changed, 78 insertions(+), 159 deletions(-)

Comments

Linus Torvalds Jan. 3, 2024, 6:12 p.m. UTC | #1
On Wed, 3 Jan 2024 at 07:24, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Instead, just have eventfs have its own iterate_shared callback function
> that will fill in the dent entries. This simplifies the code quite a bit.

Much better. Now eventfs looks more like a real filesystem, and less
like an eldritch horror monster that is parts of dcache tackled onto a
pseudo-filesystem.

However, one request, and one nit:

> Also, remove the "lookup" parameter to the create_file/dir_dentry() and
> always have it return a dentry that has its ref count incremented, and
> have the caller call the dput. This simplifies that code as well.

Can you please do that as a separate patch, where the first patch just
cleans up the directory iteration, and the second patch then goes "now
there are no more callers that have the 'lookup' argument set to
false".

Because as-is, the patch is kind of two things mixed up.

The small nit is this:

> +static int eventfs_iterate(struct file *file, struct dir_context *ctx)
>  {
> +       /*
> +        * Need to create the dentries and inodes to have a consistent
> +        * inode number.
> +        */
>         list_for_each_entry_srcu(ei_child, &ei->children, list,
>                                  srcu_read_lock_held(&eventfs_srcu)) {
> -               d = create_dir_dentry(ei, ei_child, parent, false);
> -               if (d) {
> -                       ret = add_dentries(&dentries, d, cnt);
> -                       if (ret < 0)
> -                               break;
> -                       cnt++;
> +
> +               if (ei_child->is_freed)
> +                       continue;
> +
> +               name = ei_child->name;
> +
> +               dentry = create_dir_dentry(ei, ei_child, ei_dentry);
> +               if (!dentry)
> +                       goto out;
> +               ino = dentry->d_inode->i_ino;
> +               dput(dentry);
> +
> +               if (c > 0) {
> +                       c--;
> +                       continue;
>                 }

Just move this "is the position before this name" up to the top of the
loop. Even above the "is_freed" part.

Let's just always count all the entries in the child list.

And same for the ei->nr_entries loop:

>         for (i = 0; i < ei->nr_entries; i++) {

where there's no point in creating that dentry just to look up the
inode number, only to then decide "Oh, we already iterated past this
part, so let's not do anything with it".

This wouldn't seem to matter much with a big enough getdents buffer
(which is the normal user level behavior), but it actually does,
because we don't keep track of "we have read to the end of the
directory".

So every readdir ends up effectively doing getdents at least twice:
once to read the directory entries, and then once to just be told
"that was all".

End result: you should strive very hard to *not* waste time on the
directory entries that have already been read, and are less than
'ctx->pos'.

             Linus
Steven Rostedt Jan. 3, 2024, 6:27 p.m. UTC | #2
On Wed, 3 Jan 2024 10:12:08 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 3 Jan 2024 at 07:24, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Instead, just have eventfs have its own iterate_shared callback function
> > that will fill in the dent entries. This simplifies the code quite a bit.  
> 
> Much better. Now eventfs looks more like a real filesystem, and less
> like an eldritch horror monster that is parts of dcache tackled onto a
> pseudo-filesystem.

Thanks.

> 
> However, one request, and one nit:
> 
> > Also, remove the "lookup" parameter to the create_file/dir_dentry() and
> > always have it return a dentry that has its ref count incremented, and
> > have the caller call the dput. This simplifies that code as well.  
> 
> Can you please do that as a separate patch, where the first patch just
> cleans up the directory iteration, and the second patch then goes "now
> there are no more callers that have the 'lookup' argument set to
> false".

Yeah, I was thinking of doing it as two patches and figured I'd merge them
into one because I deleted one of the users of it. As I was on the fence
with doing two patches, I'm happy to change that.

> 
> Because as-is, the patch is kind of two things mixed up.
> 
> The small nit is this:
> 
> > +static int eventfs_iterate(struct file *file, struct dir_context *ctx)
> >  {
> > +       /*
> > +        * Need to create the dentries and inodes to have a consistent
> > +        * inode number.
> > +        */
> >         list_for_each_entry_srcu(ei_child, &ei->children, list,
> >                                  srcu_read_lock_held(&eventfs_srcu)) {
> > -               d = create_dir_dentry(ei, ei_child, parent, false);
> > -               if (d) {
> > -                       ret = add_dentries(&dentries, d, cnt);
> > -                       if (ret < 0)
> > -                               break;
> > -                       cnt++;
> > +
> > +               if (ei_child->is_freed)
> > +                       continue;
> > +
> > +               name = ei_child->name;
> > +
> > +               dentry = create_dir_dentry(ei, ei_child, ei_dentry);
> > +               if (!dentry)
> > +                       goto out;
> > +               ino = dentry->d_inode->i_ino;
> > +               dput(dentry);
> > +
> > +               if (c > 0) {
> > +                       c--;
> > +                       continue;
> >                 }  
> 
> Just move this "is the position before this name" up to the top of the
> loop. Even above the "is_freed" part.
> 
> Let's just always count all the entries in the child list.
> 
> And same for the ei->nr_entries loop:
> 
> >         for (i = 0; i < ei->nr_entries; i++) {  
> 
> where there's no point in creating that dentry just to look up the
> inode number, only to then decide "Oh, we already iterated past this
> part, so let's not do anything with it".
> 
> This wouldn't seem to matter much with a big enough getdents buffer
> (which is the normal user level behavior), but it actually does,
> because we don't keep track of "we have read to the end of the
> directory".
> 
> So every readdir ends up effectively doing getdents at least twice:
> once to read the directory entries, and then once to just be told
> "that was all".
> 
> End result: you should strive very hard to *not* waste time on the
> directory entries that have already been read, and are less than
> 'ctx->pos'.

My patch originally did that, but then I was worried about counting something
that doesn't exist.

If it is done twice, there's a good chance the dentry will still be around
anyway, so it doesn't slow it down that much. The dput() only decrements
the entry and doesn't free it. I added back my "show_events_dentries" file
to test this. They sit with refcount equal to zero waiting to be reclaimed.
But if they get referenced again, the refcount goes up again.

That is, the first time it is called, where ctx->pos is likely zero, it
creates the dentry, but that is also added to the list. The next time, with
ctx->pos greater than zero, the create_dir_dentry() starts with:

static struct dentry *
create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
		  struct dentry *parent)
{
	struct dentry *dentry = NULL;

	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));

	mutex_lock(&eventfs_mutex);
	if (pei->is_freed || ei->is_freed) {
		mutex_unlock(&eventfs_mutex);
		return NULL;
	}
	if (ei->dentry) {
		/* If the dentry already has a dentry, use it */
		dentry = ei->dentry;
		dget(dentry);
		mutex_unlock(&eventfs_mutex);
		return dentry;
	}
	mutex_unlock(&eventfs_mutex);

Where the already created dentry is returned. (hmm, I just noticed that
comment should be "if the eventfs_inode already has a dentry" and not "If
the dentry already has a dentry" :-p ).

It does require taking a mutex, but that's actually quite fast too.

If you don't think it will cause any inconsistencies to count something
that perhaps doesn't exist anymore, then I can move the ctx->pos check up.

-- Steve
Linus Torvalds Jan. 3, 2024, 6:38 p.m. UTC | #3
On Wed, 3 Jan 2024 at 10:12, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Much better. Now eventfs looks more like a real filesystem, and less
> like an eldritch horror monster that is parts of dcache tackled onto a
> pseudo-filesystem.

Oh, except I think you still need to just remove the 'set_gid()' mess.

It's disgusting and it's wrong, and it's not even what the 'uid'
option does (it only sets the root inode uid).

If you remount the filesystem with different gid values, you get to
keep both broken pieces. And if it isn't a remount, then setting the
root uid is sufficient.

I think the whole thing was triggered by commit 49d67e445742, and
maybe the fix is to just revert that commit.

That commit makes no sense in general, since the default mounting
position for tracefs that the kernel sets up is only accessible to
root anyway.

Alternatively, just do the ->permissions() thing, and allow access to
the group in the mount options.

Getting rid of set_gid() would be this attached lovely patch:

 fs/tracefs/inode.c | 83 ++----------------------------------------------------
 1 file changed, 2 insertions(+), 81 deletions(-)

and would get rid of the final (?) piece of disgusting dcache hackery
that tracefs most definitely should not have.

             Linus
Steven Rostedt Jan. 3, 2024, 6:51 p.m. UTC | #4
On Wed, 3 Jan 2024 10:38:09 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 3 Jan 2024 at 10:12, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Much better. Now eventfs looks more like a real filesystem, and less
> > like an eldritch horror monster that is parts of dcache tackled onto a
> > pseudo-filesystem.  
> 
> Oh, except I think you still need to just remove the 'set_gid()' mess.
> 
> It's disgusting and it's wrong, and it's not even what the 'uid'
> option does (it only sets the root inode uid).
> 
> If you remount the filesystem with different gid values, you get to
> keep both broken pieces. And if it isn't a remount, then setting the
> root uid is sufficient.
> 
> I think the whole thing was triggered by commit 49d67e445742, and
> maybe the fix is to just revert that commit.
> 
> That commit makes no sense in general, since the default mounting
> position for tracefs that the kernel sets up is only accessible to
> root anyway.
> 
> Alternatively, just do the ->permissions() thing, and allow access to
> the group in the mount options.
> 
> Getting rid of set_gid() would be this attached lovely patch:
> 
>  fs/tracefs/inode.c | 83 ++----------------------------------------------------
>  1 file changed, 2 insertions(+), 81 deletions(-)
> 
> and would get rid of the final (?) piece of disgusting dcache hackery
> that tracefs most definitely should not have.
>

I'll look at that and play with it. I understand VFS much better now that I
spent so much time with eventfs. That commit had to do with allowing OTH
read access, which is a security issue as the trace files expose a lot of
the kernel internals.

I think these changes are a bit much for -rc8, don't you? Or do you want
all this in before v6.7 is released. I'd be more comfortable with adding
these changes in the upcoming merge window, where I can have more time
playing with them.

-- Steve
Linus Torvalds Jan. 3, 2024, 7:04 p.m. UTC | #5
On Wed, 3 Jan 2024 at 10:50, Steven Rostedt <rostedt@goodmis.org> wrote:
> I think these changes are a bit much for -rc8, don't you?

Oh, absolutely.

None of this is rc8 material apart from the oops fix in your pull
request (which my patch that then removes the whole function did *not*
have - so that patch won't apply as-is to your tree).

But let's aim for a tracefs that doesn't play games with the dcache.

Basically, the dcache is *much* too subtle for a filesystem to mess
with. You should either:

 - be a fully virtual filesystem where the dcache just maintains
everything, and you don't mess with it because you don't need to (eg
tmpfs etc). Everything is in the dcache, and you don't need to touch
it, because you don't even care - the dcache is doing everything for
you.

 - be a "normal" filesystem where the dcache is just a cache, and you
maintain your *own* file data structures, and just get normal lookup
etc ops, and you don't mess with the dcache because it is just doing
its caching thing that you as a filesystem don't care about.

and in both of those cases the filesystem just never has to really
delve into it. But tracefs had this abomination where it maintained
its own data structures, _and_ it tried to make them coherent with the
dcache that maintained part of it. That's the part I hated.

               Linus
Steven Rostedt Jan. 3, 2024, 7:53 p.m. UTC | #6
On Wed, 3 Jan 2024 10:38:09 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> @@ -332,10 +255,8 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
>  	if (!remount || opts->opts & BIT(Opt_uid))
>  		inode->i_uid = opts->uid;
>  
> -	if (!remount || opts->opts & BIT(Opt_gid)) {
> -		/* Set all the group ids to the mount option */
> -		set_gid(sb->s_root, opts->gid);
> -	}
> +	if (!remount || opts->opts & BIT(Opt_gid))
> +		inode->i_gid = opts->gid;
>  
>  	return 0;
>  }

This doesn't work because for tracefs (not eventfs) the dentries are
created at boot up and before the file system is mounted. This means you
can't even set a gid in /etc/fstab. This will cause a regression.

tracefs was designed after debugfs, which also ignores gid. But because
there's users out there that want non-root accounts to have access to
tracing, it is documented to set the gid to a group that you can then add
users to. And that's the reason behind the set_gid() walk.

Reverting that one commit won't fix things either, because it only blocked
OTH to be read, but the creation of the files changed their mode's passed
to block OTH as well, so all those would need to be changed too. And I
don't think making the trace files open to OTH is a good solution, even if
the tracefs top level directory itself blocks other. The issue was that the
user use to just mount the top level to allow the group access to the files
below, which allowed all users access. But this is weak control of the file
system.

Even my non-test machines have me in the tracing group so my user account
has access to tracefs.

On boot up, all the tracefs files are created via tracefs_create_file() and
directories by tracefs_create_dir() which was copied from
debugfs_create_file/dir(). At this moment, the dentry is created with the
permissions set. There's no looking at the super block.

So we need a way to change the permissions at mount time.

The only solution I can think of that doesn't include walking the current
dentries, is to convert all of tracefs to be more like eventfs, and have
the dentries created on demand. But perhaps, different than eventfs, they
do not need to be freed when they are no longer referenced, which should
make it easier to implement. And there's not nearly as many files and
directories, so keeping meta data around isn't as much of an issue.

Instead of creating the inode and dentry in the tracefs_create_file/dir(),
it could just create a descriptor that holds the fops, data and mode. Then
on lookup, it would create the inodes and dentries similar to eventfs.

It would need its own iterate_shared as well.

-- Steve
Linus Torvalds Jan. 3, 2024, 7:57 p.m. UTC | #7
On Wed, 3 Jan 2024 at 11:52, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This doesn't work because for tracefs (not eventfs) the dentries are
> created at boot up and before the file system is mounted. This means you
> can't even set a gid in /etc/fstab. This will cause a regression.

Which is why I suggested

   "I think the whole thing was triggered by commit 49d67e445742, and
    maybe the fix is to just revert that commit"

there was never any coherent reason for that commit, since the
permissions are dealt with at the mount point.

So this all was triggered by that original change that makes little
sense. The fact that you then apparently changed other things
afterwards too might need fixing.

Or, you know, you could do what I've told you to do at least TEN TIMES
already, which is to not mess with any of this, and just implement the
'->permission()' callback (and getattr() to just make 'ls' look sane
too, rather than silently saying "we'll act as if gid is set right,
but not show it").

Why do you keep bringing up things that I've told you solutions for many times?

                 Linus
Linus Torvalds Jan. 3, 2024, 9:54 p.m. UTC | #8
On Wed, 3 Jan 2024 at 11:57, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Or, you know, you could do what I've told you to do at least TEN TIMES
> already, which is to not mess with any of this, and just implement the
> '->permission()' callback (and getattr() to just make 'ls' look sane
> too, rather than silently saying "we'll act as if gid is set right,
> but not show it").

Actually, an even simpler option might be to just do this all at
d_revalidate() time.

Here's an updated patch that builds, and is PURELY AN EXAMPLE. I think
it "works", but it currently always resets the inode mode/uid/gid
unconditionally, which is wrong - it should not do so if the inode has
been manually set.

So take this as a "this might work", but it probably needs a bit more
work - eventfs_set_attr() should set some bit in the inode to say
"these have been set manually", and then revalidate would say "I'll
not touch inodes that have that bit set".

Or something.

Anyway, this patch is nwo relative to your latest pull request, so it
has the check for dentry->d_inode in set_gid() (and still removes the
whole function).

Again: UNTESTED, and meant as a "this is another way to avoid messing
with the dentry tree manually, and just using the VFS interfaces we
already have"

               Linus
Steven Rostedt Jan. 3, 2024, 10:05 p.m. UTC | #9
On Wed, 3 Jan 2024 13:54:36 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 3 Jan 2024 at 11:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Or, you know, you could do what I've told you to do at least TEN TIMES
> > already, which is to not mess with any of this, and just implement the
> > '->permission()' callback (and getattr() to just make 'ls' look sane
> > too, rather than silently saying "we'll act as if gid is set right,
> > but not show it").
> 
> Actually, an even simpler option might be to just do this all at
> d_revalidate() time.
> 
> Here's an updated patch that builds, and is PURELY AN EXAMPLE. I think
> it "works", but it currently always resets the inode mode/uid/gid
> unconditionally, which is wrong - it should not do so if the inode has
> been manually set.
> 
> So take this as a "this might work", but it probably needs a bit more
> work - eventfs_set_attr() should set some bit in the inode to say
> "these have been set manually", and then revalidate would say "I'll
> not touch inodes that have that bit set".
> 
> Or something.
> 
> Anyway, this patch is nwo relative to your latest pull request, so it
> has the check for dentry->d_inode in set_gid() (and still removes the
> whole function).
> 
> Again: UNTESTED, and meant as a "this is another way to avoid messing
> with the dentry tree manually, and just using the VFS interfaces we
> already have"
> 

I actually have something almost working too. Here's the WIP. It only works
for tracefs, and now eventfs needs to be updated as the "events" directory
no longer has the right ownership. So I need a way to link the eventfs
entries to use the tracefs default conditionally.

The issue I'm currently working on is to make the files of:

  /sys/kernel/tracing/events

default to the root of tracefs, but

 /sys/kernel/tracing/instances/foo/events

to default to what events was when it was created by "mkdir instances/foo".

But other than that, the tracefs part seems to work.

-- Steve

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index ae648deed019..9c55dc903d7d 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -141,10 +141,76 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
 	return ret;
 }
 
-static const struct inode_operations tracefs_dir_inode_operations = {
+static void set_tracefs_inode_owner(struct inode *inode)
+{
+	struct inode *root_inode = d_inode(inode->i_sb->s_root);
+	struct tracefs_inode *ti = get_tracefs(inode);
+
+	/*
+	 * If this inode has never been referenced, then update
+	 * the permissions to the superblock.
+	 */
+	if (!(ti->flags & TRACEFS_EVENT_UID_PERM_SET))
+		inode->i_uid = root_inode->i_uid;
+
+	if (!(ti->flags & TRACEFS_EVENT_GID_PERM_SET))
+		inode->i_gid = root_inode->i_gid;
+}
+
+static int tracefs_permission(struct mnt_idmap *idmap,
+			      struct inode *inode, int mask)
+{
+	set_tracefs_inode_owner(inode);
+	return generic_permission(idmap, inode, mask);
+}
+
+static int tracefs_getattr(struct mnt_idmap *idmap,
+			   const struct path *path, struct kstat *stat,
+			   u32 request_mask, unsigned int flags)
+{
+	struct inode *inode = d_backing_inode(path->dentry);
+
+	set_tracefs_inode_owner(inode);
+	generic_fillattr(idmap, request_mask, inode, stat);
+	return 0;
+}
+
+static int tracefs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+			   struct iattr *attr)
+{
+	unsigned int ia_valid = attr->ia_valid;
+	struct inode *inode = d_inode(dentry);
+	struct tracefs_inode *ti = get_tracefs(inode);
+
+	if (ia_valid & ATTR_UID)
+		ti->flags |= TRACEFS_EVENT_UID_PERM_SET;
+
+	if (ia_valid & ATTR_GID)
+		ti->flags |= TRACEFS_EVENT_GID_PERM_SET;
+
+	return simple_setattr(idmap, dentry, attr);
+}
+
+static const struct inode_operations tracefs_instance_dir_inode_operations = {
 	.lookup		= simple_lookup,
 	.mkdir		= tracefs_syscall_mkdir,
 	.rmdir		= tracefs_syscall_rmdir,
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_dir_inode_operations = {
+	.lookup		= simple_lookup,
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_file_inode_operations = {
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
 };
 
 struct inode *tracefs_get_inode(struct super_block *sb)
@@ -183,77 +249,6 @@ struct tracefs_fs_info {
 	struct tracefs_mount_opts mount_opts;
 };
 
-static void change_gid(struct dentry *dentry, kgid_t gid)
-{
-	if (!dentry->d_inode)
-		return;
-	dentry->d_inode->i_gid = gid;
-}
-
-/*
- * Taken from d_walk, but without he need for handling renames.
- * Nothing can be renamed while walking the list, as tracefs
- * does not support renames. This is only called when mounting
- * or remounting the file system, to set all the files to
- * the given gid.
- */
-static void set_gid(struct dentry *parent, kgid_t gid)
-{
-	struct dentry *this_parent;
-	struct list_head *next;
-
-	this_parent = parent;
-	spin_lock(&this_parent->d_lock);
-
-	change_gid(this_parent, gid);
-repeat:
-	next = this_parent->d_subdirs.next;
-resume:
-	while (next != &this_parent->d_subdirs) {
-		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
-		next = tmp->next;
-
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-
-		change_gid(dentry, gid);
-
-		if (!list_empty(&dentry->d_subdirs)) {
-			spin_unlock(&this_parent->d_lock);
-			spin_release(&dentry->d_lock.dep_map, _RET_IP_);
-			this_parent = dentry;
-			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
-			goto repeat;
-		}
-		spin_unlock(&dentry->d_lock);
-	}
-	/*
-	 * All done at this level ... ascend and resume the search.
-	 */
-	rcu_read_lock();
-ascend:
-	if (this_parent != parent) {
-		struct dentry *child = this_parent;
-		this_parent = child->d_parent;
-
-		spin_unlock(&child->d_lock);
-		spin_lock(&this_parent->d_lock);
-
-		/* go into the first sibling still alive */
-		do {
-			next = child->d_child.next;
-			if (next == &this_parent->d_subdirs)
-				goto ascend;
-			child = list_entry(next, struct dentry, d_child);
-		} while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
-		rcu_read_unlock();
-		goto resume;
-	}
-	rcu_read_unlock();
-	spin_unlock(&this_parent->d_lock);
-	return;
-}
-
 static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts)
 {
 	substring_t args[MAX_OPT_ARGS];
@@ -326,10 +321,8 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
 	if (!remount || opts->opts & BIT(Opt_uid))
 		inode->i_uid = opts->uid;
 
-	if (!remount || opts->opts & BIT(Opt_gid)) {
-		/* Set all the group ids to the mount option */
-		set_gid(sb->s_root, opts->gid);
-	}
+	if (!remount || opts->opts & BIT(Opt_gid))
+		inode->i_gid = opts->gid;
 
 	return 0;
 }
@@ -612,6 +605,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 		return tracefs_failed_creating(dentry);
 
 	inode->i_mode = mode;
+	inode->i_op = &tracefs_file_inode_operations;
 	inode->i_fop = fops ? fops : &tracefs_file_operations;
 	inode->i_private = data;
 	inode->i_uid = d_inode(dentry->d_parent)->i_uid;
@@ -671,7 +665,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent)
 	if (security_locked_down(LOCKDOWN_TRACEFS))
 		return NULL;
 
-	return __create_dir(name, parent, &simple_dir_inode_operations);
+	return __create_dir(name, parent, &tracefs_dir_inode_operations);
 }
 
 /**
@@ -702,7 +696,7 @@ __init struct dentry *tracefs_create_instance_dir(const char *name,
 	if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
 		return NULL;
 
-	dentry = __create_dir(name, parent, &tracefs_dir_inode_operations);
+	dentry = __create_dir(name, parent, &tracefs_instance_dir_inode_operations);
 	if (!dentry)
 		return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index ccee18ca66c7..20a021bd5acb 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -5,6 +5,8 @@
 enum {
 	TRACEFS_EVENT_INODE		= BIT(1),
 	TRACEFS_EVENT_TOP_INODE		= BIT(2),
+	TRACEFS_EVENT_GID_PERM_SET	= BIT(3),
+	TRACEFS_EVENT_UID_PERM_SET	= BIT(4),
 };
 
 struct tracefs_inode {
Linus Torvalds Jan. 3, 2024, 10:06 p.m. UTC | #10
On Wed, 3 Jan 2024 at 13:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's an updated patch that builds, and is PURELY AN EXAMPLE.

Oh, and while doing this patch, I found another bug in tracefs,
although it happily is one that doesn't have any way to trigger.

Tracefs has code like this:

        if (dentry->d_inode->i_mode & S_IFDIR) {

and that's very wrong. S_IFDIR is not a bitmask, it's a value that is
part of S_IFMT.

The reason this bug doesn't have any way to trigger is that I think
tracefs can only have S_IFMT values of S_IFDIR and S_IFREG, and those
happen to not have any bits in common, so doing it as a bit test is
wrong, but happens to work.

The test *should* be done as

        if (S_ISDIR(dentry->d_inode->i_mode)) {

(note "IS" vs "IF" - not the greatest user experience ever, but hey,
it harkens back to Ye Olden Times).

                Linus
Linus Torvalds Jan. 3, 2024, 10:14 p.m. UTC | #11
On Wed, 3 Jan 2024 at 14:04, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I actually have something almost working too. Here's the WIP. It only works
> for tracefs, and now eventfs needs to be updated as the "events" directory
> no longer has the right ownership. So I need a way to link the eventfs
> entries to use the tracefs default conditionally.

Ack. So the ->getattr() and ->permission() thing is a bit more
targeted to "look at modes", and is probably better just for that
reason.

Doing it in d_revalidate() is a bit hacky, and impacts path lookup a
bit even when not necessary. But it's still a lot less evil than
walking the dentry tree manually.

So that d_revalidate() patch was more of a "I think you can make it
smaller by just hooking in at this layer"). So d_revalidate ends up
with a smaller patch, I think, but it has the problem that now you
*have* to be able to deal with things in RCU context.

In contrast, doing it in ->getattr() and ->permission() ends up
meaning you can use sleeping locks etc if you need to serialize, for
example. So it's a bit more specific to the whole issue of "deal with
modes and owndership", but it is *also* a bit more flexible in how you
can then do it.

Anyway, your patch looks fine from a quick scan.

                  Linus
Al Viro Jan. 3, 2024, 10:14 p.m. UTC | #12
On Wed, Jan 03, 2024 at 01:54:36PM -0800, Linus Torvalds wrote:

> Again: UNTESTED, and meant as a "this is another way to avoid messing
> with the dentry tree manually, and just using the VFS interfaces we
> already have"

That would break chown(), though.  From conversation back in November:

17:50 #kernel: < viro> while we are at it, why not simply supply ->permission() and ->getattr() that would pick gid from superblock 
 and shove them into ->i_gid?
17:50 #kernel: < viro> and called the default variants
17:50 #kernel: < viro> no need to scan the tree, etc.
17:51 #kernel: < viro> how many place in VFS or VM give a fuck about GID of inode?
17:53 #kernel: < viro> stat() and permission checks
17:56 #kernel: < viro> but that boils down to "well, generic getattr and permission use that field and on-disk filesystems use it to keep track of what value to put on disk"
17:56 #kernel: < viro> you can trivially override the defaults for ->permission() and ->getattr()
17:57 #kernel: < viro> and have them set the right ->i_gid whenever called

17:58 #kernel: < viro> what do you want to happen for chown() + remount?
17:58 #kernel: < viro> any group changes from the former lost on the latter?
18:00 #kernel: < viro> if you want the current semantics, slap generation counter in superblock (bumped on remount)
18:00 #kernel: < viro> sample it into inode on ->setattr()
18:01 #kernel: < viro> and have ->permission() and ->getattr() compare inode and superblock gen counts, picking ->i_gid from superblock if it's more recent there
18:02 #kernel: < viro> if you want the result of chown() to stick, have it stuff ~0U into inode's gen counter instead of sampling the superblock's counter there

18:17 #kernel: < viro> OK... so we need to filter SB_I_VERSION out of flags on mount/remount, lest the timestamp updates start playing silly buggers with it
18:18 #kernel: < viro> and use inode_..._iversion_raw() for access
18:19 #kernel: < viro> or use ->i_generation, perhaps...
18:20 #kernel: < viro> 32bit, but if somebody does 4G mount -o remount, they are deliberately asking for trouble

21:37 #kernel: < viro> hmm...
21:37 #kernel: < viro> ->d_revalidate() as well, probably
21:39 #kernel: < viro> rostedt: my apologies, looks like I had been too optimistic

I have the beginnings of patch along those lines stuck in the local tree, but
the problem had been that ->d_revalidate() is not always called on the way to
some places where ->i_uid/->i_gid is accessed ;-/

I can resurrect the analysis, but that'll take a few hours.
Linus Torvalds Jan. 3, 2024, 10:17 p.m. UTC | #13
On Wed, 3 Jan 2024 at 14:14, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Jan 03, 2024 at 01:54:36PM -0800, Linus Torvalds wrote:
>
> > Again: UNTESTED, and meant as a "this is another way to avoid messing
> > with the dentry tree manually, and just using the VFS interfaces we
> > already have"
>
> That would break chown(), though.

Right,. That's why I had that note about

   So take this as a "this might work", but it probably needs a bit more
   work - eventfs_set_attr() should set some bit in the inode to say
   "these have been set manually", and then revalidate would say "I'll
   not touch inodes that have that bit set".

and how my example patch overrides things a bit *too* aggressively.

That said, I think the patch that Steven just sent out is the right
direction to go regardless. The d_revalidate() thing was literally
just a "we can do this many different ways".

                     Linus
diff mbox series

Patch

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f0677ea0ec24..53d34a4b5a2b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -52,9 +52,7 @@  enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags);
-static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
-static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx);
-static int eventfs_release(struct inode *inode, struct file *file);
+static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -148,11 +146,9 @@  static const struct inode_operations eventfs_file_inode_operations = {
 };
 
 static const struct file_operations eventfs_file_operations = {
-	.open		= dcache_dir_open_wrapper,
 	.read		= generic_read_dir,
-	.iterate_shared	= dcache_readdir_wrapper,
+	.iterate_shared	= eventfs_iterate,
 	.llseek		= generic_file_llseek,
-	.release	= eventfs_release,
 };
 
 /* Return the evenfs_inode of the "events" directory */
@@ -390,16 +386,14 @@  void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
  * @mode: The mode of the file.
  * @data: The data to use to set the inode of the file with on open()
  * @fops: The fops of the file to be created.
- * @lookup: If called by the lookup routine, in which case, dput() the created dentry.
  *
  * Create a dentry for a file of an eventfs_inode @ei and place it into the
- * address located at @e_dentry. If the @e_dentry already has a dentry, then
- * just do a dget() on it and return. Otherwise create the dentry and attach it.
+ * address located at @e_dentry.
  */
 static struct dentry *
 create_file_dentry(struct eventfs_inode *ei, int idx,
 		   struct dentry *parent, const char *name, umode_t mode, void *data,
-		   const struct file_operations *fops, bool lookup)
+		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
 	struct dentry **e_dentry = &ei->d_children[idx];
@@ -414,9 +408,7 @@  create_file_dentry(struct eventfs_inode *ei, int idx,
 	}
 	/* If the e_dentry already has a dentry, use it */
 	if (*e_dentry) {
-		/* lookup does not need to up the ref count */
-		if (!lookup)
-			dget(*e_dentry);
+		dget(*e_dentry);
 		mutex_unlock(&eventfs_mutex);
 		return *e_dentry;
 	}
@@ -441,13 +433,12 @@  create_file_dentry(struct eventfs_inode *ei, int idx,
 		 * way to being freed, don't return it. If e_dentry is NULL
 		 * it means it was already freed.
 		 */
-		if (ei->is_freed)
+		if (ei->is_freed) {
 			dentry = NULL;
-		else
+		} else {
 			dentry = *e_dentry;
-		/* The lookup does not need to up the dentry refcount */
-		if (dentry && !lookup)
 			dget(dentry);
+		}
 		mutex_unlock(&eventfs_mutex);
 		return dentry;
 	}
@@ -465,9 +456,6 @@  create_file_dentry(struct eventfs_inode *ei, int idx,
 	}
 	mutex_unlock(&eventfs_mutex);
 
-	if (lookup)
-		dput(dentry);
-
 	return dentry;
 }
 
@@ -500,13 +488,12 @@  static void eventfs_post_create_dir(struct eventfs_inode *ei)
  * @pei: The eventfs_inode parent of ei.
  * @ei: The eventfs_inode to create the directory for
  * @parent: The dentry of the parent of this directory
- * @lookup: True if this is called by the lookup code
  *
  * This creates and attaches a directory dentry to the eventfs_inode @ei.
  */
 static struct dentry *
 create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
-		  struct dentry *parent, bool lookup)
+		  struct dentry *parent)
 {
 	struct dentry *dentry = NULL;
 
@@ -520,9 +507,7 @@  create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
 	if (ei->dentry) {
 		/* If the dentry already has a dentry, use it */
 		dentry = ei->dentry;
-		/* lookup does not need to up the ref count */
-		if (!lookup)
-			dget(dentry);
+		dget(dentry);
 		mutex_unlock(&eventfs_mutex);
 		return dentry;
 	}
@@ -542,7 +527,7 @@  create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
 		 * way to being freed.
 		 */
 		dentry = ei->dentry;
-		if (dentry && !lookup)
+		if (dentry)
 			dget(dentry);
 		mutex_unlock(&eventfs_mutex);
 		return dentry;
@@ -562,9 +547,6 @@  create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
 	}
 	mutex_unlock(&eventfs_mutex);
 
-	if (lookup)
-		dput(dentry);
-
 	return dentry;
 }
 
@@ -589,8 +571,8 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 	struct eventfs_inode *ei;
 	struct dentry *ei_dentry = NULL;
 	struct dentry *ret = NULL;
+	struct dentry *d;
 	const char *name = dentry->d_name.name;
-	bool created = false;
 	umode_t mode;
 	void *data;
 	int idx;
@@ -626,13 +608,10 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 		ret = simple_lookup(dir, dentry, flags);
 		if (IS_ERR(ret))
 			goto out;
-		create_dir_dentry(ei, ei_child, ei_dentry, true);
-		created = true;
-		break;
-	}
-
-	if (created)
+		d = create_dir_dentry(ei, ei_child, ei_dentry);
+		dput(d);
 		goto out;
+	}
 
 	for (i = 0; i < ei->nr_entries; i++) {
 		entry = &ei->entries[i];
@@ -650,8 +629,8 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 			ret = simple_lookup(dir, dentry, flags);
 			if (IS_ERR(ret))
 				goto out;
-			create_file_dentry(ei, i, ei_dentry, name, mode, cdata,
-					   fops, true);
+			d = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
+			dput(d);
 			break;
 		}
 	}
@@ -660,127 +639,82 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 	return ret;
 }
 
-struct dentry_list {
-	void			*cursor;
-	struct dentry		**dentries;
-};
-
-/**
- * eventfs_release - called to release eventfs file/dir
- * @inode: inode to be released
- * @file: file to be released (not used)
- */
-static int eventfs_release(struct inode *inode, struct file *file)
-{
-	struct tracefs_inode *ti;
-	struct dentry_list *dlist = file->private_data;
-	void *cursor;
-	int i;
-
-	ti = get_tracefs(inode);
-	if (!(ti->flags & TRACEFS_EVENT_INODE))
-		return -EINVAL;
-
-	if (WARN_ON_ONCE(!dlist))
-		return -EINVAL;
-
-	for (i = 0; dlist->dentries && dlist->dentries[i]; i++) {
-		dput(dlist->dentries[i]);
-	}
-
-	cursor = dlist->cursor;
-	kfree(dlist->dentries);
-	kfree(dlist);
-	file->private_data = cursor;
-	return dcache_dir_close(inode, file);
-}
-
-static int add_dentries(struct dentry ***dentries, struct dentry *d, int cnt)
-{
-	struct dentry **tmp;
-
-	tmp = krealloc(*dentries, sizeof(d) * (cnt + 2), GFP_NOFS);
-	if (!tmp)
-		return -1;
-	tmp[cnt] = d;
-	tmp[cnt + 1] = NULL;
-	*dentries = tmp;
-	return 0;
-}
-
-/**
- * dcache_dir_open_wrapper - eventfs open wrapper
- * @inode: not used
- * @file: dir to be opened (to create it's children)
- *
- * Used to dynamic create file/dir with-in @file, all the
- * file/dir will be created. If already created then references
- * will be increased
+/*
+ * Walk the children of a eventfs_inode to fill in getdents().
  */
-static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
+static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 {
 	const struct file_operations *fops;
+	struct inode *f_inode = file_inode(file);
 	const struct eventfs_entry *entry;
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
-	struct dentry_list *dlist;
-	struct dentry **dentries = NULL;
-	struct dentry *parent = file_dentry(file);
-	struct dentry *d;
-	struct inode *f_inode = file_inode(file);
-	const char *name = parent->d_name.name;
+	struct dentry *ei_dentry = NULL;
+	struct dentry *dentry;
+	const char *name;
 	umode_t mode;
-	void *data;
-	int cnt = 0;
 	int idx;
-	int ret;
-	int i;
-	int r;
+	int ret = -EINVAL;
+	int ino;
+	int i, r, c;
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
 
 	ti = get_tracefs(f_inode);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
 		return -EINVAL;
 
-	if (WARN_ON_ONCE(file->private_data))
-		return -EINVAL;
+	c = ctx->pos - 2;
 
 	idx = srcu_read_lock(&eventfs_srcu);
 
 	mutex_lock(&eventfs_mutex);
 	ei = READ_ONCE(ti->private);
+	if (ei && !ei->is_freed)
+		ei_dentry = READ_ONCE(ei->dentry);
 	mutex_unlock(&eventfs_mutex);
 
-	if (!ei) {
-		srcu_read_unlock(&eventfs_srcu, idx);
-		return -EINVAL;
-	}
-
-
-	data = ei->data;
+	if (!ei || !ei_dentry)
+		goto out;
 
-	dlist = kmalloc(sizeof(*dlist), GFP_KERNEL);
-	if (!dlist) {
-		srcu_read_unlock(&eventfs_srcu, idx);
-		return -ENOMEM;
-	}
+	ret = 0;
 
-	inode_lock(parent->d_inode);
+	/*
+	 * Need to create the dentries and inodes to have a consistent
+	 * inode number.
+	 */
 	list_for_each_entry_srcu(ei_child, &ei->children, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		d = create_dir_dentry(ei, ei_child, parent, false);
-		if (d) {
-			ret = add_dentries(&dentries, d, cnt);
-			if (ret < 0)
-				break;
-			cnt++;
+
+		if (ei_child->is_freed)
+			continue;
+
+		name = ei_child->name;
+
+		dentry = create_dir_dentry(ei, ei_child, ei_dentry);
+		if (!dentry)
+			goto out;
+		ino = dentry->d_inode->i_ino;
+		dput(dentry);
+
+		if (c > 0) {
+			c--;
+			continue;
 		}
+
+		if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
+			goto out;
+		ctx->pos++;
 	}
 
 	for (i = 0; i < ei->nr_entries; i++) {
-		void *cdata = data;
+		void *cdata = ei->data;
+
 		entry = &ei->entries[i];
 		name = entry->name;
+
 		mutex_lock(&eventfs_mutex);
 		/* If ei->is_freed, then the event itself may be too */
 		if (!ei->is_freed)
@@ -790,41 +724,26 @@  static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
 		mutex_unlock(&eventfs_mutex);
 		if (r <= 0)
 			continue;
-		d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, false);
-		if (d) {
-			ret = add_dentries(&dentries, d, cnt);
-			if (ret < 0)
-				break;
-			cnt++;
+
+		dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
+		if (!dentry)
+			goto out;
+		ino = dentry->d_inode->i_ino;
+		dput(dentry);
+
+		if (c > 0) {
+			c--;
+			continue;
 		}
+
+		if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
+			goto out;
+		ctx->pos++;
 	}
-	inode_unlock(parent->d_inode);
+	ret = 1;
+ out:
 	srcu_read_unlock(&eventfs_srcu, idx);
-	ret = dcache_dir_open(inode, file);
-
-	/*
-	 * dcache_dir_open() sets file->private_data to a dentry cursor.
-	 * Need to save that but also save all the dentries that were
-	 * opened by this function.
-	 */
-	dlist->cursor = file->private_data;
-	dlist->dentries = dentries;
-	file->private_data = dlist;
-	return ret;
-}
-
-/*
- * This just sets the file->private_data back to the cursor and back.
- */
-static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx)
-{
-	struct dentry_list *dlist = file->private_data;
-	int ret;
 
-	file->private_data = dlist->cursor;
-	ret = dcache_readdir(file, ctx);
-	dlist->cursor = file->private_data;
-	file->private_data = dlist;
 	return ret;
 }