diff mbox series

tracefs/eventfs: Use root and instance inodes as default ownership

Message ID 20240103203246.115732ec@gandalf.local.home (mailing list archive)
State Superseded
Headers show
Series tracefs/eventfs: Use root and instance inodes as default ownership | expand

Commit Message

Steven Rostedt Jan. 4, 2024, 1:32 a.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.

For all files and directories in the top level instance:

 /sys/kernel/tracing/*

It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).

When an instance is created:

 mkdir /sys/kernel/tracing/instance/foo

The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.

If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.

The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.

For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.

Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c |  80 ++++++++++++++-
 fs/tracefs/inode.c       | 205 ++++++++++++++++++++++-----------------
 fs/tracefs/internal.h    |   3 +
 3 files changed, 198 insertions(+), 90 deletions(-)

Comments

Al Viro Jan. 4, 2024, 1:48 a.m. UTC | #1
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> +	/* Get the tracefs root from the parent */
> +	inode = d_inode(dentry->d_parent);
> +	inode = d_inode(inode->i_sb->s_root);

That makes no sense.  First of all, for any positive dentry we have
dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
dentries on given superblock.  So what's the point of that dance?
If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
and be done with that...
Al Viro Jan. 4, 2024, 1:59 a.m. UTC | #2
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> +static struct inode *instance_inode(struct dentry *parent, struct inode *inode)
> +{
> +	struct tracefs_inode *ti;
> +	struct inode *root_inode;
> +
> +	root_inode = d_inode(inode->i_sb->s_root);
> +
> +	/* If parent is NULL then use root inode */
> +	if (!parent)
> +		return root_inode;
> +
> +	/* Find the inode that is flagged as an instance or the root inode */
> +	do {
> +		inode = d_inode(parent);
> +		if (inode == root_inode)
> +			return root_inode;
> +
> +		ti = get_tracefs(inode);
> +
> +		if (ti->flags & TRACEFS_INSTANCE_INODE)
> +			return inode;
> +	} while ((parent = parent->d_parent));

*blink*

This is equivalent to
		...
		parent = parent->d_parent;
	} while (true);

->d_parent is *never* NULL.  And what the hell is that loop supposed to do,
anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?

If root is not marked that way, I would suggest
	if (!parent)
		parent = inode->i_sb->s_root;
	while (!IS_ROOT(parent)) {
		struct tracefs_inode *ti = get_tracefs(parent->d_inode);
		if (ti->flags & TRACEFS_INSTANCE_INODE)
			break;
		parent = parent->d_parent;
	}
	return parent->d_inode;
Steven Rostedt Jan. 4, 2024, 2:17 a.m. UTC | #3
On Thu, 4 Jan 2024 01:59:10 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +static struct inode *instance_inode(struct dentry *parent, struct inode *inode)
> > +{
> > +	struct tracefs_inode *ti;
> > +	struct inode *root_inode;
> > +
> > +	root_inode = d_inode(inode->i_sb->s_root);
> > +
> > +	/* If parent is NULL then use root inode */
> > +	if (!parent)
> > +		return root_inode;
> > +
> > +	/* Find the inode that is flagged as an instance or the root inode */
> > +	do {
> > +		inode = d_inode(parent);
> > +		if (inode == root_inode)
> > +			return root_inode;
> > +
> > +		ti = get_tracefs(inode);
> > +
> > +		if (ti->flags & TRACEFS_INSTANCE_INODE)
> > +			return inode;
> > +	} while ((parent = parent->d_parent));  
> 
> *blink*
> 
> This is equivalent to
> 		...
> 		parent = parent->d_parent;
> 	} while (true);

Yeah, that loop went through a few iterations as I first thought that root
was a tracefs_inode and the get_tracefs() would work on it. No, it was not,
and it caused a cash. But I didn't rewrite the loop well after fixing that.

I was also not sure if parent could be NULL, and wanted to protect against it.

> 
> ->d_parent is *never* NULL.  And what the hell is that loop supposed to do,  
> anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?
> 
> If root is not marked that way, I would suggest
> 	if (!parent)
> 		parent = inode->i_sb->s_root;
> 	while (!IS_ROOT(parent)) {
> 		struct tracefs_inode *ti = get_tracefs(parent->d_inode);
> 		if (ti->flags & TRACEFS_INSTANCE_INODE)
> 			break;
> 		parent = parent->d_parent;
> 	}
> 	return parent->d_inode;

Sure, I could rewrite it that way too.

Thanks,

-- Steve
Steven Rostedt Jan. 4, 2024, 2:25 a.m. UTC | #4
On Thu, 4 Jan 2024 01:48:37 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +	/* Get the tracefs root from the parent */
> > +	inode = d_inode(dentry->d_parent);
> > +	inode = d_inode(inode->i_sb->s_root);  
> 
> That makes no sense.  First of all, for any positive dentry we have
> dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> dentries on given superblock.  So what's the point of that dance?
> If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> and be done with that...

That was more of thinking that the dentry and dentry->d_parent are
different. As dentry is part of eventfs and dentry->d_parent is part of
tracefs. Currently they both have the same superblock so yeah, I could just
write it that way too and it would work. But in my head, I was thinking
that they behave differently and maybe one day eventfs would get its own
superblock which would not work.

To explain this better:

  /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events

But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
Everything in /sys/kernel/tracing/events is part of eventfs.

That was my thought process. But as both tracefs and eventfs still use
tracefs_get_inode(), it would work as you state.

I'll update that, as I don't foresee that eventfs will become its own file
system.

Thanks,

-- Steve
Al Viro Jan. 4, 2024, 4:39 a.m. UTC | #5
On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote:
> On Thu, 4 Jan 2024 01:48:37 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > 
> > > +	/* Get the tracefs root from the parent */
> > > +	inode = d_inode(dentry->d_parent);
> > > +	inode = d_inode(inode->i_sb->s_root);  
> > 
> > That makes no sense.  First of all, for any positive dentry we have
> > dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> > dentries on given superblock.  So what's the point of that dance?
> > If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> > and be done with that...
> 
> That was more of thinking that the dentry and dentry->d_parent are
> different. As dentry is part of eventfs and dentry->d_parent is part of
> tracefs.

???

>Currently they both have the same superblock so yeah, I could just
> write it that way too and it would work. But in my head, I was thinking
> that they behave differently and maybe one day eventfs would get its own
> superblock which would not work.

->d_parent *always* points to the same filesystem; if you get an (automounted?)
mountpoint there, ->d_parent simply won't work - it will point to dentry itself.

> To explain this better:
> 
>   /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events
> 
> But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
> Everything in /sys/kernel/tracing/events is part of eventfs.
> 
> That was my thought process. But as both tracefs and eventfs still use
> tracefs_get_inode(), it would work as you state.
> 
> I'll update that, as I don't foresee that eventfs will become its own file
> system.

There is no way to get to underlying mountpoint by dentry - simply because
the same fs instance can be mounted in any number of places.

A crude overview of taxonomy:

file_system_type: what filesystem instances belong to.  Not quite the same
thing as fs driver (one driver can provide several of those).  Usually
it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).

super_block: individual filesystem instance.  Hosts dentry tree (connected or
several disconnected parts - think NFSv4 or the state while trying to get
a dentry by fhandle, etc.).

dentry: object in a filesystem's directory tree(s).  Always belongs to
specific filesystem instance - that relationship never changes.  Tree
structure (and names) _within_ _filesystem_ belong on that level.
->d_parent is part of that tree structure; never NULL, root of a (sub)tree
has it pointing to itself.  Might be negative, might refer to a filesystem object
(file, directory, symlink, etc.).

inode: filesystem object (file, directory, etc.).  Always belongs to
specific filesystem instance.  Non-directory inodes might have any
number of dentry instances (aliases) refering to it; a directory one - no 
more than one.  Filesystem object contents belongs here; multiple hardlinks
have different dentries and the same inode.  Of course, filesystem type in
question might have no such thing as multiple hardlinks - that's up to
filesystem.  In general there is no way to find (or enumerate) such links;
e.g. a local filesystem might have an extra hardlink somewhere we had
never looked at and there won't be any dentries for such hardlinks and
no way to get them short of doing the full search of the entire tree.
The situation with e.g. NFS client is even worse, obviously.

mount: in a sense, mount to super_block is what dentry is to inode.  It
provides a view of (sub)tree hosted in given filesystem instance.  The
same filesystem may have any number of mounts, refering to its subtrees
(possibly the same subtree for each, possibly all different - up to
the callers of mount(2)).  They form mount tree(s) - that's where the
notions related to "this mounted on top of that" belong.  Note that
they can be moved around - with no telling the filesystem about that
happening.  Again, there's no such thing as "the mountpoint of given
filesystem instance" - it might be mounted in any number of places
at the same time.  Specific mount - sure, no problem, except that it
can move around.

namespace: mount tree.  Unlike everything prior, this one is a part of
process state - same as descriptor table, mappings, etc.

file: opened IO channel.  It does refer to specific mount and specific
dentry (and thus filesystem instance and an inode on it).  Current
IO position lives here, so does any per-open(2) state.

descriptor table: mapping from numbers to IO channels (opened files).
Again, a part of process state.  dup() creates a new entry, with
reference to the same file as the old one; multiple open() of the
same pathname will each yield a separate opened file.  _Some_ state
belongs here (close-on-exec, mostly).  Note that there's no such
thing as "the descriptor of this file" - not even "the user-supplied
number that had been used to get the file we are currently reading
from", since that number might be refering to something entirely
different right after we'd resolved it to opened file and that
happens *without* disrupting the operation.
Steven Rostedt Jan. 4, 2024, 3:05 p.m. UTC | #6
On Thu, 4 Jan 2024 04:39:45 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote:
> > On Thu, 4 Jan 2024 01:48:37 +0000
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> >   
> > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > >   
> > > > +	/* Get the tracefs root from the parent */
> > > > +	inode = d_inode(dentry->d_parent);
> > > > +	inode = d_inode(inode->i_sb->s_root);    
> > > 
> > > That makes no sense.  First of all, for any positive dentry we have
> > > dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> > > dentries on given superblock.  So what's the point of that dance?
> > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> > > and be done with that...  
> > 
> > That was more of thinking that the dentry and dentry->d_parent are
> > different. As dentry is part of eventfs and dentry->d_parent is part of
> > tracefs.  
> 
> ???
> 
> >Currently they both have the same superblock so yeah, I could just
> > write it that way too and it would work. But in my head, I was thinking
> > that they behave differently and maybe one day eventfs would get its own
> > superblock which would not work.  
> 
> ->d_parent *always* points to the same filesystem; if you get an (automounted?)  
> mountpoint there, ->d_parent simply won't work - it will point to dentry itself.
> 

This is the "tribal knowledge" I'm talking about. I really didn't know how
the root dentry parent worked. I guess that makes sense, as it matches the
'..' of a directory, and the '/' directory '..' points to itself. Although
mounted file systems do not behave that way. My /proc/.. is '/'. I just
figured that the dentry->d_parent would be similar. Learn something everyday.

> > To explain this better:
> > 
> >   /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events
> > 
> > But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
> > Everything in /sys/kernel/tracing/events is part of eventfs.
> > 
> > That was my thought process. But as both tracefs and eventfs still use
> > tracefs_get_inode(), it would work as you state.
> > 
> > I'll update that, as I don't foresee that eventfs will become its own file
> > system.  
> 
> There is no way to get to underlying mountpoint by dentry - simply because
> the same fs instance can be mounted in any number of places.

OK, so the dentry is still separate from the path and tied closer to the
inode.

> 
> A crude overview of taxonomy:
> 
> file_system_type: what filesystem instances belong to.  Not quite the same
> thing as fs driver (one driver can provide several of those).  Usually
> it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).

I don't know the difference between NFS and NFSv4 as I just used whatever
was the latest. But I understand the ext[234] part.

> 
> super_block: individual filesystem instance.  Hosts dentry tree (connected or
> several disconnected parts - think NFSv4 or the state while trying to get
> a dentry by fhandle, etc.).

I don't know how NFSv4 works, I'm only a user of it, I never actually
looked at the code. So that's not the best example, at least for me.

> 
> dentry: object in a filesystem's directory tree(s).  Always belongs to
> specific filesystem instance - that relationship never changes.  Tree
> structure (and names) _within_ _filesystem_ belong on that level.
> ->d_parent is part of that tree structure; never NULL, root of a (sub)tree  
> has it pointing to itself.  Might be negative, might refer to a filesystem object
> (file, directory, symlink, etc.).

This is useful.

> 
> inode: filesystem object (file, directory, etc.).  Always belongs to
> specific filesystem instance.  Non-directory inodes might have any
> number of dentry instances (aliases) refering to it; a directory one - no 
> more than one.

This above is very useful knowledge that I did not know. That directory
inodes can only have a single dentry.

>  Filesystem object contents belongs here; multiple hardlinks
> have different dentries and the same inode.

So, can I assume that an inode could only have as many dentries as hard
links? I know directories are only allowed to have a single hard link. Is
that why they can only have a single dentry?

>  Of course, filesystem type in
> question might have no such thing as multiple hardlinks - that's up to
> filesystem.  In general there is no way to find (or enumerate) such links;
> e.g. a local filesystem might have an extra hardlink somewhere we had
> never looked at and there won't be any dentries for such hardlinks and
> no way to get them short of doing the full search of the entire tree.
> The situation with e.g. NFS client is even worse, obviously.
> 
> mount: in a sense, mount to super_block is what dentry is to inode.  It
> provides a view of (sub)tree hosted in given filesystem instance.  The
> same filesystem may have any number of mounts, refering to its subtrees
> (possibly the same subtree for each, possibly all different - up to
> the callers of mount(2)).  They form mount tree(s) - that's where the
> notions related to "this mounted on top of that" belong.  Note that
> they can be moved around - with no telling the filesystem about that
> happening.  Again, there's no such thing as "the mountpoint of given
> filesystem instance" - it might be mounted in any number of places
> at the same time.  Specific mount - sure, no problem, except that it
> can move around.
> 
> namespace: mount tree.  Unlike everything prior, this one is a part of
> process state - same as descriptor table, mappings, etc.

And I'm guessing namespace is for containers. At least that's what I've
been assuming they are for.

> 
> file: opened IO channel.  It does refer to specific mount and specific
> dentry (and thus filesystem instance and an inode on it).  Current
> IO position lives here, so does any per-open(2) state.

And IIUC, this is what maps to a processes fd table. That is, the process's
file descriptor number it passes to the kernel will be mapped to this
"file".

> 
> descriptor table: mapping from numbers to IO channels (opened files).

This is that "process fd table" I mentioned above (I wrote that before
reading this).

> Again, a part of process state.  dup() creates a new entry, with
> reference to the same file as the old one; multiple open() of the

Hmm, wouldn't "dup()" create another "file" that just points to the same
dentry? It wouldn't be the "same file", or did you mean "file" from the
user space point of view?

> same pathname will each yield a separate opened file.  _Some_ state
> belongs here (close-on-exec, mostly).  Note that there's no such
> thing as "the descriptor of this file" - not even "the user-supplied
> number that had been used to get the file we are currently reading
> from", since that number might be refering to something entirely
> different right after we'd resolved it to opened file and that
> happens *without* disrupting the operation.

This last paragraph confused me. What do you mean by ""referring to
something entirely different"?

Thanks for this overview. It was very useful, and something I think we
should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
honestly, I think your writeup here is a better overview.

-- Steve
Al Viro Jan. 4, 2024, 6:25 p.m. UTC | #7
On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:

> This is the "tribal knowledge" I'm talking about. I really didn't know how
> the root dentry parent worked. I guess that makes sense, as it matches the
> '..' of a directory, and the '/' directory '..' points to itself. Although
> mounted file systems do not behave that way. My /proc/.. is '/'. I just
> figured that the dentry->d_parent would be similar. Learn something everyday.

What would you expect to happen if you have the same filesystem mounted in
several places?  Having separate dentry trees would be a nightmare - you'd
get cache coherency problems from hell.  It's survivable for procfs, but
for something like a normal local filesystem it'd become very painful.
And if we want them to share dentry tree, how do you choose where the ..
would lead from the root dentry?

The way it's done is that linkage between the trees is done separately -
there's a tree of struct mount (well, forest, really - different processes
can easily have separate trees, which is how namespaces are done) and
each node in the mount tree refers to a dentry (sub)tree in some filesystem
instance.  Location is represented by (mount, dentry) pair and handling of
.. is basically (modulo refcounting, locking, error handling, etc.)
	while dentry == subtree_root(mount) && mount != mountpoint_mount(mount)
		// cross into the mountpoint under it
		dentry = mountpoint_dentry(mount)
		mount = mountpoint_mount(mount)
	go_into(mount, dentry->d_parent)

Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah:
; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah
will do it.  Then e.g. /mnt/blah/include will resolve to the same dentry as
/usr/lib/gcc/x86_64-linux-gnu/12/include, etc.
; chdir /mnt/blah
; ls
32                 crtprec80.o        libgomp.so         libsanitizer.spec
cc1                g++-mapper-server  libgomp.spec       libssp_nonshared.a
cc1plus            include            libitm.a           libstdc++.a
collect2           libasan.a          libitm.so          libstdc++fs.a
crtbegin.o         libasan_preinit.o  libitm.spec        libstdc++.so
crtbeginS.o        libasan.so         liblsan.a          libsupc++.a
crtbeginT.o        libatomic.a        liblsan_preinit.o  libtsan.a
crtend.o           libatomic.so       liblsan.so         libtsan_preinit.o
crtendS.o          libbacktrace.a     liblto_plugin.so   libtsan.so
crtfastmath.o      libcc1.so          libobjc.a          libubsan.a
crtoffloadbegin.o  libgcc.a           libobjc_gc.a       libubsan.so
crtoffloadend.o    libgcc_eh.a        libobjc_gc.so      lto1
crtoffloadtable.o  libgcc_s.so        libobjc.so         lto-wrapper
crtprec32.o        libgcov.a          libquadmath.a      plugin
crtprec64.o        libgomp.a          libquadmath.so     x32

We obviously want .. to resolve to /mnt, though.
; ls ..
; ls /usr/lib/gcc/x86_64-linux-gnu/
12

So the trigger for "cross into underlying mountpoint" has to be "dentry is
the root of subtree mount refers to" - it depends upon the mount we are
in.

> >  Filesystem object contents belongs here; multiple hardlinks
> > have different dentries and the same inode.
> 
> So, can I assume that an inode could only have as many dentries as hard
> links? I know directories are only allowed to have a single hard link. Is
> that why they can only have a single dentry?

Not quite.  Single alias for directories is more about cache coherency
fun; we really can't afford multiple aliases for those.  For non-directories
it's possible to have an entirely disconnected dentry refering to that
sucker; if somebody hands you an fhandle with no indication of the parent
directory, you might end up having to do one of those, no matter how many
times you find the same inode later.  Not an issue for tracefs, though.

> > namespace: mount tree.  Unlike everything prior, this one is a part of
> > process state - same as descriptor table, mappings, etc.
> 
> And I'm guessing namespace is for containers. At least that's what I've
> been assuming they are for.

It predates containers by quite a few years, but yes, that's one of the
users.  It is related to virtual machines, in the same sense the set
of memory mappings is - each thread can be thought of as a VM, with
a bunch of components.  Just as mmap() manipulates the virtual address
translation for the threads that share memory space with the caller,
mount() manipulates the pathname resolution for the threads that share
the namespace with the caller.

> > descriptor table: mapping from numbers to IO channels (opened files).
> 
> This is that "process fd table" I mentioned above (I wrote that before
> reading this).
> 
> > Again, a part of process state.  dup() creates a new entry, with
> > reference to the same file as the old one; multiple open() of the
> 
> Hmm, wouldn't "dup()" create another "file" that just points to the same
> dentry? It wouldn't be the "same file", or did you mean "file" from the
> user space point of view?

No.  The difference between open() and dup() is that the latter will
result in a descriptor that really refers to the same file.  Current
IO position belongs to IO channel; it doesn't matter for e.g. terminals,
but for regular file it immediately becomes an issue.
	fd1 = open("foo", 0);
	fd2 = open("foo", 0);
	read(fd1, &c1, 1);
	read(fd2, &c2, 1);
will result in the first byte of foo read into c1 and c2, but
	fd1 = open("foo", 0);
	fd2 = dup(fd1);
	read(fd1, &c1, 1);
	read(fd2, &c2, 1);
will have the first byte of foo in c1 and the second one - in c2.
open() yields a new IO channel attached to new descriptor; dup()
(and dup2()) attaches the existing IO channel to new descriptor.
fork() acts like dup() in that respect - child gets its descriptor
table populated with references to the same IO channels as the
parent does.

Any Unix since about '71 has it done that way and the same goes
for NT, DOS, etc. - you can't implement redirects to/from regular
files without that distinction.

Unfortunately, the terms are clumsy as hell - POSIX ends up with
"file descriptor" (for numbers) vs. "file description" (for IO
channels), which is hard to distinguish when reading and just
as hard to distinguish when listening.  "Opened file" (as IO
channel) vs. "file on disc" (as collection of data that might
be accessed via said channels) distinction on top of that also
doesn't help, to put it mildly.  It's many decades too late to
do anything about, unfortunately.  Pity the UNIX 101 students... ;-/

The bottom line:
	* struct file represents an IO channel; it might be operating
on various objects, including regular files, pipes, sockets, etc.
	* current IO position is a property of IO channel.
	* struct files_struct represents a descriptor table; each of
those maps numbers to IO channels.
	* each thread uses a descriptor table to turn numbers ("file
descriptors") into struct file references.  Different threads might
share the same descriptor table or have separate descriptor tables.
current->files points to the descriptor table of the current thread.
	* open() creates a new IO channel and attaches it to an
unused position in descriptor table.
	* dup(n) takes the IO channel from position 'n' in descriptor
table and attaches it to an unused position.
	* dup2(old, new) takes the IO channel from position 'old' and
attaches it to position 'new'; if there used to be something in position
'new', it gets detached.
	* close(n) takes the IO channel from position 'n', flushes and
detaches it.  Note that it IO channel itself is *NOT* closed until
all references to it are gone.  E.g. open() + fork() + (in parent) close()
will end up with the child's descriptor table keeping a reference to
IO channel established by open(); close() in parent will not shut the
channel down.  The same goes for implicit close() done by dup2() or
by exit(), etc.
	* things like mmap() retain struct file references;
open() + mmap() + close() ends up with struct file left (in vma->vm_file)
alive and well for as long as the mapping exists, nevermind the reference
that used to be in descriptor table.  In other words, IO channels can
exist with no references in any descriptor tables.  There are other
ways for such situation to occur (e.g. SCM_RIGHTS stuff); it's entirely
normal.

> > same pathname will each yield a separate opened file.  _Some_ state
> > belongs here (close-on-exec, mostly).  Note that there's no such
> > thing as "the descriptor of this file" - not even "the user-supplied
> > number that had been used to get the file we are currently reading
> > from", since that number might be refering to something entirely
> > different right after we'd resolved it to opened file and that
> > happens *without* disrupting the operation.
> 
> This last paragraph confused me. What do you mean by ""referring to
> something entirely different"?

	Two threads share descriptor table; one of them is in
read(fd, ...), another does dup2(fd2, fd).  If read() gets past the
point where it gets struct file reference, it will keep accessing that
IO channel.  dup2() will replace the reference in descriptor table,
but that won't disrupt the read()...

> 
> Thanks for this overview. It was very useful, and something I think we
> should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> honestly, I think your writeup here is a better overview.

At the very least it would need serious reordering ;-/
Matthew Wilcox Jan. 4, 2024, 7:03 p.m. UTC | #8
On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:
> > file_system_type: what filesystem instances belong to.  Not quite the same
> > thing as fs driver (one driver can provide several of those).  Usually
> > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).
> 
> I don't know the difference between NFS and NFSv4 as I just used whatever
> was the latest. But I understand the ext[234] part.

What Al's sying is that nfs.ko provides both nfs_fs_type and
nfs4_fs_type.  ext4.ko provides ext2_fs_type, ext3_fs_type and
ext4_fs_type.  This is allowed but anomalous.  Most filesystems provide
only one, eg ocfs2_fs_type.

> > 
> > super_block: individual filesystem instance.  Hosts dentry tree (connected or
> > several disconnected parts - think NFSv4 or the state while trying to get
> > a dentry by fhandle, etc.).
> 
> I don't know how NFSv4 works, I'm only a user of it, I never actually
> looked at the code. So that's not the best example, at least for me.

Right, so NFS (v4 or otherwise) is Special.  In the protocol, files
are identified by a thing called an fhandle.  This is (iirc) a 32-byte
identifier which must persist across server reboot.  Originally it was
probably supposed to encode dev_t plus ino_t plus generation number.
But you can do all kinds of things in the NFS protocol with an fhandle
that you need a dentry for in Linux (like path walks).  Unfortunately,
clients can't be told "Hey, we've lost context, please rewalk" (which
would have other problems anyway), so we need a way to find the dentry
for an fhandle.  I understand this very badly, but essentially we end
up looking for canonical ones, and then creating isolated trees of
dentries if we can't find them.  Sometimes we then graft these isolated
trees into the canonical spots if we end up connecting them through
various filesystem activity.

At least that's my understanding which probably contains several
misunderstandings.

> >  Filesystem object contents belongs here; multiple hardlinks
> > have different dentries and the same inode.
> 
> So, can I assume that an inode could only have as many dentries as hard
> links? I know directories are only allowed to have a single hard link. Is
> that why they can only have a single dentry?

There could be more.  For example, I could open("A"); ln("A", "B");
open("B"); rm("A"); ln("B", "C"); open("C"); rm("B").

Now there are three dentries for this inode, its link count is currently
one and never exceeded two.

> Thanks for this overview. It was very useful, and something I think we
> should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> honestly, I think your writeup here is a better overview.

Documentation/filesystems/locking.rst is often a better source, although
the two should really be merged.  Not for the faint-hearted.
Steven Rostedt Jan. 4, 2024, 7:10 p.m. UTC | #9
On Thu, 4 Jan 2024 18:25:02 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:
> 
> > This is the "tribal knowledge" I'm talking about. I really didn't know how
> > the root dentry parent worked. I guess that makes sense, as it matches the
> > '..' of a directory, and the '/' directory '..' points to itself. Although
> > mounted file systems do not behave that way. My /proc/.. is '/'. I just
> > figured that the dentry->d_parent would be similar. Learn something everyday.  
> 
> What would you expect to happen if you have the same filesystem mounted in
> several places?  Having separate dentry trees would be a nightmare - you'd
> get cache coherency problems from hell.  It's survivable for procfs, but
> for something like a normal local filesystem it'd become very painful.
> And if we want them to share dentry tree, how do you choose where the ..
> would lead from the root dentry?

My mistake was thinking that the dentry was attached more to the path than
the inode. But that doesn't seem to be the case. I wasn't sure if there was
a way to get to a dentry from the inode. I see the i_dentry list, which is
a list, where I got some of my idea that dentry was closer to path than inode.

> 
> The way it's done is that linkage between the trees is done separately -
> there's a tree of struct mount (well, forest, really - different processes
> can easily have separate trees, which is how namespaces are done) and
> each node in the mount tree refers to a dentry (sub)tree in some filesystem
> instance.  Location is represented by (mount, dentry) pair and handling of
> .. is basically (modulo refcounting, locking, error handling, etc.)
> 	while dentry == subtree_root(mount) && mount != mountpoint_mount(mount)
> 		// cross into the mountpoint under it
> 		dentry = mountpoint_dentry(mount)
> 		mount = mountpoint_mount(mount)
> 	go_into(mount, dentry->d_parent)
> 
> Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah:
> ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah
> will do it.  Then e.g. /mnt/blah/include will resolve to the same dentry as
> /usr/lib/gcc/x86_64-linux-gnu/12/include, etc.
> ; chdir /mnt/blah
> ; ls
> 32                 crtprec80.o        libgomp.so         libsanitizer.spec
> cc1                g++-mapper-server  libgomp.spec       libssp_nonshared.a
> cc1plus            include            libitm.a           libstdc++.a
> collect2           libasan.a          libitm.so          libstdc++fs.a
> crtbegin.o         libasan_preinit.o  libitm.spec        libstdc++.so
> crtbeginS.o        libasan.so         liblsan.a          libsupc++.a
> crtbeginT.o        libatomic.a        liblsan_preinit.o  libtsan.a
> crtend.o           libatomic.so       liblsan.so         libtsan_preinit.o
> crtendS.o          libbacktrace.a     liblto_plugin.so   libtsan.so
> crtfastmath.o      libcc1.so          libobjc.a          libubsan.a
> crtoffloadbegin.o  libgcc.a           libobjc_gc.a       libubsan.so
> crtoffloadend.o    libgcc_eh.a        libobjc_gc.so      lto1
> crtoffloadtable.o  libgcc_s.so        libobjc.so         lto-wrapper
> crtprec32.o        libgcov.a          libquadmath.a      plugin
> crtprec64.o        libgomp.a          libquadmath.so     x32
> 
> We obviously want .. to resolve to /mnt, though.
> ; ls ..
> ; ls /usr/lib/gcc/x86_64-linux-gnu/
> 12
> 
> So the trigger for "cross into underlying mountpoint" has to be "dentry is
> the root of subtree mount refers to" - it depends upon the mount we are
> in.
> 
> > >  Filesystem object contents belongs here; multiple hardlinks
> > > have different dentries and the same inode.  
> > 
> > So, can I assume that an inode could only have as many dentries as hard
> > links? I know directories are only allowed to have a single hard link. Is
> > that why they can only have a single dentry?  
> 
> Not quite.  Single alias for directories is more about cache coherency
> fun; we really can't afford multiple aliases for those.  For non-directories
> it's possible to have an entirely disconnected dentry refering to that
> sucker; if somebody hands you an fhandle with no indication of the parent
> directory, you might end up having to do one of those, no matter how many
> times you find the same inode later.  Not an issue for tracefs, though.
> 
> > > namespace: mount tree.  Unlike everything prior, this one is a part of
> > > process state - same as descriptor table, mappings, etc.  
> > 
> > And I'm guessing namespace is for containers. At least that's what I've
> > been assuming they are for.  
> 
> It predates containers by quite a few years, but yes, that's one of the
> users.  It is related to virtual machines, in the same sense the set
> of memory mappings is - each thread can be thought of as a VM, with
> a bunch of components.  Just as mmap() manipulates the virtual address
> translation for the threads that share memory space with the caller,
> mount() manipulates the pathname resolution for the threads that share
> the namespace with the caller.
> 
> > > descriptor table: mapping from numbers to IO channels (opened files).  
> > 
> > This is that "process fd table" I mentioned above (I wrote that before
> > reading this).
> >   
> > > Again, a part of process state.  dup() creates a new entry, with
> > > reference to the same file as the old one; multiple open() of the  
> > 
> > Hmm, wouldn't "dup()" create another "file" that just points to the same
> > dentry? It wouldn't be the "same file", or did you mean "file" from the
> > user space point of view?  
> 
> No.  The difference between open() and dup() is that the latter will
> result in a descriptor that really refers to the same file.  Current
> IO position belongs to IO channel; it doesn't matter for e.g. terminals,
> but for regular file it immediately becomes an issue.
> 	fd1 = open("foo", 0);
> 	fd2 = open("foo", 0);
> 	read(fd1, &c1, 1);
> 	read(fd2, &c2, 1);
> will result in the first byte of foo read into c1 and c2, but
> 	fd1 = open("foo", 0);
> 	fd2 = dup(fd1);
> 	read(fd1, &c1, 1);
> 	read(fd2, &c2, 1);
> will have the first byte of foo in c1 and the second one - in c2.
> open() yields a new IO channel attached to new descriptor; dup()
> (and dup2()) attaches the existing IO channel to new descriptor.
> fork() acts like dup() in that respect - child gets its descriptor
> table populated with references to the same IO channels as the
> parent does.

Ah, looking at the code I use dup() in, it's mostly for pipes in
and for redirecting stdout,stdin, etc. So yeah, that makes sense.

> 
> Any Unix since about '71 has it done that way and the same goes
> for NT, DOS, etc. - you can't implement redirects to/from regular
> files without that distinction.

Yep, which is what I used it for. Just forgot the details.

> 
> > > same pathname will each yield a separate opened file.  _Some_ state
> > > belongs here (close-on-exec, mostly).  Note that there's no such
> > > thing as "the descriptor of this file" - not even "the user-supplied
> > > number that had been used to get the file we are currently reading
> > > from", since that number might be refering to something entirely
> > > different right after we'd resolved it to opened file and that
> > > happens *without* disrupting the operation.  
> > 
> > This last paragraph confused me. What do you mean by ""referring to
> > something entirely different"?  
> 
> 	Two threads share descriptor table; one of them is in
> read(fd, ...), another does dup2(fd2, fd).  If read() gets past the
> point where it gets struct file reference, it will keep accessing that
> IO channel.  dup2() will replace the reference in descriptor table,
> but that won't disrupt the read()...

Oh, OK. So basically if fd 4 is a reference to /tmp/foo and you open
/tmp/bar which gets fd2, and one thread is reading fd 4 (/tmp/foo), the
other thread doing dup2(fd2, fd) will make fd 4 a reference to /tmp/bar but
the read will finish reading /tmp/foo.

But if the first thread were to do another read(fd, ...) it would then read
/tmp/bar. In other words, it allows read() to stay atomic with respect to
what it is reading until it returns.

> 
> > 
> > Thanks for this overview. It was very useful, and something I think we
> > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> > honestly, I think your writeup here is a better overview.  
> 
> At the very least it would need serious reordering ;-/

Yeah, but this is all great information. Thanks for explaining it.

-- Steve
Steven Rostedt Jan. 4, 2024, 7:15 p.m. UTC | #10
On Thu, 4 Jan 2024 18:25:02 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> Unfortunately, the terms are clumsy as hell - POSIX ends up with
> "file descriptor" (for numbers) vs. "file description" (for IO
> channels), which is hard to distinguish when reading and just
> as hard to distinguish when listening.  "Opened file" (as IO
> channel) vs. "file on disc" (as collection of data that might
> be accessed via said channels) distinction on top of that also
> doesn't help, to put it mildly.  It's many decades too late to
> do anything about, unfortunately.  Pity the UNIX 101 students... ;-/

Just so I understand this correctly.

"file descriptor" - is just what maps to a specific inode.

"file description" - is how the file is accessed (position in the file and
			flags associated to how it was opened)

Did I get that correct?

-- Steve
Linus Torvalds Jan. 4, 2024, 7:21 p.m. UTC | #11
On Thu, 4 Jan 2024 at 11:09, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> My mistake was thinking that the dentry was attached more to the path than
> the inode. But that doesn't seem to be the case. I wasn't sure if there was
> a way to get to a dentry from the inode.

Yeah, so dentry->inode and path->dentry are one-way translations,
because the other way can have multiple different cases.

IOW, a path will specify *one* dentry, and a dentry will specily *one*
inode, but one inode can be associated with multiple dentries, and
there may be other undiscovered dentries that *would* point to it but
aren't even cached right now.

And a single dentry can be part of multiple paths, thanks to bind mounts.

The "inode->i_dentry" list is *not* a way to look up all dentries,
because - as mentioned - there may be potential other paths (and thus
other dentries) that lead to the same inode that just haven't been
looked up yet (or that have already been aged out of the cache).

Of course any *particular* filesystem may not have hard links (so one
inode has only one possible dentry), and you may not have bind mounts,
and it might be one of the virtual filesystems where everything is
always in memory, so none of the above problems are guaranteed to be
the case in any *particular* situation.

But it's all part of why the dcache is actually really subtle. It's
not just the RCU lookup rules and the specialized locking (both
reflock and the rather complicated rules about d_lock ordering), it's
also that whole "yeah, the filesystem only sees a 'dentry', but
because of bind mounts the vfs layer actually does things internally
in terms of 'struct path' in order to be able to then show that single
fiolesystem in multiple places".

Etc etc.

There's a reason Al Viro ends up owning the dcache. Nobody else can
wrap their tiny little minds around it all.

               Linus
Matthew Wilcox Jan. 4, 2024, 7:26 p.m. UTC | #12
On Thu, Jan 04, 2024 at 02:15:17PM -0500, Steven Rostedt wrote:
> On Thu, 4 Jan 2024 18:25:02 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > Unfortunately, the terms are clumsy as hell - POSIX ends up with
> > "file descriptor" (for numbers) vs. "file description" (for IO
> > channels), which is hard to distinguish when reading and just
> > as hard to distinguish when listening.  "Opened file" (as IO
> > channel) vs. "file on disc" (as collection of data that might
> > be accessed via said channels) distinction on top of that also
> > doesn't help, to put it mildly.  It's many decades too late to
> > do anything about, unfortunately.  Pity the UNIX 101 students... ;-/
> 
> Just so I understand this correctly.
> 
> "file descriptor" - is just what maps to a specific inode.

No -- file descriptor is a number in fdtable that maps to a struct file.

> "file description" - is how the file is accessed (position in the file and
> 			flags associated to how it was opened)

file description is posix's awful name for struct file.
Linus Torvalds Jan. 4, 2024, 7:35 p.m. UTC | #13
On Thu, 4 Jan 2024 at 11:14, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> "file descriptor" - is just what maps to a specific inode.

Nope. Technically and traditionally, file descriptor is just the
integer index that is used to look up a 'struct file *'.

Except in the kernel, we really just tend to use that term (well, I
do) for the 'struct file *' itself, since the integer 'fd' is usually
not really relevant except at the system call interface.

Which is *NOT* the inode, because the 'struct file' has other things
in it (the file position, the permissions that were used at open time
etc, close-on-exec state etc etc).

> "file description" - is how the file is accessed (position in the file and
>                         flags associated to how it was opened)

That's a horrible term that shouldn't be used at all. Apparently some
people use it for what is our 'struct file *", also known as a "file
table entry".  Avoid it.

If anything, just use "fd" for the integer representation, and "file"
for the pointer to a 'struct file".

But most of the time the two are conceptually interchangeable, in that
an 'fd' just translates directly to a 'struct file *'.

Note that while there's that conceptual direct translation, there's
also very much a "time of use" issue, in that a "fd -> file"
translation happens at one particular time and in one particular user
context, and then it's *done* (so closing and possibly re-using the fd
after it's been looked up does not actually affect an existing 'struct
file *').

And while 'fd -> file' lookup is quick and common, the other way
doesn't exist, because multiple 'fd's can map to one 'struct file *'
thanks to dup() (and 'fork()', since a 'fd -> file' translation always
happens within the context of a particular user space, an 'fd' in one
process is obviously not the same as an 'fd' in another one).

               Linus
Linus Torvalds Jan. 4, 2024, 8:02 p.m. UTC | #14
On Thu, 4 Jan 2024 at 11:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>>
> Which is *NOT* the inode, because the 'struct file' has other things
> in it (the file position, the permissions that were used at open time
> etc, close-on-exec state etc etc).

That close-on-exec thing was a particularly bad example of things that
are in the 'struct file', because it's in fact the only thing that
*isn't* in 'struct file' and is associated directly with the 'int fd'.

But hopefully the intent was clear despite me picking a particularly
bad example.

            Linus
Al Viro Jan. 4, 2024, 9:28 p.m. UTC | #15
On Thu, Jan 04, 2024 at 11:35:37AM -0800, Linus Torvalds wrote:

> > "file description" - is how the file is accessed (position in the file and
> >                         flags associated to how it was opened)
> 
> That's a horrible term that shouldn't be used at all. Apparently some
> people use it for what is our 'struct file *", also known as a "file
> table entry".  Avoid it.

Worse, really.  As far as I can reconstruct what happened it was something
along the lines of "colloquial expression is 'opened file', but that is
confusing - sounds like a property+noun, so it might be misparsed as
a member of subset of files satisfying the property of 'being opened';
can't have that in a standard, let's come up with something else".
Except that what they did come up with had been much worse, for obvious
linguistic reasons.

The *ONLY* uses for that expression I can think of are
	1.  When reading POSIX texts, watch out for that one - if you
see them talking about a file descriptor in context where it really
should be about an opened file, check the wording.  If it really says
"file descriptOR", it's probably a bug in standard or a codified
bullshit practice.  If it says "file descriptION" instead, replace with
"opened file" and move on.
	2.  An outstanding example of the taste of that bunch.

IO channel would be a saner variant, but it's far too late for that.

The 3-way distinction between descriptor/opened file/file as collection of data
needs to be explained in UNIX 101; it is userland-visible and it has to be
understood.  Unfortunately, it's often done in a way that leaves students
seriously confused ;-/
Christian Brauner Jan. 5, 2024, 2:26 p.m. UTC | #16
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> Instead of walking the dentries on mount/remount to update the gid values of
> all the dentries if a gid option is specified on mount, just update the root
> inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> operations to update the permissions of the files and directories.
> 
> For all files and directories in the top level instance:
> 
>  /sys/kernel/tracing/*
> 
> It will use the root inode as the default permissions. The inode that
> represents: /sys/kernel/tracing (or wherever it is mounted).
> 
> When an instance is created:
> 
>  mkdir /sys/kernel/tracing/instance/foo
> 
> The directory "foo" and all its files and directories underneath will use
> the default of what foo is when it was created. A remount of tracefs will
> not affect it.

That kinda sounds like eventfs should actually be a separate filesystem.
But I don't know enough about the relationship between the two concepts.

> 
> If a user were to modify the permissions of any file or directory in
> tracefs, it will also no longer be modified by a change in ownership of a
> remount.

Very odd semantics and I would recommend to avoid that. It's just plain
weird imo.

> 
> The events directory, if it is in the top level instance, will use the
> tracefs root inode as the default ownership for itself and all the files and
> directories below it.
> 
> For the events directory in an instance ("foo"), it will keep the ownership
> of what it was when it was created, and that will be used as the default
> ownership for the files and directories beneath it.
> 
> Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---

So tracefs supports remounting with different uid/gid mount options and
then actually wades through _all_ of the inodes and changes their
ownership internally? What's the use-case for this? Containers?

Aside from optimizing this and the special semantics for this eventfs
stuff that you really should think twice of doing, here's one idea for
an extension that might alleviate some of the pain:

If you need flexible dynamic ownership change to e.g., be able to
delegate (all, a directory, a single file of) tracefs to
unprivileged/containers/whatever then you might want to consider
supporting idmapped mounts for tracefs. Because then you can do stuff
like:

user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
user1@localhost:~/data/scripts$ ls -ln /run/
total 12
drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
drwx------  2 0  0   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
drwx------  6 0  0  280 Jan  5 11:57 incus_agent
prw-------  1 0  0    0 Jan  5 11:57 initctl
drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
drwx------  2 0  0   40 Jan  5 11:57 lvm
-r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
-rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 0  0   40 Jan  5 11:57 mount
drwx------  2 0  0   40 Jan  5 11:57 multipath
drwxr-xr-x  2 0  0   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 0  0    8 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 0  0   40 Jan  5 11:57 sudo
drwxr-xr-x 24 0  0  660 Jan  5 14:30 systemd
drwxr-xr-x  6 0  0  140 Jan  5 14:30 udev
drwxr-xr-x  4 0  0   80 Jan  5 11:58 user
-rw-rw-r--  1 0 43 2304 Jan  5 15:15 utmp

user1@localhost:~/data/scripts$ ls -ln /mnt/
total 12
drwxr-xr-x  2 1234  1000   40 Jan  5 12:12 credentials
drwx------  2 1234  1000   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 1234  1000   60 Jan  5 11:57 dbus
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 incus_agent
prw-------  1 1234  1000    0 Jan  5 11:57 initctl
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 lock
drwxr-xr-x  3 1234  1000   60 Jan  5 11:57 log
drwx------  2 1234  1000   40 Jan  5 11:57 lvm
-r--r--r--  1 1234  1000   33 Jan  5 11:57 machine-id
-rw-r--r--  1 1234  1000  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 mount
drwx------  2 1234  1000   40 Jan  5 11:57 multipath
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 1234  1000    8 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 1234  1000   40 Jan  5 11:57 sudo
drwxr-xr-x 24 1234  1000  660 Jan  5 14:30 systemd
drwxr-xr-x  6 1234  1000  140 Jan  5 14:30 udev
drwxr-xr-x  4 1234  1000   80 Jan  5 11:58 user
-rw-rw-r--  1 1234 65534 2304 Jan  5 15:15 utmp

Where you can see that ownership of this tmpfs instance in this example
is changed. I'm not trying to advocate here but this will probably
ultimately be nicer for your users because it means that a container
manager or whatever can be handed a part of tracefs (or all of it) and
the ownership and access rights for that thing is correct. And you can
get rid of that gid based access completely.

You can change uids, gids, or both. You can specify up to 340 individual
mappings it's quite flexible.

Because then you can have a single tracefs superblock and have multiple
mounts with different ownership for the relevant parts of tracefs that
you want to delegate to whoever. If you need an ownership change you can
then just create another idmapped mount with the new ownership and then
use MOVE_MOUNT_BENEATH + umount to replace that mount.

Probably even know someone that would implement this for you (not me) if
that sounds like something that would cover some of the use-case for the
proposed change here. But maybe I just misunderstood things completely.
Steven Rostedt Jan. 5, 2024, 2:59 p.m. UTC | #17
On Fri, 5 Jan 2024 15:26:28 +0100
Christian Brauner <brauner@kernel.org> wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> > 
> > Instead of walking the dentries on mount/remount to update the gid values of
> > all the dentries if a gid option is specified on mount, just update the root
> > inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> > operations to update the permissions of the files and directories.
> > 
> > For all files and directories in the top level instance:
> > 
> >  /sys/kernel/tracing/*
> > 
> > It will use the root inode as the default permissions. The inode that
> > represents: /sys/kernel/tracing (or wherever it is mounted).
> > 
> > When an instance is created:
> > 
> >  mkdir /sys/kernel/tracing/instance/foo
> > 
> > The directory "foo" and all its files and directories underneath will use
> > the default of what foo is when it was created. A remount of tracefs will
> > not affect it.  
> 
> That kinda sounds like eventfs should actually be a separate filesystem.
> But I don't know enough about the relationship between the two concepts.

It may someday become one, as eventfs is used by perf where the rest of the
tracefs system is not.

> 
> > 
> > If a user were to modify the permissions of any file or directory in
> > tracefs, it will also no longer be modified by a change in ownership of a
> > remount.  
> 
> Very odd semantics and I would recommend to avoid that. It's just plain
> weird imo.
> 
> > 
> > The events directory, if it is in the top level instance, will use the
> > tracefs root inode as the default ownership for itself and all the files and
> > directories below it.
> > 
> > For the events directory in an instance ("foo"), it will keep the ownership
> > of what it was when it was created, and that will be used as the default
> > ownership for the files and directories beneath it.
> > 
> > Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/
> > 
> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > ---  
> 
> So tracefs supports remounting with different uid/gid mount options and
> then actually wades through _all_ of the inodes and changes their
> ownership internally? What's the use-case for this? Containers?

No, in fact tracing doesn't work well with containers as tracing is global
to the entire machine. It can work with privileged containers though.

The reason for this is because tracefs was based off of debugfs where the
files and directores are created at boot up and mounted later. The reason
to do this was to allow users to mount with gid=GID to allow a given group
to have access to tracing. Without this update, tracefs would ignore it
like debugfs and proc does today.

I think its time I explain the purpose of tracefs and how it came to be.

The tracing system required a way to control tracing and read the traces.
It could have just used a new system like perf (although
/sys/kernel/debug/tracing predates perf), where it created a single ioctl()
like system call do do everything.

As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
tracer, which both have an embedded background, I chose an interface that
could work with just an unmodified version of busybox. That is, I wanted it
to work with just cat and echo.

The main difference with tracefs compared to other file systems is that it
is a control interface, where writes happen as much as reads. The data read
is controlled. The closest thing I can think of is how cgroups work.

As tracing is a privileged operation, but something that could be changed
to allow a group to have access to, I wanted to make it easy for an admin
to decide who gets to do what at boot up via the /etc/fstab file.

> 
> Aside from optimizing this and the special semantics for this eventfs
> stuff that you really should think twice of doing, here's one idea for
> an extension that might alleviate some of the pain:
> 
> If you need flexible dynamic ownership change to e.g., be able to
> delegate (all, a directory, a single file of) tracefs to
> unprivileged/containers/whatever then you might want to consider
> supporting idmapped mounts for tracefs. Because then you can do stuff
> like:

I'm a novice here and have no idea on how id maps work ;-)

> 
> user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
> user1@localhost:~/data/scripts$ ls -ln /run/
> total 12
> drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
> drwx------  2 0  0   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
> drwx------  6 0  0  280 Jan  5 11:57 incus_agent
> prw-------  1 0  0    0 Jan  5 11:57 initctl
> drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
> drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
> drwx------  2 0  0   40 Jan  5 11:57 lvm
> -r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
> -rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
> drwxr-xr-x  2 0  0   40 Jan  5 11:57 mount
> drwx------  2 0  0   40 Jan  5 11:57 multipath
> drwxr-xr-x  2 0  0   40 Jan  5 11:57 sendsigs.omit.d
> lrwxrwxrwx  1 0  0    8 Jan  5 11:57 shm -> /dev/shm
> drwx--x--x  2 0  0   40 Jan  5 11:57 sudo
> drwxr-xr-x 24 0  0  660 Jan  5 14:30 systemd
> drwxr-xr-x  6 0  0  140 Jan  5 14:30 udev
> drwxr-xr-x  4 0  0   80 Jan  5 11:58 user
> -rw-rw-r--  1 0 43 2304 Jan  5 15:15 utmp
> 
> user1@localhost:~/data/scripts$ ls -ln /mnt/
> total 12
> drwxr-xr-x  2 1234  1000   40 Jan  5 12:12 credentials
> drwx------  2 1234  1000   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 1234  1000   60 Jan  5 11:57 dbus
> drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 incus_agent
> prw-------  1 1234  1000    0 Jan  5 11:57 initctl
> drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 lock
> drwxr-xr-x  3 1234  1000   60 Jan  5 11:57 log
> drwx------  2 1234  1000   40 Jan  5 11:57 lvm
> -r--r--r--  1 1234  1000   33 Jan  5 11:57 machine-id
> -rw-r--r--  1 1234  1000  101 Jan  5 11:58 motd.dynamic
> drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 mount
> drwx------  2 1234  1000   40 Jan  5 11:57 multipath
> drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 sendsigs.omit.d
> lrwxrwxrwx  1 1234  1000    8 Jan  5 11:57 shm -> /dev/shm
> drwx--x--x  2 1234  1000   40 Jan  5 11:57 sudo
> drwxr-xr-x 24 1234  1000  660 Jan  5 14:30 systemd
> drwxr-xr-x  6 1234  1000  140 Jan  5 14:30 udev
> drwxr-xr-x  4 1234  1000   80 Jan  5 11:58 user
> -rw-rw-r--  1 1234 65534 2304 Jan  5 15:15 utmp
> 
> Where you can see that ownership of this tmpfs instance in this example
> is changed. I'm not trying to advocate here but this will probably
> ultimately be nicer for your users because it means that a container
> manager or whatever can be handed a part of tracefs (or all of it) and
> the ownership and access rights for that thing is correct. And you can
> get rid of that gid based access completely.
> 
> You can change uids, gids, or both. You can specify up to 340 individual
> mappings it's quite flexible.
> 
> Because then you can have a single tracefs superblock and have multiple
> mounts with different ownership for the relevant parts of tracefs that
> you want to delegate to whoever. If you need an ownership change you can
> then just create another idmapped mount with the new ownership and then
> use MOVE_MOUNT_BENEATH + umount to replace that mount.
> 
> Probably even know someone that would implement this for you (not me) if
> that sounds like something that would cover some of the use-case for the
> proposed change here. But maybe I just misunderstood things completely.

That actually sounds nice. I have no idea how to implement it, but having a
way to bind mount with different permissions looks like a nifty feature to
have. Now, can we do that and still keep the dynamic creation of inodes and
dentries? Does this require having more than one dentry per inode?

-- Steve
Christian Brauner Jan. 7, 2024, 12:42 p.m. UTC | #18
> > So tracefs supports remounting with different uid/gid mount options and
> > then actually wades through _all_ of the inodes and changes their
> > ownership internally? What's the use-case for this? Containers?
> 
> No, in fact tracing doesn't work well with containers as tracing is global
> to the entire machine. It can work with privileged containers though.

At least the tracefs interface is easily supportable within a delegation
model. IOW, you have a privileged process that delegates relevant
portions to a container via idmapped mounts _without_ doing the insane thing
and making it mountable by a container aka the fs-to-CVE pipeline.

> 
> The reason for this is because tracefs was based off of debugfs where the
> files and directores are created at boot up and mounted later. The reason
> to do this was to allow users to mount with gid=GID to allow a given group
> to have access to tracing. Without this update, tracefs would ignore it
> like debugfs and proc does today.
> 
> I think its time I explain the purpose of tracefs and how it came to be.
> 
> The tracing system required a way to control tracing and read the traces.
> It could have just used a new system like perf (although
> /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> like system call do do everything.
> 
> As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> tracer, which both have an embedded background, I chose an interface that
> could work with just an unmodified version of busybox. That is, I wanted it
> to work with just cat and echo.
> 
> The main difference with tracefs compared to other file systems is that it
> is a control interface, where writes happen as much as reads. The data read
> is controlled. The closest thing I can think of is how cgroups work.
> 
> As tracing is a privileged operation, but something that could be changed
> to allow a group to have access to, I wanted to make it easy for an admin
> to decide who gets to do what at boot up via the /etc/fstab file.

Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.

mount(8) should just give you the ability to specify "map the ids I explicitly
want to remap to something else and for the rest use the identity mapping". I
wanted that for other reasons anyway.

So in one of the next versions of mount(8) you can then do (where --beneath
means place the mount beneath the current one and --replace is
self-explanatory):

sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
sudo umount /sys/kernel/tracing

or as a shortcut provided by mount(8):

sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing 

In both cases you replace the mount without unmounting tracefs.

I can illustrate this right now though:

user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/

# This is a tool I wrote for testing the patchset I wrote back then.
user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing
Mounting beneath top mount
Creating anonymous mount
Attaching mount /mnt -> /sys/kernel/tracing
Creating single detached mount

user1@localhost:~/data/move-mount-beneath$

# Now there's two mounts stacked on top of each other.
user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped
|   `-/sys/kernel/tracing      tracefs        tracefs     rw,nosuid,nodev,noexec,relatime

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
total 0
drwx------  6 root root 0 Jan  7 13:33 .
drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
-r--r-----  1 root root 0 Jan  7 13:33 README
-r--r-----  1 root root 0 Jan  7 13:33 available_events
-r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions
-r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
-r--r-----  1 root root 0 Jan  7 13:33 available_tracers
-rw-r-----  1 root root 0 Jan  7 13:33 buffer_percent
-rw-r-----  1 root root 0 Jan  7 13:33 buffer_size_kb

# Reveal updated mount
user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing

user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
total 0
drwx------  6 user1 user1 0 Jan  7 13:33 .
drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
-r--r-----  1 user1 user1 0 Jan  7 13:33 README
-r--r-----  1 user1 user1 0 Jan  7 13:33 available_events
-r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions
-r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions_addrs
-r--r-----  1 user1 user1 0 Jan  7 13:33 available_tracers
-rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_percent
-rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_size_kb

sudo umount -l /sys/kernel/tracing

and reveal the new mount with updated permissions and at no point in time will
you have had to unmount tracefs itself. No chown needed, no remount needed that
has to touch all inodes.

I did perf numbers for this when I implemented this and there's no meaningful
perf impact for anything below 5 mappings. which covers 80% of the use-cases.

I mean, there are crazy people out there that do have 30 mappings, maybe. And
maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap
still fits into cacheline) because of some LDAP crap or something.

See an ext4 filesystem to open/create 1,000,000 files. Then I looped through
all of the files calling fstat() on each of them 1000 times and calculated the
mean fstat() time for a single file.

|   # MAPPINGS | PATCH-NEW |
|--------------|-----------|
|   0 mappings |   158 ns  |
|   1 mappings |   157 ns  |
|   2 mappings |   158 ns  |
|   3 mappings |   161 ns  |
|   5 mappings |   165 ns  |
|  10 mappings |   199 ns  |
|  50 mappings |   218 ns  |
| 100 mappings |   229 ns  |
| 200 mappings |   239 ns  |
| 300 mappings |   240 ns  |
| 340 mappings |   248 ns  |

> 
> > 
> > Aside from optimizing this and the special semantics for this eventfs
> > stuff that you really should think twice of doing, here's one idea for
> > an extension that might alleviate some of the pain:
> > 
> > If you need flexible dynamic ownership change to e.g., be able to
> > delegate (all, a directory, a single file of) tracefs to
> > unprivileged/containers/whatever then you might want to consider
> > supporting idmapped mounts for tracefs. Because then you can do stuff
> > like:
> 
> I'm a novice here and have no idea on how id maps work ;-)

It's no magic and you don't even need to care about it. If you're on
util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option
does all the details for you.

Though I still want an api where you can just pass the idmappings directly to
mount_setattr(). That's another topic though.

> have. Now, can we do that and still keep the dynamic creation of inodes and
> dentries?

Yes.

> Does this require having more than one dentry per inode?

No. It's just a topological change. The same subtree exposed at different
locations with the ability of exposing it with different permissions (without
any persistent filesystem-level changes).

So, I tried to do an exploratory patch even though I promised myself not
to do it. But hey...

Some notes:

* Permission handling for idmapped mounts is done completely in the
  VFS. That's the case for all filesytems that don't have a custom
  ->permission() handler. So there's nothing to do for us here.

* Idmapped mount handling for ->getattr() is done completely by the VFS
  if the filesystem doesn't have a custom ->getattr() handler. So we're
  done here.

* Tracefs doesn't support attribute changes via ->setattr() (chown(),
  chmod etc.). So there's nothing to here.

* Eventfs does support attribute changes via ->setattr(). But it relies
  on simple_setattr() which is already idmapped mount aware. So there's
  nothing for us to do.

* Ownership is inherited from the parent inode (tracefs) or optionally
  from stashed ownership information (eventfs). That means the idmapping
  is irrelevant here. It's similar to the "inherit gid from parent
  directory" logic we have in some circumstances. TL;DR nothing to do
  here as well.

* Tracefs supports the creation of instances from userspace via mkdir.
  For example,

	mkdir /sys/kernel/tracing/instances/foo

  And here the idmapping is relevant so we need to make the helpers
  aware of the idmapping.

  I just went and plumbed this through to most helpers.

There's some subtlety in eventfs. Afaict, the directories and files for
the individual events are created on-demand during lookup or readdir.

The ownership of these events is again inherited from the parent inode
or recovered from stored state. In both cases the actual idmapping is
irrelevant.

The callchain here is:

eventfs_root_lookup("xfs", "events")
-> create_{dir,file}_dentry("xfs", "events")
   -> create_{dir,file}("xfs", "events")
      -> eventfs_start_creating("xfs", "events")
         -> lookup_one_len("xfs", "events")

And the subtlety is that lookup_one_len() does permission checking on
the parent inode (IOW, if you want a dentry for "blech" under "events"
it'll do a permission check on events->d_inode) for exec permissions
and then goes on to give you a new dentry.

Usually this call would have to be changed to lookup_one() and the
idmapping be handed down to it. But I think that's irrelevant here.

Lookup generally doesn't need to be aware of idmappings at all. The
permission checking is done purely in the vfs via may_lookup() and the
idmapping is irrelevant because we always initialize inodes with the
filesystem level ownership (see the idmappings.rst) documentation if
you're interested in excessive details (otherwise you get inode aliases
which you really don't want).

For tracefs it would not matter for lookup per se but only because
tracefs seemingly creates inodes/dentries during lookup (and readdir()).

But imho the permission checking done in current eventfs_root_lookup()
via lookup_one_len() is meaningless in any way; possibly even
(conceptually) wrong.

Because, the actual permission checking for the creation of the eventfs
entries isn't really done during lookup or readdir, it's done when mkdir
is called:

        mkdir /sys/kernel/tracing/instances/foo

Here, all possible entries beneath foo including "events" and further
below are recorded and stored. So once mkdir returns it basically means
that it succeeded with the creation of all the necessary directories and
files. For all purposes the foo/events/ directory and below have all the
entries that matter. They have been created. It's comparable to them not
being in the {d,i}cache, I guess.

When one goes and looksup stuff under foo/events/ or readdir the entries
in that directory:

fd = open("foo/events")
readdir(fd, ...)

then they are licensed to list an entry in that directory. So all that
needs to be done is to actually list those files in that directory. And
since they already exist (they were created during mkdir) we just need
to splice in inodes and dentries for them. But for that we shouldn't
check permissions on the directory again. Because we've done that
already correctly when the VFS called may_lookup().

IOW, the inode_permission() in lookup_one_len() that eventfs does is
redundant and just wrong.

Luckily, I don't think we need to even change anything because all
directories that eventfs creates always grant exec permissions to the
other group so lookup_one_len() will trivially succeed. IIUC.

Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org>
---
 fs/tracefs/event_inode.c             |   8 +-
 fs/tracefs/inode.c                   |  38 +++--
 fs/tracefs/internal.h                |   3 +-
 include/linux/tracefs.h              |  20 +--
 kernel/trace/ftrace.c                |  43 +++---
 kernel/trace/trace.c                 | 201 ++++++++++++++-------------
 kernel/trace/trace.h                 |  22 +--
 kernel/trace/trace_dynevent.c        |   2 +-
 kernel/trace/trace_events.c          |  27 ++--
 kernel/trace/trace_events_synth.c    |   5 +-
 kernel/trace/trace_functions.c       |   6 +-
 kernel/trace/trace_functions_graph.c |   4 +-
 kernel/trace/trace_hwlat.c           |   8 +-
 kernel/trace/trace_kprobe.c          |   4 +-
 kernel/trace/trace_osnoise.c         |  48 ++++---
 kernel/trace/trace_printk.c          |   4 +-
 kernel/trace/trace_stack.c           |  11 +-
 kernel/trace/trace_stat.c            |   6 +-
 kernel/trace/trace_uprobe.c          |   4 +-
 19 files changed, 251 insertions(+), 213 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2ccc849a5bda..e2f352bd8779 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -852,11 +852,11 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
  *
  * See eventfs_create_dir() for use of @entries.
  */
-struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
-						const struct eventfs_entry *entries,
-						int size, void *data)
+struct eventfs_inode *eventfs_create_events_dir(
+	struct mnt_idmap *idmap, const char *name, struct dentry *parent,
+	const struct eventfs_entry *entries, int size, void *data)
 {
-	struct dentry *dentry = tracefs_start_creating(name, parent);
+	struct dentry *dentry = tracefs_start_creating(idmap, name, parent);
 	struct eventfs_inode *ei;
 	struct tracefs_inode *ti;
 	struct inode *inode;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index ae648deed019..f4f4904eb3a0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -68,7 +68,7 @@ static const struct file_operations tracefs_file_operations = {
 };
 
 static struct tracefs_dir_ops {
-	int (*mkdir)(const char *name);
+	int (*mkdir)(struct mnt_idmap *idmap, const char *name);
 	int (*rmdir)(const char *name);
 } tracefs_ops __ro_after_init;
 
@@ -104,7 +104,7 @@ static int tracefs_syscall_mkdir(struct mnt_idmap *idmap,
 	 * mkdir routine to handle races.
 	 */
 	inode_unlock(inode);
-	ret = tracefs_ops.mkdir(name);
+	ret = tracefs_ops.mkdir(idmap, name);
 	inode_lock(inode);
 
 	kfree(name);
@@ -439,10 +439,12 @@ static struct file_system_type trace_fs_type = {
 	.name =		"tracefs",
 	.mount =	trace_mount,
 	.kill_sb =	kill_litter_super,
+	.fs_flags =	FS_ALLOW_IDMAP,
 };
 MODULE_ALIAS_FS("tracefs");
 
-struct dentry *tracefs_start_creating(const char *name, struct dentry *parent)
+struct dentry *tracefs_start_creating(struct mnt_idmap *idmap, const char *name,
+				      struct dentry *parent)
 {
 	struct dentry *dentry;
 	int error;
@@ -466,7 +468,7 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent)
 	if (unlikely(IS_DEADDIR(d_inode(parent))))
 		dentry = ERR_PTR(-ENOENT);
 	else
-		dentry = lookup_one_len(name, parent, strlen(name));
+		dentry = lookup_one(idmap, name, parent, strlen(name));
 	if (!IS_ERR(dentry) && d_inode(dentry)) {
 		dput(dentry);
 		dentry = ERR_PTR(-EEXIST);
@@ -589,8 +591,9 @@ struct dentry *eventfs_end_creating(struct dentry *dentry)
  * If tracefs is not enabled in the kernel, the value -%ENODEV will be
  * returned.
  */
-struct dentry *tracefs_create_file(const char *name, umode_t mode,
-				   struct dentry *parent, void *data,
+struct dentry *tracefs_create_file(struct mnt_idmap *idmap, const char *name,
+				   umode_t mode, struct dentry *parent,
+				   void *data,
 				   const struct file_operations *fops)
 {
 	struct dentry *dentry;
@@ -602,7 +605,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (!(mode & S_IFMT))
 		mode |= S_IFREG;
 	BUG_ON(!S_ISREG(mode));
-	dentry = tracefs_start_creating(name, parent);
+	dentry = tracefs_start_creating(idmap, name, parent);
 
 	if (IS_ERR(dentry))
 		return NULL;
@@ -621,10 +624,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	return tracefs_end_creating(dentry);
 }
 
-static struct dentry *__create_dir(const char *name, struct dentry *parent,
+static struct dentry *__create_dir(struct mnt_idmap *idmap, const char *name,
+				   struct dentry *parent,
 				   const struct inode_operations *ops)
 {
-	struct dentry *dentry = tracefs_start_creating(name, parent);
+	struct dentry *dentry = tracefs_start_creating(idmap, name, parent);
 	struct inode *inode;
 
 	if (IS_ERR(dentry))
@@ -649,6 +653,10 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent,
 	return tracefs_end_creating(dentry);
 }
 
+const struct inode_operations tracefs_default_dir_inode_operations = {
+	.lookup		= simple_lookup,
+};
+
 /**
  * tracefs_create_dir - create a directory in the tracefs filesystem
  * @name: a pointer to a string containing the name of the directory to
@@ -666,12 +674,14 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent,
  * If tracing is not enabled in the kernel, the value -%ENODEV will be
  * returned.
  */
-struct dentry *tracefs_create_dir(const char *name, struct dentry *parent)
+struct dentry *tracefs_create_dir(struct mnt_idmap *idmap, const char *name,
+				  struct dentry *parent)
 {
 	if (security_locked_down(LOCKDOWN_TRACEFS))
 		return NULL;
 
-	return __create_dir(name, parent, &simple_dir_inode_operations);
+	return __create_dir(idmap, name, parent,
+			    &tracefs_default_dir_inode_operations);
 }
 
 /**
@@ -693,7 +703,8 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent)
  */
 __init struct dentry *tracefs_create_instance_dir(const char *name,
 					  struct dentry *parent,
-					  int (*mkdir)(const char *name),
+					  int (*mkdir)(struct mnt_idmap *idmap,
+						       const char *name),
 					  int (*rmdir)(const char *name))
 {
 	struct dentry *dentry;
@@ -702,7 +713,8 @@ __init struct dentry *tracefs_create_instance_dir(const char *name,
 	if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
 		return NULL;
 
-	dentry = __create_dir(name, parent, &tracefs_dir_inode_operations);
+	dentry = __create_dir(&nop_mnt_idmap, name, parent,
+			      &tracefs_dir_inode_operations);
 	if (!dentry)
 		return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index ccee18ca66c7..64abc5fcc62f 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -70,8 +70,9 @@ static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
 	return container_of(inode, struct tracefs_inode, vfs_inode);
 }
 
-struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
 struct dentry *tracefs_end_creating(struct dentry *dentry);
+struct dentry *tracefs_start_creating(struct mnt_idmap *idmap, const char *name,
+				      struct dentry *parent);
 struct dentry *tracefs_failed_creating(struct dentry *dentry);
 struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..0457afbafc94 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -76,9 +76,9 @@ struct eventfs_entry {
 
 struct eventfs_inode;
 
-struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
-						const struct eventfs_entry *entries,
-						int size, void *data);
+struct eventfs_inode *eventfs_create_events_dir(
+	struct mnt_idmap *idmap, const char *name, struct dentry *parent,
+	const struct eventfs_entry *entries, int size, void *data);
 
 struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
 					 const struct eventfs_entry *entries,
@@ -87,16 +87,20 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 void eventfs_remove_events_dir(struct eventfs_inode *ei);
 void eventfs_remove_dir(struct eventfs_inode *ei);
 
-struct dentry *tracefs_create_file(const char *name, umode_t mode,
-				   struct dentry *parent, void *data,
+struct dentry *tracefs_create_file(struct mnt_idmap *idmap, const char *name,
+				   umode_t mode, struct dentry *parent,
+				   void *data,
 				   const struct file_operations *fops);
 
-struct dentry *tracefs_create_dir(const char *name, struct dentry *parent);
+struct dentry *tracefs_create_dir(struct mnt_idmap *idmap, const char *name,
+				  struct dentry *parent);
 
 void tracefs_remove(struct dentry *dentry);
 
-struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent,
-					   int (*mkdir)(const char *name),
+struct dentry *tracefs_create_instance_dir(const char *name,
+					   struct dentry *parent,
+					   int (*mkdir)(struct mnt_idmap *idmap,
+							const char *name),
 					   int (*rmdir)(const char *name));
 
 bool tracefs_initialized(void);
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..6bf9dc7c5c29 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1015,7 +1015,7 @@ static __init void ftrace_profile_tracefs(struct dentry *d_tracer)
 		}
 	}
 
-	trace_create_file("function_profile_enabled",
+	trace_create_file(&nop_mnt_idmap, "function_profile_enabled",
 			  TRACE_MODE_WRITE, d_tracer, NULL,
 			  &ftrace_profile_fops);
 }
@@ -6380,14 +6380,14 @@ static const struct file_operations ftrace_graph_notrace_fops = {
 };
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
-void ftrace_create_filter_files(struct ftrace_ops *ops,
+void ftrace_create_filter_files(struct mnt_idmap *idmap, struct ftrace_ops *ops,
 				struct dentry *parent)
 {
 
-	trace_create_file("set_ftrace_filter", TRACE_MODE_WRITE, parent,
+	trace_create_file(idmap, "set_ftrace_filter", TRACE_MODE_WRITE, parent,
 			  ops, &ftrace_filter_fops);
 
-	trace_create_file("set_ftrace_notrace", TRACE_MODE_WRITE, parent,
+	trace_create_file(idmap, "set_ftrace_notrace", TRACE_MODE_WRITE, parent,
 			  ops, &ftrace_notrace_fops);
 }
 
@@ -6413,28 +6413,26 @@ void ftrace_destroy_filter_files(struct ftrace_ops *ops)
 
 static __init int ftrace_init_dyn_tracefs(struct dentry *d_tracer)
 {
+	trace_create_file(&nop_mnt_idmap, "available_filter_functions",
+			  TRACE_MODE_READ, d_tracer, NULL, &ftrace_avail_fops);
 
-	trace_create_file("available_filter_functions", TRACE_MODE_READ,
-			d_tracer, NULL, &ftrace_avail_fops);
+	trace_create_file(&nop_mnt_idmap, "available_filter_functions_addrs",
+			  TRACE_MODE_READ, d_tracer, NULL,
+			  &ftrace_avail_addrs_fops);
 
-	trace_create_file("available_filter_functions_addrs", TRACE_MODE_READ,
-			d_tracer, NULL, &ftrace_avail_addrs_fops);
-
-	trace_create_file("enabled_functions", TRACE_MODE_READ,
+	trace_create_file(&nop_mnt_idmap, "enabled_functions", TRACE_MODE_READ,
 			d_tracer, NULL, &ftrace_enabled_fops);
 
-	trace_create_file("touched_functions", TRACE_MODE_READ,
-			d_tracer, NULL, &ftrace_touched_fops);
+	trace_create_file(&nop_mnt_idmap, "touched_functions", TRACE_MODE_READ,
+			  d_tracer, NULL, &ftrace_touched_fops);
 
-	ftrace_create_filter_files(&global_ops, d_tracer);
+	ftrace_create_filter_files(&nop_mnt_idmap, &global_ops, d_tracer);
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-	trace_create_file("set_graph_function", TRACE_MODE_WRITE, d_tracer,
-				    NULL,
-				    &ftrace_graph_fops);
-	trace_create_file("set_graph_notrace", TRACE_MODE_WRITE, d_tracer,
-				    NULL,
-				    &ftrace_graph_notrace_fops);
+	trace_create_file(&nop_mnt_idmap, "set_graph_function",
+			  TRACE_MODE_WRITE, d_tracer, NULL, &ftrace_graph_fops);
+	trace_create_file(&nop_mnt_idmap, "set_graph_notrace", TRACE_MODE_WRITE,
+			  d_tracer, NULL, &ftrace_graph_notrace_fops);
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
 	return 0;
@@ -7858,11 +7856,12 @@ static const struct file_operations ftrace_no_pid_fops = {
 	.release	= ftrace_pid_release,
 };
 
-void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer)
+void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr,
+			 struct dentry *d_tracer)
 {
-	trace_create_file("set_ftrace_pid", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "set_ftrace_pid", TRACE_MODE_WRITE, d_tracer,
 			    tr, &ftrace_pid_fops);
-	trace_create_file("set_ftrace_notrace_pid", TRACE_MODE_WRITE,
+	trace_create_file(idmap, "set_ftrace_notrace_pid", TRACE_MODE_WRITE,
 			  d_tracer, tr, &ftrace_no_pid_fops);
 }
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 199df497db07..c0bd4b0dfe85 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1766,12 +1766,13 @@ static void latency_fsnotify_workfn_irq(struct irq_work *iwork)
 	queue_work(fsnotify_wq, &tr->fsnotify_work);
 }
 
-static void trace_create_maxlat_file(struct trace_array *tr,
+static void trace_create_maxlat_file(struct mnt_idmap *idmap,
+				     struct trace_array *tr,
 				     struct dentry *d_tracer)
 {
 	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
 	init_irq_work(&tr->fsnotify_irqwork, latency_fsnotify_workfn_irq);
-	tr->d_max_latency = trace_create_file("tracing_max_latency",
+	tr->d_max_latency = trace_create_file(idmap, "tracing_max_latency",
 					      TRACE_MODE_WRITE,
 					      d_tracer, tr,
 					      &tracing_max_lat_fops);
@@ -1804,8 +1805,8 @@ void latency_fsnotify(struct trace_array *tr)
 
 #else /* !LATENCY_FS_NOTIFY */
 
-#define trace_create_maxlat_file(tr, d_tracer)				\
-	trace_create_file("tracing_max_latency", TRACE_MODE_WRITE,	\
+#define trace_create_maxlat_file(idmap, tr, d_tracer)				\
+	trace_create_file(idmap, "tracing_max_latency", TRACE_MODE_WRITE,	\
 			  d_tracer, tr, &tracing_max_lat_fops)
 
 #endif
@@ -2124,7 +2125,8 @@ static inline int do_run_tracer_selftest(struct tracer *type)
 }
 #endif /* CONFIG_FTRACE_STARTUP_TEST */
 
-static void add_tracer_options(struct trace_array *tr, struct tracer *t);
+static void add_tracer_options(struct mnt_idmap *idmap, struct trace_array *tr,
+			       struct tracer *t);
 
 static void __init apply_trace_boot_options(void);
 
@@ -2191,7 +2193,7 @@ int __init register_tracer(struct tracer *type)
 
 	type->next = trace_types;
 	trace_types = type;
-	add_tracer_options(&global_trace, type);
+	add_tracer_options(&nop_mnt_idmap, &global_trace, type);
 
  out:
 	mutex_unlock(&trace_types_lock);
@@ -6245,14 +6247,14 @@ trace_insert_eval_map_file(struct module *mod, struct trace_eval_map **start,
 	mutex_unlock(&trace_eval_mutex);
 }
 
-static void trace_create_eval_file(struct dentry *d_tracer)
+static void trace_create_eval_file(struct mnt_idmap *idmap, struct dentry *d_tracer)
 {
-	trace_create_file("eval_map", TRACE_MODE_READ, d_tracer,
+	trace_create_file(idmap, "eval_map", TRACE_MODE_READ, d_tracer,
 			  NULL, &tracing_eval_map_fops);
 }
 
 #else /* CONFIG_TRACE_EVAL_MAP_FILE */
-static inline void trace_create_eval_file(struct dentry *d_tracer) { }
+static inline void trace_create_eval_file(struct mnt_idmap *idmap, struct dentry *d_tracer) { }
 static inline void trace_insert_eval_map_file(struct module *mod,
 			      struct trace_eval_map **start, int len) { }
 #endif /* !CONFIG_TRACE_EVAL_MAP_FILE */
@@ -6454,7 +6456,7 @@ int tracing_update_buffers(struct trace_array *tr)
 struct trace_option_dentry;
 
 static void
-create_trace_option_files(struct trace_array *tr, struct tracer *tracer);
+create_trace_option_files(struct mnt_idmap *idmap, struct trace_array *tr, struct tracer *tracer);
 
 /*
  * Used to clear out the tracer before deletion of an instance.
@@ -6475,7 +6477,8 @@ static void tracing_set_nop(struct trace_array *tr)
 
 static bool tracer_options_updated;
 
-static void add_tracer_options(struct trace_array *tr, struct tracer *t)
+static void add_tracer_options(struct mnt_idmap *idmap, struct trace_array *tr,
+			       struct tracer *t)
 {
 	/* Only enable if the directory has been created already. */
 	if (!tr->dir)
@@ -6485,7 +6488,7 @@ static void add_tracer_options(struct trace_array *tr, struct tracer *t)
 	if (!tracer_options_updated)
 		return;
 
-	create_trace_option_files(tr, t);
+	create_trace_option_files(idmap, tr, t);
 }
 
 int tracing_set_tracer(struct trace_array *tr, const char *buf)
@@ -8845,7 +8848,8 @@ static struct dentry *tracing_get_dentry(struct trace_array *tr)
 	return tr->dir;
 }
 
-static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu)
+static struct dentry *tracing_dentry_percpu(struct mnt_idmap *idmap,
+					    struct trace_array *tr, int cpu)
 {
 	struct dentry *d_tracer;
 
@@ -8856,7 +8860,7 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu)
 	if (IS_ERR(d_tracer))
 		return NULL;
 
-	tr->percpu_dir = tracefs_create_dir("per_cpu", d_tracer);
+	tr->percpu_dir = tracefs_create_dir(idmap, "per_cpu", d_tracer);
 
 	MEM_FAIL(!tr->percpu_dir,
 		  "Could not create tracefs directory 'per_cpu/%d'\n", cpu);
@@ -8864,11 +8868,13 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu)
 	return tr->percpu_dir;
 }
 
-static struct dentry *
-trace_create_cpu_file(const char *name, umode_t mode, struct dentry *parent,
-		      void *data, long cpu, const struct file_operations *fops)
+static struct dentry *trace_create_cpu_file(struct mnt_idmap *idmap,
+					    const char *name, umode_t mode,
+					    struct dentry *parent, void *data,
+					    long cpu,
+					    const struct file_operations *fops)
 {
-	struct dentry *ret = trace_create_file(name, mode, parent, data, fops);
+	struct dentry *ret = trace_create_file(idmap, name, mode, parent, data, fops);
 
 	if (ret) /* See tracing_get_cpu() */
 		d_inode(ret)->i_cdev = (void *)(cpu + 1);
@@ -8876,9 +8882,9 @@ trace_create_cpu_file(const char *name, umode_t mode, struct dentry *parent,
 }
 
 static void
-tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
+tracing_init_tracefs_percpu(struct mnt_idmap *idmap, struct trace_array *tr, long cpu)
 {
-	struct dentry *d_percpu = tracing_dentry_percpu(tr, cpu);
+	struct dentry *d_percpu = tracing_dentry_percpu(idmap, tr, cpu);
 	struct dentry *d_cpu;
 	char cpu_dir[30]; /* 30 characters should be more than enough */
 
@@ -8886,34 +8892,34 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
 		return;
 
 	snprintf(cpu_dir, 30, "cpu%ld", cpu);
-	d_cpu = tracefs_create_dir(cpu_dir, d_percpu);
+	d_cpu = tracefs_create_dir(idmap, cpu_dir, d_percpu);
 	if (!d_cpu) {
 		pr_warn("Could not create tracefs '%s' entry\n", cpu_dir);
 		return;
 	}
 
 	/* per cpu trace_pipe */
-	trace_create_cpu_file("trace_pipe", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file(idmap, "trace_pipe", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_pipe_fops);
 
 	/* per cpu trace */
-	trace_create_cpu_file("trace", TRACE_MODE_WRITE, d_cpu,
+	trace_create_cpu_file(idmap, "trace", TRACE_MODE_WRITE, d_cpu,
 				tr, cpu, &tracing_fops);
 
-	trace_create_cpu_file("trace_pipe_raw", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file(idmap, "trace_pipe_raw", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_buffers_fops);
 
-	trace_create_cpu_file("stats", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file(idmap, "stats", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_stats_fops);
 
-	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file(idmap, "buffer_size_kb", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-	trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+	trace_create_cpu_file(idmap, "snapshot", TRACE_MODE_WRITE, d_cpu,
 				tr, cpu, &snapshot_fops);
 
-	trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file(idmap, "snapshot_raw", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &snapshot_raw_fops);
 #endif
 }
@@ -9088,23 +9094,21 @@ static const struct file_operations trace_options_core_fops = {
 	.llseek = generic_file_llseek,
 };
 
-struct dentry *trace_create_file(const char *name,
-				 umode_t mode,
-				 struct dentry *parent,
-				 void *data,
-				 const struct file_operations *fops)
+struct dentry *trace_create_file(struct mnt_idmap *idmap, const char *name,
+				 umode_t mode, struct dentry *parent,
+				 void *data, const struct file_operations *fops)
 {
 	struct dentry *ret;
 
-	ret = tracefs_create_file(name, mode, parent, data, fops);
+	ret = tracefs_create_file(idmap, name, mode, parent, data, fops);
 	if (!ret)
 		pr_warn("Could not create tracefs '%s' entry\n", name);
 
 	return ret;
 }
 
-
-static struct dentry *trace_options_init_dentry(struct trace_array *tr)
+static struct dentry *trace_options_init_dentry(struct mnt_idmap *idmap,
+						struct trace_array *tr)
 {
 	struct dentry *d_tracer;
 
@@ -9115,7 +9119,7 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr)
 	if (IS_ERR(d_tracer))
 		return NULL;
 
-	tr->options = tracefs_create_dir("options", d_tracer);
+	tr->options = tracefs_create_dir(idmap, "options", d_tracer);
 	if (!tr->options) {
 		pr_warn("Could not create tracefs directory 'options'\n");
 		return NULL;
@@ -9125,14 +9129,14 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr)
 }
 
 static void
-create_trace_option_file(struct trace_array *tr,
+create_trace_option_file(struct mnt_idmap *idmap, struct trace_array *tr,
 			 struct trace_option_dentry *topt,
 			 struct tracer_flags *flags,
 			 struct tracer_opt *opt)
 {
 	struct dentry *t_options;
 
-	t_options = trace_options_init_dentry(tr);
+	t_options = trace_options_init_dentry(idmap, tr);
 	if (!t_options)
 		return;
 
@@ -9140,13 +9144,14 @@ create_trace_option_file(struct trace_array *tr,
 	topt->opt = opt;
 	topt->tr = tr;
 
-	topt->entry = trace_create_file(opt->name, TRACE_MODE_WRITE,
+	topt->entry = trace_create_file(idmap, opt->name, TRACE_MODE_WRITE,
 					t_options, topt, &trace_options_fops);
 
 }
 
-static void
-create_trace_option_files(struct trace_array *tr, struct tracer *tracer)
+static void create_trace_option_files(struct mnt_idmap *idmap,
+				      struct trace_array *tr,
+				      struct tracer *tracer)
 {
 	struct trace_option_dentry *topts;
 	struct trace_options *tr_topts;
@@ -9198,7 +9203,7 @@ create_trace_option_files(struct trace_array *tr, struct tracer *tracer)
 	tr->nr_topts++;
 
 	for (cnt = 0; opts[cnt].name; cnt++) {
-		create_trace_option_file(tr, &topts[cnt], flags,
+		create_trace_option_file(idmap, tr, &topts[cnt], flags,
 					 &opts[cnt]);
 		MEM_FAIL(topts[cnt].entry == NULL,
 			  "Failed to create trace option: %s",
@@ -9207,34 +9212,34 @@ create_trace_option_files(struct trace_array *tr, struct tracer *tracer)
 }
 
 static struct dentry *
-create_trace_option_core_file(struct trace_array *tr,
+create_trace_option_core_file(struct mnt_idmap *idmap, struct trace_array *tr,
 			      const char *option, long index)
 {
 	struct dentry *t_options;
 
-	t_options = trace_options_init_dentry(tr);
+	t_options = trace_options_init_dentry(idmap, tr);
 	if (!t_options)
 		return NULL;
 
-	return trace_create_file(option, TRACE_MODE_WRITE, t_options,
+	return trace_create_file(idmap, option, TRACE_MODE_WRITE, t_options,
 				 (void *)&tr->trace_flags_index[index],
 				 &trace_options_core_fops);
 }
 
-static void create_trace_options_dir(struct trace_array *tr)
+static void create_trace_options_dir(struct mnt_idmap *idmap, struct trace_array *tr)
 {
 	struct dentry *t_options;
 	bool top_level = tr == &global_trace;
 	int i;
 
-	t_options = trace_options_init_dentry(tr);
+	t_options = trace_options_init_dentry(idmap, tr);
 	if (!t_options)
 		return;
 
 	for (i = 0; trace_options[i]; i++) {
 		if (top_level ||
 		    !((1 << i) & TOP_LEVEL_TRACE_FLAGS))
-			create_trace_option_core_file(tr, trace_options[i], i);
+			create_trace_option_core_file(idmap, tr, trace_options[i], i);
 	}
 }
 
@@ -9342,8 +9347,8 @@ static const struct file_operations buffer_percent_fops = {
 
 static struct dentry *trace_instance_dir;
 
-static void
-init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer);
+static void init_tracer_tracefs(struct mnt_idmap *idmap, struct trace_array *tr,
+				struct dentry *d_tracer);
 
 static int
 allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size)
@@ -9426,19 +9431,20 @@ static void init_trace_flags_index(struct trace_array *tr)
 		tr->trace_flags_index[i] = i;
 }
 
-static void __update_tracer_options(struct trace_array *tr)
+static void __update_tracer_options(struct mnt_idmap *idmap,
+				    struct trace_array *tr)
 {
 	struct tracer *t;
 
 	for (t = trace_types; t; t = t->next)
-		add_tracer_options(tr, t);
+		add_tracer_options(idmap, tr, t);
 }
 
 static void update_tracer_options(struct trace_array *tr)
 {
 	mutex_lock(&trace_types_lock);
 	tracer_options_updated = true;
-	__update_tracer_options(tr);
+	__update_tracer_options(&nop_mnt_idmap, tr);
 	mutex_unlock(&trace_types_lock);
 }
 
@@ -9470,27 +9476,28 @@ struct trace_array *trace_array_find_get(const char *instance)
 	return tr;
 }
 
-static int trace_array_create_dir(struct trace_array *tr)
+static int trace_array_create_dir(struct mnt_idmap *idmap, struct trace_array *tr)
 {
 	int ret;
 
-	tr->dir = tracefs_create_dir(tr->name, trace_instance_dir);
+	tr->dir = tracefs_create_dir(idmap, tr->name, trace_instance_dir);
 	if (!tr->dir)
 		return -EINVAL;
 
-	ret = event_trace_add_tracer(tr->dir, tr);
+	ret = event_trace_add_tracer(idmap, tr->dir, tr);
 	if (ret) {
 		tracefs_remove(tr->dir);
 		return ret;
 	}
 
-	init_tracer_tracefs(tr, tr->dir);
-	__update_tracer_options(tr);
+	init_tracer_tracefs(idmap, tr, tr->dir);
+	__update_tracer_options(idmap, tr);
 
 	return ret;
 }
 
-static struct trace_array *trace_array_create(const char *name)
+static struct trace_array *trace_array_create(struct mnt_idmap *idmap,
+					      const char *name)
 {
 	struct trace_array *tr;
 	int ret;
@@ -9539,7 +9546,7 @@ static struct trace_array *trace_array_create(const char *name)
 	init_trace_flags_index(tr);
 
 	if (trace_instance_dir) {
-		ret = trace_array_create_dir(tr);
+		ret = trace_array_create_dir(idmap, tr);
 		if (ret)
 			goto out_free_tr;
 	} else
@@ -9562,7 +9569,7 @@ static struct trace_array *trace_array_create(const char *name)
 	return ERR_PTR(ret);
 }
 
-static int instance_mkdir(const char *name)
+static int instance_mkdir(struct mnt_idmap *idmap, const char *name)
 {
 	struct trace_array *tr;
 	int ret;
@@ -9574,7 +9581,7 @@ static int instance_mkdir(const char *name)
 	if (trace_array_find(name))
 		goto out_unlock;
 
-	tr = trace_array_create(name);
+	tr = trace_array_create(idmap, name);
 
 	ret = PTR_ERR_OR_ZERO(tr);
 
@@ -9612,7 +9619,7 @@ struct trace_array *trace_array_get_by_name(const char *name)
 			goto out_unlock;
 	}
 
-	tr = trace_array_create(name);
+	tr = trace_array_create(&nop_mnt_idmap, name);
 
 	if (IS_ERR(tr))
 		tr = NULL;
@@ -9728,7 +9735,7 @@ static __init void create_trace_instances(struct dentry *d_tracer)
 	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
 		if (!tr->name)
 			continue;
-		if (MEM_FAIL(trace_array_create_dir(tr) < 0,
+		if (MEM_FAIL(trace_array_create_dir(&nop_mnt_idmap, tr) < 0,
 			     "Failed to create instance directory\n"))
 			break;
 	}
@@ -9737,81 +9744,81 @@ static __init void create_trace_instances(struct dentry *d_tracer)
 	mutex_unlock(&event_mutex);
 }
 
-static void
-init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
+static void init_tracer_tracefs(struct mnt_idmap *idmap, struct trace_array *tr,
+				struct dentry *d_tracer)
 {
 	int cpu;
 
-	trace_create_file("available_tracers", TRACE_MODE_READ, d_tracer,
+	trace_create_file(idmap, "available_tracers", TRACE_MODE_READ, d_tracer,
 			tr, &show_traces_fops);
 
-	trace_create_file("current_tracer", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "current_tracer", TRACE_MODE_WRITE, d_tracer,
 			tr, &set_tracer_fops);
 
-	trace_create_file("tracing_cpumask", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "tracing_cpumask", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_cpumask_fops);
 
-	trace_create_file("trace_options", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "trace_options", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_iter_fops);
 
-	trace_create_file("trace", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "trace", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_fops);
 
-	trace_create_file("trace_pipe", TRACE_MODE_READ, d_tracer,
+	trace_create_file(idmap, "trace_pipe", TRACE_MODE_READ, d_tracer,
 			  tr, &tracing_pipe_fops);
 
-	trace_create_file("buffer_size_kb", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "buffer_size_kb", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_entries_fops);
 
-	trace_create_file("buffer_total_size_kb", TRACE_MODE_READ, d_tracer,
+	trace_create_file(idmap, "buffer_total_size_kb", TRACE_MODE_READ, d_tracer,
 			  tr, &tracing_total_entries_fops);
 
-	trace_create_file("free_buffer", 0200, d_tracer,
+	trace_create_file(idmap, "free_buffer", 0200, d_tracer,
 			  tr, &tracing_free_buffer_fops);
 
-	trace_create_file("trace_marker", 0220, d_tracer,
+	trace_create_file(idmap, "trace_marker", 0220, d_tracer,
 			  tr, &tracing_mark_fops);
 
 	tr->trace_marker_file = __find_event_file(tr, "ftrace", "print");
 
-	trace_create_file("trace_marker_raw", 0220, d_tracer,
+	trace_create_file(idmap, "trace_marker_raw", 0220, d_tracer,
 			  tr, &tracing_mark_raw_fops);
 
-	trace_create_file("trace_clock", TRACE_MODE_WRITE, d_tracer, tr,
+	trace_create_file(idmap, "trace_clock", TRACE_MODE_WRITE, d_tracer, tr,
 			  &trace_clock_fops);
 
-	trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "tracing_on", TRACE_MODE_WRITE, d_tracer,
 			  tr, &rb_simple_fops);
 
-	trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
+	trace_create_file(idmap, "timestamp_mode", TRACE_MODE_READ, d_tracer, tr,
 			  &trace_time_stamp_mode_fops);
 
 	tr->buffer_percent = 50;
 
-	trace_create_file("buffer_percent", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "buffer_percent", TRACE_MODE_WRITE, d_tracer,
 			tr, &buffer_percent_fops);
 
-	create_trace_options_dir(tr);
+	create_trace_options_dir(idmap, tr);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-	trace_create_maxlat_file(tr, d_tracer);
+	trace_create_maxlat_file(idmap, tr, d_tracer);
 #endif
 
-	if (ftrace_create_function_files(tr, d_tracer))
+	if (ftrace_create_function_files(idmap, tr, d_tracer))
 		MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-	trace_create_file("snapshot", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "snapshot", TRACE_MODE_WRITE, d_tracer,
 			  tr, &snapshot_fops);
 #endif
 
-	trace_create_file("error_log", TRACE_MODE_WRITE, d_tracer,
+	trace_create_file(idmap, "error_log", TRACE_MODE_WRITE, d_tracer,
 			  tr, &tracing_err_log_fops);
 
 	for_each_tracing_cpu(cpu)
-		tracing_init_tracefs_percpu(tr, cpu);
+		tracing_init_tracefs_percpu(idmap, tr, cpu);
 
-	ftrace_init_tracefs(tr, d_tracer);
+	ftrace_init_tracefs(idmap, tr, d_tracer);
 }
 
 static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore)
@@ -9991,32 +9998,32 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
 
 	event_trace_init();
 
-	init_tracer_tracefs(&global_trace, NULL);
+	init_tracer_tracefs(&nop_mnt_idmap, &global_trace, NULL);
 	ftrace_init_tracefs_toplevel(&global_trace, NULL);
 
-	trace_create_file("tracing_thresh", TRACE_MODE_WRITE, NULL,
+	trace_create_file(&nop_mnt_idmap, "tracing_thresh", TRACE_MODE_WRITE, NULL,
 			&global_trace, &tracing_thresh_fops);
 
-	trace_create_file("README", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "README", TRACE_MODE_READ, NULL,
 			NULL, &tracing_readme_fops);
 
-	trace_create_file("saved_cmdlines", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "saved_cmdlines", TRACE_MODE_READ, NULL,
 			NULL, &tracing_saved_cmdlines_fops);
 
-	trace_create_file("saved_cmdlines_size", TRACE_MODE_WRITE, NULL,
+	trace_create_file(&nop_mnt_idmap, "saved_cmdlines_size", TRACE_MODE_WRITE, NULL,
 			  NULL, &tracing_saved_cmdlines_size_fops);
 
-	trace_create_file("saved_tgids", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "saved_tgids", TRACE_MODE_READ, NULL,
 			NULL, &tracing_saved_tgids_fops);
 
-	trace_create_eval_file(NULL);
+	trace_create_eval_file(&nop_mnt_idmap, NULL);
 
 #ifdef CONFIG_MODULES
 	register_module_notifier(&trace_module_nb);
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-	trace_create_file("dyn_ftrace_total_info", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "dyn_ftrace_total_info", TRACE_MODE_READ, NULL,
 			NULL, &tracing_dyn_info_fops);
 #endif
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 0489e72c8169..b8de94e4387d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -622,9 +622,8 @@ bool tracing_is_disabled(void);
 bool tracer_tracing_is_on(struct trace_array *tr);
 void tracer_tracing_on(struct trace_array *tr);
 void tracer_tracing_off(struct trace_array *tr);
-struct dentry *trace_create_file(const char *name,
-				 umode_t mode,
-				 struct dentry *parent,
+struct dentry *trace_create_file(struct mnt_idmap *idmap, const char *name,
+				 umode_t mode, struct dentry *parent,
 				 void *data,
 				 const struct file_operations *fops);
 
@@ -1025,7 +1024,7 @@ static inline int ftrace_trace_task(struct trace_array *tr)
 		FTRACE_PID_IGNORE;
 }
 extern int ftrace_is_dead(void);
-int ftrace_create_function_files(struct trace_array *tr,
+int ftrace_create_function_files(struct mnt_idmap *idmap, struct trace_array *tr,
 				 struct dentry *parent);
 void ftrace_destroy_function_files(struct trace_array *tr);
 int ftrace_allocate_ftrace_ops(struct trace_array *tr);
@@ -1033,7 +1032,8 @@ void ftrace_free_ftrace_ops(struct trace_array *tr);
 void ftrace_init_global_array_ops(struct trace_array *tr);
 void ftrace_init_array_ops(struct trace_array *tr, ftrace_func_t func);
 void ftrace_reset_array_ops(struct trace_array *tr);
-void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer);
+void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr,
+			 struct dentry *d_tracer);
 void ftrace_init_tracefs_toplevel(struct trace_array *tr,
 				  struct dentry *d_tracer);
 void ftrace_clear_pids(struct trace_array *tr);
@@ -1046,7 +1046,7 @@ static inline int ftrace_trace_task(struct trace_array *tr)
 }
 static inline int ftrace_is_dead(void) { return 0; }
 static inline int
-ftrace_create_function_files(struct trace_array *tr,
+ftrace_create_function_files(struct mnt_idmap *idmap, struct trace_array *tr,
 			     struct dentry *parent)
 {
 	return 0;
@@ -1060,7 +1060,7 @@ static inline void ftrace_destroy_function_files(struct trace_array *tr) { }
 static inline __init void
 ftrace_init_global_array_ops(struct trace_array *tr) { }
 static inline void ftrace_reset_array_ops(struct trace_array *tr) { }
-static inline void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d) { }
+static inline void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, struct dentry *d) { }
 static inline void ftrace_init_tracefs_toplevel(struct trace_array *tr, struct dentry *d) { }
 static inline void ftrace_clear_pids(struct trace_array *tr) { }
 static inline int init_function_trace(void) { return 0; }
@@ -1114,7 +1114,7 @@ extern void clear_ftrace_function_probes(struct trace_array *tr);
 int register_ftrace_command(struct ftrace_func_command *cmd);
 int unregister_ftrace_command(struct ftrace_func_command *cmd);
 
-void ftrace_create_filter_files(struct ftrace_ops *ops,
+void ftrace_create_filter_files(struct mnt_idmap *idmap, struct ftrace_ops *ops,
 				struct dentry *parent);
 void ftrace_destroy_filter_files(struct ftrace_ops *ops);
 
@@ -1141,7 +1141,7 @@ static inline void clear_ftrace_function_probes(struct trace_array *tr)
  * The ops parameter passed in is usually undefined.
  * This must be a macro.
  */
-#define ftrace_create_filter_files(ops, parent) do { } while (0)
+#define ftrace_create_filter_files(idmap, ops, parent) do { } while (0)
 #define ftrace_destroy_filter_files(ops) do { } while (0)
 #endif /* CONFIG_FUNCTION_TRACER && CONFIG_DYNAMIC_FTRACE */
 
@@ -1541,7 +1541,9 @@ extern void trace_event_enable_tgid_record(bool enable);
 
 extern int event_trace_init(void);
 extern int init_events(void);
-extern int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr);
+extern int event_trace_add_tracer(struct mnt_idmap *idmap,
+				  struct dentry *parent,
+				  struct trace_array *tr);
 extern int event_trace_del_tracer(struct trace_array *tr);
 extern void __trace_early_add_events(struct trace_array *tr);
 
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index 4376887e0d8a..c0caf4631f82 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -263,7 +263,7 @@ static __init int init_dynamic_event(void)
 	if (ret)
 		return 0;
 
-	trace_create_file("dynamic_events", TRACE_MODE_WRITE, NULL,
+	trace_create_file(&nop_mnt_idmap, "dynamic_events", TRACE_MODE_WRITE, NULL,
 			  NULL, &dynamic_events_ops);
 
 	return 0;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f29e815ca5b2..b41e0f26cee4 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3768,7 +3768,8 @@ static int events_callback(const char *name, umode_t *mode, void **data,
 
 /* Expects to have event_mutex held when called */
 static int
-create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
+create_event_toplevel_files(struct mnt_idmap *idmap, struct dentry *parent,
+			    struct trace_array *tr)
 {
 	struct eventfs_inode *e_events;
 	struct dentry *entry;
@@ -3788,15 +3789,15 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
 		},
 	};
 
-	entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
+	entry = trace_create_file(idmap, "set_event", TRACE_MODE_WRITE, parent,
 				  tr, &ftrace_set_event_fops);
 	if (!entry)
 		return -ENOMEM;
 
 	nr_entries = ARRAY_SIZE(events_entries);
 
-	e_events = eventfs_create_events_dir("events", parent, events_entries,
-					     nr_entries, tr);
+	e_events = eventfs_create_events_dir(idmap, "events", parent,
+					     events_entries, nr_entries, tr);
 	if (IS_ERR(e_events)) {
 		pr_warn("Could not create tracefs 'events' directory\n");
 		return -ENOMEM;
@@ -3804,12 +3805,11 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
 
 	/* There are not as crucial, just warn if they are not created */
 
-	trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
-			  tr, &ftrace_set_event_pid_fops);
+	trace_create_file(idmap, "set_event_pid", TRACE_MODE_WRITE, parent, tr,
+			  &ftrace_set_event_pid_fops);
 
-	trace_create_file("set_event_notrace_pid",
-			  TRACE_MODE_WRITE, parent, tr,
-			  &ftrace_set_event_notrace_pid_fops);
+	trace_create_file(idmap, "set_event_notrace_pid", TRACE_MODE_WRITE,
+			  parent, tr, &ftrace_set_event_notrace_pid_fops);
 
 	tr->event_dir = e_events;
 
@@ -3829,13 +3829,14 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
  *
  * Must be called with event_mutex held.
  */
-int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr)
+int event_trace_add_tracer(struct mnt_idmap *idmap, struct dentry *parent,
+			   struct trace_array *tr)
 {
 	int ret;
 
 	lockdep_assert_held(&event_mutex);
 
-	ret = create_event_toplevel_files(parent, tr);
+	ret = create_event_toplevel_files(idmap, parent, tr);
 	if (ret)
 		goto out;
 
@@ -3862,7 +3863,7 @@ early_event_add_tracer(struct dentry *parent, struct trace_array *tr)
 
 	mutex_lock(&event_mutex);
 
-	ret = create_event_toplevel_files(parent, tr);
+	ret = create_event_toplevel_files(&nop_mnt_idmap, parent, tr);
 	if (ret)
 		goto out_unlock;
 
@@ -4021,7 +4022,7 @@ __init int event_trace_init(void)
 	if (!tr)
 		return -ENODEV;
 
-	trace_create_file("available_events", TRACE_MODE_READ,
+	trace_create_file(&nop_mnt_idmap, "available_events", TRACE_MODE_READ,
 			  NULL, tr, &ftrace_avail_fops);
 
 	ret = early_event_add_tracer(NULL, tr);
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index e7af286af4f1..f8fbb0f4ef87 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -2318,8 +2318,9 @@ static __init int trace_events_synth_init(void)
 	if (err)
 		goto err;
 
-	entry = tracefs_create_file("synthetic_events", TRACE_MODE_WRITE,
-				    NULL, NULL, &synth_events_fops);
+	entry = tracefs_create_file(&nop_mnt_idmap, "synthetic_events",
+				    TRACE_MODE_WRITE, NULL, NULL,
+				    &synth_events_fops);
 	if (!entry) {
 		err = -ENODEV;
 		goto err;
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 9f1bfbe105e8..266428151b60 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -77,8 +77,8 @@ void ftrace_free_ftrace_ops(struct trace_array *tr)
 	tr->ops = NULL;
 }
 
-int ftrace_create_function_files(struct trace_array *tr,
-				 struct dentry *parent)
+int ftrace_create_function_files(struct mnt_idmap *idmap,
+				 struct trace_array *tr, struct dentry *parent)
 {
 	/*
 	 * The top level array uses the "global_ops", and the files are
@@ -90,7 +90,7 @@ int ftrace_create_function_files(struct trace_array *tr,
 	if (!tr->ops)
 		return -EINVAL;
 
-	ftrace_create_filter_files(tr->ops, parent);
+	ftrace_create_filter_files(idmap, tr->ops, parent);
 
 	return 0;
 }
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index c35fbaab2a47..e7b4fa238eaf 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -1413,8 +1413,8 @@ static __init int init_graph_tracefs(void)
 	if (ret)
 		return 0;
 
-	trace_create_file("max_graph_depth", TRACE_MODE_WRITE, NULL,
-			  NULL, &graph_depth_fops);
+	trace_create_file(&nop_mnt_idmap, "max_graph_depth", TRACE_MODE_WRITE,
+			  NULL, NULL, &graph_depth_fops);
 
 	return 0;
 }
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index b791524a6536..7020ee831929 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -775,25 +775,25 @@ static int init_tracefs(void)
 	if (ret)
 		return -ENOMEM;
 
-	top_dir = tracefs_create_dir("hwlat_detector", NULL);
+	top_dir = tracefs_create_dir(&nop_mnt_idmap, "hwlat_detector", NULL);
 	if (!top_dir)
 		return -ENOMEM;
 
-	hwlat_sample_window = tracefs_create_file("window", TRACE_MODE_WRITE,
+	hwlat_sample_window = tracefs_create_file(&nop_mnt_idmap, "window", TRACE_MODE_WRITE,
 						  top_dir,
 						  &hwlat_window,
 						  &trace_min_max_fops);
 	if (!hwlat_sample_window)
 		goto err;
 
-	hwlat_sample_width = tracefs_create_file("width", TRACE_MODE_WRITE,
+	hwlat_sample_width = tracefs_create_file(&nop_mnt_idmap, "width", TRACE_MODE_WRITE,
 						 top_dir,
 						 &hwlat_width,
 						 &trace_min_max_fops);
 	if (!hwlat_sample_width)
 		goto err;
 
-	hwlat_thread_mode = trace_create_file("mode", TRACE_MODE_WRITE,
+	hwlat_thread_mode = trace_create_file(&nop_mnt_idmap, "mode", TRACE_MODE_WRITE,
 					      top_dir,
 					      NULL,
 					      &thread_mode_fops);
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 52f8b537dd0a..c74f0524b5d4 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1914,11 +1914,11 @@ static __init int init_kprobe_trace(void)
 		return 0;
 
 	/* Event list interface */
-	trace_create_file("kprobe_events", TRACE_MODE_WRITE,
+	trace_create_file(&nop_mnt_idmap, "kprobe_events", TRACE_MODE_WRITE,
 			  NULL, NULL, &kprobe_events_ops);
 
 	/* Profile interface */
-	trace_create_file("kprobe_profile", TRACE_MODE_READ,
+	trace_create_file(&nop_mnt_idmap, "kprobe_profile", TRACE_MODE_READ,
 			  NULL, NULL, &kprobe_profile_ops);
 
 	setup_boot_kprobe_events();
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index bd0d01d00fb9..a7d073c2d97c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2692,7 +2692,8 @@ static int init_timerlat_stack_tracefs(struct dentry *top_dir)
 {
 	struct dentry *tmp;
 
-	tmp = tracefs_create_file("print_stack", TRACE_MODE_WRITE, top_dir,
+	tmp = tracefs_create_file(&nop_mnt_idmap, "print_stack",
+				  TRACE_MODE_WRITE, top_dir,
 				  &osnoise_print_stack, &trace_min_max_fops);
 	if (!tmp)
 		return -ENOMEM;
@@ -2720,18 +2721,19 @@ static int osnoise_create_cpu_timerlat_fd(struct dentry *top_dir)
 	 * Because osnoise/timerlat have a single workload, having
 	 * multiple files like these are wast of memory.
 	 */
-	per_cpu = tracefs_create_dir("per_cpu", top_dir);
+	per_cpu = tracefs_create_dir(idmap, "per_cpu", top_dir);
 	if (!per_cpu)
 		return -ENOMEM;
 
 	for_each_possible_cpu(cpu) {
 		snprintf(cpu_str, 30, "cpu%ld", cpu);
-		cpu_dir = tracefs_create_dir(cpu_str, per_cpu);
+		cpu_dir = tracefs_create_dir(idmap, cpu_str, per_cpu);
 		if (!cpu_dir)
 			goto out_clean;
 
-		timerlat_fd = trace_create_file("timerlat_fd", TRACE_MODE_READ,
-						cpu_dir, NULL, &timerlat_fd_fops);
+		timerlat_fd = trace_create_file(&nop_mnt_idmap, "timerlat_fd",
+						TRACE_MODE_READ, cpu_dir, NULL,
+						&timerlat_fd_fops);
 		if (!timerlat_fd)
 			goto out_clean;
 
@@ -2754,8 +2756,9 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
 	struct dentry *tmp;
 	int retval;
 
-	tmp = tracefs_create_file("timerlat_period_us", TRACE_MODE_WRITE, top_dir,
-				  &timerlat_period, &trace_min_max_fops);
+	tmp = tracefs_create_file(&nop_mnt_idmap, "timerlat_period_us",
+				  TRACE_MODE_WRITE, top_dir, &timerlat_period,
+				  &trace_min_max_fops);
 	if (!tmp)
 		return -ENOMEM;
 
@@ -2789,36 +2792,43 @@ static int init_tracefs(void)
 	if (ret)
 		return -ENOMEM;
 
-	top_dir = tracefs_create_dir("osnoise", NULL);
+	top_dir = tracefs_create_dir(&nop_mnt_idmap, "osnoise", NULL);
 	if (!top_dir)
 		return 0;
 
-	tmp = tracefs_create_file("period_us", TRACE_MODE_WRITE, top_dir,
-				  &osnoise_period, &trace_min_max_fops);
+	tmp = tracefs_create_file(&nop_mnt_idmap, "period_us", TRACE_MODE_WRITE,
+				  top_dir, &osnoise_period,
+				  &trace_min_max_fops);
 	if (!tmp)
 		goto err;
 
-	tmp = tracefs_create_file("runtime_us", TRACE_MODE_WRITE, top_dir,
-				  &osnoise_runtime, &trace_min_max_fops);
+	tmp = tracefs_create_file(&nop_mnt_idmap, "runtime_us",
+				  TRACE_MODE_WRITE, top_dir, &osnoise_runtime,
+				  &trace_min_max_fops);
 	if (!tmp)
 		goto err;
 
-	tmp = tracefs_create_file("stop_tracing_us", TRACE_MODE_WRITE, top_dir,
-				  &osnoise_stop_tracing_in, &trace_min_max_fops);
+	tmp = tracefs_create_file(&nop_mnt_idmap, "stop_tracing_us",
+				  TRACE_MODE_WRITE, top_dir,
+				  &osnoise_stop_tracing_in,
+				  &trace_min_max_fops);
 	if (!tmp)
 		goto err;
 
-	tmp = tracefs_create_file("stop_tracing_total_us", TRACE_MODE_WRITE, top_dir,
-				  &osnoise_stop_tracing_total, &trace_min_max_fops);
+	tmp = tracefs_create_file(&nop_mnt_idmap, "stop_tracing_total_us",
+				  TRACE_MODE_WRITE, top_dir,
+				  &osnoise_stop_tracing_total,
+				  &trace_min_max_fops);
 	if (!tmp)
 		goto err;
 
-	tmp = trace_create_file("cpus", TRACE_MODE_WRITE, top_dir, NULL, &cpus_fops);
+	tmp = trace_create_file(&nop_mnt_idmap, "cpus", TRACE_MODE_WRITE,
+				top_dir, NULL, &cpus_fops);
 	if (!tmp)
 		goto err;
 
-	tmp = trace_create_file("options", TRACE_MODE_WRITE, top_dir, NULL,
-				&osnoise_options_fops);
+	tmp = trace_create_file(&nop_mnt_idmap, "options", TRACE_MODE_WRITE,
+				top_dir, NULL, &osnoise_options_fops);
 	if (!tmp)
 		goto err;
 
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 29f6e95439b6..1a5cb6aa495e 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -384,8 +384,8 @@ static __init int init_trace_printk_function_export(void)
 	if (ret)
 		return 0;
 
-	trace_create_file("printk_formats", TRACE_MODE_READ, NULL,
-				    NULL, &ftrace_formats_fops);
+	trace_create_file(&nop_mnt_idmap, "printk_formats", TRACE_MODE_READ,
+			  NULL, NULL, &ftrace_formats_fops);
 
 	return 0;
 }
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5a48dba912ea..06e63086a081 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -559,15 +559,16 @@ static __init int stack_trace_init(void)
 	if (ret)
 		return 0;
 
-	trace_create_file("stack_max_size", TRACE_MODE_WRITE, NULL,
-			&stack_trace_max_size, &stack_max_size_fops);
+	trace_create_file(&nop_mnt_idmap, "stack_max_size", TRACE_MODE_WRITE,
+			  NULL, &stack_trace_max_size, &stack_max_size_fops);
 
-	trace_create_file("stack_trace", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "stack_trace", TRACE_MODE_READ, NULL,
 			NULL, &stack_trace_fops);
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-	trace_create_file("stack_trace_filter", TRACE_MODE_WRITE, NULL,
-			  &trace_ops, &stack_trace_filter_fops);
+	trace_create_file(&nop_mnt_idmap, "stack_trace_filter",
+			  TRACE_MODE_WRITE, NULL, &trace_ops,
+			  &stack_trace_filter_fops);
 #endif
 
 	if (stack_trace_filter_buf[0])
diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c
index bb247beec447..ca6e322468aa 100644
--- a/kernel/trace/trace_stat.c
+++ b/kernel/trace/trace_stat.c
@@ -282,7 +282,7 @@ static int tracing_stat_init(void)
 	if (ret)
 		return -ENODEV;
 
-	stat_dir = tracefs_create_dir("trace_stat", NULL);
+	stat_dir = tracefs_create_dir(&nop_mnt_idmap, "trace_stat", NULL);
 	if (!stat_dir) {
 		pr_warn("Could not create tracefs 'trace_stat' entry\n");
 		return -ENOMEM;
@@ -297,8 +297,8 @@ static int init_stat_file(struct stat_session *session)
 	if (!stat_dir && (ret = tracing_stat_init()))
 		return ret;
 
-	session->file = tracefs_create_file(session->ts->name, TRACE_MODE_WRITE,
-					    stat_dir, session,
+	session->file = tracefs_create_file(&nop_mnt_idmap, session->ts->name,
+					    TRACE_MODE_WRITE, stat_dir, session,
 					    &tracing_stat_fops);
 	if (!session->file)
 		return -ENOMEM;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 99c051de412a..060160c63f43 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1654,10 +1654,10 @@ static __init int init_uprobe_trace(void)
 	if (ret)
 		return 0;
 
-	trace_create_file("uprobe_events", TRACE_MODE_WRITE, NULL,
+	trace_create_file(&nop_mnt_idmap, "uprobe_events", TRACE_MODE_WRITE, NULL,
 				    NULL, &uprobe_events_ops);
 	/* Profile interface */
-	trace_create_file("uprobe_profile", TRACE_MODE_READ, NULL,
+	trace_create_file(&nop_mnt_idmap, "uprobe_profile", TRACE_MODE_READ, NULL,
 				    NULL, &uprobe_profile_ops);
 	return 0;
 }
Christian Brauner Jan. 7, 2024, 5:42 p.m. UTC | #19
On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > So tracefs supports remounting with different uid/gid mount options and
> > > then actually wades through _all_ of the inodes and changes their
> > > ownership internally? What's the use-case for this? Containers?
> > 
> > No, in fact tracing doesn't work well with containers as tracing is global
> > to the entire machine. It can work with privileged containers though.
> 
> At least the tracefs interface is easily supportable within a delegation
> model. IOW, you have a privileged process that delegates relevant
> portions to a container via idmapped mounts _without_ doing the insane thing
> and making it mountable by a container aka the fs-to-CVE pipeline.
> 
> > 
> > The reason for this is because tracefs was based off of debugfs where the
> > files and directores are created at boot up and mounted later. The reason
> > to do this was to allow users to mount with gid=GID to allow a given group
> > to have access to tracing. Without this update, tracefs would ignore it
> > like debugfs and proc does today.
> > 
> > I think its time I explain the purpose of tracefs and how it came to be.
> > 
> > The tracing system required a way to control tracing and read the traces.
> > It could have just used a new system like perf (although
> > /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> > like system call do do everything.
> > 
> > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> > tracer, which both have an embedded background, I chose an interface that
> > could work with just an unmodified version of busybox. That is, I wanted it
> > to work with just cat and echo.
> > 
> > The main difference with tracefs compared to other file systems is that it
> > is a control interface, where writes happen as much as reads. The data read
> > is controlled. The closest thing I can think of is how cgroups work.
> > 
> > As tracing is a privileged operation, but something that could be changed
> > to allow a group to have access to, I wanted to make it easy for an admin
> > to decide who gets to do what at boot up via the /etc/fstab file.
> 
> Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> 
> mount(8) should just give you the ability to specify "map the ids I explicitly
> want to remap to something else and for the rest use the identity mapping". I
> wanted that for other reasons anyway.
> 
> So in one of the next versions of mount(8) you can then do (where --beneath
> means place the mount beneath the current one and --replace is
> self-explanatory):
> 
> sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
> sudo umount /sys/kernel/tracing
> 
> or as a shortcut provided by mount(8):
> 
> sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing 
> 
> In both cases you replace the mount without unmounting tracefs.
> 
> I can illustrate this right now though:
> 
> user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/
> 
> # This is a tool I wrote for testing the patchset I wrote back then.
> user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing
> Mounting beneath top mount
> Creating anonymous mount
> Attaching mount /mnt -> /sys/kernel/tracing
> Creating single detached mount
> 
> user1@localhost:~/data/move-mount-beneath$
> 
> # Now there's two mounts stacked on top of each other.
> user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> | `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped
> |   `-/sys/kernel/tracing      tracefs        tracefs     rw,nosuid,nodev,noexec,relatime
> 
> user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
> total 0
> drwx------  6 root root 0 Jan  7 13:33 .
> drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
> -r--r-----  1 root root 0 Jan  7 13:33 README
> -r--r-----  1 root root 0 Jan  7 13:33 available_events
> -r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions
> -r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
> -r--r-----  1 root root 0 Jan  7 13:33 available_tracers
> -rw-r-----  1 root root 0 Jan  7 13:33 buffer_percent
> -rw-r-----  1 root root 0 Jan  7 13:33 buffer_size_kb
> 
> # Reveal updated mount
> user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing
> 
> user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> | `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped
> 
> user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
> total 0
> drwx------  6 user1 user1 0 Jan  7 13:33 .
> drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
> -r--r-----  1 user1 user1 0 Jan  7 13:33 README
> -r--r-----  1 user1 user1 0 Jan  7 13:33 available_events
> -r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions
> -r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions_addrs
> -r--r-----  1 user1 user1 0 Jan  7 13:33 available_tracers
> -rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_percent
> -rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_size_kb
> 
> sudo umount -l /sys/kernel/tracing
> 
> and reveal the new mount with updated permissions and at no point in time will
> you have had to unmount tracefs itself. No chown needed, no remount needed that
> has to touch all inodes.
> 
> I did perf numbers for this when I implemented this and there's no meaningful
> perf impact for anything below 5 mappings. which covers 80% of the use-cases.
> 
> I mean, there are crazy people out there that do have 30 mappings, maybe. And
> maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap
> still fits into cacheline) because of some LDAP crap or something.
> 
> See an ext4 filesystem to open/create 1,000,000 files. Then I looped through
> all of the files calling fstat() on each of them 1000 times and calculated the
> mean fstat() time for a single file.
> 
> |   # MAPPINGS | PATCH-NEW |
> |--------------|-----------|
> |   0 mappings |   158 ns  |
> |   1 mappings |   157 ns  |
> |   2 mappings |   158 ns  |
> |   3 mappings |   161 ns  |
> |   5 mappings |   165 ns  |
> |  10 mappings |   199 ns  |
> |  50 mappings |   218 ns  |
> | 100 mappings |   229 ns  |
> | 200 mappings |   239 ns  |
> | 300 mappings |   240 ns  |
> | 340 mappings |   248 ns  |
> 
> > 
> > > 
> > > Aside from optimizing this and the special semantics for this eventfs
> > > stuff that you really should think twice of doing, here's one idea for
> > > an extension that might alleviate some of the pain:
> > > 
> > > If you need flexible dynamic ownership change to e.g., be able to
> > > delegate (all, a directory, a single file of) tracefs to
> > > unprivileged/containers/whatever then you might want to consider
> > > supporting idmapped mounts for tracefs. Because then you can do stuff
> > > like:
> > 
> > I'm a novice here and have no idea on how id maps work ;-)
> 
> It's no magic and you don't even need to care about it. If you're on
> util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option
> does all the details for you.
> 
> Though I still want an api where you can just pass the idmappings directly to
> mount_setattr(). That's another topic though.
> 
> > have. Now, can we do that and still keep the dynamic creation of inodes and
> > dentries?
> 
> Yes.
> 
> > Does this require having more than one dentry per inode?
> 
> No. It's just a topological change. The same subtree exposed at different
> locations with the ability of exposing it with different permissions (without
> any persistent filesystem-level changes).
> 
> So, I tried to do an exploratory patch even though I promised myself not
> to do it. But hey...
> 
> Some notes:
> 
> * Permission handling for idmapped mounts is done completely in the
>   VFS. That's the case for all filesytems that don't have a custom
>   ->permission() handler. So there's nothing to do for us here.
> 
> * Idmapped mount handling for ->getattr() is done completely by the VFS
>   if the filesystem doesn't have a custom ->getattr() handler. So we're
>   done here.
> 
> * Tracefs doesn't support attribute changes via ->setattr() (chown(),
>   chmod etc.). So there's nothing to here.
> 
> * Eventfs does support attribute changes via ->setattr(). But it relies
>   on simple_setattr() which is already idmapped mount aware. So there's
>   nothing for us to do.
> 
> * Ownership is inherited from the parent inode (tracefs) or optionally
>   from stashed ownership information (eventfs). That means the idmapping
>   is irrelevant here. It's similar to the "inherit gid from parent
>   directory" logic we have in some circumstances. TL;DR nothing to do
>   here as well.
> 
> * Tracefs supports the creation of instances from userspace via mkdir.
>   For example,
> 
> 	mkdir /sys/kernel/tracing/instances/foo
> 
>   And here the idmapping is relevant so we need to make the helpers
>   aware of the idmapping.
> 
>   I just went and plumbed this through to most helpers.
> 
> There's some subtlety in eventfs. Afaict, the directories and files for
> the individual events are created on-demand during lookup or readdir.
> 
> The ownership of these events is again inherited from the parent inode
> or recovered from stored state. In both cases the actual idmapping is
> irrelevant.
> 
> The callchain here is:
> 
> eventfs_root_lookup("xfs", "events")
> -> create_{dir,file}_dentry("xfs", "events")
>    -> create_{dir,file}("xfs", "events")
>       -> eventfs_start_creating("xfs", "events")
>          -> lookup_one_len("xfs", "events")
> 
> And the subtlety is that lookup_one_len() does permission checking on
> the parent inode (IOW, if you want a dentry for "blech" under "events"
> it'll do a permission check on events->d_inode) for exec permissions
> and then goes on to give you a new dentry.
> 
> Usually this call would have to be changed to lookup_one() and the
> idmapping be handed down to it. But I think that's irrelevant here.
> 
> Lookup generally doesn't need to be aware of idmappings at all. The
> permission checking is done purely in the vfs via may_lookup() and the
> idmapping is irrelevant because we always initialize inodes with the
> filesystem level ownership (see the idmappings.rst) documentation if
> you're interested in excessive details (otherwise you get inode aliases
> which you really don't want).
> 
> For tracefs it would not matter for lookup per se but only because
> tracefs seemingly creates inodes/dentries during lookup (and readdir()).
> 
> But imho the permission checking done in current eventfs_root_lookup()
> via lookup_one_len() is meaningless in any way; possibly even
> (conceptually) wrong.
> 
> Because, the actual permission checking for the creation of the eventfs
> entries isn't really done during lookup or readdir, it's done when mkdir
> is called:
> 
>         mkdir /sys/kernel/tracing/instances/foo
> 
> Here, all possible entries beneath foo including "events" and further
> below are recorded and stored. So once mkdir returns it basically means
> that it succeeded with the creation of all the necessary directories and
> files. For all purposes the foo/events/ directory and below have all the
> entries that matter. They have been created. It's comparable to them not
> being in the {d,i}cache, I guess.
> 
> When one goes and looksup stuff under foo/events/ or readdir the entries
> in that directory:
> 
> fd = open("foo/events")
> readdir(fd, ...)
> 
> then they are licensed to list an entry in that directory. So all that
> needs to be done is to actually list those files in that directory. And
> since they already exist (they were created during mkdir) we just need
> to splice in inodes and dentries for them. But for that we shouldn't
> check permissions on the directory again. Because we've done that
> already correctly when the VFS called may_lookup().
> 
> IOW, the inode_permission() in lookup_one_len() that eventfs does is
> redundant and just wrong.
> 
> Luckily, I don't think we need to even change anything because all
> directories that eventfs creates always grant exec permissions to the
> other group so lookup_one_len() will trivially succeed. IIUC.
> 
> Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org>

Actually, I think that patch can be way simpler for a similar reason
than I mentioned before. In the regular system call for tracefs we have:

mkdir /sys/kernel/tracing/instances/foo

-> do_mkdirat("foo")
   -> i_op->mkdir::tracefs_syscall_mkdir(child_dentry::foo, parent_inode)
      -> char *name = get_dname(child_dentry::foo) // duplicate the dentries name
      -> tracefs_ops.mkdir::instance_mkdir("foo")
        -> trace_array_create("foo")
	   -> trace_array_create_dir("foo")
	      -> tracefs_create_dir("foo", trace_instance_dir)
	         -> __create_dir("foo", trace_instance_dir)
        	    -> tracefs_start_creating("foo", trace_instance_dir)
		       -> lookup_one_len("foo", trace_instance_dir) // perform the lookup again and find @child_dentry again that the vfs already created

So afaict, the vfs has created you the correct target dentry and
performed the required permission checks already.

You then go on to retrieve and duplicate the name of the dentry and pass
that name down. And then you perform a lookup via lookup_one_len() on
the instances directory. And here lookup_one_len() performs the
permission checking again that the vfs already did in may_lookup() when
it walked up to the "instances" directory.

The reason I'm saying this is that for tracefs _the only reason_ we need
that idmapping in any of these helpers at all is because of the
superfluous permission check in lookup_one_len(). Because for actual
ownership of the inode it's completely irrelevant because you inherit
the ownership anyway. IOW, the callers fs{g,u}id etc is entirely
irrelevant.

So basically if you massage the system call path to simply reuse the
dentry that the vfs already given you instead of looking it up again
later the whole patch is literally (baring anything minor I forgot):


diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index ae648deed019..2de3ec114793 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -439,6 +439,7 @@ static struct file_system_type trace_fs_type = {
        .name =         "tracefs",
        .mount =        trace_mount,
        .kill_sb =      kill_litter_super,
+       .fs_flags =     FS_ALLOW_IDMAP,
 };
 MODULE_ALIAS_FS("tracefs");

Because tracefs simply inherits ownership from parent directories or
from stashed info the idmapping is entirely immaterial on the creation
side of things. As soon as the permission checking in the VFS has
succeeded we know that things are alright.

So really, the complexity is getting in your way here. The
tracefs_ops.mkdir thing is only used in a single location and that is

kernel/trace/trace.c:   trace_instance_dir = tracefs_create_instance_dir("instances", d_tracer,

and there's only one callback instance_mkdir().

otherwise that callback ops stuff is completely unused iiuc. If that's
the case it would make a lot more sense to just reuse everything that
the VFS has given you in the regular system call path than to do that
"copy the name and do the lookup again including permission checks
later" dance.

Idk, maybe I'm seeing things and there's an obvious reason why my
thinking is all wrong and that permission check is required. I just
don't see it though. But I'm sure people will tell me what my error is.

At least for the system call path it shouldn't be.
Christian Brauner Jan. 7, 2024, 6:01 p.m. UTC | #20
On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote:
> On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > > So tracefs supports remounting with different uid/gid mount options and
> > > > then actually wades through _all_ of the inodes and changes their
> > > > ownership internally? What's the use-case for this? Containers?
> > > 
> > > No, in fact tracing doesn't work well with containers as tracing is global
> > > to the entire machine. It can work with privileged containers though.
> > 
> > At least the tracefs interface is easily supportable within a delegation
> > model. IOW, you have a privileged process that delegates relevant
> > portions to a container via idmapped mounts _without_ doing the insane thing
> > and making it mountable by a container aka the fs-to-CVE pipeline.
> > 
> > > 
> > > The reason for this is because tracefs was based off of debugfs where the
> > > files and directores are created at boot up and mounted later. The reason
> > > to do this was to allow users to mount with gid=GID to allow a given group
> > > to have access to tracing. Without this update, tracefs would ignore it
> > > like debugfs and proc does today.
> > > 
> > > I think its time I explain the purpose of tracefs and how it came to be.
> > > 
> > > The tracing system required a way to control tracing and read the traces.
> > > It could have just used a new system like perf (although
> > > /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> > > like system call do do everything.
> > > 
> > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> > > tracer, which both have an embedded background, I chose an interface that
> > > could work with just an unmodified version of busybox. That is, I wanted it
> > > to work with just cat and echo.
> > > 
> > > The main difference with tracefs compared to other file systems is that it
> > > is a control interface, where writes happen as much as reads. The data read
> > > is controlled. The closest thing I can think of is how cgroups work.
> > > 
> > > As tracing is a privileged operation, but something that could be changed
> > > to allow a group to have access to, I wanted to make it easy for an admin
> > > to decide who gets to do what at boot up via the /etc/fstab file.
> > 
> > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> > 
> > mount(8) should just give you the ability to specify "map the ids I explicitly
> > want to remap to something else and for the rest use the identity mapping". I
> > wanted that for other reasons anyway.
> > 
> > So in one of the next versions of mount(8) you can then do (where --beneath
> > means place the mount beneath the current one and --replace is
> > self-explanatory):
> > 
> > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
> > sudo umount /sys/kernel/tracing
> > 
> > or as a shortcut provided by mount(8):
> > 
> > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing 
> > 
> > In both cases you replace the mount without unmounting tracefs.
> > 
> > I can illustrate this right now though:
> > 
> > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/
> > 
> > # This is a tool I wrote for testing the patchset I wrote back then.
> > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing
> > Mounting beneath top mount
> > Creating anonymous mount
> > Attaching mount /mnt -> /sys/kernel/tracing
> > Creating single detached mount
> > 
> > user1@localhost:~/data/move-mount-beneath$
> > 
> > # Now there's two mounts stacked on top of each other.
> > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> > | `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped
> > |   `-/sys/kernel/tracing      tracefs        tracefs     rw,nosuid,nodev,noexec,relatime
> > 
> > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
> > total 0
> > drwx------  6 root root 0 Jan  7 13:33 .
> > drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
> > -r--r-----  1 root root 0 Jan  7 13:33 README
> > -r--r-----  1 root root 0 Jan  7 13:33 available_events
> > -r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions
> > -r--r-----  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
> > -r--r-----  1 root root 0 Jan  7 13:33 available_tracers
> > -rw-r-----  1 root root 0 Jan  7 13:33 buffer_percent
> > -rw-r-----  1 root root 0 Jan  7 13:33 buffer_size_kb
> > 
> > # Reveal updated mount
> > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing
> > 
> > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> > | `-/sys/kernel/tracing        tracefs        tracefs     rw,nosuid,nodev,noexec,relatime,idmapped
> > 
> > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head
> > total 0
> > drwx------  6 user1 user1 0 Jan  7 13:33 .
> > drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
> > -r--r-----  1 user1 user1 0 Jan  7 13:33 README
> > -r--r-----  1 user1 user1 0 Jan  7 13:33 available_events
> > -r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions
> > -r--r-----  1 user1 user1 0 Jan  7 13:33 available_filter_functions_addrs
> > -r--r-----  1 user1 user1 0 Jan  7 13:33 available_tracers
> > -rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_percent
> > -rw-r-----  1 user1 user1 0 Jan  7 13:33 buffer_size_kb
> > 
> > sudo umount -l /sys/kernel/tracing
> > 
> > and reveal the new mount with updated permissions and at no point in time will
> > you have had to unmount tracefs itself. No chown needed, no remount needed that
> > has to touch all inodes.
> > 
> > I did perf numbers for this when I implemented this and there's no meaningful
> > perf impact for anything below 5 mappings. which covers 80% of the use-cases.
> > 
> > I mean, there are crazy people out there that do have 30 mappings, maybe. And
> > maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap
> > still fits into cacheline) because of some LDAP crap or something.
> > 
> > See an ext4 filesystem to open/create 1,000,000 files. Then I looped through
> > all of the files calling fstat() on each of them 1000 times and calculated the
> > mean fstat() time for a single file.
> > 
> > |   # MAPPINGS | PATCH-NEW |
> > |--------------|-----------|
> > |   0 mappings |   158 ns  |
> > |   1 mappings |   157 ns  |
> > |   2 mappings |   158 ns  |
> > |   3 mappings |   161 ns  |
> > |   5 mappings |   165 ns  |
> > |  10 mappings |   199 ns  |
> > |  50 mappings |   218 ns  |
> > | 100 mappings |   229 ns  |
> > | 200 mappings |   239 ns  |
> > | 300 mappings |   240 ns  |
> > | 340 mappings |   248 ns  |
> > 
> > > 
> > > > 
> > > > Aside from optimizing this and the special semantics for this eventfs
> > > > stuff that you really should think twice of doing, here's one idea for
> > > > an extension that might alleviate some of the pain:
> > > > 
> > > > If you need flexible dynamic ownership change to e.g., be able to
> > > > delegate (all, a directory, a single file of) tracefs to
> > > > unprivileged/containers/whatever then you might want to consider
> > > > supporting idmapped mounts for tracefs. Because then you can do stuff
> > > > like:
> > > 
> > > I'm a novice here and have no idea on how id maps work ;-)
> > 
> > It's no magic and you don't even need to care about it. If you're on
> > util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option
> > does all the details for you.
> > 
> > Though I still want an api where you can just pass the idmappings directly to
> > mount_setattr(). That's another topic though.
> > 
> > > have. Now, can we do that and still keep the dynamic creation of inodes and
> > > dentries?
> > 
> > Yes.
> > 
> > > Does this require having more than one dentry per inode?
> > 
> > No. It's just a topological change. The same subtree exposed at different
> > locations with the ability of exposing it with different permissions (without
> > any persistent filesystem-level changes).
> > 
> > So, I tried to do an exploratory patch even though I promised myself not
> > to do it. But hey...
> > 
> > Some notes:
> > 
> > * Permission handling for idmapped mounts is done completely in the
> >   VFS. That's the case for all filesytems that don't have a custom
> >   ->permission() handler. So there's nothing to do for us here.
> > 
> > * Idmapped mount handling for ->getattr() is done completely by the VFS
> >   if the filesystem doesn't have a custom ->getattr() handler. So we're
> >   done here.
> > 
> > * Tracefs doesn't support attribute changes via ->setattr() (chown(),
> >   chmod etc.). So there's nothing to here.
> > 
> > * Eventfs does support attribute changes via ->setattr(). But it relies
> >   on simple_setattr() which is already idmapped mount aware. So there's
> >   nothing for us to do.
> > 
> > * Ownership is inherited from the parent inode (tracefs) or optionally
> >   from stashed ownership information (eventfs). That means the idmapping
> >   is irrelevant here. It's similar to the "inherit gid from parent
> >   directory" logic we have in some circumstances. TL;DR nothing to do
> >   here as well.
> > 
> > * Tracefs supports the creation of instances from userspace via mkdir.
> >   For example,
> > 
> > 	mkdir /sys/kernel/tracing/instances/foo
> > 
> >   And here the idmapping is relevant so we need to make the helpers
> >   aware of the idmapping.
> > 
> >   I just went and plumbed this through to most helpers.
> > 
> > There's some subtlety in eventfs. Afaict, the directories and files for
> > the individual events are created on-demand during lookup or readdir.
> > 
> > The ownership of these events is again inherited from the parent inode
> > or recovered from stored state. In both cases the actual idmapping is
> > irrelevant.
> > 
> > The callchain here is:
> > 
> > eventfs_root_lookup("xfs", "events")
> > -> create_{dir,file}_dentry("xfs", "events")
> >    -> create_{dir,file}("xfs", "events")
> >       -> eventfs_start_creating("xfs", "events")
> >          -> lookup_one_len("xfs", "events")
> > 
> > And the subtlety is that lookup_one_len() does permission checking on
> > the parent inode (IOW, if you want a dentry for "blech" under "events"
> > it'll do a permission check on events->d_inode) for exec permissions
> > and then goes on to give you a new dentry.
> > 
> > Usually this call would have to be changed to lookup_one() and the
> > idmapping be handed down to it. But I think that's irrelevant here.
> > 
> > Lookup generally doesn't need to be aware of idmappings at all. The
> > permission checking is done purely in the vfs via may_lookup() and the
> > idmapping is irrelevant because we always initialize inodes with the
> > filesystem level ownership (see the idmappings.rst) documentation if
> > you're interested in excessive details (otherwise you get inode aliases
> > which you really don't want).
> > 
> > For tracefs it would not matter for lookup per se but only because
> > tracefs seemingly creates inodes/dentries during lookup (and readdir()).
> > 
> > But imho the permission checking done in current eventfs_root_lookup()
> > via lookup_one_len() is meaningless in any way; possibly even
> > (conceptually) wrong.
> > 
> > Because, the actual permission checking for the creation of the eventfs
> > entries isn't really done during lookup or readdir, it's done when mkdir
> > is called:
> > 
> >         mkdir /sys/kernel/tracing/instances/foo
> > 
> > Here, all possible entries beneath foo including "events" and further
> > below are recorded and stored. So once mkdir returns it basically means
> > that it succeeded with the creation of all the necessary directories and
> > files. For all purposes the foo/events/ directory and below have all the
> > entries that matter. They have been created. It's comparable to them not
> > being in the {d,i}cache, I guess.
> > 
> > When one goes and looksup stuff under foo/events/ or readdir the entries
> > in that directory:
> > 
> > fd = open("foo/events")
> > readdir(fd, ...)
> > 
> > then they are licensed to list an entry in that directory. So all that
> > needs to be done is to actually list those files in that directory. And
> > since they already exist (they were created during mkdir) we just need
> > to splice in inodes and dentries for them. But for that we shouldn't
> > check permissions on the directory again. Because we've done that
> > already correctly when the VFS called may_lookup().
> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.
> > 
> > Luckily, I don't think we need to even change anything because all
> > directories that eventfs creates always grant exec permissions to the
> > other group so lookup_one_len() will trivially succeed. IIUC.
> > 
> > Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org>
> 
> Actually, I think that patch can be way simpler for a similar reason
> than I mentioned before. In the regular system call for tracefs we have:
> 
> mkdir /sys/kernel/tracing/instances/foo
> 
> -> do_mkdirat("foo")
>    -> i_op->mkdir::tracefs_syscall_mkdir(child_dentry::foo, parent_inode)
>       -> char *name = get_dname(child_dentry::foo) // duplicate the dentries name
>       -> tracefs_ops.mkdir::instance_mkdir("foo")
>         -> trace_array_create("foo")
> 	   -> trace_array_create_dir("foo")
> 	      -> tracefs_create_dir("foo", trace_instance_dir)
> 	         -> __create_dir("foo", trace_instance_dir)
>         	    -> tracefs_start_creating("foo", trace_instance_dir)
> 		       -> lookup_one_len("foo", trace_instance_dir) // perform the lookup again and find @child_dentry again that the vfs already created
> 
> So afaict, the vfs has created you the correct target dentry and
> performed the required permission checks already.
> 
> You then go on to retrieve and duplicate the name of the dentry and pass
> that name down. And then you perform a lookup via lookup_one_len() on
> the instances directory. And here lookup_one_len() performs the
> permission checking again that the vfs already did in may_lookup() when
> it walked up to the "instances" directory.
> 
> The reason I'm saying this is that for tracefs _the only reason_ we need
> that idmapping in any of these helpers at all is because of the
> superfluous permission check in lookup_one_len(). Because for actual
> ownership of the inode it's completely irrelevant because you inherit
> the ownership anyway. IOW, the callers fs{g,u}id etc is entirely
> irrelevant.
> 
> So basically if you massage the system call path to simply reuse the
> dentry that the vfs already given you instead of looking it up again
> later the whole patch is literally (baring anything minor I forgot):
> 
> 
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index ae648deed019..2de3ec114793 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -439,6 +439,7 @@ static struct file_system_type trace_fs_type = {
>         .name =         "tracefs",
>         .mount =        trace_mount,
>         .kill_sb =      kill_litter_super,
> +       .fs_flags =     FS_ALLOW_IDMAP,
>  };
>  MODULE_ALIAS_FS("tracefs");
> 
> Because tracefs simply inherits ownership from parent directories or
> from stashed info the idmapping is entirely immaterial on the creation
> side of things. As soon as the permission checking in the VFS has
> succeeded we know that things are alright.
> 
> So really, the complexity is getting in your way here. The
> tracefs_ops.mkdir thing is only used in a single location and that is
> 
> kernel/trace/trace.c:   trace_instance_dir = tracefs_create_instance_dir("instances", d_tracer,
> 
> and there's only one callback instance_mkdir().
> 
> otherwise that callback ops stuff is completely unused iiuc. If that's
> the case it would make a lot more sense to just reuse everything that
> the VFS has given you in the regular system call path than to do that
> "copy the name and do the lookup again including permission checks
> later" dance.
> 
> Idk, maybe I'm seeing things and there's an obvious reason why my
> thinking is all wrong and that permission check is required. I just
> don't see it though. But I'm sure people will tell me what my error is.
> 
> At least for the system call path it shouldn't be.

I guess an argument would be that the check is somewhat required because
you create a whole hierarchy of directories and files beneath the top
level directory that the system call path creates.

But then would you ever want to fail creating any directories or files
beneath /sys/kernel/tracing/instances/foo/* if the VFS told you that you
have the required permission to create the top level directory
/sys/kernel/tracing/instances/foo?

I think that isn't the case and it isn't even possible currently. So
even for the directories and files beneath the top level directory the
permission check is irrelevant iiuc.

Anyway, I don't think you necessarily need to change any of what you do
right now. The first version of the patch will work with what you
currently have and is correct. It's just more complex because of that
additional permission checking that I think isn't needed. Because it's
not that the user has any control over the creation of the subtree
beneath /sys/kernel/tracing/instances/foo/*. But again, I might be
completely off.
Steven Rostedt Jan. 7, 2024, 6:29 p.m. UTC | #21
On Sun, 7 Jan 2024 13:42:39 +0100
Christian Brauner <brauner@kernel.org> wrote:
> 
> So, I tried to do an exploratory patch even though I promised myself not
> to do it. But hey...
> 
> Some notes:
> 
> * Permission handling for idmapped mounts is done completely in the
>   VFS. That's the case for all filesytems that don't have a custom
>   ->permission() handler. So there's nothing to do for us here.  
> 
> * Idmapped mount handling for ->getattr() is done completely by the VFS
>   if the filesystem doesn't have a custom ->getattr() handler. So we're
>   done here.
> 
> * Tracefs doesn't support attribute changes via ->setattr() (chown(),
>   chmod etc.). So there's nothing to here.
> 
> * Eventfs does support attribute changes via ->setattr(). But it relies
>   on simple_setattr() which is already idmapped mount aware. So there's
>   nothing for us to do.
> 
> * Ownership is inherited from the parent inode (tracefs) or optionally
>   from stashed ownership information (eventfs). That means the idmapping
>   is irrelevant here. It's similar to the "inherit gid from parent
>   directory" logic we have in some circumstances. TL;DR nothing to do
>   here as well.

The reason ownership is inherited from the parent is because the inodes
are created at boot up before user space starts.

eventfs inodes are created on demand after user space so it needs to
either check the "default" ownership and permissions, or if the user
changed a specific file/directory, it must save it and use that again
if the inode/dentry are reclaimed and then referenced again and
recreated.

> 
> * Tracefs supports the creation of instances from userspace via mkdir.
>   For example,
> 
> 	mkdir /sys/kernel/tracing/instances/foo
> 
>   And here the idmapping is relevant so we need to make the helpers
>   aware of the idmapping.
> 
>   I just went and plumbed this through to most helpers.
> 
> There's some subtlety in eventfs. Afaict, the directories and files for
> the individual events are created on-demand during lookup or readdir.
> 
> The ownership of these events is again inherited from the parent inode
> or recovered from stored state. In both cases the actual idmapping is
> irrelevant.
> 
> The callchain here is:
> 
> eventfs_root_lookup("xfs", "events")
> -> create_{dir,file}_dentry("xfs", "events")
>    -> create_{dir,file}("xfs", "events")
>       -> eventfs_start_creating("xfs", "events")
>          -> lookup_one_len("xfs", "events")  
> 
> And the subtlety is that lookup_one_len() does permission checking on
> the parent inode (IOW, if you want a dentry for "blech" under "events"
> it'll do a permission check on events->d_inode) for exec permissions
> and then goes on to give you a new dentry.
> 
> Usually this call would have to be changed to lookup_one() and the
> idmapping be handed down to it. But I think that's irrelevant here.
> 
> Lookup generally doesn't need to be aware of idmappings at all. The
> permission checking is done purely in the vfs via may_lookup() and the
> idmapping is irrelevant because we always initialize inodes with the
> filesystem level ownership (see the idmappings.rst) documentation if
> you're interested in excessive details (otherwise you get inode aliases
> which you really don't want).
> 
> For tracefs it would not matter for lookup per se but only because
> tracefs seemingly creates inodes/dentries during lookup (and readdir()).

tracefs creates the inodes/dentries at boot up, it's eventfs that does
it on demand during lookup.

For inodes/dentries:

 /sys/kernel/tracing/* is all created at boot up, except for "events".
 /sys/kernel/tracing/events/* is created on demand.

> 
> But imho the permission checking done in current eventfs_root_lookup()
> via lookup_one_len() is meaningless in any way; possibly even
> (conceptually) wrong.
> 
> Because, the actual permission checking for the creation of the eventfs
> entries isn't really done during lookup or readdir, it's done when mkdir
> is called:
> 
>         mkdir /sys/kernel/tracing/instances/foo

No. that creates a entire new tracefs instance, which happens to
include another eventfs directory.

The eventsfs directory is in "/sys/kernel/tracing/events" and
 "/sys/kernel/tracing/instances/*/events"

eventfs has 10s of thousands of files and directories which is why I
changed it to be on-demand inode/dentry creation and also reclaiming
when no longer accessed.

 # ls /sys/kernel/tracing/events/

will create the inodes and dentries, and a memory stress program will
free those created inodes and dentries.

> 
> Here, all possible entries beneath foo including "events" and further
> below are recorded and stored. So once mkdir returns it basically means
> that it succeeded with the creation of all the necessary directories and
> files. For all purposes the foo/events/ directory and below have all the
> entries that matter. They have been created. It's comparable to them not
> being in the {d,i}cache, I guess.

No. Only the meta data is created for the eventfs directory with a
mkdir instances/foo. The inodes and dentries are not there.

> 
> When one goes and looksup stuff under foo/events/ or readdir the entries
> in that directory:
> 
> fd = open("foo/events")
> readdir(fd, ...)
> 
> then they are licensed to list an entry in that directory. So all that
> needs to be done is to actually list those files in that directory. And
> since they already exist (they were created during mkdir) we just need
> to splice in inodes and dentries for them. But for that we shouldn't
> check permissions on the directory again. Because we've done that
> already correctly when the VFS called may_lookup().

No they do not exist.

> 
> IOW, the inode_permission() in lookup_one_len() that eventfs does is
> redundant and just wrong.

I don't think so.

> 
> Luckily, I don't think we need to even change anything because all
> directories that eventfs creates always grant exec permissions to the
> other group so lookup_one_len() will trivially succeed. IIUC.
> 
> Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org>

Checkout this branch:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git eventfs-show-files

It creates a file "/sys/kernel/tracing/show_events_dentries" That shows
when inodes and dentries are created and freed.

I use this to make sure that reclaim actually does reclaim them. Linus
did not like this file, but it has become very useful to make sure
things are working properly as there is no other way to know if the
inodes and dentries are reclaimed or not. I may see if he'll let me add
it with a debug config option.

-- Steve
Steven Rostedt Jan. 7, 2024, 6:32 p.m. UTC | #22
On Sun, 7 Jan 2024 13:29:12 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.  
> 
> I don't think so.

Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
It exists without that. Although one rationale to do eventfs was so
that the instance directories wouldn't recreate the same 10thousands
event inodes and dentries for every mkdir done.

-- Steve
Christian Brauner Jan. 8, 2024, 11:04 a.m. UTC | #23
> > * Tracefs supports the creation of instances from userspace via mkdir.
> >   For example,
> > 
> > 	mkdir /sys/kernel/tracing/instances/foo
> > 
> >   And here the idmapping is relevant so we need to make the helpers
> >   aware of the idmapping.
> > 
> >   I just went and plumbed this through to most helpers.
> > 
> > There's some subtlety in eventfs. Afaict, the directories and files for
> > the individual events are created on-demand during lookup or readdir.
> > 
> > The ownership of these events is again inherited from the parent inode
> > or recovered from stored state. In both cases the actual idmapping is
> > irrelevant.
> > 
> > The callchain here is:
> > 
> > eventfs_root_lookup("xfs", "events")
> > -> create_{dir,file}_dentry("xfs", "events")
> >    -> create_{dir,file}("xfs", "events")
> >       -> eventfs_start_creating("xfs", "events")
> >          -> lookup_one_len("xfs", "events")  
> > 
> > And the subtlety is that lookup_one_len() does permission checking on
> > the parent inode (IOW, if you want a dentry for "blech" under "events"
> > it'll do a permission check on events->d_inode) for exec permissions
> > and then goes on to give you a new dentry.
> > 
> > Usually this call would have to be changed to lookup_one() and the
> > idmapping be handed down to it. But I think that's irrelevant here.
> > 
> > Lookup generally doesn't need to be aware of idmappings at all. The
> > permission checking is done purely in the vfs via may_lookup() and the
> > idmapping is irrelevant because we always initialize inodes with the
> > filesystem level ownership (see the idmappings.rst) documentation if
> > you're interested in excessive details (otherwise you get inode aliases
> > which you really don't want).
> > 
> > For tracefs it would not matter for lookup per se but only because
> > tracefs seemingly creates inodes/dentries during lookup (and readdir()).
> 
> tracefs creates the inodes/dentries at boot up, it's eventfs that does
> it on demand during lookup.
> 
> For inodes/dentries:
> 
>  /sys/kernel/tracing/* is all created at boot up, except for "events".

Yes.

>  /sys/kernel/tracing/events/* is created on demand.

Yes.

> 
> > 
> > But imho the permission checking done in current eventfs_root_lookup()
> > via lookup_one_len() is meaningless in any way; possibly even
> > (conceptually) wrong.
> > 
> > Because, the actual permission checking for the creation of the eventfs
> > entries isn't really done during lookup or readdir, it's done when mkdir
> > is called:
> > 
> >         mkdir /sys/kernel/tracing/instances/foo
> 
> No. that creates a entire new tracefs instance, which happens to
> include another eventfs directory.

Yes, I'm aware of all that.

> No. Only the meta data is created for the eventfs directory with a
> mkdir instances/foo. The inodes and dentries are not there.

I know, that is what I'm saying.

> 
> > 
> > When one goes and looksup stuff under foo/events/ or readdir the entries
> > in that directory:
> > 
> > fd = open("foo/events")
> > readdir(fd, ...)
> > 
> > then they are licensed to list an entry in that directory. So all that
> > needs to be done is to actually list those files in that directory. And
> > since they already exist (they were created during mkdir) we just need
> > to splice in inodes and dentries for them. But for that we shouldn't
> > check permissions on the directory again. Because we've done that
> > already correctly when the VFS called may_lookup().
> 
> No they do not exist.

I am aware.

> 
> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.
> 
> I don't think so.

I'm very well aware that the dentries and inode aren't created during
mkdir but the completely directory layout is determined. You're just
splicing in dentries and inodes during lookup and readdir.

If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
do a lookup/readdir on

ls -al /sys/kernel/tracing/instances/foo/events

Why should the creation of the dentries and inodes ever fail due to a
permission failure? The vfs did already verify that you had the required
permissions to list entries in that directory. Why should filling up
/sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
That tracefs instance would be half-functional. And again, right now
that inode_permission() check cannot even fail.
Christian Brauner Jan. 8, 2024, 11:32 a.m. UTC | #24
On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> On Sun, 7 Jan 2024 13:29:12 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > > 
> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.
> 
> Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> It exists without that. Although one rationale to do eventfs was so

Every instance/foo/ tracefs instances also contains an events directory
and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
not a separate filesystem. Both eventfs and tracefs are on the same
single, system wide superblock.

> that the instance directories wouldn't recreate the same 10thousands
> event inodes and dentries for every mkdir done.

I know but that's irrelevant to what I'm trying to tell you.

A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
instance. With or without the on-demand dentry and inode creation for
the eventfs portion that tracefs "instance" has now been created in its
entirety including all the required information for someone to later
come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.

All you've done is to defer the addition of the dentries and inodes when
someone does actually look at the events directory of the tracefs
instance.

Whether you choose to splice in the dentries and inodes for the eventfs
portion during lookup and readdir or if you had chosen to not do the
on-demand thing at all and the entries were created at the same time as
the mkdir call are equivalent from the perspective of permission
checking.

If you have the required permissions to look at the events directory
then there's no reason why listing the directory entries in there should
fail. This can't even happen right now.
Steven Rostedt Jan. 8, 2024, 3:23 p.m. UTC | #25
On Mon, 8 Jan 2024 12:04:54 +0100
Christian Brauner <brauner@kernel.org> wrote:

> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.  
> 
> I'm very well aware that the dentries and inode aren't created during
> mkdir but the completely directory layout is determined. You're just
> splicing in dentries and inodes during lookup and readdir.
> 
> If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> do a lookup/readdir on
> 
> ls -al /sys/kernel/tracing/instances/foo/events
> 
> Why should the creation of the dentries and inodes ever fail due to a
> permission failure?

They shouldn't.

> The vfs did already verify that you had the required
> permissions to list entries in that directory. Why should filling up
> /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> That tracefs instance would be half-functional. And again, right now
> that inode_permission() check cannot even fail.

And it shouldn't. But without dentries and inodes, how does VFS know what
is allowed to open the files?

-- Steve
Steven Rostedt Jan. 8, 2024, 3:41 p.m. UTC | #26
On Mon, 8 Jan 2024 12:32:46 +0100
Christian Brauner <brauner@kernel.org> wrote:

> On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> > On Sun, 7 Jan 2024 13:29:12 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > > 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.    
> > > 
> > > I don't think so.  
> > 
> > Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> > It exists without that. Although one rationale to do eventfs was so  
> 
> Every instance/foo/ tracefs instances also contains an events directory
> and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
> not a separate filesystem. Both eventfs and tracefs are on the same
> single, system wide superblock.
> 
> > that the instance directories wouldn't recreate the same 10thousands
> > event inodes and dentries for every mkdir done.  
> 
> I know but that's irrelevant to what I'm trying to tell you.
> 
> A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
> instance. With or without the on-demand dentry and inode creation for
> the eventfs portion that tracefs "instance" has now been created in its
> entirety including all the required information for someone to later
> come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.
> 
> All you've done is to defer the addition of the dentries and inodes when
> someone does actually look at the events directory of the tracefs
> instance.
> 
> Whether you choose to splice in the dentries and inodes for the eventfs
> portion during lookup and readdir or if you had chosen to not do the
> on-demand thing at all and the entries were created at the same time as
> the mkdir call are equivalent from the perspective of permission
> checking.
> 
> If you have the required permissions to look at the events directory
> then there's no reason why listing the directory entries in there should
> fail. This can't even happen right now.

Ah, I think I know where the confusion lies. The tracing information in
kernel/trace/*.c doesn't keep track of permission. It relies totally on
fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with
'gid=xxx' then it's up to tracefs to maintain that information and not the
tracing subsystem. The tracing subsystem only gives the "default"
permissions (before boot finishes).

The difference between normal file systems and pseudo file systems like
debugfs and tracefs, is that normal file systems keep the permission
information stored on the external device. That is, when the inodes and
dentries are created, the information is retrieved from the stored file
system.

I think this may actually be a failure of debugfs (and tracefs as it was
based on debugfs), in that the inodes and dentries are created at the same
time the "files" backing them are. Which is normally at boot up and before
the file system is mounted.

That is, inodes and dentries are actually coupled with the data they
represent. It's not a cache for a back store like a hard drive partition.

To create a file in debugfs you do:

struct dentry *debugfs_create_file(const char *name, umode_t mode,
				   struct dentry *parent, void *data,
				   const struct file_operations *fops)

That is, you pass a the name, the mode, the parent dentry, data, and the
fops and that will create an inode and dentry (which is returned).

This happens at boot up before user space is running and before debugfs is
even mounted.

Because debugfs is mostly for debugging, people don't care about how it's
mounted. It is usually restricted to root only access. Especially since
there's a lot of sensitive information that shouldn't be exposed to
non-privileged users.

The reason tracefs came about is that people asked me to be able to have
access to tracing without needing to even enable debugfs. They also want to
easily make it accessible to non root users and talking with Kees Cook, he
recommended using ACL. But because it inherited a lot from debugfs, I
started doing these tricks like walking the dentry tree to make it work a
bit better. Because the dentries and inodes were created before mount, I
had to play these tricks.

But as Linus pointed out, that was the wrong way to do that. The right way
was to use .getattr and .permission callbacks to figure out what the
permissions to the files are.

This has nothing to do with the creation of the files, it's about who has
access to the files that the inodes point to.

This sounds like another topic to bring up at LSFMM ;-)  "Can we
standardize pseudo file systems like debugfs and tracefs to act more like
real file systems, and have inodes and dentries act as cache and not be so
coupled to the data?"

-- Steve
Christian Brauner Jan. 10, 2024, 11:45 a.m. UTC | #27
On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote:
> On Mon, 8 Jan 2024 12:04:54 +0100
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.  
> > > 
> > > I don't think so.  
> > 
> > I'm very well aware that the dentries and inode aren't created during
> > mkdir but the completely directory layout is determined. You're just
> > splicing in dentries and inodes during lookup and readdir.
> > 
> > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> > do a lookup/readdir on
> > 
> > ls -al /sys/kernel/tracing/instances/foo/events
> > 
> > Why should the creation of the dentries and inodes ever fail due to a
> > permission failure?
> 
> They shouldn't.
> 
> > The vfs did already verify that you had the required
> > permissions to list entries in that directory. Why should filling up
> > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> > That tracefs instance would be half-functional. And again, right now
> > that inode_permission() check cannot even fail.
> 
> And it shouldn't. But without dentries and inodes, how does VFS know what
> is allowed to open the files?

So say you do:

mkdir /sys/kernel/tracing/instances/foo

After this has returned we know everything we need to know about the new
tracefs instance including the ownership and the mode of all inodes in
/sys/kernel/tracing/instances/foo/events/* and below precisely because
ownership is always inherited from the parent dentry and recorded in the
metadata struct eventfs_inode.

So say someone does:

open("/sys/kernel/tracing/instances/foo/events/xfs");

and say this is the first time that someone accesses that events/
directory.

When the open pathwalk is done, the vfs will determine via

[1] may_lookup(inode_of(events))

whether you are able to list entries such as "xfs" in that directory.
The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
it ends up calling i_op->eventfs_root_lookup(events).

At this point tracefs/eventfs adds the inodes for all entries in that
"events" directory including "xfs" based on the metadata it recorded
during the mkdir. Since now someone is actually interested in them. And
it initializes the inodes with ownership and everything and adds the
dentries that belong into that directory.

Nothing here depends on the permissions of the caller. The only
permission that mattered was done in the VFS in [1]. If the caller has
permissions to enter a directory they can lookup and list its contents.
And its contents where determined/fixed etc when mkdir was called.

So we just need to add the required objects into the caches (inode,
dentry) whose addition we intentionally defered until someone actually
needed them.

So, eventfs_root_lookup() now initializes the inodes with the ownership
from the stored metadata or from the parent dentry and splices in inodes
and dentries. No permission checking is needed for this because it is
always a recheck of what the vfs did in [1].

We now return to the vfs and path walk continues to the final component
that you actually want to open which is that "xfs" directory in this
example. We check the permissions on that inode via may_open("xfs") and
we open that directory returning an fd to userspace ultimately.

(I'm going by memory since I need to step out the door.)
Steven Rostedt Jan. 10, 2024, 1:07 p.m. UTC | #28
On Wed, 10 Jan 2024 12:45:36 +0100
Christian Brauner <brauner@kernel.org> wrote:

> So say you do:
> 
> mkdir /sys/kernel/tracing/instances/foo
> 
> After this has returned we know everything we need to know about the new
> tracefs instance including the ownership and the mode of all inodes in
> /sys/kernel/tracing/instances/foo/events/* and below precisely because
> ownership is always inherited from the parent dentry and recorded in the
> metadata struct eventfs_inode.
> 
> So say someone does:
> 
> open("/sys/kernel/tracing/instances/foo/events/xfs");
> 
> and say this is the first time that someone accesses that events/
> directory.
> 
> When the open pathwalk is done, the vfs will determine via
> 
> [1] may_lookup(inode_of(events))
> 
> whether you are able to list entries such as "xfs" in that directory.
> The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> it ends up calling i_op->eventfs_root_lookup(events).
> 
> At this point tracefs/eventfs adds the inodes for all entries in that
> "events" directory including "xfs" based on the metadata it recorded
> during the mkdir. Since now someone is actually interested in them. And
> it initializes the inodes with ownership and everything and adds the
> dentries that belong into that directory.
> 
> Nothing here depends on the permissions of the caller. The only
> permission that mattered was done in the VFS in [1]. If the caller has
> permissions to enter a directory they can lookup and list its contents.
> And its contents where determined/fixed etc when mkdir was called.
> 
> So we just need to add the required objects into the caches (inode,
> dentry) whose addition we intentionally defered until someone actually
> needed them.
> 
> So, eventfs_root_lookup() now initializes the inodes with the ownership
> from the stored metadata or from the parent dentry and splices in inodes
> and dentries. No permission checking is needed for this because it is
> always a recheck of what the vfs did in [1].
> 
> We now return to the vfs and path walk continues to the final component
> that you actually want to open which is that "xfs" directory in this
> example. We check the permissions on that inode via may_open("xfs") and
> we open that directory returning an fd to userspace ultimately.
> 
> (I'm going by memory since I need to step out the door.)

So, let's say we do:

 chgrp -R rostedt /sys/kernel/tracing/

But I don't want rostedt to have access to xfs

 chgrp -R root /sys/kernel/tracing/events/xfs

Both actions will create the inodes and dentries of all files and
directories (because of "-R"). But once that is done, the ref counts go to
zero. They stay around until reclaim. But then I open Chrome ;-) and it
reclaims all the dentries and inodes, so we are back to here we were on
boot.

Now as rostedt I do:

 ls /sys/kernel/tracing/events/xfs

The VFS layer doesn't know if I have permission to that or not, because all
the inodes and dentries have been freed. It has to call back to eventfs to
find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
recreated the inodes with the proper permission.

Or are you saying that I don't need the ".permission" callback, because
eventfs does it when it creates the inodes? But for eventfs to know what
the permissions changes are, it uses .getattr and .setattr.

-- Steve
Steven Rostedt Jan. 10, 2024, 3:52 p.m. UTC | #29
On Wed, 10 Jan 2024 08:07:46 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Or are you saying that I don't need the ".permission" callback, because
> eventfs does it when it creates the inodes? But for eventfs to know what
> the permissions changes are, it uses .getattr and .setattr.

OK, if your main argument is that we do not need .permission, I agree with
you. But that's a trivial change and doesn't affect the complexity that
eventfs is doing. In fact, removing the "permission" check is simply this
patch:

--
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..f2af07a857e2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
 	return 0;
 }
 
-static int eventfs_permission(struct mnt_idmap *idmap,
-			      struct inode *inode, int mask)
-{
-	set_top_events_ownership(inode);
-	return generic_permission(idmap, inode, mask);
-}
-
 static const struct inode_operations eventfs_root_dir_inode_operations = {
 	.lookup		= eventfs_root_lookup,
 	.setattr	= eventfs_set_attr,
 	.getattr	= eventfs_get_attr,
-	.permission	= eventfs_permission,
 };
 
 static const struct inode_operations eventfs_file_inode_operations = {
--

I only did that because Linus mentioned it, and I thought it was needed.
I'll apply this patch too, as it appears to work with this code.

Thanks!

-- Steve
Steven Rostedt Jan. 10, 2024, 4:04 p.m. UTC | #30
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'll apply this patch too, as it appears to work with this code.

I meant "appears to work without this code".

-- Steve
Steven Rostedt Jan. 10, 2024, 6:31 p.m. UTC | #31
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 10 Jan 2024 08:07:46 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Or are you saying that I don't need the ".permission" callback, because
> > eventfs does it when it creates the inodes? But for eventfs to know what
> > the permissions changes are, it uses .getattr and .setattr.  
> 
> OK, if your main argument is that we do not need .permission, I agree with
> you. But that's a trivial change and doesn't affect the complexity that
> eventfs is doing. In fact, removing the "permission" check is simply this
> patch:
> 
> --
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..f2af07a857e2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
>  	return 0;
>  }
>  
> -static int eventfs_permission(struct mnt_idmap *idmap,
> -			      struct inode *inode, int mask)
> -{
> -	set_top_events_ownership(inode);
> -	return generic_permission(idmap, inode, mask);
> -}
> -
>  static const struct inode_operations eventfs_root_dir_inode_operations = {
>  	.lookup		= eventfs_root_lookup,
>  	.setattr	= eventfs_set_attr,
>  	.getattr	= eventfs_get_attr,
> -	.permission	= eventfs_permission,
>  };
>  
>  static const struct inode_operations eventfs_file_inode_operations = {
> --
> 
> I only did that because Linus mentioned it, and I thought it was needed.
> I'll apply this patch too, as it appears to work with this code.

Oh, eventfs files and directories don't need the .permissions because its
inodes and dentries are not created until accessed. But the "events"
directory itself has its dentry and inode created at boot up, but still
uses the eventfs_root_dir_inode_operations. So the .permissions is still
needed!

If you look at the "set_top_events_ownership()" function, it has:

	/* The top events directory doesn't get automatically updated */
	if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
		return;

That is, it does nothing if the entry is not the "events" directory. It
falls back to he default "->permissions()" function for everything but the
top level "events" directory.

But this and .getattr are still needed for the events directory, because it
suffers the same issue as the other tracefs entries. That is, it's inodes
and dentries are created at boot up before it is mounted. So if the mount
has gid=1000, it will be ignored.

The .getattr is called by "stat" which ls does. So after boot up if you
just do:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ls /sys/kernel/tracing/events/
9p            ext4            iomap        module      raw_syscalls  thermal
alarmtimer    fib             iommu        msr         rcu           thp
avc           fib6            io_uring     napi        regmap        timer
block         filelock        ipi          neigh       regulator     tlb
bpf_test_run  filemap         irq          net         resctrl       udp
bpf_trace     ftrace          irq_matrix   netfs       rpm           virtio_gpu[
...]

The above works because "ls" does a stat() on the directory first, which
does a .getattr() call that updates the permissions of the existing "events"
directory inode.

  BUT!

If I had used my own getents() program that has:

        fd = openat(AT_FDCWD, argv[1], O_RDONLY);
        if (fd < 0)
                perror("openat");

        n = getdents64(fd, buf, BUF_SIZE);
        if (n < 0)
                perror("getdents64");

Where it calls the openat() without doing a stat fist, and after boot, had done:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ./getdents /sys/kernel/tracing/events
openat: Permission denied
getdents64: Bad file descriptor

It errors because he "events" inode permission hasn't been updated yet.
Now after getting the above error, if I do the "ls" and then run it again:

 $ ls /sys/kernel/tracing/events > /dev/null
 $ ./getdents /sys/kernel/tracing/events
enable
header_page
header_event
initcall
vsyscall
syscalls

it works!

so no, I can't remove that .permissions callback from eventfs.

-- Steve
Christian Brauner Jan. 11, 2024, 9:01 p.m. UTC | #32
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote:
> On Wed, 10 Jan 2024 12:45:36 +0100
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > So say you do:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> > After this has returned we know everything we need to know about the new
> > tracefs instance including the ownership and the mode of all inodes in
> > /sys/kernel/tracing/instances/foo/events/* and below precisely because
> > ownership is always inherited from the parent dentry and recorded in the
> > metadata struct eventfs_inode.
> > 
> > So say someone does:
> > 
> > open("/sys/kernel/tracing/instances/foo/events/xfs");
> > 
> > and say this is the first time that someone accesses that events/
> > directory.
> > 
> > When the open pathwalk is done, the vfs will determine via
> > 
> > [1] may_lookup(inode_of(events))
> > 
> > whether you are able to list entries such as "xfs" in that directory.
> > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> > it ends up calling i_op->eventfs_root_lookup(events).
> > 
> > At this point tracefs/eventfs adds the inodes for all entries in that
> > "events" directory including "xfs" based on the metadata it recorded
> > during the mkdir. Since now someone is actually interested in them. And
> > it initializes the inodes with ownership and everything and adds the
> > dentries that belong into that directory.
> > 
> > Nothing here depends on the permissions of the caller. The only
> > permission that mattered was done in the VFS in [1]. If the caller has
> > permissions to enter a directory they can lookup and list its contents.
> > And its contents where determined/fixed etc when mkdir was called.
> > 
> > So we just need to add the required objects into the caches (inode,
> > dentry) whose addition we intentionally defered until someone actually
> > needed them.
> > 
> > So, eventfs_root_lookup() now initializes the inodes with the ownership
> > from the stored metadata or from the parent dentry and splices in inodes
> > and dentries. No permission checking is needed for this because it is
> > always a recheck of what the vfs did in [1].
> > 
> > We now return to the vfs and path walk continues to the final component
> > that you actually want to open which is that "xfs" directory in this
> > example. We check the permissions on that inode via may_open("xfs") and
> > we open that directory returning an fd to userspace ultimately.
> > 
> > (I'm going by memory since I need to step out the door.)
> 
> So, let's say we do:
> 
>  chgrp -R rostedt /sys/kernel/tracing/

The rostedt group needs exec permissions and "other" cannot have exec
permissions otherwise you can trivially list the entries even if it's
owned by root:

chmod 750 /sys/kernel/tracing

user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing
drwxr-x---   6 0 1000    0 Jan 11 18:23 tracing

> 
> But I don't want rostedt to have access to xfs
> 
>  chgrp -R root /sys/kernel/tracing/events/xfs

chmod 750 /sys/kernel/tracing/events/xfs

user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs
drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs

This ensure that if a user is in the group and the group has exec perms
lookup is possible (For root this will usually work because
CAP_DAC_READ_SEARCH overrides the exec requirement.).

> 
> Both actions will create the inodes and dentries of all files and
> directories (because of "-R"). But once that is done, the ref counts go to
> zero. They stay around until reclaim. But then I open Chrome ;-) and it
> reclaims all the dentries and inodes, so we are back to here we were on
> boot.
> 
> Now as rostedt I do:
> 
>  ls /sys/kernel/tracing/events/xfs
> 
> The VFS layer doesn't know if I have permission to that or not, because all
> the inodes and dentries have been freed. It has to call back to eventfs to
> find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
> recreated the inodes with the proper permission.

Very roughly, ignoring most of the complexity of lookup and focussing on
the permission checking:

When a caller looks up an entry in a directory then the VFS will call
inode_permission(MAY_EXEC) on the directory the caller is trying to
perform that lookup in.

If the caller wants to lookup the "events" entry in the "tracing"
directory then the VFS will call inode_permission(MAY_EXEC, "tracing")
and then - assuming it's not in the cache - call into the lookup method
of the filesystem.

After the VFS has determined that the caller has the permissions to
lookup the "events" entry in the "tracing" directory no further
permission checks are needed.

Path lookup now moves on to the next part which is looking up the "xfs"
entry in the "events" directory. So again, VFS calls
inode_permission(MAY_EXEC, "events") finds that caller has the permissions
and then goes on to call into eventfs_root_lookup().

The VFS has already given the caller a dentry for "xfs" and is passing
that down to tracefs/eventfs. So eventfs gets
eventfs_rootfs_lookup(@dir[events], @dentry[xfs], ...)

eventfs now wades through the recorded metadata entries for the
@dir[events] directory and finds a matching metadata entry for "xfs". It
allocates an inode for it and initializes the owner, mode etc based on
the recorded information. The lookup_one_len() call finds the dentry
that was just added by the VFS.

The "xfs" dentry/inode is now in the cache. The VFS moves on. It calls
inode_permission(MAY_EXEC, "xfs"). It sees that based on the mode you
don't have access to that directory. Lookup fails.

What I'm pointing out in the current logic is that the caller is
taxed twice:

(1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
(2) And again when you call lookup_one_len() in eventfs_start_creating()
    _because_ the permission check in lookup_one_len() is the exact
    same permission check again that the vfs has done
    inode_permission(MAY_EXEC, "xfs").

The readdir case is similar only that instead of generating an inode for
a single entry in the directory like in lookup, you're generating inodes
and dentries for multiple entries. And you might end up calling
eventfs_start_creating() like 600 times for xfs/ for example. So you end
up calling lookup_one_len() and thus inode_permission() 600 times.  And
this inode_permission() check is the same check that the VFS already
performed: inode_permission(MAY_EXEC, "xfs").

You can even break readdir semantics via tracefs right now:

// We managed to open the directory so we have permission to list
// directory entries in "xfs".
fd = open("/sys/kernel/tracing/events/xfs");

// Remove ownership so we can't open the directory anymore
chown("/sys/kernel/tracing/events/xfs", 0, 0);

// Or just remove exec bit for the group and restrict to owner
chmod("/sys/kernel/tracing/events/xfs", 700);

// Drop caches to force an eventfs_root_lookup() on everything
write("/proc/sys/vm/drop_caches", "3", 1);

// Returns 0 even though directory has a lot of entries and we should be
// able to list them
getdents64(fd, ...);

And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
lookup_one_len() in eventfs_start_creating() which now fails.

But that's not how this is supposed to work. Once you hold an fd to a
directory you can read it's entries even if the permissions change.
There's no DAC read-time checks on the directory after that. But this is
exactly what happens on eventfs. So it's even broken right now imho.

I've tested this:

user1@localhost:~/data/scripts$ ./readdir_fail /sys/kernel/tracing/events/xfs/
inode#    file type  d_reclen  d_off   d_name
    4861  directory    24          1  .
    1027  directory    24          2  ..

So what I meant is something completely untested, possibly ugly, drafted
like the below.

But tbh, I suspect that eventfs_root_lookup() could even simply be
rewritten to reuse the dentry it was provided from the VFS directly
instead of even having to go through lookup_one_len() via
eventfs_start_creating() at all. Then only the ->iterate_shared
implementation needs to call the noperm lookup variant I hastily drafted
below. But idk, I haven't given that part much thought.

 fs/namei.c            | 46 ++++++++++++++++++++++++++++++++++++++++---
 fs/tracefs/inode.c    |  2 +-
 include/linux/namei.h |  1 +
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 963576e67f62..0bfd87e79c5d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2656,9 +2656,8 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
 }
 EXPORT_SYMBOL(vfs_path_lookup);
 
-static int lookup_one_common(struct mnt_idmap *idmap,
-			     const char *name, struct dentry *base, int len,
-			     struct qstr *this)
+static int lookup_one_common_noperm(const char *name, struct dentry *base,
+				    int len, struct qstr *this)
 {
 	this->name = name;
 	this->len = len;
@@ -2686,6 +2685,19 @@ static int lookup_one_common(struct mnt_idmap *idmap,
 			return err;
 	}
 
+	return 0;
+}
+
+static inline int lookup_one_common(struct mnt_idmap *idmap,
+			     const char *name, struct dentry *base, int len,
+			     struct qstr *this)
+{
+	int ret;
+
+	ret = lookup_one_common_noperm(name, base, len, this);
+	if (ret)
+		return ret;
+
 	return inode_permission(idmap, base->d_inode, MAY_EXEC);
 }
 
@@ -2776,6 +2788,34 @@ struct dentry *lookup_one(struct mnt_idmap *idmap, const char *name,
 }
 EXPORT_SYMBOL(lookup_one);
 
+/**
+ * lookup_one_noperm - filesystem helper to lookup single pathname component
+ * @name:	pathname component to lookup
+ * @base:	base directory to lookup from
+ * @len:	maximum length @len should be interpreted to
+ *
+ * Note that this routine is purely a helper for filesystem usage and should
+ * not be called by generic code. This performs no permission checking on @base.
+ *
+ * The caller must hold base->i_mutex.
+ */
+struct dentry *lookup_one_noperm(const char *name, struct dentry *base, int len)
+{
+	struct dentry *dentry;
+	struct qstr this;
+	int err;
+
+	WARN_ON_ONCE(!inode_is_locked(base->d_inode));
+
+	err = lookup_one_common_noperm(name, base, len, &this);
+	if (err)
+		return ERR_PTR(err);
+
+	dentry = lookup_dcache(&this, base, 0);
+	return dentry ? dentry : __lookup_slow(&this, base, 0);
+}
+EXPORT_SYMBOL(lookup_one_noperm);
+
 /**
  * lookup_one_unlocked - filesystem helper to lookup single pathname component
  * @idmap:	idmap of the mount the lookup is performed from
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..fa04aa6a5ce7 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -524,7 +524,7 @@ struct dentry *eventfs_start_creating(const char *name, struct dentry *parent)
 	if (unlikely(IS_DEADDIR(parent->d_inode)))
 		dentry = ERR_PTR(-ENOENT);
 	else
-		dentry = lookup_one_len(name, parent, strlen(name));
+		dentry = lookup_one_noperm(name, parent, strlen(name));
 
 	if (!IS_ERR(dentry) && dentry->d_inode) {
 		dput(dentry);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 3100371b5e32..a1491da37cbe 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -77,6 +77,7 @@ extern struct dentry *lookup_one_len(const char *, struct dentry *, int);
 extern struct dentry *lookup_one_len_unlocked(const char *, struct dentry *, int);
 extern struct dentry *lookup_positive_unlocked(const char *, struct dentry *, int);
 struct dentry *lookup_one(struct mnt_idmap *, const char *, struct dentry *, int);
+struct dentry *lookup_one_noperm(const char *, struct dentry *, int);
 struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
 				   const char *name, struct dentry *base,
 				   int len);
Steven Rostedt Jan. 11, 2024, 9:53 p.m. UTC | #33
On Thu, 11 Jan 2024 22:01:32 +0100
Christian Brauner <brauner@kernel.org> wrote:

> What I'm pointing out in the current logic is that the caller is
> taxed twice:
> 
> (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> (2) And again when you call lookup_one_len() in eventfs_start_creating()
>     _because_ the permission check in lookup_one_len() is the exact
>     same permission check again that the vfs has done
>     inode_permission(MAY_EXEC, "xfs").

As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/

The eventfs files below "events" doesn't need the .permissions callback at
all. It's only there because the "events" inode uses it.

The .permissions call for eventfs has:

static int eventfs_permission(struct mnt_idmap *idmap,
			      struct inode *inode, int mask)
{
	set_top_events_ownership(inode);
	return generic_permission(idmap, inode, mask);
}

Where the "set_top_events_ownership() is a nop for everything but the
"events" directory.

I guess I could have two ops:

static const struct inode_operations eventfs_root_dir_inode_operations = {
	.lookup		= eventfs_root_lookup,
	.setattr	= eventfs_set_attr,
	.getattr	= eventfs_get_attr,
	.permission	= eventfs_permission,
};

static const struct inode_operations eventfs_dir_inode_operations = {
	.lookup		= eventfs_root_lookup,
	.setattr	= eventfs_set_attr,
	.getattr	= eventfs_get_attr,
};

And use the second one for all dentries below the root, but I figured it's
not that big of a deal if I called the permissions on all. Perhaps I should
do it with two?

Anyway, the issue is with "events" directory and remounting, because like
the tracefs system, the inode and dentry for "evnets" is created at boot
up, before the mount happens. The VFS layer is going to check the
permissions of its inode and dentry, which will be incorrect if the mount
was mounted with a "gid" option.


-- Steve
Christian Brauner Jan. 12, 2024, 8:27 a.m. UTC | #34
On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> On Thu, 11 Jan 2024 22:01:32 +0100
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > What I'm pointing out in the current logic is that the caller is
> > taxed twice:
> > 
> > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> >     _because_ the permission check in lookup_one_len() is the exact
> >     same permission check again that the vfs has done
> >     inode_permission(MAY_EXEC, "xfs").
> 
> As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/
> 
> The eventfs files below "events" doesn't need the .permissions callback at
> all. It's only there because the "events" inode uses it.
> 
> The .permissions call for eventfs has:

It doesn't matter whether there's a ->permission handler. If you don't
add one explicitly the VFS will simply call generic_permission():

inode_permission()
-> do_inode_permission()
   {
        if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
                if (likely(inode->i_op->permission))
                        return inode->i_op->permission(idmap, inode, mask);
               
                /* This gets set once for the inode lifetime */
                spin_lock(&inode->i_lock);
                inode->i_opflags |= IOP_FASTPERM;
                spin_unlock(&inode->i_lock);
        }
        return generic_permission(idmap, inode, mask);
   }

> Anyway, the issue is with "events" directory and remounting, because like
> the tracefs system, the inode and dentry for "evnets" is created at boot
> up, before the mount happens. The VFS layer is going to check the
> permissions of its inode and dentry, which will be incorrect if the mount
> was mounted with a "gid" option.

The gid option has nothing to do with this and it is just handled fine
if you remove the second permission checking in (2).

You need to remove the inode_permission() code from
eventfs_start_creating(). It is just an internal lookup and the fact
that you have it in there allows userspace to break readdir on the
eventfs portions of tracefs as I've shown in the parts of the mail that
you cut off.
Steven Rostedt Jan. 12, 2024, 1:53 p.m. UTC | #35
On Fri, 12 Jan 2024 09:27:24 +0100
Christian Brauner <brauner@kernel.org> wrote:

> On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> > On Thu, 11 Jan 2024 22:01:32 +0100
> > Christian Brauner <brauner@kernel.org> wrote:
> >   
> > > What I'm pointing out in the current logic is that the caller is
> > > taxed twice:
> > > 
> > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > >     _because_ the permission check in lookup_one_len() is the exact
> > >     same permission check again that the vfs has done
> > >     inode_permission(MAY_EXEC, "xfs").  
> > 
> > As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/
> > 
> > The eventfs files below "events" doesn't need the .permissions callback at
> > all. It's only there because the "events" inode uses it.
> > 
> > The .permissions call for eventfs has:  
> 
> It doesn't matter whether there's a ->permission handler. If you don't
> add one explicitly the VFS will simply call generic_permission():
> 
> inode_permission()
> -> do_inode_permission()  
>    {
>         if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
>                 if (likely(inode->i_op->permission))
>                         return inode->i_op->permission(idmap, inode, mask);
>                
>                 /* This gets set once for the inode lifetime */
>                 spin_lock(&inode->i_lock);
>                 inode->i_opflags |= IOP_FASTPERM;
>                 spin_unlock(&inode->i_lock);
>         }
>         return generic_permission(idmap, inode, mask);
>    }

Yes I know that, because that's where I knew what to call in the non
"events" dir case.

> 
> > Anyway, the issue is with "events" directory and remounting, because like
> > the tracefs system, the inode and dentry for "evnets" is created at boot
> > up, before the mount happens. The VFS layer is going to check the
> > permissions of its inode and dentry, which will be incorrect if the mount
> > was mounted with a "gid" option.  
> 
> The gid option has nothing to do with this and it is just handled fine
> if you remove the second permission checking in (2).

I guess I'm confused to what you are having an issue with. Is it just
that the permission check gets called twice?

> 
> You need to remove the inode_permission() code from
> eventfs_start_creating(). It is just an internal lookup and the fact
> that you have it in there allows userspace to break readdir on the
> eventfs portions of tracefs as I've shown in the parts of the mail that
> you cut off.

That's because I didn't see how it was related to the way I fixed the
mount=gid issue. Are you only concerned because of the check in
eventfs_start_creating()?

Yes, you posted code that would make things act funny for some code
that I see no real world use case for. Yeah, it may not act "properly"
but I'm not sure that's bad.

Here, I'll paste it back:

> // We managed to open the directory so we have permission to list
> // directory entries in "xfs".
> fd = open("/sys/kernel/tracing/events/xfs");
> 
> // Remove ownership so we can't open the directory anymore
> chown("/sys/kernel/tracing/events/xfs", 0, 0);
> 
> // Or just remove exec bit for the group and restrict to owner
> chmod("/sys/kernel/tracing/events/xfs", 700);
> 
> // Drop caches to force an eventfs_root_lookup() on everything
> write("/proc/sys/vm/drop_caches", "3", 1);

This requires opening the directory, then having it's permissions
change, and then immediately dropping the caches.

> 
> // Returns 0 even though directory has a lot of entries and we should be
> // able to list them
> getdents64(fd, ...);

And do we care?

Since tracing exposes internal kernel information, perhaps this is a
feature and not a bug. If someone who had access to the tracing system
and you wanted to stop them, if they had a directory open that they no
longer have access to, you don't want them to see what's left in the
directory.

In other words, I like the idea that the getends64(fd, ...) will fail!

If there's a file underneath that wasn't change, and the admin thought
that just keeping the top directory permissions off is good enough,
then that attacker having that directory open before the directory had
it's file permissions change is a way to still have access to the files
below it.

> 
> And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
> lookup_one_len() in eventfs_start_creating() which now fails.

And I think is a good thing!

Again, tracefs is special. It gives you access and possibly control to
the kernel behavior. I like the fact that as soon as someone loses
permission to a directory, they immediately lose it.

-- Steve
Steven Rostedt Jan. 12, 2024, 2:22 p.m. UTC | #36
On Fri, 12 Jan 2024 08:53:44 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > // We managed to open the directory so we have permission to list
> > // directory entries in "xfs".
> > fd = open("/sys/kernel/tracing/events/xfs");
> > 
> > // Remove ownership so we can't open the directory anymore
> > chown("/sys/kernel/tracing/events/xfs", 0, 0);
> > 
> > // Or just remove exec bit for the group and restrict to owner
> > chmod("/sys/kernel/tracing/events/xfs", 700);
> > 
> > // Drop caches to force an eventfs_root_lookup() on everything
> > write("/proc/sys/vm/drop_caches", "3", 1);  
> 
> This requires opening the directory, then having it's permissions
> change, and then immediately dropping the caches.
> 
> > 
> > // Returns 0 even though directory has a lot of entries and we should be
> > // able to list them
> > getdents64(fd, ...);  
> 
> And do we care?

Hmm, maybe the issue you have is with the inconsistency of the action?

For this to fail, it would require the admin to do both change the
permission and to flush caches. If you don't flush the caches then the
task with the dir open can still read it regardless. If the dentries
were already created.

In that case I'm fine if we change the creation of the dentries to not
check the permission.

But for now, it's just a weird side effect that I don't really see how
it would affect any user's workflow.

-- Steve
diff mbox series

Patch

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 53d34a4b5a2b..641bffa0f139 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -45,6 +45,7 @@  enum {
 	EVENTFS_SAVE_MODE	= BIT(16),
 	EVENTFS_SAVE_UID	= BIT(17),
 	EVENTFS_SAVE_GID	= BIT(18),
+	EVENTFS_TOPLEVEL	= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK	(EVENTFS_SAVE_MODE - 1)
@@ -115,10 +116,17 @@  static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 		 * The events directory dentry is never freed, unless its
 		 * part of an instance that is deleted. It's attr is the
 		 * default for its child files and directories.
-		 * Do not update it. It's not used for its own mode or ownership
+		 * Do not update it. It's not used for its own mode or ownership.
 		 */
-		if (!ei->is_events)
+		if (ei->is_events) {
+			/* But it still needs to know if it was modified */
+			if (iattr->ia_valid & ATTR_UID)
+				ei->attr.mode |= EVENTFS_SAVE_UID;
+			if (iattr->ia_valid & ATTR_GID)
+				ei->attr.mode |= EVENTFS_SAVE_GID;
+		} else {
 			update_attr(&ei->attr, iattr);
+		}
 
 	} else {
 		name = dentry->d_name.name;
@@ -136,9 +144,67 @@  static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
+static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+{
+	struct inode *inode;
+
+	/* Only update if the "events" was on the top level */
+	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+		return;
+
+	/* Get the tracefs root from the parent */
+	inode = d_inode(dentry->d_parent);
+	inode = d_inode(inode->i_sb->s_root);
+	ei->attr.uid = inode->i_uid;
+	ei->attr.gid = inode->i_gid;
+}
+
+static void set_top_events_ownership(struct inode *inode)
+{
+	struct tracefs_inode *ti = get_tracefs(inode);
+	struct eventfs_inode *ei = ti->private;
+	struct dentry *dentry;
+
+	/* The top events directory doesn't get automatically updated */
+	if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+		return;
+
+	dentry = ei->dentry;
+
+	update_top_events_attr(ei, dentry);
+
+	if (!(ei->attr.mode & EVENTFS_SAVE_UID))
+		inode->i_uid = ei->attr.uid;
+
+	if (!(ei->attr.mode & EVENTFS_SAVE_GID))
+		inode->i_gid = ei->attr.gid;
+}
+
+static int eventfs_get_attr(struct mnt_idmap *idmap,
+			    const struct path *path, struct kstat *stat,
+			    u32 request_mask, unsigned int flags)
+{
+	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_backing_inode(dentry);
+
+	set_top_events_ownership(inode);
+
+	generic_fillattr(idmap, request_mask, inode, stat);
+	return 0;
+}
+
+static int eventfs_permission(struct mnt_idmap *idmap,
+			      struct inode *inode, int mask)
+{
+	set_top_events_ownership(inode);
+	return generic_permission(idmap, inode, mask);
+}
+
 static const struct inode_operations eventfs_root_dir_inode_operations = {
 	.lookup		= eventfs_root_lookup,
 	.setattr	= eventfs_set_attr,
+	.getattr	= eventfs_get_attr,
+	.permission	= eventfs_permission,
 };
 
 static const struct inode_operations eventfs_file_inode_operations = {
@@ -174,6 +240,8 @@  static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 	} while (!ei->is_events);
 	mutex_unlock(&eventfs_mutex);
 
+	update_top_events_attr(ei, dentry);
+
 	return ei;
 }
 
@@ -887,6 +955,14 @@  struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	uid = d_inode(dentry->d_parent)->i_uid;
 	gid = d_inode(dentry->d_parent)->i_gid;
 
+	/*
+	 * If the events directory is of the top instance, then parent
+	 * is NULL. Set the attr.mode to reflect this and its permissions will
+	 * default to the tracefs root dentry.
+	 */
+	if (!parent)
+		ei->attr.mode = EVENTFS_TOPLEVEL;
+
 	/* This is used as the default ownership of the files and directories */
 	ei->attr.uid = uid;
 	ei->attr.gid = gid;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index bc86ffdb103b..63284f18741f 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -91,6 +91,7 @@  static int tracefs_syscall_mkdir(struct mnt_idmap *idmap,
 				 struct inode *inode, struct dentry *dentry,
 				 umode_t mode)
 {
+	struct tracefs_inode *ti;
 	char *name;
 	int ret;
 
@@ -98,6 +99,15 @@  static int tracefs_syscall_mkdir(struct mnt_idmap *idmap,
 	if (!name)
 		return -ENOMEM;
 
+	/*
+	 * This is a new directory that does not take the default of
+	 * the rootfs. It becomes the default permissions for all the
+	 * files and directories underneath it.
+	 */
+	ti = get_tracefs(inode);
+	ti->flags |= TRACEFS_INSTANCE_INODE;
+	ti->private = inode;
+
 	/*
 	 * The mkdir call can call the generic functions that create
 	 * the files within the tracefs system. It is up to the individual
@@ -141,10 +151,76 @@  static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry)
 	return ret;
 }
 
-static const struct inode_operations tracefs_dir_inode_operations = {
+static void set_tracefs_inode_owner(struct inode *inode)
+{
+	struct tracefs_inode *ti = get_tracefs(inode);
+	struct inode *root_inode = ti->private;
+
+	/*
+	 * If this inode has never been referenced, then update
+	 * the permissions to the superblock.
+	 */
+	if (!(ti->flags & TRACEFS_UID_PERM_SET))
+		inode->i_uid = root_inode->i_uid;
+
+	if (!(ti->flags & TRACEFS_GID_PERM_SET))
+		inode->i_gid = root_inode->i_gid;
+}
+
+static int tracefs_permission(struct mnt_idmap *idmap,
+			      struct inode *inode, int mask)
+{
+	set_tracefs_inode_owner(inode);
+	return generic_permission(idmap, inode, mask);
+}
+
+static int tracefs_getattr(struct mnt_idmap *idmap,
+			   const struct path *path, struct kstat *stat,
+			   u32 request_mask, unsigned int flags)
+{
+	struct inode *inode = d_backing_inode(path->dentry);
+
+	set_tracefs_inode_owner(inode);
+	generic_fillattr(idmap, request_mask, inode, stat);
+	return 0;
+}
+
+static int tracefs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+			   struct iattr *attr)
+{
+	unsigned int ia_valid = attr->ia_valid;
+	struct inode *inode = d_inode(dentry);
+	struct tracefs_inode *ti = get_tracefs(inode);
+
+	if (ia_valid & ATTR_UID)
+		ti->flags |= TRACEFS_UID_PERM_SET;
+
+	if (ia_valid & ATTR_GID)
+		ti->flags |= TRACEFS_GID_PERM_SET;
+
+	return simple_setattr(idmap, dentry, attr);
+}
+
+static const struct inode_operations tracefs_instance_dir_inode_operations = {
 	.lookup		= simple_lookup,
 	.mkdir		= tracefs_syscall_mkdir,
 	.rmdir		= tracefs_syscall_rmdir,
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_dir_inode_operations = {
+	.lookup		= simple_lookup,
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_file_inode_operations = {
+	.permission	= tracefs_permission,
+	.getattr	= tracefs_getattr,
+	.setattr	= tracefs_setattr,
 };
 
 struct inode *tracefs_get_inode(struct super_block *sb)
@@ -183,87 +259,6 @@  struct tracefs_fs_info {
 	struct tracefs_mount_opts mount_opts;
 };
 
-static void change_gid(struct dentry *dentry, kgid_t gid)
-{
-	if (!dentry->d_inode)
-		return;
-	dentry->d_inode->i_gid = gid;
-}
-
-/*
- * Taken from d_walk, but without he need for handling renames.
- * Nothing can be renamed while walking the list, as tracefs
- * does not support renames. This is only called when mounting
- * or remounting the file system, to set all the files to
- * the given gid.
- */
-static void set_gid(struct dentry *parent, kgid_t gid)
-{
-	struct dentry *this_parent;
-	struct list_head *next;
-
-	this_parent = parent;
-	spin_lock(&this_parent->d_lock);
-
-	change_gid(this_parent, gid);
-repeat:
-	next = this_parent->d_subdirs.next;
-resume:
-	while (next != &this_parent->d_subdirs) {
-		struct tracefs_inode *ti;
-		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
-		next = tmp->next;
-
-		/* Note, getdents() can add a cursor dentry with no inode */
-		if (!dentry->d_inode)
-			continue;
-
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-
-		change_gid(dentry, gid);
-
-		/* If this is the events directory, update that too */
-		ti = get_tracefs(dentry->d_inode);
-		if (ti && (ti->flags & TRACEFS_EVENT_INODE))
-			eventfs_update_gid(dentry, gid);
-
-		if (!list_empty(&dentry->d_subdirs)) {
-			spin_unlock(&this_parent->d_lock);
-			spin_release(&dentry->d_lock.dep_map, _RET_IP_);
-			this_parent = dentry;
-			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
-			goto repeat;
-		}
-		spin_unlock(&dentry->d_lock);
-	}
-	/*
-	 * All done at this level ... ascend and resume the search.
-	 */
-	rcu_read_lock();
-ascend:
-	if (this_parent != parent) {
-		struct dentry *child = this_parent;
-		this_parent = child->d_parent;
-
-		spin_unlock(&child->d_lock);
-		spin_lock(&this_parent->d_lock);
-
-		/* go into the first sibling still alive */
-		do {
-			next = child->d_child.next;
-			if (next == &this_parent->d_subdirs)
-				goto ascend;
-			child = list_entry(next, struct dentry, d_child);
-		} while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
-		rcu_read_unlock();
-		goto resume;
-	}
-	rcu_read_unlock();
-	spin_unlock(&this_parent->d_lock);
-	return;
-}
-
 static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts)
 {
 	substring_t args[MAX_OPT_ARGS];
@@ -336,10 +331,8 @@  static int tracefs_apply_options(struct super_block *sb, bool remount)
 	if (!remount || opts->opts & BIT(Opt_uid))
 		inode->i_uid = opts->uid;
 
-	if (!remount || opts->opts & BIT(Opt_gid)) {
-		/* Set all the group ids to the mount option */
-		set_gid(sb->s_root, opts->gid);
-	}
+	if (!remount || opts->opts & BIT(Opt_gid))
+		inode->i_gid = opts->gid;
 
 	return 0;
 }
@@ -573,6 +566,33 @@  struct dentry *eventfs_end_creating(struct dentry *dentry)
 	return dentry;
 }
 
+/* Find the inode that this will use for default */
+static struct inode *instance_inode(struct dentry *parent, struct inode *inode)
+{
+	struct tracefs_inode *ti;
+	struct inode *root_inode;
+
+	root_inode = d_inode(inode->i_sb->s_root);
+
+	/* If parent is NULL then use root inode */
+	if (!parent)
+		return root_inode;
+
+	/* Find the inode that is flagged as an instance or the root inode */
+	do {
+		inode = d_inode(parent);
+		if (inode == root_inode)
+			return root_inode;
+
+		ti = get_tracefs(inode);
+
+		if (ti->flags & TRACEFS_INSTANCE_INODE)
+			return inode;
+	} while ((parent = parent->d_parent));
+
+	return NULL;
+}
+
 /**
  * tracefs_create_file - create a file in the tracefs filesystem
  * @name: a pointer to a string containing the name of the file to create.
@@ -603,6 +623,7 @@  struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops)
 {
+	struct tracefs_inode *ti;
 	struct dentry *dentry;
 	struct inode *inode;
 
@@ -621,7 +642,11 @@  struct dentry *tracefs_create_file(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return tracefs_failed_creating(dentry);
 
+	ti = get_tracefs(inode);
+	ti->private = instance_inode(parent, inode);
+
 	inode->i_mode = mode;
+	inode->i_op = &tracefs_file_inode_operations;
 	inode->i_fop = fops ? fops : &tracefs_file_operations;
 	inode->i_private = data;
 	inode->i_uid = d_inode(dentry->d_parent)->i_uid;
@@ -634,6 +659,7 @@  struct dentry *tracefs_create_file(const char *name, umode_t mode,
 static struct dentry *__create_dir(const char *name, struct dentry *parent,
 				   const struct inode_operations *ops)
 {
+	struct tracefs_inode *ti;
 	struct dentry *dentry = tracefs_start_creating(name, parent);
 	struct inode *inode;
 
@@ -651,6 +677,9 @@  static struct dentry *__create_dir(const char *name, struct dentry *parent,
 	inode->i_uid = d_inode(dentry->d_parent)->i_uid;
 	inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 
+	ti = get_tracefs(inode);
+	ti->private = instance_inode(parent, inode);
+
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -681,7 +710,7 @@  struct dentry *tracefs_create_dir(const char *name, struct dentry *parent)
 	if (security_locked_down(LOCKDOWN_TRACEFS))
 		return NULL;
 
-	return __create_dir(name, parent, &simple_dir_inode_operations);
+	return __create_dir(name, parent, &tracefs_dir_inode_operations);
 }
 
 /**
@@ -712,7 +741,7 @@  __init struct dentry *tracefs_create_instance_dir(const char *name,
 	if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
 		return NULL;
 
-	dentry = __create_dir(name, parent, &tracefs_dir_inode_operations);
+	dentry = __create_dir(name, parent, &tracefs_instance_dir_inode_operations);
 	if (!dentry)
 		return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 42bdeb471a07..12b7d0150ae9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -5,6 +5,9 @@ 
 enum {
 	TRACEFS_EVENT_INODE		= BIT(1),
 	TRACEFS_EVENT_TOP_INODE		= BIT(2),
+	TRACEFS_GID_PERM_SET		= BIT(3),
+	TRACEFS_UID_PERM_SET		= BIT(4),
+	TRACEFS_INSTANCE_INODE		= BIT(5),
 };
 
 struct tracefs_inode {