Message ID | 20240103203246.115732ec@gandalf.local.home (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | tracefs/eventfs: Use root and instance inodes as default ownership | expand |
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > + /* Get the tracefs root from the parent */ > + inode = d_inode(dentry->d_parent); > + inode = d_inode(inode->i_sb->s_root); That makes no sense. First of all, for any positive dentry we have dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all dentries on given superblock. So what's the point of that dance? If you want the root inode, just go for d_inode(dentry->d_sb->s_root) and be done with that...
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > +static struct inode *instance_inode(struct dentry *parent, struct inode *inode) > +{ > + struct tracefs_inode *ti; > + struct inode *root_inode; > + > + root_inode = d_inode(inode->i_sb->s_root); > + > + /* If parent is NULL then use root inode */ > + if (!parent) > + return root_inode; > + > + /* Find the inode that is flagged as an instance or the root inode */ > + do { > + inode = d_inode(parent); > + if (inode == root_inode) > + return root_inode; > + > + ti = get_tracefs(inode); > + > + if (ti->flags & TRACEFS_INSTANCE_INODE) > + return inode; > + } while ((parent = parent->d_parent)); *blink* This is equivalent to ... parent = parent->d_parent; } while (true); ->d_parent is *never* NULL. And what the hell is that loop supposed to do, anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? If root is not marked that way, I would suggest if (!parent) parent = inode->i_sb->s_root; while (!IS_ROOT(parent)) { struct tracefs_inode *ti = get_tracefs(parent->d_inode); if (ti->flags & TRACEFS_INSTANCE_INODE) break; parent = parent->d_parent; } return parent->d_inode;
On Thu, 4 Jan 2024 01:59:10 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > +static struct inode *instance_inode(struct dentry *parent, struct inode *inode) > > +{ > > + struct tracefs_inode *ti; > > + struct inode *root_inode; > > + > > + root_inode = d_inode(inode->i_sb->s_root); > > + > > + /* If parent is NULL then use root inode */ > > + if (!parent) > > + return root_inode; > > + > > + /* Find the inode that is flagged as an instance or the root inode */ > > + do { > > + inode = d_inode(parent); > > + if (inode == root_inode) > > + return root_inode; > > + > > + ti = get_tracefs(inode); > > + > > + if (ti->flags & TRACEFS_INSTANCE_INODE) > > + return inode; > > + } while ((parent = parent->d_parent)); > > *blink* > > This is equivalent to > ... > parent = parent->d_parent; > } while (true); Yeah, that loop went through a few iterations as I first thought that root was a tracefs_inode and the get_tracefs() would work on it. No, it was not, and it caused a cash. But I didn't rewrite the loop well after fixing that. I was also not sure if parent could be NULL, and wanted to protect against it. > > ->d_parent is *never* NULL. And what the hell is that loop supposed to do, > anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? > > If root is not marked that way, I would suggest > if (!parent) > parent = inode->i_sb->s_root; > while (!IS_ROOT(parent)) { > struct tracefs_inode *ti = get_tracefs(parent->d_inode); > if (ti->flags & TRACEFS_INSTANCE_INODE) > break; > parent = parent->d_parent; > } > return parent->d_inode; Sure, I could rewrite it that way too. Thanks, -- Steve
On Thu, 4 Jan 2024 01:48:37 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > + /* Get the tracefs root from the parent */ > > + inode = d_inode(dentry->d_parent); > > + inode = d_inode(inode->i_sb->s_root); > > That makes no sense. First of all, for any positive dentry we have > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > dentries on given superblock. So what's the point of that dance? > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > and be done with that... That was more of thinking that the dentry and dentry->d_parent are different. As dentry is part of eventfs and dentry->d_parent is part of tracefs. Currently they both have the same superblock so yeah, I could just write it that way too and it would work. But in my head, I was thinking that they behave differently and maybe one day eventfs would get its own superblock which would not work. To explain this better: /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events But everything but "events" in /sys/kernel/tracing/* is part of tracefs. Everything in /sys/kernel/tracing/events is part of eventfs. That was my thought process. But as both tracefs and eventfs still use tracefs_get_inode(), it would work as you state. I'll update that, as I don't foresee that eventfs will become its own file system. Thanks, -- Steve
On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote: > On Thu, 4 Jan 2024 01:48:37 +0000 > Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > > > + /* Get the tracefs root from the parent */ > > > + inode = d_inode(dentry->d_parent); > > > + inode = d_inode(inode->i_sb->s_root); > > > > That makes no sense. First of all, for any positive dentry we have > > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > > dentries on given superblock. So what's the point of that dance? > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > > and be done with that... > > That was more of thinking that the dentry and dentry->d_parent are > different. As dentry is part of eventfs and dentry->d_parent is part of > tracefs. ??? >Currently they both have the same superblock so yeah, I could just > write it that way too and it would work. But in my head, I was thinking > that they behave differently and maybe one day eventfs would get its own > superblock which would not work. ->d_parent *always* points to the same filesystem; if you get an (automounted?) mountpoint there, ->d_parent simply won't work - it will point to dentry itself. > To explain this better: > > /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events > > But everything but "events" in /sys/kernel/tracing/* is part of tracefs. > Everything in /sys/kernel/tracing/events is part of eventfs. > > That was my thought process. But as both tracefs and eventfs still use > tracefs_get_inode(), it would work as you state. > > I'll update that, as I don't foresee that eventfs will become its own file > system. There is no way to get to underlying mountpoint by dentry - simply because the same fs instance can be mounted in any number of places. A crude overview of taxonomy: file_system_type: what filesystem instances belong to. Not quite the same thing as fs driver (one driver can provide several of those). Usually it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). super_block: individual filesystem instance. Hosts dentry tree (connected or several disconnected parts - think NFSv4 or the state while trying to get a dentry by fhandle, etc.). dentry: object in a filesystem's directory tree(s). Always belongs to specific filesystem instance - that relationship never changes. Tree structure (and names) _within_ _filesystem_ belong on that level. ->d_parent is part of that tree structure; never NULL, root of a (sub)tree has it pointing to itself. Might be negative, might refer to a filesystem object (file, directory, symlink, etc.). inode: filesystem object (file, directory, etc.). Always belongs to specific filesystem instance. Non-directory inodes might have any number of dentry instances (aliases) refering to it; a directory one - no more than one. Filesystem object contents belongs here; multiple hardlinks have different dentries and the same inode. Of course, filesystem type in question might have no such thing as multiple hardlinks - that's up to filesystem. In general there is no way to find (or enumerate) such links; e.g. a local filesystem might have an extra hardlink somewhere we had never looked at and there won't be any dentries for such hardlinks and no way to get them short of doing the full search of the entire tree. The situation with e.g. NFS client is even worse, obviously. mount: in a sense, mount to super_block is what dentry is to inode. It provides a view of (sub)tree hosted in given filesystem instance. The same filesystem may have any number of mounts, refering to its subtrees (possibly the same subtree for each, possibly all different - up to the callers of mount(2)). They form mount tree(s) - that's where the notions related to "this mounted on top of that" belong. Note that they can be moved around - with no telling the filesystem about that happening. Again, there's no such thing as "the mountpoint of given filesystem instance" - it might be mounted in any number of places at the same time. Specific mount - sure, no problem, except that it can move around. namespace: mount tree. Unlike everything prior, this one is a part of process state - same as descriptor table, mappings, etc. file: opened IO channel. It does refer to specific mount and specific dentry (and thus filesystem instance and an inode on it). Current IO position lives here, so does any per-open(2) state. descriptor table: mapping from numbers to IO channels (opened files). Again, a part of process state. dup() creates a new entry, with reference to the same file as the old one; multiple open() of the same pathname will each yield a separate opened file. _Some_ state belongs here (close-on-exec, mostly). Note that there's no such thing as "the descriptor of this file" - not even "the user-supplied number that had been used to get the file we are currently reading from", since that number might be refering to something entirely different right after we'd resolved it to opened file and that happens *without* disrupting the operation.
On Thu, 4 Jan 2024 04:39:45 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote: > > On Thu, 4 Jan 2024 01:48:37 +0000 > > Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > > > > > + /* Get the tracefs root from the parent */ > > > > + inode = d_inode(dentry->d_parent); > > > > + inode = d_inode(inode->i_sb->s_root); > > > > > > That makes no sense. First of all, for any positive dentry we have > > > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > > > dentries on given superblock. So what's the point of that dance? > > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > > > and be done with that... > > > > That was more of thinking that the dentry and dentry->d_parent are > > different. As dentry is part of eventfs and dentry->d_parent is part of > > tracefs. > > ??? > > >Currently they both have the same superblock so yeah, I could just > > write it that way too and it would work. But in my head, I was thinking > > that they behave differently and maybe one day eventfs would get its own > > superblock which would not work. > > ->d_parent *always* points to the same filesystem; if you get an (automounted?) > mountpoint there, ->d_parent simply won't work - it will point to dentry itself. > This is the "tribal knowledge" I'm talking about. I really didn't know how the root dentry parent worked. I guess that makes sense, as it matches the '..' of a directory, and the '/' directory '..' points to itself. Although mounted file systems do not behave that way. My /proc/.. is '/'. I just figured that the dentry->d_parent would be similar. Learn something everyday. > > To explain this better: > > > > /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events > > > > But everything but "events" in /sys/kernel/tracing/* is part of tracefs. > > Everything in /sys/kernel/tracing/events is part of eventfs. > > > > That was my thought process. But as both tracefs and eventfs still use > > tracefs_get_inode(), it would work as you state. > > > > I'll update that, as I don't foresee that eventfs will become its own file > > system. > > There is no way to get to underlying mountpoint by dentry - simply because > the same fs instance can be mounted in any number of places. OK, so the dentry is still separate from the path and tied closer to the inode. > > A crude overview of taxonomy: > > file_system_type: what filesystem instances belong to. Not quite the same > thing as fs driver (one driver can provide several of those). Usually > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). I don't know the difference between NFS and NFSv4 as I just used whatever was the latest. But I understand the ext[234] part. > > super_block: individual filesystem instance. Hosts dentry tree (connected or > several disconnected parts - think NFSv4 or the state while trying to get > a dentry by fhandle, etc.). I don't know how NFSv4 works, I'm only a user of it, I never actually looked at the code. So that's not the best example, at least for me. > > dentry: object in a filesystem's directory tree(s). Always belongs to > specific filesystem instance - that relationship never changes. Tree > structure (and names) _within_ _filesystem_ belong on that level. > ->d_parent is part of that tree structure; never NULL, root of a (sub)tree > has it pointing to itself. Might be negative, might refer to a filesystem object > (file, directory, symlink, etc.). This is useful. > > inode: filesystem object (file, directory, etc.). Always belongs to > specific filesystem instance. Non-directory inodes might have any > number of dentry instances (aliases) refering to it; a directory one - no > more than one. This above is very useful knowledge that I did not know. That directory inodes can only have a single dentry. > Filesystem object contents belongs here; multiple hardlinks > have different dentries and the same inode. So, can I assume that an inode could only have as many dentries as hard links? I know directories are only allowed to have a single hard link. Is that why they can only have a single dentry? > Of course, filesystem type in > question might have no such thing as multiple hardlinks - that's up to > filesystem. In general there is no way to find (or enumerate) such links; > e.g. a local filesystem might have an extra hardlink somewhere we had > never looked at and there won't be any dentries for such hardlinks and > no way to get them short of doing the full search of the entire tree. > The situation with e.g. NFS client is even worse, obviously. > > mount: in a sense, mount to super_block is what dentry is to inode. It > provides a view of (sub)tree hosted in given filesystem instance. The > same filesystem may have any number of mounts, refering to its subtrees > (possibly the same subtree for each, possibly all different - up to > the callers of mount(2)). They form mount tree(s) - that's where the > notions related to "this mounted on top of that" belong. Note that > they can be moved around - with no telling the filesystem about that > happening. Again, there's no such thing as "the mountpoint of given > filesystem instance" - it might be mounted in any number of places > at the same time. Specific mount - sure, no problem, except that it > can move around. > > namespace: mount tree. Unlike everything prior, this one is a part of > process state - same as descriptor table, mappings, etc. And I'm guessing namespace is for containers. At least that's what I've been assuming they are for. > > file: opened IO channel. It does refer to specific mount and specific > dentry (and thus filesystem instance and an inode on it). Current > IO position lives here, so does any per-open(2) state. And IIUC, this is what maps to a processes fd table. That is, the process's file descriptor number it passes to the kernel will be mapped to this "file". > > descriptor table: mapping from numbers to IO channels (opened files). This is that "process fd table" I mentioned above (I wrote that before reading this). > Again, a part of process state. dup() creates a new entry, with > reference to the same file as the old one; multiple open() of the Hmm, wouldn't "dup()" create another "file" that just points to the same dentry? It wouldn't be the "same file", or did you mean "file" from the user space point of view? > same pathname will each yield a separate opened file. _Some_ state > belongs here (close-on-exec, mostly). Note that there's no such > thing as "the descriptor of this file" - not even "the user-supplied > number that had been used to get the file we are currently reading > from", since that number might be refering to something entirely > different right after we'd resolved it to opened file and that > happens *without* disrupting the operation. This last paragraph confused me. What do you mean by ""referring to something entirely different"? Thanks for this overview. It was very useful, and something I think we should add to kernel doc. I did read Documentation/filesystems/vfs.rst but honestly, I think your writeup here is a better overview. -- Steve
On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > This is the "tribal knowledge" I'm talking about. I really didn't know how > the root dentry parent worked. I guess that makes sense, as it matches the > '..' of a directory, and the '/' directory '..' points to itself. Although > mounted file systems do not behave that way. My /proc/.. is '/'. I just > figured that the dentry->d_parent would be similar. Learn something everyday. What would you expect to happen if you have the same filesystem mounted in several places? Having separate dentry trees would be a nightmare - you'd get cache coherency problems from hell. It's survivable for procfs, but for something like a normal local filesystem it'd become very painful. And if we want them to share dentry tree, how do you choose where the .. would lead from the root dentry? The way it's done is that linkage between the trees is done separately - there's a tree of struct mount (well, forest, really - different processes can easily have separate trees, which is how namespaces are done) and each node in the mount tree refers to a dentry (sub)tree in some filesystem instance. Location is represented by (mount, dentry) pair and handling of .. is basically (modulo refcounting, locking, error handling, etc.) while dentry == subtree_root(mount) && mount != mountpoint_mount(mount) // cross into the mountpoint under it dentry = mountpoint_dentry(mount) mount = mountpoint_mount(mount) go_into(mount, dentry->d_parent) Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah: ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah will do it. Then e.g. /mnt/blah/include will resolve to the same dentry as /usr/lib/gcc/x86_64-linux-gnu/12/include, etc. ; chdir /mnt/blah ; ls 32 crtprec80.o libgomp.so libsanitizer.spec cc1 g++-mapper-server libgomp.spec libssp_nonshared.a cc1plus include libitm.a libstdc++.a collect2 libasan.a libitm.so libstdc++fs.a crtbegin.o libasan_preinit.o libitm.spec libstdc++.so crtbeginS.o libasan.so liblsan.a libsupc++.a crtbeginT.o libatomic.a liblsan_preinit.o libtsan.a crtend.o libatomic.so liblsan.so libtsan_preinit.o crtendS.o libbacktrace.a liblto_plugin.so libtsan.so crtfastmath.o libcc1.so libobjc.a libubsan.a crtoffloadbegin.o libgcc.a libobjc_gc.a libubsan.so crtoffloadend.o libgcc_eh.a libobjc_gc.so lto1 crtoffloadtable.o libgcc_s.so libobjc.so lto-wrapper crtprec32.o libgcov.a libquadmath.a plugin crtprec64.o libgomp.a libquadmath.so x32 We obviously want .. to resolve to /mnt, though. ; ls .. ; ls /usr/lib/gcc/x86_64-linux-gnu/ 12 So the trigger for "cross into underlying mountpoint" has to be "dentry is the root of subtree mount refers to" - it depends upon the mount we are in. > > Filesystem object contents belongs here; multiple hardlinks > > have different dentries and the same inode. > > So, can I assume that an inode could only have as many dentries as hard > links? I know directories are only allowed to have a single hard link. Is > that why they can only have a single dentry? Not quite. Single alias for directories is more about cache coherency fun; we really can't afford multiple aliases for those. For non-directories it's possible to have an entirely disconnected dentry refering to that sucker; if somebody hands you an fhandle with no indication of the parent directory, you might end up having to do one of those, no matter how many times you find the same inode later. Not an issue for tracefs, though. > > namespace: mount tree. Unlike everything prior, this one is a part of > > process state - same as descriptor table, mappings, etc. > > And I'm guessing namespace is for containers. At least that's what I've > been assuming they are for. It predates containers by quite a few years, but yes, that's one of the users. It is related to virtual machines, in the same sense the set of memory mappings is - each thread can be thought of as a VM, with a bunch of components. Just as mmap() manipulates the virtual address translation for the threads that share memory space with the caller, mount() manipulates the pathname resolution for the threads that share the namespace with the caller. > > descriptor table: mapping from numbers to IO channels (opened files). > > This is that "process fd table" I mentioned above (I wrote that before > reading this). > > > Again, a part of process state. dup() creates a new entry, with > > reference to the same file as the old one; multiple open() of the > > Hmm, wouldn't "dup()" create another "file" that just points to the same > dentry? It wouldn't be the "same file", or did you mean "file" from the > user space point of view? No. The difference between open() and dup() is that the latter will result in a descriptor that really refers to the same file. Current IO position belongs to IO channel; it doesn't matter for e.g. terminals, but for regular file it immediately becomes an issue. fd1 = open("foo", 0); fd2 = open("foo", 0); read(fd1, &c1, 1); read(fd2, &c2, 1); will result in the first byte of foo read into c1 and c2, but fd1 = open("foo", 0); fd2 = dup(fd1); read(fd1, &c1, 1); read(fd2, &c2, 1); will have the first byte of foo in c1 and the second one - in c2. open() yields a new IO channel attached to new descriptor; dup() (and dup2()) attaches the existing IO channel to new descriptor. fork() acts like dup() in that respect - child gets its descriptor table populated with references to the same IO channels as the parent does. Any Unix since about '71 has it done that way and the same goes for NT, DOS, etc. - you can't implement redirects to/from regular files without that distinction. Unfortunately, the terms are clumsy as hell - POSIX ends up with "file descriptor" (for numbers) vs. "file description" (for IO channels), which is hard to distinguish when reading and just as hard to distinguish when listening. "Opened file" (as IO channel) vs. "file on disc" (as collection of data that might be accessed via said channels) distinction on top of that also doesn't help, to put it mildly. It's many decades too late to do anything about, unfortunately. Pity the UNIX 101 students... ;-/ The bottom line: * struct file represents an IO channel; it might be operating on various objects, including regular files, pipes, sockets, etc. * current IO position is a property of IO channel. * struct files_struct represents a descriptor table; each of those maps numbers to IO channels. * each thread uses a descriptor table to turn numbers ("file descriptors") into struct file references. Different threads might share the same descriptor table or have separate descriptor tables. current->files points to the descriptor table of the current thread. * open() creates a new IO channel and attaches it to an unused position in descriptor table. * dup(n) takes the IO channel from position 'n' in descriptor table and attaches it to an unused position. * dup2(old, new) takes the IO channel from position 'old' and attaches it to position 'new'; if there used to be something in position 'new', it gets detached. * close(n) takes the IO channel from position 'n', flushes and detaches it. Note that it IO channel itself is *NOT* closed until all references to it are gone. E.g. open() + fork() + (in parent) close() will end up with the child's descriptor table keeping a reference to IO channel established by open(); close() in parent will not shut the channel down. The same goes for implicit close() done by dup2() or by exit(), etc. * things like mmap() retain struct file references; open() + mmap() + close() ends up with struct file left (in vma->vm_file) alive and well for as long as the mapping exists, nevermind the reference that used to be in descriptor table. In other words, IO channels can exist with no references in any descriptor tables. There are other ways for such situation to occur (e.g. SCM_RIGHTS stuff); it's entirely normal. > > same pathname will each yield a separate opened file. _Some_ state > > belongs here (close-on-exec, mostly). Note that there's no such > > thing as "the descriptor of this file" - not even "the user-supplied > > number that had been used to get the file we are currently reading > > from", since that number might be refering to something entirely > > different right after we'd resolved it to opened file and that > > happens *without* disrupting the operation. > > This last paragraph confused me. What do you mean by ""referring to > something entirely different"? Two threads share descriptor table; one of them is in read(fd, ...), another does dup2(fd2, fd). If read() gets past the point where it gets struct file reference, it will keep accessing that IO channel. dup2() will replace the reference in descriptor table, but that won't disrupt the read()... > > Thanks for this overview. It was very useful, and something I think we > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > honestly, I think your writeup here is a better overview. At the very least it would need serious reordering ;-/
On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > > file_system_type: what filesystem instances belong to. Not quite the same > > thing as fs driver (one driver can provide several of those). Usually > > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). > > I don't know the difference between NFS and NFSv4 as I just used whatever > was the latest. But I understand the ext[234] part. What Al's sying is that nfs.ko provides both nfs_fs_type and nfs4_fs_type. ext4.ko provides ext2_fs_type, ext3_fs_type and ext4_fs_type. This is allowed but anomalous. Most filesystems provide only one, eg ocfs2_fs_type. > > > > super_block: individual filesystem instance. Hosts dentry tree (connected or > > several disconnected parts - think NFSv4 or the state while trying to get > > a dentry by fhandle, etc.). > > I don't know how NFSv4 works, I'm only a user of it, I never actually > looked at the code. So that's not the best example, at least for me. Right, so NFS (v4 or otherwise) is Special. In the protocol, files are identified by a thing called an fhandle. This is (iirc) a 32-byte identifier which must persist across server reboot. Originally it was probably supposed to encode dev_t plus ino_t plus generation number. But you can do all kinds of things in the NFS protocol with an fhandle that you need a dentry for in Linux (like path walks). Unfortunately, clients can't be told "Hey, we've lost context, please rewalk" (which would have other problems anyway), so we need a way to find the dentry for an fhandle. I understand this very badly, but essentially we end up looking for canonical ones, and then creating isolated trees of dentries if we can't find them. Sometimes we then graft these isolated trees into the canonical spots if we end up connecting them through various filesystem activity. At least that's my understanding which probably contains several misunderstandings. > > Filesystem object contents belongs here; multiple hardlinks > > have different dentries and the same inode. > > So, can I assume that an inode could only have as many dentries as hard > links? I know directories are only allowed to have a single hard link. Is > that why they can only have a single dentry? There could be more. For example, I could open("A"); ln("A", "B"); open("B"); rm("A"); ln("B", "C"); open("C"); rm("B"). Now there are three dentries for this inode, its link count is currently one and never exceeded two. > Thanks for this overview. It was very useful, and something I think we > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > honestly, I think your writeup here is a better overview. Documentation/filesystems/locking.rst is often a better source, although the two should really be merged. Not for the faint-hearted.
On Thu, 4 Jan 2024 18:25:02 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > > > This is the "tribal knowledge" I'm talking about. I really didn't know how > > the root dentry parent worked. I guess that makes sense, as it matches the > > '..' of a directory, and the '/' directory '..' points to itself. Although > > mounted file systems do not behave that way. My /proc/.. is '/'. I just > > figured that the dentry->d_parent would be similar. Learn something everyday. > > What would you expect to happen if you have the same filesystem mounted in > several places? Having separate dentry trees would be a nightmare - you'd > get cache coherency problems from hell. It's survivable for procfs, but > for something like a normal local filesystem it'd become very painful. > And if we want them to share dentry tree, how do you choose where the .. > would lead from the root dentry? My mistake was thinking that the dentry was attached more to the path than the inode. But that doesn't seem to be the case. I wasn't sure if there was a way to get to a dentry from the inode. I see the i_dentry list, which is a list, where I got some of my idea that dentry was closer to path than inode. > > The way it's done is that linkage between the trees is done separately - > there's a tree of struct mount (well, forest, really - different processes > can easily have separate trees, which is how namespaces are done) and > each node in the mount tree refers to a dentry (sub)tree in some filesystem > instance. Location is represented by (mount, dentry) pair and handling of > .. is basically (modulo refcounting, locking, error handling, etc.) > while dentry == subtree_root(mount) && mount != mountpoint_mount(mount) > // cross into the mountpoint under it > dentry = mountpoint_dentry(mount) > mount = mountpoint_mount(mount) > go_into(mount, dentry->d_parent) > > Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah: > ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah > will do it. Then e.g. /mnt/blah/include will resolve to the same dentry as > /usr/lib/gcc/x86_64-linux-gnu/12/include, etc. > ; chdir /mnt/blah > ; ls > 32 crtprec80.o libgomp.so libsanitizer.spec > cc1 g++-mapper-server libgomp.spec libssp_nonshared.a > cc1plus include libitm.a libstdc++.a > collect2 libasan.a libitm.so libstdc++fs.a > crtbegin.o libasan_preinit.o libitm.spec libstdc++.so > crtbeginS.o libasan.so liblsan.a libsupc++.a > crtbeginT.o libatomic.a liblsan_preinit.o libtsan.a > crtend.o libatomic.so liblsan.so libtsan_preinit.o > crtendS.o libbacktrace.a liblto_plugin.so libtsan.so > crtfastmath.o libcc1.so libobjc.a libubsan.a > crtoffloadbegin.o libgcc.a libobjc_gc.a libubsan.so > crtoffloadend.o libgcc_eh.a libobjc_gc.so lto1 > crtoffloadtable.o libgcc_s.so libobjc.so lto-wrapper > crtprec32.o libgcov.a libquadmath.a plugin > crtprec64.o libgomp.a libquadmath.so x32 > > We obviously want .. to resolve to /mnt, though. > ; ls .. > ; ls /usr/lib/gcc/x86_64-linux-gnu/ > 12 > > So the trigger for "cross into underlying mountpoint" has to be "dentry is > the root of subtree mount refers to" - it depends upon the mount we are > in. > > > > Filesystem object contents belongs here; multiple hardlinks > > > have different dentries and the same inode. > > > > So, can I assume that an inode could only have as many dentries as hard > > links? I know directories are only allowed to have a single hard link. Is > > that why they can only have a single dentry? > > Not quite. Single alias for directories is more about cache coherency > fun; we really can't afford multiple aliases for those. For non-directories > it's possible to have an entirely disconnected dentry refering to that > sucker; if somebody hands you an fhandle with no indication of the parent > directory, you might end up having to do one of those, no matter how many > times you find the same inode later. Not an issue for tracefs, though. > > > > namespace: mount tree. Unlike everything prior, this one is a part of > > > process state - same as descriptor table, mappings, etc. > > > > And I'm guessing namespace is for containers. At least that's what I've > > been assuming they are for. > > It predates containers by quite a few years, but yes, that's one of the > users. It is related to virtual machines, in the same sense the set > of memory mappings is - each thread can be thought of as a VM, with > a bunch of components. Just as mmap() manipulates the virtual address > translation for the threads that share memory space with the caller, > mount() manipulates the pathname resolution for the threads that share > the namespace with the caller. > > > > descriptor table: mapping from numbers to IO channels (opened files). > > > > This is that "process fd table" I mentioned above (I wrote that before > > reading this). > > > > > Again, a part of process state. dup() creates a new entry, with > > > reference to the same file as the old one; multiple open() of the > > > > Hmm, wouldn't "dup()" create another "file" that just points to the same > > dentry? It wouldn't be the "same file", or did you mean "file" from the > > user space point of view? > > No. The difference between open() and dup() is that the latter will > result in a descriptor that really refers to the same file. Current > IO position belongs to IO channel; it doesn't matter for e.g. terminals, > but for regular file it immediately becomes an issue. > fd1 = open("foo", 0); > fd2 = open("foo", 0); > read(fd1, &c1, 1); > read(fd2, &c2, 1); > will result in the first byte of foo read into c1 and c2, but > fd1 = open("foo", 0); > fd2 = dup(fd1); > read(fd1, &c1, 1); > read(fd2, &c2, 1); > will have the first byte of foo in c1 and the second one - in c2. > open() yields a new IO channel attached to new descriptor; dup() > (and dup2()) attaches the existing IO channel to new descriptor. > fork() acts like dup() in that respect - child gets its descriptor > table populated with references to the same IO channels as the > parent does. Ah, looking at the code I use dup() in, it's mostly for pipes in and for redirecting stdout,stdin, etc. So yeah, that makes sense. > > Any Unix since about '71 has it done that way and the same goes > for NT, DOS, etc. - you can't implement redirects to/from regular > files without that distinction. Yep, which is what I used it for. Just forgot the details. > > > > same pathname will each yield a separate opened file. _Some_ state > > > belongs here (close-on-exec, mostly). Note that there's no such > > > thing as "the descriptor of this file" - not even "the user-supplied > > > number that had been used to get the file we are currently reading > > > from", since that number might be refering to something entirely > > > different right after we'd resolved it to opened file and that > > > happens *without* disrupting the operation. > > > > This last paragraph confused me. What do you mean by ""referring to > > something entirely different"? > > Two threads share descriptor table; one of them is in > read(fd, ...), another does dup2(fd2, fd). If read() gets past the > point where it gets struct file reference, it will keep accessing that > IO channel. dup2() will replace the reference in descriptor table, > but that won't disrupt the read()... Oh, OK. So basically if fd 4 is a reference to /tmp/foo and you open /tmp/bar which gets fd2, and one thread is reading fd 4 (/tmp/foo), the other thread doing dup2(fd2, fd) will make fd 4 a reference to /tmp/bar but the read will finish reading /tmp/foo. But if the first thread were to do another read(fd, ...) it would then read /tmp/bar. In other words, it allows read() to stay atomic with respect to what it is reading until it returns. > > > > > Thanks for this overview. It was very useful, and something I think we > > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > > honestly, I think your writeup here is a better overview. > > At the very least it would need serious reordering ;-/ Yeah, but this is all great information. Thanks for explaining it. -- Steve
On Thu, 4 Jan 2024 18:25:02 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > Unfortunately, the terms are clumsy as hell - POSIX ends up with > "file descriptor" (for numbers) vs. "file description" (for IO > channels), which is hard to distinguish when reading and just > as hard to distinguish when listening. "Opened file" (as IO > channel) vs. "file on disc" (as collection of data that might > be accessed via said channels) distinction on top of that also > doesn't help, to put it mildly. It's many decades too late to > do anything about, unfortunately. Pity the UNIX 101 students... ;-/ Just so I understand this correctly. "file descriptor" - is just what maps to a specific inode. "file description" - is how the file is accessed (position in the file and flags associated to how it was opened) Did I get that correct? -- Steve
On Thu, 4 Jan 2024 at 11:09, Steven Rostedt <rostedt@goodmis.org> wrote: > > My mistake was thinking that the dentry was attached more to the path than > the inode. But that doesn't seem to be the case. I wasn't sure if there was > a way to get to a dentry from the inode. Yeah, so dentry->inode and path->dentry are one-way translations, because the other way can have multiple different cases. IOW, a path will specify *one* dentry, and a dentry will specily *one* inode, but one inode can be associated with multiple dentries, and there may be other undiscovered dentries that *would* point to it but aren't even cached right now. And a single dentry can be part of multiple paths, thanks to bind mounts. The "inode->i_dentry" list is *not* a way to look up all dentries, because - as mentioned - there may be potential other paths (and thus other dentries) that lead to the same inode that just haven't been looked up yet (or that have already been aged out of the cache). Of course any *particular* filesystem may not have hard links (so one inode has only one possible dentry), and you may not have bind mounts, and it might be one of the virtual filesystems where everything is always in memory, so none of the above problems are guaranteed to be the case in any *particular* situation. But it's all part of why the dcache is actually really subtle. It's not just the RCU lookup rules and the specialized locking (both reflock and the rather complicated rules about d_lock ordering), it's also that whole "yeah, the filesystem only sees a 'dentry', but because of bind mounts the vfs layer actually does things internally in terms of 'struct path' in order to be able to then show that single fiolesystem in multiple places". Etc etc. There's a reason Al Viro ends up owning the dcache. Nobody else can wrap their tiny little minds around it all. Linus
On Thu, Jan 04, 2024 at 02:15:17PM -0500, Steven Rostedt wrote: > On Thu, 4 Jan 2024 18:25:02 +0000 > Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Unfortunately, the terms are clumsy as hell - POSIX ends up with > > "file descriptor" (for numbers) vs. "file description" (for IO > > channels), which is hard to distinguish when reading and just > > as hard to distinguish when listening. "Opened file" (as IO > > channel) vs. "file on disc" (as collection of data that might > > be accessed via said channels) distinction on top of that also > > doesn't help, to put it mildly. It's many decades too late to > > do anything about, unfortunately. Pity the UNIX 101 students... ;-/ > > Just so I understand this correctly. > > "file descriptor" - is just what maps to a specific inode. No -- file descriptor is a number in fdtable that maps to a struct file. > "file description" - is how the file is accessed (position in the file and > flags associated to how it was opened) file description is posix's awful name for struct file.
On Thu, 4 Jan 2024 at 11:14, Steven Rostedt <rostedt@goodmis.org> wrote: > > "file descriptor" - is just what maps to a specific inode. Nope. Technically and traditionally, file descriptor is just the integer index that is used to look up a 'struct file *'. Except in the kernel, we really just tend to use that term (well, I do) for the 'struct file *' itself, since the integer 'fd' is usually not really relevant except at the system call interface. Which is *NOT* the inode, because the 'struct file' has other things in it (the file position, the permissions that were used at open time etc, close-on-exec state etc etc). > "file description" - is how the file is accessed (position in the file and > flags associated to how it was opened) That's a horrible term that shouldn't be used at all. Apparently some people use it for what is our 'struct file *", also known as a "file table entry". Avoid it. If anything, just use "fd" for the integer representation, and "file" for the pointer to a 'struct file". But most of the time the two are conceptually interchangeable, in that an 'fd' just translates directly to a 'struct file *'. Note that while there's that conceptual direct translation, there's also very much a "time of use" issue, in that a "fd -> file" translation happens at one particular time and in one particular user context, and then it's *done* (so closing and possibly re-using the fd after it's been looked up does not actually affect an existing 'struct file *'). And while 'fd -> file' lookup is quick and common, the other way doesn't exist, because multiple 'fd's can map to one 'struct file *' thanks to dup() (and 'fork()', since a 'fd -> file' translation always happens within the context of a particular user space, an 'fd' in one process is obviously not the same as an 'fd' in another one). Linus
On Thu, 4 Jan 2024 at 11:35, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > Which is *NOT* the inode, because the 'struct file' has other things > in it (the file position, the permissions that were used at open time > etc, close-on-exec state etc etc). That close-on-exec thing was a particularly bad example of things that are in the 'struct file', because it's in fact the only thing that *isn't* in 'struct file' and is associated directly with the 'int fd'. But hopefully the intent was clear despite me picking a particularly bad example. Linus
On Thu, Jan 04, 2024 at 11:35:37AM -0800, Linus Torvalds wrote: > > "file description" - is how the file is accessed (position in the file and > > flags associated to how it was opened) > > That's a horrible term that shouldn't be used at all. Apparently some > people use it for what is our 'struct file *", also known as a "file > table entry". Avoid it. Worse, really. As far as I can reconstruct what happened it was something along the lines of "colloquial expression is 'opened file', but that is confusing - sounds like a property+noun, so it might be misparsed as a member of subset of files satisfying the property of 'being opened'; can't have that in a standard, let's come up with something else". Except that what they did come up with had been much worse, for obvious linguistic reasons. The *ONLY* uses for that expression I can think of are 1. When reading POSIX texts, watch out for that one - if you see them talking about a file descriptor in context where it really should be about an opened file, check the wording. If it really says "file descriptOR", it's probably a bug in standard or a codified bullshit practice. If it says "file descriptION" instead, replace with "opened file" and move on. 2. An outstanding example of the taste of that bunch. IO channel would be a saner variant, but it's far too late for that. The 3-way distinction between descriptor/opened file/file as collection of data needs to be explained in UNIX 101; it is userland-visible and it has to be understood. Unfortunately, it's often done in a way that leaves students seriously confused ;-/
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > Instead of walking the dentries on mount/remount to update the gid values of > all the dentries if a gid option is specified on mount, just update the root > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > operations to update the permissions of the files and directories. > > For all files and directories in the top level instance: > > /sys/kernel/tracing/* > > It will use the root inode as the default permissions. The inode that > represents: /sys/kernel/tracing (or wherever it is mounted). > > When an instance is created: > > mkdir /sys/kernel/tracing/instance/foo > > The directory "foo" and all its files and directories underneath will use > the default of what foo is when it was created. A remount of tracefs will > not affect it. That kinda sounds like eventfs should actually be a separate filesystem. But I don't know enough about the relationship between the two concepts. > > If a user were to modify the permissions of any file or directory in > tracefs, it will also no longer be modified by a change in ownership of a > remount. Very odd semantics and I would recommend to avoid that. It's just plain weird imo. > > The events directory, if it is in the top level instance, will use the > tracefs root inode as the default ownership for itself and all the files and > directories below it. > > For the events directory in an instance ("foo"), it will keep the ownership > of what it was when it was created, and that will be used as the default > ownership for the files and directories beneath it. > > Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/ > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > --- So tracefs supports remounting with different uid/gid mount options and then actually wades through _all_ of the inodes and changes their ownership internally? What's the use-case for this? Containers? Aside from optimizing this and the special semantics for this eventfs stuff that you really should think twice of doing, here's one idea for an extension that might alleviate some of the pain: If you need flexible dynamic ownership change to e.g., be able to delegate (all, a directory, a single file of) tracefs to unprivileged/containers/whatever then you might want to consider supporting idmapped mounts for tracefs. Because then you can do stuff like: user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt user1@localhost:~/data/scripts$ ls -ln /run/ total 12 drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials drwx------ 2 0 0 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus drwx------ 6 0 0 280 Jan 5 11:57 incus_agent prw------- 1 0 0 0 Jan 5 11:57 initctl drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock drwxr-xr-x 3 0 0 60 Jan 5 11:57 log drwx------ 2 0 0 40 Jan 5 11:57 lvm -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 0 0 40 Jan 5 11:57 mount drwx------ 2 0 0 40 Jan 5 11:57 multipath drwxr-xr-x 2 0 0 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 0 0 8 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 0 0 40 Jan 5 11:57 sudo drwxr-xr-x 24 0 0 660 Jan 5 14:30 systemd drwxr-xr-x 6 0 0 140 Jan 5 14:30 udev drwxr-xr-x 4 0 0 80 Jan 5 11:58 user -rw-rw-r-- 1 0 43 2304 Jan 5 15:15 utmp user1@localhost:~/data/scripts$ ls -ln /mnt/ total 12 drwxr-xr-x 2 1234 1000 40 Jan 5 12:12 credentials drwx------ 2 1234 1000 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 1234 1000 60 Jan 5 11:57 dbus drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 incus_agent prw------- 1 1234 1000 0 Jan 5 11:57 initctl drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 lock drwxr-xr-x 3 1234 1000 60 Jan 5 11:57 log drwx------ 2 1234 1000 40 Jan 5 11:57 lvm -r--r--r-- 1 1234 1000 33 Jan 5 11:57 machine-id -rw-r--r-- 1 1234 1000 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 mount drwx------ 2 1234 1000 40 Jan 5 11:57 multipath drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 1234 1000 8 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 1234 1000 40 Jan 5 11:57 sudo drwxr-xr-x 24 1234 1000 660 Jan 5 14:30 systemd drwxr-xr-x 6 1234 1000 140 Jan 5 14:30 udev drwxr-xr-x 4 1234 1000 80 Jan 5 11:58 user -rw-rw-r-- 1 1234 65534 2304 Jan 5 15:15 utmp Where you can see that ownership of this tmpfs instance in this example is changed. I'm not trying to advocate here but this will probably ultimately be nicer for your users because it means that a container manager or whatever can be handed a part of tracefs (or all of it) and the ownership and access rights for that thing is correct. And you can get rid of that gid based access completely. You can change uids, gids, or both. You can specify up to 340 individual mappings it's quite flexible. Because then you can have a single tracefs superblock and have multiple mounts with different ownership for the relevant parts of tracefs that you want to delegate to whoever. If you need an ownership change you can then just create another idmapped mount with the new ownership and then use MOVE_MOUNT_BENEATH + umount to replace that mount. Probably even know someone that would implement this for you (not me) if that sounds like something that would cover some of the use-case for the proposed change here. But maybe I just misunderstood things completely.
On Fri, 5 Jan 2024 15:26:28 +0100 Christian Brauner <brauner@kernel.org> wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > Instead of walking the dentries on mount/remount to update the gid values of > > all the dentries if a gid option is specified on mount, just update the root > > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > > operations to update the permissions of the files and directories. > > > > For all files and directories in the top level instance: > > > > /sys/kernel/tracing/* > > > > It will use the root inode as the default permissions. The inode that > > represents: /sys/kernel/tracing (or wherever it is mounted). > > > > When an instance is created: > > > > mkdir /sys/kernel/tracing/instance/foo > > > > The directory "foo" and all its files and directories underneath will use > > the default of what foo is when it was created. A remount of tracefs will > > not affect it. > > That kinda sounds like eventfs should actually be a separate filesystem. > But I don't know enough about the relationship between the two concepts. It may someday become one, as eventfs is used by perf where the rest of the tracefs system is not. > > > > > If a user were to modify the permissions of any file or directory in > > tracefs, it will also no longer be modified by a change in ownership of a > > remount. > > Very odd semantics and I would recommend to avoid that. It's just plain > weird imo. > > > > > The events directory, if it is in the top level instance, will use the > > tracefs root inode as the default ownership for itself and all the files and > > directories below it. > > > > For the events directory in an instance ("foo"), it will keep the ownership > > of what it was when it was created, and that will be used as the default > > ownership for the files and directories beneath it. > > > > Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/ > > > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > > --- > > So tracefs supports remounting with different uid/gid mount options and > then actually wades through _all_ of the inodes and changes their > ownership internally? What's the use-case for this? Containers? No, in fact tracing doesn't work well with containers as tracing is global to the entire machine. It can work with privileged containers though. The reason for this is because tracefs was based off of debugfs where the files and directores are created at boot up and mounted later. The reason to do this was to allow users to mount with gid=GID to allow a given group to have access to tracing. Without this update, tracefs would ignore it like debugfs and proc does today. I think its time I explain the purpose of tracefs and how it came to be. The tracing system required a way to control tracing and read the traces. It could have just used a new system like perf (although /sys/kernel/debug/tracing predates perf), where it created a single ioctl() like system call do do everything. As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev tracer, which both have an embedded background, I chose an interface that could work with just an unmodified version of busybox. That is, I wanted it to work with just cat and echo. The main difference with tracefs compared to other file systems is that it is a control interface, where writes happen as much as reads. The data read is controlled. The closest thing I can think of is how cgroups work. As tracing is a privileged operation, but something that could be changed to allow a group to have access to, I wanted to make it easy for an admin to decide who gets to do what at boot up via the /etc/fstab file. > > Aside from optimizing this and the special semantics for this eventfs > stuff that you really should think twice of doing, here's one idea for > an extension that might alleviate some of the pain: > > If you need flexible dynamic ownership change to e.g., be able to > delegate (all, a directory, a single file of) tracefs to > unprivileged/containers/whatever then you might want to consider > supporting idmapped mounts for tracefs. Because then you can do stuff > like: I'm a novice here and have no idea on how id maps work ;-) > > user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt > user1@localhost:~/data/scripts$ ls -ln /run/ > total 12 > drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials > drwx------ 2 0 0 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus > drwx------ 6 0 0 280 Jan 5 11:57 incus_agent > prw------- 1 0 0 0 Jan 5 11:57 initctl > drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock > drwxr-xr-x 3 0 0 60 Jan 5 11:57 log > drwx------ 2 0 0 40 Jan 5 11:57 lvm > -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id > -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic > drwxr-xr-x 2 0 0 40 Jan 5 11:57 mount > drwx------ 2 0 0 40 Jan 5 11:57 multipath > drwxr-xr-x 2 0 0 40 Jan 5 11:57 sendsigs.omit.d > lrwxrwxrwx 1 0 0 8 Jan 5 11:57 shm -> /dev/shm > drwx--x--x 2 0 0 40 Jan 5 11:57 sudo > drwxr-xr-x 24 0 0 660 Jan 5 14:30 systemd > drwxr-xr-x 6 0 0 140 Jan 5 14:30 udev > drwxr-xr-x 4 0 0 80 Jan 5 11:58 user > -rw-rw-r-- 1 0 43 2304 Jan 5 15:15 utmp > > user1@localhost:~/data/scripts$ ls -ln /mnt/ > total 12 > drwxr-xr-x 2 1234 1000 40 Jan 5 12:12 credentials > drwx------ 2 1234 1000 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 1234 1000 60 Jan 5 11:57 dbus > drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 incus_agent > prw------- 1 1234 1000 0 Jan 5 11:57 initctl > drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 lock > drwxr-xr-x 3 1234 1000 60 Jan 5 11:57 log > drwx------ 2 1234 1000 40 Jan 5 11:57 lvm > -r--r--r-- 1 1234 1000 33 Jan 5 11:57 machine-id > -rw-r--r-- 1 1234 1000 101 Jan 5 11:58 motd.dynamic > drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 mount > drwx------ 2 1234 1000 40 Jan 5 11:57 multipath > drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 sendsigs.omit.d > lrwxrwxrwx 1 1234 1000 8 Jan 5 11:57 shm -> /dev/shm > drwx--x--x 2 1234 1000 40 Jan 5 11:57 sudo > drwxr-xr-x 24 1234 1000 660 Jan 5 14:30 systemd > drwxr-xr-x 6 1234 1000 140 Jan 5 14:30 udev > drwxr-xr-x 4 1234 1000 80 Jan 5 11:58 user > -rw-rw-r-- 1 1234 65534 2304 Jan 5 15:15 utmp > > Where you can see that ownership of this tmpfs instance in this example > is changed. I'm not trying to advocate here but this will probably > ultimately be nicer for your users because it means that a container > manager or whatever can be handed a part of tracefs (or all of it) and > the ownership and access rights for that thing is correct. And you can > get rid of that gid based access completely. > > You can change uids, gids, or both. You can specify up to 340 individual > mappings it's quite flexible. > > Because then you can have a single tracefs superblock and have multiple > mounts with different ownership for the relevant parts of tracefs that > you want to delegate to whoever. If you need an ownership change you can > then just create another idmapped mount with the new ownership and then > use MOVE_MOUNT_BENEATH + umount to replace that mount. > > Probably even know someone that would implement this for you (not me) if > that sounds like something that would cover some of the use-case for the > proposed change here. But maybe I just misunderstood things completely. That actually sounds nice. I have no idea how to implement it, but having a way to bind mount with different permissions looks like a nifty feature to have. Now, can we do that and still keep the dynamic creation of inodes and dentries? Does this require having more than one dentry per inode? -- Steve
> > So tracefs supports remounting with different uid/gid mount options and > > then actually wades through _all_ of the inodes and changes their > > ownership internally? What's the use-case for this? Containers? > > No, in fact tracing doesn't work well with containers as tracing is global > to the entire machine. It can work with privileged containers though. At least the tracefs interface is easily supportable within a delegation model. IOW, you have a privileged process that delegates relevant portions to a container via idmapped mounts _without_ doing the insane thing and making it mountable by a container aka the fs-to-CVE pipeline. > > The reason for this is because tracefs was based off of debugfs where the > files and directores are created at boot up and mounted later. The reason > to do this was to allow users to mount with gid=GID to allow a given group > to have access to tracing. Without this update, tracefs would ignore it > like debugfs and proc does today. > > I think its time I explain the purpose of tracefs and how it came to be. > > The tracing system required a way to control tracing and read the traces. > It could have just used a new system like perf (although > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > like system call do do everything. > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > tracer, which both have an embedded background, I chose an interface that > could work with just an unmodified version of busybox. That is, I wanted it > to work with just cat and echo. > > The main difference with tracefs compared to other file systems is that it > is a control interface, where writes happen as much as reads. The data read > is controlled. The closest thing I can think of is how cgroups work. > > As tracing is a privileged operation, but something that could be changed > to allow a group to have access to, I wanted to make it easy for an admin > to decide who gets to do what at boot up via the /etc/fstab file. Yeah, ok. I think you could achieve the same thing via idmapped mounts. You just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. mount(8) should just give you the ability to specify "map the ids I explicitly want to remap to something else and for the rest use the identity mapping". I wanted that for other reasons anyway. So in one of the next versions of mount(8) you can then do (where --beneath means place the mount beneath the current one and --replace is self-explanatory): sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing sudo umount /sys/kernel/tracing or as a shortcut provided by mount(8): sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing In both cases you replace the mount without unmounting tracefs. I can illustrate this right now though: user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/ # This is a tool I wrote for testing the patchset I wrote back then. user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing Mounting beneath top mount Creating anonymous mount Attaching mount /mnt -> /sys/kernel/tracing Creating single detached mount user1@localhost:~/data/move-mount-beneath$ # Now there's two mounts stacked on top of each other. user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx------ 6 root root 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r----- 1 root root 0 Jan 7 13:33 README -r--r----- 1 root root 0 Jan 7 13:33 available_events -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs -r--r----- 1 root root 0 Jan 7 13:33 available_tracers -rw-r----- 1 root root 0 Jan 7 13:33 buffer_percent -rw-r----- 1 root root 0 Jan 7 13:33 buffer_size_kb # Reveal updated mount user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx------ 6 user1 user1 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r----- 1 user1 user1 0 Jan 7 13:33 README -r--r----- 1 user1 user1 0 Jan 7 13:33 available_events -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions_addrs -r--r----- 1 user1 user1 0 Jan 7 13:33 available_tracers -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_percent -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_size_kb sudo umount -l /sys/kernel/tracing and reveal the new mount with updated permissions and at no point in time will you have had to unmount tracefs itself. No chown needed, no remount needed that has to touch all inodes. I did perf numbers for this when I implemented this and there's no meaningful perf impact for anything below 5 mappings. which covers 80% of the use-cases. I mean, there are crazy people out there that do have 30 mappings, maybe. And maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap still fits into cacheline) because of some LDAP crap or something. See an ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the files calling fstat() on each of them 1000 times and calculated the mean fstat() time for a single file. | # MAPPINGS | PATCH-NEW | |--------------|-----------| | 0 mappings | 158 ns | | 1 mappings | 157 ns | | 2 mappings | 158 ns | | 3 mappings | 161 ns | | 5 mappings | 165 ns | | 10 mappings | 199 ns | | 50 mappings | 218 ns | | 100 mappings | 229 ns | | 200 mappings | 239 ns | | 300 mappings | 240 ns | | 340 mappings | 248 ns | > > > > > Aside from optimizing this and the special semantics for this eventfs > > stuff that you really should think twice of doing, here's one idea for > > an extension that might alleviate some of the pain: > > > > If you need flexible dynamic ownership change to e.g., be able to > > delegate (all, a directory, a single file of) tracefs to > > unprivileged/containers/whatever then you might want to consider > > supporting idmapped mounts for tracefs. Because then you can do stuff > > like: > > I'm a novice here and have no idea on how id maps work ;-) It's no magic and you don't even need to care about it. If you're on util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option does all the details for you. Though I still want an api where you can just pass the idmappings directly to mount_setattr(). That's another topic though. > have. Now, can we do that and still keep the dynamic creation of inodes and > dentries? Yes. > Does this require having more than one dentry per inode? No. It's just a topological change. The same subtree exposed at different locations with the ability of exposing it with different permissions (without any persistent filesystem-level changes). So, I tried to do an exploratory patch even though I promised myself not to do it. But hey... Some notes: * Permission handling for idmapped mounts is done completely in the VFS. That's the case for all filesytems that don't have a custom ->permission() handler. So there's nothing to do for us here. * Idmapped mount handling for ->getattr() is done completely by the VFS if the filesystem doesn't have a custom ->getattr() handler. So we're done here. * Tracefs doesn't support attribute changes via ->setattr() (chown(), chmod etc.). So there's nothing to here. * Eventfs does support attribute changes via ->setattr(). But it relies on simple_setattr() which is already idmapped mount aware. So there's nothing for us to do. * Ownership is inherited from the parent inode (tracefs) or optionally from stashed ownership information (eventfs). That means the idmapping is irrelevant here. It's similar to the "inherit gid from parent directory" logic we have in some circumstances. TL;DR nothing to do here as well. * Tracefs supports the creation of instances from userspace via mkdir. For example, mkdir /sys/kernel/tracing/instances/foo And here the idmapping is relevant so we need to make the helpers aware of the idmapping. I just went and plumbed this through to most helpers. There's some subtlety in eventfs. Afaict, the directories and files for the individual events are created on-demand during lookup or readdir. The ownership of these events is again inherited from the parent inode or recovered from stored state. In both cases the actual idmapping is irrelevant. The callchain here is: eventfs_root_lookup("xfs", "events") -> create_{dir,file}_dentry("xfs", "events") -> create_{dir,file}("xfs", "events") -> eventfs_start_creating("xfs", "events") -> lookup_one_len("xfs", "events") And the subtlety is that lookup_one_len() does permission checking on the parent inode (IOW, if you want a dentry for "blech" under "events" it'll do a permission check on events->d_inode) for exec permissions and then goes on to give you a new dentry. Usually this call would have to be changed to lookup_one() and the idmapping be handed down to it. But I think that's irrelevant here. Lookup generally doesn't need to be aware of idmappings at all. The permission checking is done purely in the vfs via may_lookup() and the idmapping is irrelevant because we always initialize inodes with the filesystem level ownership (see the idmappings.rst) documentation if you're interested in excessive details (otherwise you get inode aliases which you really don't want). For tracefs it would not matter for lookup per se but only because tracefs seemingly creates inodes/dentries during lookup (and readdir()). But imho the permission checking done in current eventfs_root_lookup() via lookup_one_len() is meaningless in any way; possibly even (conceptually) wrong. Because, the actual permission checking for the creation of the eventfs entries isn't really done during lookup or readdir, it's done when mkdir is called: mkdir /sys/kernel/tracing/instances/foo Here, all possible entries beneath foo including "events" and further below are recorded and stored. So once mkdir returns it basically means that it succeeded with the creation of all the necessary directories and files. For all purposes the foo/events/ directory and below have all the entries that matter. They have been created. It's comparable to them not being in the {d,i}cache, I guess. When one goes and looksup stuff under foo/events/ or readdir the entries in that directory: fd = open("foo/events") readdir(fd, ...) then they are licensed to list an entry in that directory. So all that needs to be done is to actually list those files in that directory. And since they already exist (they were created during mkdir) we just need to splice in inodes and dentries for them. But for that we shouldn't check permissions on the directory again. Because we've done that already correctly when the VFS called may_lookup(). IOW, the inode_permission() in lookup_one_len() that eventfs does is redundant and just wrong. Luckily, I don't think we need to even change anything because all directories that eventfs creates always grant exec permissions to the other group so lookup_one_len() will trivially succeed. IIUC. Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org> --- fs/tracefs/event_inode.c | 8 +- fs/tracefs/inode.c | 38 +++-- fs/tracefs/internal.h | 3 +- include/linux/tracefs.h | 20 +-- kernel/trace/ftrace.c | 43 +++--- kernel/trace/trace.c | 201 ++++++++++++++------------- kernel/trace/trace.h | 22 +-- kernel/trace/trace_dynevent.c | 2 +- kernel/trace/trace_events.c | 27 ++-- kernel/trace/trace_events_synth.c | 5 +- kernel/trace/trace_functions.c | 6 +- kernel/trace/trace_functions_graph.c | 4 +- kernel/trace/trace_hwlat.c | 8 +- kernel/trace/trace_kprobe.c | 4 +- kernel/trace/trace_osnoise.c | 48 ++++--- kernel/trace/trace_printk.c | 4 +- kernel/trace/trace_stack.c | 11 +- kernel/trace/trace_stat.c | 6 +- kernel/trace/trace_uprobe.c | 4 +- 19 files changed, 251 insertions(+), 213 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 2ccc849a5bda..e2f352bd8779 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -852,11 +852,11 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode * * See eventfs_create_dir() for use of @entries. */ -struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, - const struct eventfs_entry *entries, - int size, void *data) +struct eventfs_inode *eventfs_create_events_dir( + struct mnt_idmap *idmap, const char *name, struct dentry *parent, + const struct eventfs_entry *entries, int size, void *data) { - struct dentry *dentry = tracefs_start_creating(name, parent); + struct dentry *dentry = tracefs_start_creating(idmap, name, parent); struct eventfs_inode *ei; struct tracefs_inode *ti; struct inode *inode; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index ae648deed019..f4f4904eb3a0 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -68,7 +68,7 @@ static const struct file_operations tracefs_file_operations = { }; static struct tracefs_dir_ops { - int (*mkdir)(const char *name); + int (*mkdir)(struct mnt_idmap *idmap, const char *name); int (*rmdir)(const char *name); } tracefs_ops __ro_after_init; @@ -104,7 +104,7 @@ static int tracefs_syscall_mkdir(struct mnt_idmap *idmap, * mkdir routine to handle races. */ inode_unlock(inode); - ret = tracefs_ops.mkdir(name); + ret = tracefs_ops.mkdir(idmap, name); inode_lock(inode); kfree(name); @@ -439,10 +439,12 @@ static struct file_system_type trace_fs_type = { .name = "tracefs", .mount = trace_mount, .kill_sb = kill_litter_super, + .fs_flags = FS_ALLOW_IDMAP, }; MODULE_ALIAS_FS("tracefs"); -struct dentry *tracefs_start_creating(const char *name, struct dentry *parent) +struct dentry *tracefs_start_creating(struct mnt_idmap *idmap, const char *name, + struct dentry *parent) { struct dentry *dentry; int error; @@ -466,7 +468,7 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent) if (unlikely(IS_DEADDIR(d_inode(parent)))) dentry = ERR_PTR(-ENOENT); else - dentry = lookup_one_len(name, parent, strlen(name)); + dentry = lookup_one(idmap, name, parent, strlen(name)); if (!IS_ERR(dentry) && d_inode(dentry)) { dput(dentry); dentry = ERR_PTR(-EEXIST); @@ -589,8 +591,9 @@ struct dentry *eventfs_end_creating(struct dentry *dentry) * If tracefs is not enabled in the kernel, the value -%ENODEV will be * returned. */ -struct dentry *tracefs_create_file(const char *name, umode_t mode, - struct dentry *parent, void *data, +struct dentry *tracefs_create_file(struct mnt_idmap *idmap, const char *name, + umode_t mode, struct dentry *parent, + void *data, const struct file_operations *fops) { struct dentry *dentry; @@ -602,7 +605,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, if (!(mode & S_IFMT)) mode |= S_IFREG; BUG_ON(!S_ISREG(mode)); - dentry = tracefs_start_creating(name, parent); + dentry = tracefs_start_creating(idmap, name, parent); if (IS_ERR(dentry)) return NULL; @@ -621,10 +624,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, return tracefs_end_creating(dentry); } -static struct dentry *__create_dir(const char *name, struct dentry *parent, +static struct dentry *__create_dir(struct mnt_idmap *idmap, const char *name, + struct dentry *parent, const struct inode_operations *ops) { - struct dentry *dentry = tracefs_start_creating(name, parent); + struct dentry *dentry = tracefs_start_creating(idmap, name, parent); struct inode *inode; if (IS_ERR(dentry)) @@ -649,6 +653,10 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, return tracefs_end_creating(dentry); } +const struct inode_operations tracefs_default_dir_inode_operations = { + .lookup = simple_lookup, +}; + /** * tracefs_create_dir - create a directory in the tracefs filesystem * @name: a pointer to a string containing the name of the directory to @@ -666,12 +674,14 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, * If tracing is not enabled in the kernel, the value -%ENODEV will be * returned. */ -struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) +struct dentry *tracefs_create_dir(struct mnt_idmap *idmap, const char *name, + struct dentry *parent) { if (security_locked_down(LOCKDOWN_TRACEFS)) return NULL; - return __create_dir(name, parent, &simple_dir_inode_operations); + return __create_dir(idmap, name, parent, + &tracefs_default_dir_inode_operations); } /** @@ -693,7 +703,8 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) */ __init struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent, - int (*mkdir)(const char *name), + int (*mkdir)(struct mnt_idmap *idmap, + const char *name), int (*rmdir)(const char *name)) { struct dentry *dentry; @@ -702,7 +713,8 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; - dentry = __create_dir(name, parent, &tracefs_dir_inode_operations); + dentry = __create_dir(&nop_mnt_idmap, name, parent, + &tracefs_dir_inode_operations); if (!dentry) return NULL; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index ccee18ca66c7..64abc5fcc62f 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -70,8 +70,9 @@ static inline struct tracefs_inode *get_tracefs(const struct inode *inode) return container_of(inode, struct tracefs_inode, vfs_inode); } -struct dentry *tracefs_start_creating(const char *name, struct dentry *parent); struct dentry *tracefs_end_creating(struct dentry *dentry); +struct dentry *tracefs_start_creating(struct mnt_idmap *idmap, const char *name, + struct dentry *parent); struct dentry *tracefs_failed_creating(struct dentry *dentry); struct inode *tracefs_get_inode(struct super_block *sb); struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..0457afbafc94 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -76,9 +76,9 @@ struct eventfs_entry { struct eventfs_inode; -struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, - const struct eventfs_entry *entries, - int size, void *data); +struct eventfs_inode *eventfs_create_events_dir( + struct mnt_idmap *idmap, const char *name, struct dentry *parent, + const struct eventfs_entry *entries, int size, void *data); struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, @@ -87,16 +87,20 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode void eventfs_remove_events_dir(struct eventfs_inode *ei); void eventfs_remove_dir(struct eventfs_inode *ei); -struct dentry *tracefs_create_file(const char *name, umode_t mode, - struct dentry *parent, void *data, +struct dentry *tracefs_create_file(struct mnt_idmap *idmap, const char *name, + umode_t mode, struct dentry *parent, + void *data, const struct file_operations *fops); -struct dentry *tracefs_create_dir(const char *name, struct dentry *parent); +struct dentry *tracefs_create_dir(struct mnt_idmap *idmap, const char *name, + struct dentry *parent); void tracefs_remove(struct dentry *dentry); -struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent, - int (*mkdir)(const char *name), +struct dentry *tracefs_create_instance_dir(const char *name, + struct dentry *parent, + int (*mkdir)(struct mnt_idmap *idmap, + const char *name), int (*rmdir)(const char *name)); bool tracefs_initialized(void); diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8de8bec5f366..6bf9dc7c5c29 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1015,7 +1015,7 @@ static __init void ftrace_profile_tracefs(struct dentry *d_tracer) } } - trace_create_file("function_profile_enabled", + trace_create_file(&nop_mnt_idmap, "function_profile_enabled", TRACE_MODE_WRITE, d_tracer, NULL, &ftrace_profile_fops); } @@ -6380,14 +6380,14 @@ static const struct file_operations ftrace_graph_notrace_fops = { }; #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ -void ftrace_create_filter_files(struct ftrace_ops *ops, +void ftrace_create_filter_files(struct mnt_idmap *idmap, struct ftrace_ops *ops, struct dentry *parent) { - trace_create_file("set_ftrace_filter", TRACE_MODE_WRITE, parent, + trace_create_file(idmap, "set_ftrace_filter", TRACE_MODE_WRITE, parent, ops, &ftrace_filter_fops); - trace_create_file("set_ftrace_notrace", TRACE_MODE_WRITE, parent, + trace_create_file(idmap, "set_ftrace_notrace", TRACE_MODE_WRITE, parent, ops, &ftrace_notrace_fops); } @@ -6413,28 +6413,26 @@ void ftrace_destroy_filter_files(struct ftrace_ops *ops) static __init int ftrace_init_dyn_tracefs(struct dentry *d_tracer) { + trace_create_file(&nop_mnt_idmap, "available_filter_functions", + TRACE_MODE_READ, d_tracer, NULL, &ftrace_avail_fops); - trace_create_file("available_filter_functions", TRACE_MODE_READ, - d_tracer, NULL, &ftrace_avail_fops); + trace_create_file(&nop_mnt_idmap, "available_filter_functions_addrs", + TRACE_MODE_READ, d_tracer, NULL, + &ftrace_avail_addrs_fops); - trace_create_file("available_filter_functions_addrs", TRACE_MODE_READ, - d_tracer, NULL, &ftrace_avail_addrs_fops); - - trace_create_file("enabled_functions", TRACE_MODE_READ, + trace_create_file(&nop_mnt_idmap, "enabled_functions", TRACE_MODE_READ, d_tracer, NULL, &ftrace_enabled_fops); - trace_create_file("touched_functions", TRACE_MODE_READ, - d_tracer, NULL, &ftrace_touched_fops); + trace_create_file(&nop_mnt_idmap, "touched_functions", TRACE_MODE_READ, + d_tracer, NULL, &ftrace_touched_fops); - ftrace_create_filter_files(&global_ops, d_tracer); + ftrace_create_filter_files(&nop_mnt_idmap, &global_ops, d_tracer); #ifdef CONFIG_FUNCTION_GRAPH_TRACER - trace_create_file("set_graph_function", TRACE_MODE_WRITE, d_tracer, - NULL, - &ftrace_graph_fops); - trace_create_file("set_graph_notrace", TRACE_MODE_WRITE, d_tracer, - NULL, - &ftrace_graph_notrace_fops); + trace_create_file(&nop_mnt_idmap, "set_graph_function", + TRACE_MODE_WRITE, d_tracer, NULL, &ftrace_graph_fops); + trace_create_file(&nop_mnt_idmap, "set_graph_notrace", TRACE_MODE_WRITE, + d_tracer, NULL, &ftrace_graph_notrace_fops); #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ return 0; @@ -7858,11 +7856,12 @@ static const struct file_operations ftrace_no_pid_fops = { .release = ftrace_pid_release, }; -void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer) +void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, + struct dentry *d_tracer) { - trace_create_file("set_ftrace_pid", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "set_ftrace_pid", TRACE_MODE_WRITE, d_tracer, tr, &ftrace_pid_fops); - trace_create_file("set_ftrace_notrace_pid", TRACE_MODE_WRITE, + trace_create_file(idmap, "set_ftrace_notrace_pid", TRACE_MODE_WRITE, d_tracer, tr, &ftrace_no_pid_fops); } diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 199df497db07..c0bd4b0dfe85 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1766,12 +1766,13 @@ static void latency_fsnotify_workfn_irq(struct irq_work *iwork) queue_work(fsnotify_wq, &tr->fsnotify_work); } -static void trace_create_maxlat_file(struct trace_array *tr, +static void trace_create_maxlat_file(struct mnt_idmap *idmap, + struct trace_array *tr, struct dentry *d_tracer) { INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn); init_irq_work(&tr->fsnotify_irqwork, latency_fsnotify_workfn_irq); - tr->d_max_latency = trace_create_file("tracing_max_latency", + tr->d_max_latency = trace_create_file(idmap, "tracing_max_latency", TRACE_MODE_WRITE, d_tracer, tr, &tracing_max_lat_fops); @@ -1804,8 +1805,8 @@ void latency_fsnotify(struct trace_array *tr) #else /* !LATENCY_FS_NOTIFY */ -#define trace_create_maxlat_file(tr, d_tracer) \ - trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \ +#define trace_create_maxlat_file(idmap, tr, d_tracer) \ + trace_create_file(idmap, "tracing_max_latency", TRACE_MODE_WRITE, \ d_tracer, tr, &tracing_max_lat_fops) #endif @@ -2124,7 +2125,8 @@ static inline int do_run_tracer_selftest(struct tracer *type) } #endif /* CONFIG_FTRACE_STARTUP_TEST */ -static void add_tracer_options(struct trace_array *tr, struct tracer *t); +static void add_tracer_options(struct mnt_idmap *idmap, struct trace_array *tr, + struct tracer *t); static void __init apply_trace_boot_options(void); @@ -2191,7 +2193,7 @@ int __init register_tracer(struct tracer *type) type->next = trace_types; trace_types = type; - add_tracer_options(&global_trace, type); + add_tracer_options(&nop_mnt_idmap, &global_trace, type); out: mutex_unlock(&trace_types_lock); @@ -6245,14 +6247,14 @@ trace_insert_eval_map_file(struct module *mod, struct trace_eval_map **start, mutex_unlock(&trace_eval_mutex); } -static void trace_create_eval_file(struct dentry *d_tracer) +static void trace_create_eval_file(struct mnt_idmap *idmap, struct dentry *d_tracer) { - trace_create_file("eval_map", TRACE_MODE_READ, d_tracer, + trace_create_file(idmap, "eval_map", TRACE_MODE_READ, d_tracer, NULL, &tracing_eval_map_fops); } #else /* CONFIG_TRACE_EVAL_MAP_FILE */ -static inline void trace_create_eval_file(struct dentry *d_tracer) { } +static inline void trace_create_eval_file(struct mnt_idmap *idmap, struct dentry *d_tracer) { } static inline void trace_insert_eval_map_file(struct module *mod, struct trace_eval_map **start, int len) { } #endif /* !CONFIG_TRACE_EVAL_MAP_FILE */ @@ -6454,7 +6456,7 @@ int tracing_update_buffers(struct trace_array *tr) struct trace_option_dentry; static void -create_trace_option_files(struct trace_array *tr, struct tracer *tracer); +create_trace_option_files(struct mnt_idmap *idmap, struct trace_array *tr, struct tracer *tracer); /* * Used to clear out the tracer before deletion of an instance. @@ -6475,7 +6477,8 @@ static void tracing_set_nop(struct trace_array *tr) static bool tracer_options_updated; -static void add_tracer_options(struct trace_array *tr, struct tracer *t) +static void add_tracer_options(struct mnt_idmap *idmap, struct trace_array *tr, + struct tracer *t) { /* Only enable if the directory has been created already. */ if (!tr->dir) @@ -6485,7 +6488,7 @@ static void add_tracer_options(struct trace_array *tr, struct tracer *t) if (!tracer_options_updated) return; - create_trace_option_files(tr, t); + create_trace_option_files(idmap, tr, t); } int tracing_set_tracer(struct trace_array *tr, const char *buf) @@ -8845,7 +8848,8 @@ static struct dentry *tracing_get_dentry(struct trace_array *tr) return tr->dir; } -static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) +static struct dentry *tracing_dentry_percpu(struct mnt_idmap *idmap, + struct trace_array *tr, int cpu) { struct dentry *d_tracer; @@ -8856,7 +8860,7 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) if (IS_ERR(d_tracer)) return NULL; - tr->percpu_dir = tracefs_create_dir("per_cpu", d_tracer); + tr->percpu_dir = tracefs_create_dir(idmap, "per_cpu", d_tracer); MEM_FAIL(!tr->percpu_dir, "Could not create tracefs directory 'per_cpu/%d'\n", cpu); @@ -8864,11 +8868,13 @@ static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu) return tr->percpu_dir; } -static struct dentry * -trace_create_cpu_file(const char *name, umode_t mode, struct dentry *parent, - void *data, long cpu, const struct file_operations *fops) +static struct dentry *trace_create_cpu_file(struct mnt_idmap *idmap, + const char *name, umode_t mode, + struct dentry *parent, void *data, + long cpu, + const struct file_operations *fops) { - struct dentry *ret = trace_create_file(name, mode, parent, data, fops); + struct dentry *ret = trace_create_file(idmap, name, mode, parent, data, fops); if (ret) /* See tracing_get_cpu() */ d_inode(ret)->i_cdev = (void *)(cpu + 1); @@ -8876,9 +8882,9 @@ trace_create_cpu_file(const char *name, umode_t mode, struct dentry *parent, } static void -tracing_init_tracefs_percpu(struct trace_array *tr, long cpu) +tracing_init_tracefs_percpu(struct mnt_idmap *idmap, struct trace_array *tr, long cpu) { - struct dentry *d_percpu = tracing_dentry_percpu(tr, cpu); + struct dentry *d_percpu = tracing_dentry_percpu(idmap, tr, cpu); struct dentry *d_cpu; char cpu_dir[30]; /* 30 characters should be more than enough */ @@ -8886,34 +8892,34 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu) return; snprintf(cpu_dir, 30, "cpu%ld", cpu); - d_cpu = tracefs_create_dir(cpu_dir, d_percpu); + d_cpu = tracefs_create_dir(idmap, cpu_dir, d_percpu); if (!d_cpu) { pr_warn("Could not create tracefs '%s' entry\n", cpu_dir); return; } /* per cpu trace_pipe */ - trace_create_cpu_file("trace_pipe", TRACE_MODE_READ, d_cpu, + trace_create_cpu_file(idmap, "trace_pipe", TRACE_MODE_READ, d_cpu, tr, cpu, &tracing_pipe_fops); /* per cpu trace */ - trace_create_cpu_file("trace", TRACE_MODE_WRITE, d_cpu, + trace_create_cpu_file(idmap, "trace", TRACE_MODE_WRITE, d_cpu, tr, cpu, &tracing_fops); - trace_create_cpu_file("trace_pipe_raw", TRACE_MODE_READ, d_cpu, + trace_create_cpu_file(idmap, "trace_pipe_raw", TRACE_MODE_READ, d_cpu, tr, cpu, &tracing_buffers_fops); - trace_create_cpu_file("stats", TRACE_MODE_READ, d_cpu, + trace_create_cpu_file(idmap, "stats", TRACE_MODE_READ, d_cpu, tr, cpu, &tracing_stats_fops); - trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu, + trace_create_cpu_file(idmap, "buffer_size_kb", TRACE_MODE_READ, d_cpu, tr, cpu, &tracing_entries_fops); #ifdef CONFIG_TRACER_SNAPSHOT - trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu, + trace_create_cpu_file(idmap, "snapshot", TRACE_MODE_WRITE, d_cpu, tr, cpu, &snapshot_fops); - trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu, + trace_create_cpu_file(idmap, "snapshot_raw", TRACE_MODE_READ, d_cpu, tr, cpu, &snapshot_raw_fops); #endif } @@ -9088,23 +9094,21 @@ static const struct file_operations trace_options_core_fops = { .llseek = generic_file_llseek, }; -struct dentry *trace_create_file(const char *name, - umode_t mode, - struct dentry *parent, - void *data, - const struct file_operations *fops) +struct dentry *trace_create_file(struct mnt_idmap *idmap, const char *name, + umode_t mode, struct dentry *parent, + void *data, const struct file_operations *fops) { struct dentry *ret; - ret = tracefs_create_file(name, mode, parent, data, fops); + ret = tracefs_create_file(idmap, name, mode, parent, data, fops); if (!ret) pr_warn("Could not create tracefs '%s' entry\n", name); return ret; } - -static struct dentry *trace_options_init_dentry(struct trace_array *tr) +static struct dentry *trace_options_init_dentry(struct mnt_idmap *idmap, + struct trace_array *tr) { struct dentry *d_tracer; @@ -9115,7 +9119,7 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr) if (IS_ERR(d_tracer)) return NULL; - tr->options = tracefs_create_dir("options", d_tracer); + tr->options = tracefs_create_dir(idmap, "options", d_tracer); if (!tr->options) { pr_warn("Could not create tracefs directory 'options'\n"); return NULL; @@ -9125,14 +9129,14 @@ static struct dentry *trace_options_init_dentry(struct trace_array *tr) } static void -create_trace_option_file(struct trace_array *tr, +create_trace_option_file(struct mnt_idmap *idmap, struct trace_array *tr, struct trace_option_dentry *topt, struct tracer_flags *flags, struct tracer_opt *opt) { struct dentry *t_options; - t_options = trace_options_init_dentry(tr); + t_options = trace_options_init_dentry(idmap, tr); if (!t_options) return; @@ -9140,13 +9144,14 @@ create_trace_option_file(struct trace_array *tr, topt->opt = opt; topt->tr = tr; - topt->entry = trace_create_file(opt->name, TRACE_MODE_WRITE, + topt->entry = trace_create_file(idmap, opt->name, TRACE_MODE_WRITE, t_options, topt, &trace_options_fops); } -static void -create_trace_option_files(struct trace_array *tr, struct tracer *tracer) +static void create_trace_option_files(struct mnt_idmap *idmap, + struct trace_array *tr, + struct tracer *tracer) { struct trace_option_dentry *topts; struct trace_options *tr_topts; @@ -9198,7 +9203,7 @@ create_trace_option_files(struct trace_array *tr, struct tracer *tracer) tr->nr_topts++; for (cnt = 0; opts[cnt].name; cnt++) { - create_trace_option_file(tr, &topts[cnt], flags, + create_trace_option_file(idmap, tr, &topts[cnt], flags, &opts[cnt]); MEM_FAIL(topts[cnt].entry == NULL, "Failed to create trace option: %s", @@ -9207,34 +9212,34 @@ create_trace_option_files(struct trace_array *tr, struct tracer *tracer) } static struct dentry * -create_trace_option_core_file(struct trace_array *tr, +create_trace_option_core_file(struct mnt_idmap *idmap, struct trace_array *tr, const char *option, long index) { struct dentry *t_options; - t_options = trace_options_init_dentry(tr); + t_options = trace_options_init_dentry(idmap, tr); if (!t_options) return NULL; - return trace_create_file(option, TRACE_MODE_WRITE, t_options, + return trace_create_file(idmap, option, TRACE_MODE_WRITE, t_options, (void *)&tr->trace_flags_index[index], &trace_options_core_fops); } -static void create_trace_options_dir(struct trace_array *tr) +static void create_trace_options_dir(struct mnt_idmap *idmap, struct trace_array *tr) { struct dentry *t_options; bool top_level = tr == &global_trace; int i; - t_options = trace_options_init_dentry(tr); + t_options = trace_options_init_dentry(idmap, tr); if (!t_options) return; for (i = 0; trace_options[i]; i++) { if (top_level || !((1 << i) & TOP_LEVEL_TRACE_FLAGS)) - create_trace_option_core_file(tr, trace_options[i], i); + create_trace_option_core_file(idmap, tr, trace_options[i], i); } } @@ -9342,8 +9347,8 @@ static const struct file_operations buffer_percent_fops = { static struct dentry *trace_instance_dir; -static void -init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer); +static void init_tracer_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, + struct dentry *d_tracer); static int allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size) @@ -9426,19 +9431,20 @@ static void init_trace_flags_index(struct trace_array *tr) tr->trace_flags_index[i] = i; } -static void __update_tracer_options(struct trace_array *tr) +static void __update_tracer_options(struct mnt_idmap *idmap, + struct trace_array *tr) { struct tracer *t; for (t = trace_types; t; t = t->next) - add_tracer_options(tr, t); + add_tracer_options(idmap, tr, t); } static void update_tracer_options(struct trace_array *tr) { mutex_lock(&trace_types_lock); tracer_options_updated = true; - __update_tracer_options(tr); + __update_tracer_options(&nop_mnt_idmap, tr); mutex_unlock(&trace_types_lock); } @@ -9470,27 +9476,28 @@ struct trace_array *trace_array_find_get(const char *instance) return tr; } -static int trace_array_create_dir(struct trace_array *tr) +static int trace_array_create_dir(struct mnt_idmap *idmap, struct trace_array *tr) { int ret; - tr->dir = tracefs_create_dir(tr->name, trace_instance_dir); + tr->dir = tracefs_create_dir(idmap, tr->name, trace_instance_dir); if (!tr->dir) return -EINVAL; - ret = event_trace_add_tracer(tr->dir, tr); + ret = event_trace_add_tracer(idmap, tr->dir, tr); if (ret) { tracefs_remove(tr->dir); return ret; } - init_tracer_tracefs(tr, tr->dir); - __update_tracer_options(tr); + init_tracer_tracefs(idmap, tr, tr->dir); + __update_tracer_options(idmap, tr); return ret; } -static struct trace_array *trace_array_create(const char *name) +static struct trace_array *trace_array_create(struct mnt_idmap *idmap, + const char *name) { struct trace_array *tr; int ret; @@ -9539,7 +9546,7 @@ static struct trace_array *trace_array_create(const char *name) init_trace_flags_index(tr); if (trace_instance_dir) { - ret = trace_array_create_dir(tr); + ret = trace_array_create_dir(idmap, tr); if (ret) goto out_free_tr; } else @@ -9562,7 +9569,7 @@ static struct trace_array *trace_array_create(const char *name) return ERR_PTR(ret); } -static int instance_mkdir(const char *name) +static int instance_mkdir(struct mnt_idmap *idmap, const char *name) { struct trace_array *tr; int ret; @@ -9574,7 +9581,7 @@ static int instance_mkdir(const char *name) if (trace_array_find(name)) goto out_unlock; - tr = trace_array_create(name); + tr = trace_array_create(idmap, name); ret = PTR_ERR_OR_ZERO(tr); @@ -9612,7 +9619,7 @@ struct trace_array *trace_array_get_by_name(const char *name) goto out_unlock; } - tr = trace_array_create(name); + tr = trace_array_create(&nop_mnt_idmap, name); if (IS_ERR(tr)) tr = NULL; @@ -9728,7 +9735,7 @@ static __init void create_trace_instances(struct dentry *d_tracer) list_for_each_entry(tr, &ftrace_trace_arrays, list) { if (!tr->name) continue; - if (MEM_FAIL(trace_array_create_dir(tr) < 0, + if (MEM_FAIL(trace_array_create_dir(&nop_mnt_idmap, tr) < 0, "Failed to create instance directory\n")) break; } @@ -9737,81 +9744,81 @@ static __init void create_trace_instances(struct dentry *d_tracer) mutex_unlock(&event_mutex); } -static void -init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) +static void init_tracer_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, + struct dentry *d_tracer) { int cpu; - trace_create_file("available_tracers", TRACE_MODE_READ, d_tracer, + trace_create_file(idmap, "available_tracers", TRACE_MODE_READ, d_tracer, tr, &show_traces_fops); - trace_create_file("current_tracer", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "current_tracer", TRACE_MODE_WRITE, d_tracer, tr, &set_tracer_fops); - trace_create_file("tracing_cpumask", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "tracing_cpumask", TRACE_MODE_WRITE, d_tracer, tr, &tracing_cpumask_fops); - trace_create_file("trace_options", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "trace_options", TRACE_MODE_WRITE, d_tracer, tr, &tracing_iter_fops); - trace_create_file("trace", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "trace", TRACE_MODE_WRITE, d_tracer, tr, &tracing_fops); - trace_create_file("trace_pipe", TRACE_MODE_READ, d_tracer, + trace_create_file(idmap, "trace_pipe", TRACE_MODE_READ, d_tracer, tr, &tracing_pipe_fops); - trace_create_file("buffer_size_kb", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "buffer_size_kb", TRACE_MODE_WRITE, d_tracer, tr, &tracing_entries_fops); - trace_create_file("buffer_total_size_kb", TRACE_MODE_READ, d_tracer, + trace_create_file(idmap, "buffer_total_size_kb", TRACE_MODE_READ, d_tracer, tr, &tracing_total_entries_fops); - trace_create_file("free_buffer", 0200, d_tracer, + trace_create_file(idmap, "free_buffer", 0200, d_tracer, tr, &tracing_free_buffer_fops); - trace_create_file("trace_marker", 0220, d_tracer, + trace_create_file(idmap, "trace_marker", 0220, d_tracer, tr, &tracing_mark_fops); tr->trace_marker_file = __find_event_file(tr, "ftrace", "print"); - trace_create_file("trace_marker_raw", 0220, d_tracer, + trace_create_file(idmap, "trace_marker_raw", 0220, d_tracer, tr, &tracing_mark_raw_fops); - trace_create_file("trace_clock", TRACE_MODE_WRITE, d_tracer, tr, + trace_create_file(idmap, "trace_clock", TRACE_MODE_WRITE, d_tracer, tr, &trace_clock_fops); - trace_create_file("tracing_on", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "tracing_on", TRACE_MODE_WRITE, d_tracer, tr, &rb_simple_fops); - trace_create_file("timestamp_mode", TRACE_MODE_READ, d_tracer, tr, + trace_create_file(idmap, "timestamp_mode", TRACE_MODE_READ, d_tracer, tr, &trace_time_stamp_mode_fops); tr->buffer_percent = 50; - trace_create_file("buffer_percent", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "buffer_percent", TRACE_MODE_WRITE, d_tracer, tr, &buffer_percent_fops); - create_trace_options_dir(tr); + create_trace_options_dir(idmap, tr); #ifdef CONFIG_TRACER_MAX_TRACE - trace_create_maxlat_file(tr, d_tracer); + trace_create_maxlat_file(idmap, tr, d_tracer); #endif - if (ftrace_create_function_files(tr, d_tracer)) + if (ftrace_create_function_files(idmap, tr, d_tracer)) MEM_FAIL(1, "Could not allocate function filter files"); #ifdef CONFIG_TRACER_SNAPSHOT - trace_create_file("snapshot", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "snapshot", TRACE_MODE_WRITE, d_tracer, tr, &snapshot_fops); #endif - trace_create_file("error_log", TRACE_MODE_WRITE, d_tracer, + trace_create_file(idmap, "error_log", TRACE_MODE_WRITE, d_tracer, tr, &tracing_err_log_fops); for_each_tracing_cpu(cpu) - tracing_init_tracefs_percpu(tr, cpu); + tracing_init_tracefs_percpu(idmap, tr, cpu); - ftrace_init_tracefs(tr, d_tracer); + ftrace_init_tracefs(idmap, tr, d_tracer); } static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore) @@ -9991,32 +9998,32 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work) event_trace_init(); - init_tracer_tracefs(&global_trace, NULL); + init_tracer_tracefs(&nop_mnt_idmap, &global_trace, NULL); ftrace_init_tracefs_toplevel(&global_trace, NULL); - trace_create_file("tracing_thresh", TRACE_MODE_WRITE, NULL, + trace_create_file(&nop_mnt_idmap, "tracing_thresh", TRACE_MODE_WRITE, NULL, &global_trace, &tracing_thresh_fops); - trace_create_file("README", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "README", TRACE_MODE_READ, NULL, NULL, &tracing_readme_fops); - trace_create_file("saved_cmdlines", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "saved_cmdlines", TRACE_MODE_READ, NULL, NULL, &tracing_saved_cmdlines_fops); - trace_create_file("saved_cmdlines_size", TRACE_MODE_WRITE, NULL, + trace_create_file(&nop_mnt_idmap, "saved_cmdlines_size", TRACE_MODE_WRITE, NULL, NULL, &tracing_saved_cmdlines_size_fops); - trace_create_file("saved_tgids", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "saved_tgids", TRACE_MODE_READ, NULL, NULL, &tracing_saved_tgids_fops); - trace_create_eval_file(NULL); + trace_create_eval_file(&nop_mnt_idmap, NULL); #ifdef CONFIG_MODULES register_module_notifier(&trace_module_nb); #endif #ifdef CONFIG_DYNAMIC_FTRACE - trace_create_file("dyn_ftrace_total_info", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "dyn_ftrace_total_info", TRACE_MODE_READ, NULL, NULL, &tracing_dyn_info_fops); #endif diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 0489e72c8169..b8de94e4387d 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -622,9 +622,8 @@ bool tracing_is_disabled(void); bool tracer_tracing_is_on(struct trace_array *tr); void tracer_tracing_on(struct trace_array *tr); void tracer_tracing_off(struct trace_array *tr); -struct dentry *trace_create_file(const char *name, - umode_t mode, - struct dentry *parent, +struct dentry *trace_create_file(struct mnt_idmap *idmap, const char *name, + umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops); @@ -1025,7 +1024,7 @@ static inline int ftrace_trace_task(struct trace_array *tr) FTRACE_PID_IGNORE; } extern int ftrace_is_dead(void); -int ftrace_create_function_files(struct trace_array *tr, +int ftrace_create_function_files(struct mnt_idmap *idmap, struct trace_array *tr, struct dentry *parent); void ftrace_destroy_function_files(struct trace_array *tr); int ftrace_allocate_ftrace_ops(struct trace_array *tr); @@ -1033,7 +1032,8 @@ void ftrace_free_ftrace_ops(struct trace_array *tr); void ftrace_init_global_array_ops(struct trace_array *tr); void ftrace_init_array_ops(struct trace_array *tr, ftrace_func_t func); void ftrace_reset_array_ops(struct trace_array *tr); -void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer); +void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, + struct dentry *d_tracer); void ftrace_init_tracefs_toplevel(struct trace_array *tr, struct dentry *d_tracer); void ftrace_clear_pids(struct trace_array *tr); @@ -1046,7 +1046,7 @@ static inline int ftrace_trace_task(struct trace_array *tr) } static inline int ftrace_is_dead(void) { return 0; } static inline int -ftrace_create_function_files(struct trace_array *tr, +ftrace_create_function_files(struct mnt_idmap *idmap, struct trace_array *tr, struct dentry *parent) { return 0; @@ -1060,7 +1060,7 @@ static inline void ftrace_destroy_function_files(struct trace_array *tr) { } static inline __init void ftrace_init_global_array_ops(struct trace_array *tr) { } static inline void ftrace_reset_array_ops(struct trace_array *tr) { } -static inline void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d) { } +static inline void ftrace_init_tracefs(struct mnt_idmap *idmap, struct trace_array *tr, struct dentry *d) { } static inline void ftrace_init_tracefs_toplevel(struct trace_array *tr, struct dentry *d) { } static inline void ftrace_clear_pids(struct trace_array *tr) { } static inline int init_function_trace(void) { return 0; } @@ -1114,7 +1114,7 @@ extern void clear_ftrace_function_probes(struct trace_array *tr); int register_ftrace_command(struct ftrace_func_command *cmd); int unregister_ftrace_command(struct ftrace_func_command *cmd); -void ftrace_create_filter_files(struct ftrace_ops *ops, +void ftrace_create_filter_files(struct mnt_idmap *idmap, struct ftrace_ops *ops, struct dentry *parent); void ftrace_destroy_filter_files(struct ftrace_ops *ops); @@ -1141,7 +1141,7 @@ static inline void clear_ftrace_function_probes(struct trace_array *tr) * The ops parameter passed in is usually undefined. * This must be a macro. */ -#define ftrace_create_filter_files(ops, parent) do { } while (0) +#define ftrace_create_filter_files(idmap, ops, parent) do { } while (0) #define ftrace_destroy_filter_files(ops) do { } while (0) #endif /* CONFIG_FUNCTION_TRACER && CONFIG_DYNAMIC_FTRACE */ @@ -1541,7 +1541,9 @@ extern void trace_event_enable_tgid_record(bool enable); extern int event_trace_init(void); extern int init_events(void); -extern int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr); +extern int event_trace_add_tracer(struct mnt_idmap *idmap, + struct dentry *parent, + struct trace_array *tr); extern int event_trace_del_tracer(struct trace_array *tr); extern void __trace_early_add_events(struct trace_array *tr); diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c index 4376887e0d8a..c0caf4631f82 100644 --- a/kernel/trace/trace_dynevent.c +++ b/kernel/trace/trace_dynevent.c @@ -263,7 +263,7 @@ static __init int init_dynamic_event(void) if (ret) return 0; - trace_create_file("dynamic_events", TRACE_MODE_WRITE, NULL, + trace_create_file(&nop_mnt_idmap, "dynamic_events", TRACE_MODE_WRITE, NULL, NULL, &dynamic_events_ops); return 0; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index f29e815ca5b2..b41e0f26cee4 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -3768,7 +3768,8 @@ static int events_callback(const char *name, umode_t *mode, void **data, /* Expects to have event_mutex held when called */ static int -create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) +create_event_toplevel_files(struct mnt_idmap *idmap, struct dentry *parent, + struct trace_array *tr) { struct eventfs_inode *e_events; struct dentry *entry; @@ -3788,15 +3789,15 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) }, }; - entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent, + entry = trace_create_file(idmap, "set_event", TRACE_MODE_WRITE, parent, tr, &ftrace_set_event_fops); if (!entry) return -ENOMEM; nr_entries = ARRAY_SIZE(events_entries); - e_events = eventfs_create_events_dir("events", parent, events_entries, - nr_entries, tr); + e_events = eventfs_create_events_dir(idmap, "events", parent, + events_entries, nr_entries, tr); if (IS_ERR(e_events)) { pr_warn("Could not create tracefs 'events' directory\n"); return -ENOMEM; @@ -3804,12 +3805,11 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) /* There are not as crucial, just warn if they are not created */ - trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent, - tr, &ftrace_set_event_pid_fops); + trace_create_file(idmap, "set_event_pid", TRACE_MODE_WRITE, parent, tr, + &ftrace_set_event_pid_fops); - trace_create_file("set_event_notrace_pid", - TRACE_MODE_WRITE, parent, tr, - &ftrace_set_event_notrace_pid_fops); + trace_create_file(idmap, "set_event_notrace_pid", TRACE_MODE_WRITE, + parent, tr, &ftrace_set_event_notrace_pid_fops); tr->event_dir = e_events; @@ -3829,13 +3829,14 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) * * Must be called with event_mutex held. */ -int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr) +int event_trace_add_tracer(struct mnt_idmap *idmap, struct dentry *parent, + struct trace_array *tr) { int ret; lockdep_assert_held(&event_mutex); - ret = create_event_toplevel_files(parent, tr); + ret = create_event_toplevel_files(idmap, parent, tr); if (ret) goto out; @@ -3862,7 +3863,7 @@ early_event_add_tracer(struct dentry *parent, struct trace_array *tr) mutex_lock(&event_mutex); - ret = create_event_toplevel_files(parent, tr); + ret = create_event_toplevel_files(&nop_mnt_idmap, parent, tr); if (ret) goto out_unlock; @@ -4021,7 +4022,7 @@ __init int event_trace_init(void) if (!tr) return -ENODEV; - trace_create_file("available_events", TRACE_MODE_READ, + trace_create_file(&nop_mnt_idmap, "available_events", TRACE_MODE_READ, NULL, tr, &ftrace_avail_fops); ret = early_event_add_tracer(NULL, tr); diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c index e7af286af4f1..f8fbb0f4ef87 100644 --- a/kernel/trace/trace_events_synth.c +++ b/kernel/trace/trace_events_synth.c @@ -2318,8 +2318,9 @@ static __init int trace_events_synth_init(void) if (err) goto err; - entry = tracefs_create_file("synthetic_events", TRACE_MODE_WRITE, - NULL, NULL, &synth_events_fops); + entry = tracefs_create_file(&nop_mnt_idmap, "synthetic_events", + TRACE_MODE_WRITE, NULL, NULL, + &synth_events_fops); if (!entry) { err = -ENODEV; goto err; diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c index 9f1bfbe105e8..266428151b60 100644 --- a/kernel/trace/trace_functions.c +++ b/kernel/trace/trace_functions.c @@ -77,8 +77,8 @@ void ftrace_free_ftrace_ops(struct trace_array *tr) tr->ops = NULL; } -int ftrace_create_function_files(struct trace_array *tr, - struct dentry *parent) +int ftrace_create_function_files(struct mnt_idmap *idmap, + struct trace_array *tr, struct dentry *parent) { /* * The top level array uses the "global_ops", and the files are @@ -90,7 +90,7 @@ int ftrace_create_function_files(struct trace_array *tr, if (!tr->ops) return -EINVAL; - ftrace_create_filter_files(tr->ops, parent); + ftrace_create_filter_files(idmap, tr->ops, parent); return 0; } diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index c35fbaab2a47..e7b4fa238eaf 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -1413,8 +1413,8 @@ static __init int init_graph_tracefs(void) if (ret) return 0; - trace_create_file("max_graph_depth", TRACE_MODE_WRITE, NULL, - NULL, &graph_depth_fops); + trace_create_file(&nop_mnt_idmap, "max_graph_depth", TRACE_MODE_WRITE, + NULL, NULL, &graph_depth_fops); return 0; } diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index b791524a6536..7020ee831929 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -775,25 +775,25 @@ static int init_tracefs(void) if (ret) return -ENOMEM; - top_dir = tracefs_create_dir("hwlat_detector", NULL); + top_dir = tracefs_create_dir(&nop_mnt_idmap, "hwlat_detector", NULL); if (!top_dir) return -ENOMEM; - hwlat_sample_window = tracefs_create_file("window", TRACE_MODE_WRITE, + hwlat_sample_window = tracefs_create_file(&nop_mnt_idmap, "window", TRACE_MODE_WRITE, top_dir, &hwlat_window, &trace_min_max_fops); if (!hwlat_sample_window) goto err; - hwlat_sample_width = tracefs_create_file("width", TRACE_MODE_WRITE, + hwlat_sample_width = tracefs_create_file(&nop_mnt_idmap, "width", TRACE_MODE_WRITE, top_dir, &hwlat_width, &trace_min_max_fops); if (!hwlat_sample_width) goto err; - hwlat_thread_mode = trace_create_file("mode", TRACE_MODE_WRITE, + hwlat_thread_mode = trace_create_file(&nop_mnt_idmap, "mode", TRACE_MODE_WRITE, top_dir, NULL, &thread_mode_fops); diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 52f8b537dd0a..c74f0524b5d4 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1914,11 +1914,11 @@ static __init int init_kprobe_trace(void) return 0; /* Event list interface */ - trace_create_file("kprobe_events", TRACE_MODE_WRITE, + trace_create_file(&nop_mnt_idmap, "kprobe_events", TRACE_MODE_WRITE, NULL, NULL, &kprobe_events_ops); /* Profile interface */ - trace_create_file("kprobe_profile", TRACE_MODE_READ, + trace_create_file(&nop_mnt_idmap, "kprobe_profile", TRACE_MODE_READ, NULL, NULL, &kprobe_profile_ops); setup_boot_kprobe_events(); diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index bd0d01d00fb9..a7d073c2d97c 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2692,7 +2692,8 @@ static int init_timerlat_stack_tracefs(struct dentry *top_dir) { struct dentry *tmp; - tmp = tracefs_create_file("print_stack", TRACE_MODE_WRITE, top_dir, + tmp = tracefs_create_file(&nop_mnt_idmap, "print_stack", + TRACE_MODE_WRITE, top_dir, &osnoise_print_stack, &trace_min_max_fops); if (!tmp) return -ENOMEM; @@ -2720,18 +2721,19 @@ static int osnoise_create_cpu_timerlat_fd(struct dentry *top_dir) * Because osnoise/timerlat have a single workload, having * multiple files like these are wast of memory. */ - per_cpu = tracefs_create_dir("per_cpu", top_dir); + per_cpu = tracefs_create_dir(idmap, "per_cpu", top_dir); if (!per_cpu) return -ENOMEM; for_each_possible_cpu(cpu) { snprintf(cpu_str, 30, "cpu%ld", cpu); - cpu_dir = tracefs_create_dir(cpu_str, per_cpu); + cpu_dir = tracefs_create_dir(idmap, cpu_str, per_cpu); if (!cpu_dir) goto out_clean; - timerlat_fd = trace_create_file("timerlat_fd", TRACE_MODE_READ, - cpu_dir, NULL, &timerlat_fd_fops); + timerlat_fd = trace_create_file(&nop_mnt_idmap, "timerlat_fd", + TRACE_MODE_READ, cpu_dir, NULL, + &timerlat_fd_fops); if (!timerlat_fd) goto out_clean; @@ -2754,8 +2756,9 @@ static int init_timerlat_tracefs(struct dentry *top_dir) struct dentry *tmp; int retval; - tmp = tracefs_create_file("timerlat_period_us", TRACE_MODE_WRITE, top_dir, - &timerlat_period, &trace_min_max_fops); + tmp = tracefs_create_file(&nop_mnt_idmap, "timerlat_period_us", + TRACE_MODE_WRITE, top_dir, &timerlat_period, + &trace_min_max_fops); if (!tmp) return -ENOMEM; @@ -2789,36 +2792,43 @@ static int init_tracefs(void) if (ret) return -ENOMEM; - top_dir = tracefs_create_dir("osnoise", NULL); + top_dir = tracefs_create_dir(&nop_mnt_idmap, "osnoise", NULL); if (!top_dir) return 0; - tmp = tracefs_create_file("period_us", TRACE_MODE_WRITE, top_dir, - &osnoise_period, &trace_min_max_fops); + tmp = tracefs_create_file(&nop_mnt_idmap, "period_us", TRACE_MODE_WRITE, + top_dir, &osnoise_period, + &trace_min_max_fops); if (!tmp) goto err; - tmp = tracefs_create_file("runtime_us", TRACE_MODE_WRITE, top_dir, - &osnoise_runtime, &trace_min_max_fops); + tmp = tracefs_create_file(&nop_mnt_idmap, "runtime_us", + TRACE_MODE_WRITE, top_dir, &osnoise_runtime, + &trace_min_max_fops); if (!tmp) goto err; - tmp = tracefs_create_file("stop_tracing_us", TRACE_MODE_WRITE, top_dir, - &osnoise_stop_tracing_in, &trace_min_max_fops); + tmp = tracefs_create_file(&nop_mnt_idmap, "stop_tracing_us", + TRACE_MODE_WRITE, top_dir, + &osnoise_stop_tracing_in, + &trace_min_max_fops); if (!tmp) goto err; - tmp = tracefs_create_file("stop_tracing_total_us", TRACE_MODE_WRITE, top_dir, - &osnoise_stop_tracing_total, &trace_min_max_fops); + tmp = tracefs_create_file(&nop_mnt_idmap, "stop_tracing_total_us", + TRACE_MODE_WRITE, top_dir, + &osnoise_stop_tracing_total, + &trace_min_max_fops); if (!tmp) goto err; - tmp = trace_create_file("cpus", TRACE_MODE_WRITE, top_dir, NULL, &cpus_fops); + tmp = trace_create_file(&nop_mnt_idmap, "cpus", TRACE_MODE_WRITE, + top_dir, NULL, &cpus_fops); if (!tmp) goto err; - tmp = trace_create_file("options", TRACE_MODE_WRITE, top_dir, NULL, - &osnoise_options_fops); + tmp = trace_create_file(&nop_mnt_idmap, "options", TRACE_MODE_WRITE, + top_dir, NULL, &osnoise_options_fops); if (!tmp) goto err; diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c index 29f6e95439b6..1a5cb6aa495e 100644 --- a/kernel/trace/trace_printk.c +++ b/kernel/trace/trace_printk.c @@ -384,8 +384,8 @@ static __init int init_trace_printk_function_export(void) if (ret) return 0; - trace_create_file("printk_formats", TRACE_MODE_READ, NULL, - NULL, &ftrace_formats_fops); + trace_create_file(&nop_mnt_idmap, "printk_formats", TRACE_MODE_READ, + NULL, NULL, &ftrace_formats_fops); return 0; } diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c index 5a48dba912ea..06e63086a081 100644 --- a/kernel/trace/trace_stack.c +++ b/kernel/trace/trace_stack.c @@ -559,15 +559,16 @@ static __init int stack_trace_init(void) if (ret) return 0; - trace_create_file("stack_max_size", TRACE_MODE_WRITE, NULL, - &stack_trace_max_size, &stack_max_size_fops); + trace_create_file(&nop_mnt_idmap, "stack_max_size", TRACE_MODE_WRITE, + NULL, &stack_trace_max_size, &stack_max_size_fops); - trace_create_file("stack_trace", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "stack_trace", TRACE_MODE_READ, NULL, NULL, &stack_trace_fops); #ifdef CONFIG_DYNAMIC_FTRACE - trace_create_file("stack_trace_filter", TRACE_MODE_WRITE, NULL, - &trace_ops, &stack_trace_filter_fops); + trace_create_file(&nop_mnt_idmap, "stack_trace_filter", + TRACE_MODE_WRITE, NULL, &trace_ops, + &stack_trace_filter_fops); #endif if (stack_trace_filter_buf[0]) diff --git a/kernel/trace/trace_stat.c b/kernel/trace/trace_stat.c index bb247beec447..ca6e322468aa 100644 --- a/kernel/trace/trace_stat.c +++ b/kernel/trace/trace_stat.c @@ -282,7 +282,7 @@ static int tracing_stat_init(void) if (ret) return -ENODEV; - stat_dir = tracefs_create_dir("trace_stat", NULL); + stat_dir = tracefs_create_dir(&nop_mnt_idmap, "trace_stat", NULL); if (!stat_dir) { pr_warn("Could not create tracefs 'trace_stat' entry\n"); return -ENOMEM; @@ -297,8 +297,8 @@ static int init_stat_file(struct stat_session *session) if (!stat_dir && (ret = tracing_stat_init())) return ret; - session->file = tracefs_create_file(session->ts->name, TRACE_MODE_WRITE, - stat_dir, session, + session->file = tracefs_create_file(&nop_mnt_idmap, session->ts->name, + TRACE_MODE_WRITE, stat_dir, session, &tracing_stat_fops); if (!session->file) return -ENOMEM; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 99c051de412a..060160c63f43 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1654,10 +1654,10 @@ static __init int init_uprobe_trace(void) if (ret) return 0; - trace_create_file("uprobe_events", TRACE_MODE_WRITE, NULL, + trace_create_file(&nop_mnt_idmap, "uprobe_events", TRACE_MODE_WRITE, NULL, NULL, &uprobe_events_ops); /* Profile interface */ - trace_create_file("uprobe_profile", TRACE_MODE_READ, NULL, + trace_create_file(&nop_mnt_idmap, "uprobe_profile", TRACE_MODE_READ, NULL, NULL, &uprobe_profile_ops); return 0; }
On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > So tracefs supports remounting with different uid/gid mount options and > > > then actually wades through _all_ of the inodes and changes their > > > ownership internally? What's the use-case for this? Containers? > > > > No, in fact tracing doesn't work well with containers as tracing is global > > to the entire machine. It can work with privileged containers though. > > At least the tracefs interface is easily supportable within a delegation > model. IOW, you have a privileged process that delegates relevant > portions to a container via idmapped mounts _without_ doing the insane thing > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > The reason for this is because tracefs was based off of debugfs where the > > files and directores are created at boot up and mounted later. The reason > > to do this was to allow users to mount with gid=GID to allow a given group > > to have access to tracing. Without this update, tracefs would ignore it > > like debugfs and proc does today. > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > The tracing system required a way to control tracing and read the traces. > > It could have just used a new system like perf (although > > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > > like system call do do everything. > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > > tracer, which both have an embedded background, I chose an interface that > > could work with just an unmodified version of busybox. That is, I wanted it > > to work with just cat and echo. > > > > The main difference with tracefs compared to other file systems is that it > > is a control interface, where writes happen as much as reads. The data read > > is controlled. The closest thing I can think of is how cgroups work. > > > > As tracing is a privileged operation, but something that could be changed > > to allow a group to have access to, I wanted to make it easy for an admin > > to decide who gets to do what at boot up via the /etc/fstab file. > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > mount(8) should just give you the ability to specify "map the ids I explicitly > want to remap to something else and for the rest use the identity mapping". I > wanted that for other reasons anyway. > > So in one of the next versions of mount(8) you can then do (where --beneath > means place the mount beneath the current one and --replace is > self-explanatory): > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > sudo umount /sys/kernel/tracing > > or as a shortcut provided by mount(8): > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > > In both cases you replace the mount without unmounting tracefs. > > I can illustrate this right now though: > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/ > > # This is a tool I wrote for testing the patchset I wrote back then. > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing > Mounting beneath top mount > Creating anonymous mount > Attaching mount /mnt -> /sys/kernel/tracing > Creating single detached mount > > user1@localhost:~/data/move-mount-beneath$ > > # Now there's two mounts stacked on top of each other. > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head > total 0 > drwx------ 6 root root 0 Jan 7 13:33 . > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > -r--r----- 1 root root 0 Jan 7 13:33 README > -r--r----- 1 root root 0 Jan 7 13:33 available_events > -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions > -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs > -r--r----- 1 root root 0 Jan 7 13:33 available_tracers > -rw-r----- 1 root root 0 Jan 7 13:33 buffer_percent > -rw-r----- 1 root root 0 Jan 7 13:33 buffer_size_kb > > # Reveal updated mount > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head > total 0 > drwx------ 6 user1 user1 0 Jan 7 13:33 . > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > -r--r----- 1 user1 user1 0 Jan 7 13:33 README > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_events > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions_addrs > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_tracers > -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_percent > -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_size_kb > > sudo umount -l /sys/kernel/tracing > > and reveal the new mount with updated permissions and at no point in time will > you have had to unmount tracefs itself. No chown needed, no remount needed that > has to touch all inodes. > > I did perf numbers for this when I implemented this and there's no meaningful > perf impact for anything below 5 mappings. which covers 80% of the use-cases. > > I mean, there are crazy people out there that do have 30 mappings, maybe. And > maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap > still fits into cacheline) because of some LDAP crap or something. > > See an ext4 filesystem to open/create 1,000,000 files. Then I looped through > all of the files calling fstat() on each of them 1000 times and calculated the > mean fstat() time for a single file. > > | # MAPPINGS | PATCH-NEW | > |--------------|-----------| > | 0 mappings | 158 ns | > | 1 mappings | 157 ns | > | 2 mappings | 158 ns | > | 3 mappings | 161 ns | > | 5 mappings | 165 ns | > | 10 mappings | 199 ns | > | 50 mappings | 218 ns | > | 100 mappings | 229 ns | > | 200 mappings | 239 ns | > | 300 mappings | 240 ns | > | 340 mappings | 248 ns | > > > > > > > > > Aside from optimizing this and the special semantics for this eventfs > > > stuff that you really should think twice of doing, here's one idea for > > > an extension that might alleviate some of the pain: > > > > > > If you need flexible dynamic ownership change to e.g., be able to > > > delegate (all, a directory, a single file of) tracefs to > > > unprivileged/containers/whatever then you might want to consider > > > supporting idmapped mounts for tracefs. Because then you can do stuff > > > like: > > > > I'm a novice here and have no idea on how id maps work ;-) > > It's no magic and you don't even need to care about it. If you're on > util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option > does all the details for you. > > Though I still want an api where you can just pass the idmappings directly to > mount_setattr(). That's another topic though. > > > have. Now, can we do that and still keep the dynamic creation of inodes and > > dentries? > > Yes. > > > Does this require having more than one dentry per inode? > > No. It's just a topological change. The same subtree exposed at different > locations with the ability of exposing it with different permissions (without > any persistent filesystem-level changes). > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") > -> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo > > Here, all possible entries beneath foo including "events" and further > below are recorded and stored. So once mkdir returns it basically means > that it succeeded with the creation of all the necessary directories and > files. For all purposes the foo/events/ directory and below have all the > entries that matter. They have been created. It's comparable to them not > being in the {d,i}cache, I guess. > > When one goes and looksup stuff under foo/events/ or readdir the entries > in that directory: > > fd = open("foo/events") > readdir(fd, ...) > > then they are licensed to list an entry in that directory. So all that > needs to be done is to actually list those files in that directory. And > since they already exist (they were created during mkdir) we just need > to splice in inodes and dentries for them. But for that we shouldn't > check permissions on the directory again. Because we've done that > already correctly when the VFS called may_lookup(). > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > redundant and just wrong. > > Luckily, I don't think we need to even change anything because all > directories that eventfs creates always grant exec permissions to the > other group so lookup_one_len() will trivially succeed. IIUC. > > Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org> Actually, I think that patch can be way simpler for a similar reason than I mentioned before. In the regular system call for tracefs we have: mkdir /sys/kernel/tracing/instances/foo -> do_mkdirat("foo") -> i_op->mkdir::tracefs_syscall_mkdir(child_dentry::foo, parent_inode) -> char *name = get_dname(child_dentry::foo) // duplicate the dentries name -> tracefs_ops.mkdir::instance_mkdir("foo") -> trace_array_create("foo") -> trace_array_create_dir("foo") -> tracefs_create_dir("foo", trace_instance_dir) -> __create_dir("foo", trace_instance_dir) -> tracefs_start_creating("foo", trace_instance_dir) -> lookup_one_len("foo", trace_instance_dir) // perform the lookup again and find @child_dentry again that the vfs already created So afaict, the vfs has created you the correct target dentry and performed the required permission checks already. You then go on to retrieve and duplicate the name of the dentry and pass that name down. And then you perform a lookup via lookup_one_len() on the instances directory. And here lookup_one_len() performs the permission checking again that the vfs already did in may_lookup() when it walked up to the "instances" directory. The reason I'm saying this is that for tracefs _the only reason_ we need that idmapping in any of these helpers at all is because of the superfluous permission check in lookup_one_len(). Because for actual ownership of the inode it's completely irrelevant because you inherit the ownership anyway. IOW, the callers fs{g,u}id etc is entirely irrelevant. So basically if you massage the system call path to simply reuse the dentry that the vfs already given you instead of looking it up again later the whole patch is literally (baring anything minor I forgot): diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index ae648deed019..2de3ec114793 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -439,6 +439,7 @@ static struct file_system_type trace_fs_type = { .name = "tracefs", .mount = trace_mount, .kill_sb = kill_litter_super, + .fs_flags = FS_ALLOW_IDMAP, }; MODULE_ALIAS_FS("tracefs"); Because tracefs simply inherits ownership from parent directories or from stashed info the idmapping is entirely immaterial on the creation side of things. As soon as the permission checking in the VFS has succeeded we know that things are alright. So really, the complexity is getting in your way here. The tracefs_ops.mkdir thing is only used in a single location and that is kernel/trace/trace.c: trace_instance_dir = tracefs_create_instance_dir("instances", d_tracer, and there's only one callback instance_mkdir(). otherwise that callback ops stuff is completely unused iiuc. If that's the case it would make a lot more sense to just reuse everything that the VFS has given you in the regular system call path than to do that "copy the name and do the lookup again including permission checks later" dance. Idk, maybe I'm seeing things and there's an obvious reason why my thinking is all wrong and that permission check is required. I just don't see it though. But I'm sure people will tell me what my error is. At least for the system call path it shouldn't be.
On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > > So tracefs supports remounting with different uid/gid mount options and > > > > then actually wades through _all_ of the inodes and changes their > > > > ownership internally? What's the use-case for this? Containers? > > > > > > No, in fact tracing doesn't work well with containers as tracing is global > > > to the entire machine. It can work with privileged containers though. > > > > At least the tracefs interface is easily supportable within a delegation > > model. IOW, you have a privileged process that delegates relevant > > portions to a container via idmapped mounts _without_ doing the insane thing > > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > > > > The reason for this is because tracefs was based off of debugfs where the > > > files and directores are created at boot up and mounted later. The reason > > > to do this was to allow users to mount with gid=GID to allow a given group > > > to have access to tracing. Without this update, tracefs would ignore it > > > like debugfs and proc does today. > > > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > > > The tracing system required a way to control tracing and read the traces. > > > It could have just used a new system like perf (although > > > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > > > like system call do do everything. > > > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > > > tracer, which both have an embedded background, I chose an interface that > > > could work with just an unmodified version of busybox. That is, I wanted it > > > to work with just cat and echo. > > > > > > The main difference with tracefs compared to other file systems is that it > > > is a control interface, where writes happen as much as reads. The data read > > > is controlled. The closest thing I can think of is how cgroups work. > > > > > > As tracing is a privileged operation, but something that could be changed > > > to allow a group to have access to, I wanted to make it easy for an admin > > > to decide who gets to do what at boot up via the /etc/fstab file. > > > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > > > mount(8) should just give you the ability to specify "map the ids I explicitly > > want to remap to something else and for the rest use the identity mapping". I > > wanted that for other reasons anyway. > > > > So in one of the next versions of mount(8) you can then do (where --beneath > > means place the mount beneath the current one and --replace is > > self-explanatory): > > > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > > sudo umount /sys/kernel/tracing > > > > or as a shortcut provided by mount(8): > > > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > > > > In both cases you replace the mount without unmounting tracefs. > > > > I can illustrate this right now though: > > > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/ > > > > # This is a tool I wrote for testing the patchset I wrote back then. > > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing > > Mounting beneath top mount > > Creating anonymous mount > > Attaching mount /mnt -> /sys/kernel/tracing > > Creating single detached mount > > > > user1@localhost:~/data/move-mount-beneath$ > > > > # Now there's two mounts stacked on top of each other. > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped > > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime > > > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head > > total 0 > > drwx------ 6 root root 0 Jan 7 13:33 . > > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > > -r--r----- 1 root root 0 Jan 7 13:33 README > > -r--r----- 1 root root 0 Jan 7 13:33 available_events > > -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions > > -r--r----- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs > > -r--r----- 1 root root 0 Jan 7 13:33 available_tracers > > -rw-r----- 1 root root 0 Jan 7 13:33 buffer_percent > > -rw-r----- 1 root root 0 Jan 7 13:33 buffer_size_kb > > > > # Reveal updated mount > > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing > > > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > > | `-/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime,idmapped > > > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head > > total 0 > > drwx------ 6 user1 user1 0 Jan 7 13:33 . > > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > > -r--r----- 1 user1 user1 0 Jan 7 13:33 README > > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_events > > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions > > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_filter_functions_addrs > > -r--r----- 1 user1 user1 0 Jan 7 13:33 available_tracers > > -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_percent > > -rw-r----- 1 user1 user1 0 Jan 7 13:33 buffer_size_kb > > > > sudo umount -l /sys/kernel/tracing > > > > and reveal the new mount with updated permissions and at no point in time will > > you have had to unmount tracefs itself. No chown needed, no remount needed that > > has to touch all inodes. > > > > I did perf numbers for this when I implemented this and there's no meaningful > > perf impact for anything below 5 mappings. which covers 80% of the use-cases. > > > > I mean, there are crazy people out there that do have 30 mappings, maybe. And > > maybe there's 3 users that go into the hundreds (340 is maximum so struct idmap > > still fits into cacheline) because of some LDAP crap or something. > > > > See an ext4 filesystem to open/create 1,000,000 files. Then I looped through > > all of the files calling fstat() on each of them 1000 times and calculated the > > mean fstat() time for a single file. > > > > | # MAPPINGS | PATCH-NEW | > > |--------------|-----------| > > | 0 mappings | 158 ns | > > | 1 mappings | 157 ns | > > | 2 mappings | 158 ns | > > | 3 mappings | 161 ns | > > | 5 mappings | 165 ns | > > | 10 mappings | 199 ns | > > | 50 mappings | 218 ns | > > | 100 mappings | 229 ns | > > | 200 mappings | 239 ns | > > | 300 mappings | 240 ns | > > | 340 mappings | 248 ns | > > > > > > > > > > > > > Aside from optimizing this and the special semantics for this eventfs > > > > stuff that you really should think twice of doing, here's one idea for > > > > an extension that might alleviate some of the pain: > > > > > > > > If you need flexible dynamic ownership change to e.g., be able to > > > > delegate (all, a directory, a single file of) tracefs to > > > > unprivileged/containers/whatever then you might want to consider > > > > supporting idmapped mounts for tracefs. Because then you can do stuff > > > > like: > > > > > > I'm a novice here and have no idea on how id maps work ;-) > > > > It's no magic and you don't even need to care about it. If you're on > > util-linux 2.39 and any kernel post 5.12 the -o X-mount.idmap option > > does all the details for you. > > > > Though I still want an api where you can just pass the idmappings directly to > > mount_setattr(). That's another topic though. > > > > > have. Now, can we do that and still keep the dynamic creation of inodes and > > > dentries? > > > > Yes. > > > > > Does this require having more than one dentry per inode? > > > > No. It's just a topological change. The same subtree exposed at different > > locations with the ability of exposing it with different permissions (without > > any persistent filesystem-level changes). > > > > So, I tried to do an exploratory patch even though I promised myself not > > to do it. But hey... > > > > Some notes: > > > > * Permission handling for idmapped mounts is done completely in the > > VFS. That's the case for all filesytems that don't have a custom > > ->permission() handler. So there's nothing to do for us here. > > > > * Idmapped mount handling for ->getattr() is done completely by the VFS > > if the filesystem doesn't have a custom ->getattr() handler. So we're > > done here. > > > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > > chmod etc.). So there's nothing to here. > > > > * Eventfs does support attribute changes via ->setattr(). But it relies > > on simple_setattr() which is already idmapped mount aware. So there's > > nothing for us to do. > > > > * Ownership is inherited from the parent inode (tracefs) or optionally > > from stashed ownership information (eventfs). That means the idmapping > > is irrelevant here. It's similar to the "inherit gid from parent > > directory" logic we have in some circumstances. TL;DR nothing to do > > here as well. > > > > * Tracefs supports the creation of instances from userspace via mkdir. > > For example, > > > > mkdir /sys/kernel/tracing/instances/foo > > > > And here the idmapping is relevant so we need to make the helpers > > aware of the idmapping. > > > > I just went and plumbed this through to most helpers. > > > > There's some subtlety in eventfs. Afaict, the directories and files for > > the individual events are created on-demand during lookup or readdir. > > > > The ownership of these events is again inherited from the parent inode > > or recovered from stored state. In both cases the actual idmapping is > > irrelevant. > > > > The callchain here is: > > > > eventfs_root_lookup("xfs", "events") > > -> create_{dir,file}_dentry("xfs", "events") > > -> create_{dir,file}("xfs", "events") > > -> eventfs_start_creating("xfs", "events") > > -> lookup_one_len("xfs", "events") > > > > And the subtlety is that lookup_one_len() does permission checking on > > the parent inode (IOW, if you want a dentry for "blech" under "events" > > it'll do a permission check on events->d_inode) for exec permissions > > and then goes on to give you a new dentry. > > > > Usually this call would have to be changed to lookup_one() and the > > idmapping be handed down to it. But I think that's irrelevant here. > > > > Lookup generally doesn't need to be aware of idmappings at all. The > > permission checking is done purely in the vfs via may_lookup() and the > > idmapping is irrelevant because we always initialize inodes with the > > filesystem level ownership (see the idmappings.rst) documentation if > > you're interested in excessive details (otherwise you get inode aliases > > which you really don't want). > > > > For tracefs it would not matter for lookup per se but only because > > tracefs seemingly creates inodes/dentries during lookup (and readdir()). > > > > But imho the permission checking done in current eventfs_root_lookup() > > via lookup_one_len() is meaningless in any way; possibly even > > (conceptually) wrong. > > > > Because, the actual permission checking for the creation of the eventfs > > entries isn't really done during lookup or readdir, it's done when mkdir > > is called: > > > > mkdir /sys/kernel/tracing/instances/foo > > > > Here, all possible entries beneath foo including "events" and further > > below are recorded and stored. So once mkdir returns it basically means > > that it succeeded with the creation of all the necessary directories and > > files. For all purposes the foo/events/ directory and below have all the > > entries that matter. They have been created. It's comparable to them not > > being in the {d,i}cache, I guess. > > > > When one goes and looksup stuff under foo/events/ or readdir the entries > > in that directory: > > > > fd = open("foo/events") > > readdir(fd, ...) > > > > then they are licensed to list an entry in that directory. So all that > > needs to be done is to actually list those files in that directory. And > > since they already exist (they were created during mkdir) we just need > > to splice in inodes and dentries for them. But for that we shouldn't > > check permissions on the directory again. Because we've done that > > already correctly when the VFS called may_lookup(). > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > > > Luckily, I don't think we need to even change anything because all > > directories that eventfs creates always grant exec permissions to the > > other group so lookup_one_len() will trivially succeed. IIUC. > > > > Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org> > > Actually, I think that patch can be way simpler for a similar reason > than I mentioned before. In the regular system call for tracefs we have: > > mkdir /sys/kernel/tracing/instances/foo > > -> do_mkdirat("foo") > -> i_op->mkdir::tracefs_syscall_mkdir(child_dentry::foo, parent_inode) > -> char *name = get_dname(child_dentry::foo) // duplicate the dentries name > -> tracefs_ops.mkdir::instance_mkdir("foo") > -> trace_array_create("foo") > -> trace_array_create_dir("foo") > -> tracefs_create_dir("foo", trace_instance_dir) > -> __create_dir("foo", trace_instance_dir) > -> tracefs_start_creating("foo", trace_instance_dir) > -> lookup_one_len("foo", trace_instance_dir) // perform the lookup again and find @child_dentry again that the vfs already created > > So afaict, the vfs has created you the correct target dentry and > performed the required permission checks already. > > You then go on to retrieve and duplicate the name of the dentry and pass > that name down. And then you perform a lookup via lookup_one_len() on > the instances directory. And here lookup_one_len() performs the > permission checking again that the vfs already did in may_lookup() when > it walked up to the "instances" directory. > > The reason I'm saying this is that for tracefs _the only reason_ we need > that idmapping in any of these helpers at all is because of the > superfluous permission check in lookup_one_len(). Because for actual > ownership of the inode it's completely irrelevant because you inherit > the ownership anyway. IOW, the callers fs{g,u}id etc is entirely > irrelevant. > > So basically if you massage the system call path to simply reuse the > dentry that the vfs already given you instead of looking it up again > later the whole patch is literally (baring anything minor I forgot): > > > diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c > index ae648deed019..2de3ec114793 100644 > --- a/fs/tracefs/inode.c > +++ b/fs/tracefs/inode.c > @@ -439,6 +439,7 @@ static struct file_system_type trace_fs_type = { > .name = "tracefs", > .mount = trace_mount, > .kill_sb = kill_litter_super, > + .fs_flags = FS_ALLOW_IDMAP, > }; > MODULE_ALIAS_FS("tracefs"); > > Because tracefs simply inherits ownership from parent directories or > from stashed info the idmapping is entirely immaterial on the creation > side of things. As soon as the permission checking in the VFS has > succeeded we know that things are alright. > > So really, the complexity is getting in your way here. The > tracefs_ops.mkdir thing is only used in a single location and that is > > kernel/trace/trace.c: trace_instance_dir = tracefs_create_instance_dir("instances", d_tracer, > > and there's only one callback instance_mkdir(). > > otherwise that callback ops stuff is completely unused iiuc. If that's > the case it would make a lot more sense to just reuse everything that > the VFS has given you in the regular system call path than to do that > "copy the name and do the lookup again including permission checks > later" dance. > > Idk, maybe I'm seeing things and there's an obvious reason why my > thinking is all wrong and that permission check is required. I just > don't see it though. But I'm sure people will tell me what my error is. > > At least for the system call path it shouldn't be. I guess an argument would be that the check is somewhat required because you create a whole hierarchy of directories and files beneath the top level directory that the system call path creates. But then would you ever want to fail creating any directories or files beneath /sys/kernel/tracing/instances/foo/* if the VFS told you that you have the required permission to create the top level directory /sys/kernel/tracing/instances/foo? I think that isn't the case and it isn't even possible currently. So even for the directories and files beneath the top level directory the permission check is irrelevant iiuc. Anyway, I don't think you necessarily need to change any of what you do right now. The first version of the patch will work with what you currently have and is correct. It's just more complex because of that additional permission checking that I think isn't needed. Because it's not that the user has any control over the creation of the subtree beneath /sys/kernel/tracing/instances/foo/*. But again, I might be completely off.
On Sun, 7 Jan 2024 13:42:39 +0100 Christian Brauner <brauner@kernel.org> wrote: > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. The reason ownership is inherited from the parent is because the inodes are created at boot up before user space starts. eventfs inodes are created on demand after user space so it needs to either check the "default" ownership and permissions, or if the user changed a specific file/directory, it must save it and use that again if the inode/dentry are reclaimed and then referenced again and recreated. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") > -> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). tracefs creates the inodes/dentries at boot up, it's eventfs that does it on demand during lookup. For inodes/dentries: /sys/kernel/tracing/* is all created at boot up, except for "events". /sys/kernel/tracing/events/* is created on demand. > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo No. that creates a entire new tracefs instance, which happens to include another eventfs directory. The eventsfs directory is in "/sys/kernel/tracing/events" and "/sys/kernel/tracing/instances/*/events" eventfs has 10s of thousands of files and directories which is why I changed it to be on-demand inode/dentry creation and also reclaiming when no longer accessed. # ls /sys/kernel/tracing/events/ will create the inodes and dentries, and a memory stress program will free those created inodes and dentries. > > Here, all possible entries beneath foo including "events" and further > below are recorded and stored. So once mkdir returns it basically means > that it succeeded with the creation of all the necessary directories and > files. For all purposes the foo/events/ directory and below have all the > entries that matter. They have been created. It's comparable to them not > being in the {d,i}cache, I guess. No. Only the meta data is created for the eventfs directory with a mkdir instances/foo. The inodes and dentries are not there. > > When one goes and looksup stuff under foo/events/ or readdir the entries > in that directory: > > fd = open("foo/events") > readdir(fd, ...) > > then they are licensed to list an entry in that directory. So all that > needs to be done is to actually list those files in that directory. And > since they already exist (they were created during mkdir) we just need > to splice in inodes and dentries for them. But for that we shouldn't > check permissions on the directory again. Because we've done that > already correctly when the VFS called may_lookup(). No they do not exist. > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > redundant and just wrong. I don't think so. > > Luckily, I don't think we need to even change anything because all > directories that eventfs creates always grant exec permissions to the > other group so lookup_one_len() will trivially succeed. IIUC. > > Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@kernel.org> Checkout this branch: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git eventfs-show-files It creates a file "/sys/kernel/tracing/show_events_dentries" That shows when inodes and dentries are created and freed. I use this to make sure that reclaim actually does reclaim them. Linus did not like this file, but it has become very useful to make sure things are working properly as there is no other way to know if the inodes and dentries are reclaimed or not. I may see if he'll let me add it with a debug config option. -- Steve
On Sun, 7 Jan 2024 13:29:12 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. Just to make it clear. eventfs has nothing to do with mkdir instance/foo. It exists without that. Although one rationale to do eventfs was so that the instance directories wouldn't recreate the same 10thousands event inodes and dentries for every mkdir done. -- Steve
> > * Tracefs supports the creation of instances from userspace via mkdir. > > For example, > > > > mkdir /sys/kernel/tracing/instances/foo > > > > And here the idmapping is relevant so we need to make the helpers > > aware of the idmapping. > > > > I just went and plumbed this through to most helpers. > > > > There's some subtlety in eventfs. Afaict, the directories and files for > > the individual events are created on-demand during lookup or readdir. > > > > The ownership of these events is again inherited from the parent inode > > or recovered from stored state. In both cases the actual idmapping is > > irrelevant. > > > > The callchain here is: > > > > eventfs_root_lookup("xfs", "events") > > -> create_{dir,file}_dentry("xfs", "events") > > -> create_{dir,file}("xfs", "events") > > -> eventfs_start_creating("xfs", "events") > > -> lookup_one_len("xfs", "events") > > > > And the subtlety is that lookup_one_len() does permission checking on > > the parent inode (IOW, if you want a dentry for "blech" under "events" > > it'll do a permission check on events->d_inode) for exec permissions > > and then goes on to give you a new dentry. > > > > Usually this call would have to be changed to lookup_one() and the > > idmapping be handed down to it. But I think that's irrelevant here. > > > > Lookup generally doesn't need to be aware of idmappings at all. The > > permission checking is done purely in the vfs via may_lookup() and the > > idmapping is irrelevant because we always initialize inodes with the > > filesystem level ownership (see the idmappings.rst) documentation if > > you're interested in excessive details (otherwise you get inode aliases > > which you really don't want). > > > > For tracefs it would not matter for lookup per se but only because > > tracefs seemingly creates inodes/dentries during lookup (and readdir()). > > tracefs creates the inodes/dentries at boot up, it's eventfs that does > it on demand during lookup. > > For inodes/dentries: > > /sys/kernel/tracing/* is all created at boot up, except for "events". Yes. > /sys/kernel/tracing/events/* is created on demand. Yes. > > > > > But imho the permission checking done in current eventfs_root_lookup() > > via lookup_one_len() is meaningless in any way; possibly even > > (conceptually) wrong. > > > > Because, the actual permission checking for the creation of the eventfs > > entries isn't really done during lookup or readdir, it's done when mkdir > > is called: > > > > mkdir /sys/kernel/tracing/instances/foo > > No. that creates a entire new tracefs instance, which happens to > include another eventfs directory. Yes, I'm aware of all that. > No. Only the meta data is created for the eventfs directory with a > mkdir instances/foo. The inodes and dentries are not there. I know, that is what I'm saying. > > > > > When one goes and looksup stuff under foo/events/ or readdir the entries > > in that directory: > > > > fd = open("foo/events") > > readdir(fd, ...) > > > > then they are licensed to list an entry in that directory. So all that > > needs to be done is to actually list those files in that directory. And > > since they already exist (they were created during mkdir) we just need > > to splice in inodes and dentries for them. But for that we shouldn't > > check permissions on the directory again. Because we've done that > > already correctly when the VFS called may_lookup(). > > No they do not exist. I am aware. > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. I'm very well aware that the dentries and inode aren't created during mkdir but the completely directory layout is determined. You're just splicing in dentries and inodes during lookup and readdir. If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later do a lookup/readdir on ls -al /sys/kernel/tracing/instances/foo/events Why should the creation of the dentries and inodes ever fail due to a permission failure? The vfs did already verify that you had the required permissions to list entries in that directory. Why should filling up /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't That tracefs instance would be half-functional. And again, right now that inode_permission() check cannot even fail.
On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > On Sun, 7 Jan 2024 13:29:12 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > It exists without that. Although one rationale to do eventfs was so Every instance/foo/ tracefs instances also contains an events directory and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's not a separate filesystem. Both eventfs and tracefs are on the same single, system wide superblock. > that the instance directories wouldn't recreate the same 10thousands > event inodes and dentries for every mkdir done. I know but that's irrelevant to what I'm trying to tell you. A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs instance. With or without the on-demand dentry and inode creation for the eventfs portion that tracefs "instance" has now been created in its entirety including all the required information for someone to later come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. All you've done is to defer the addition of the dentries and inodes when someone does actually look at the events directory of the tracefs instance. Whether you choose to splice in the dentries and inodes for the eventfs portion during lookup and readdir or if you had chosen to not do the on-demand thing at all and the entries were created at the same time as the mkdir call are equivalent from the perspective of permission checking. If you have the required permissions to look at the events directory then there's no reason why listing the directory entries in there should fail. This can't even happen right now.
On Mon, 8 Jan 2024 12:04:54 +0100 Christian Brauner <brauner@kernel.org> wrote: > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > I'm very well aware that the dentries and inode aren't created during > mkdir but the completely directory layout is determined. You're just > splicing in dentries and inodes during lookup and readdir. > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > do a lookup/readdir on > > ls -al /sys/kernel/tracing/instances/foo/events > > Why should the creation of the dentries and inodes ever fail due to a > permission failure? They shouldn't. > The vfs did already verify that you had the required > permissions to list entries in that directory. Why should filling up > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > That tracefs instance would be half-functional. And again, right now > that inode_permission() check cannot even fail. And it shouldn't. But without dentries and inodes, how does VFS know what is allowed to open the files? -- Steve
On Mon, 8 Jan 2024 12:32:46 +0100 Christian Brauner <brauner@kernel.org> wrote: > On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > > On Sun, 7 Jan 2024 13:29:12 -0500 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > > It exists without that. Although one rationale to do eventfs was so > > Every instance/foo/ tracefs instances also contains an events directory > and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's > not a separate filesystem. Both eventfs and tracefs are on the same > single, system wide superblock. > > > that the instance directories wouldn't recreate the same 10thousands > > event inodes and dentries for every mkdir done. > > I know but that's irrelevant to what I'm trying to tell you. > > A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs > instance. With or without the on-demand dentry and inode creation for > the eventfs portion that tracefs "instance" has now been created in its > entirety including all the required information for someone to later > come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. > > All you've done is to defer the addition of the dentries and inodes when > someone does actually look at the events directory of the tracefs > instance. > > Whether you choose to splice in the dentries and inodes for the eventfs > portion during lookup and readdir or if you had chosen to not do the > on-demand thing at all and the entries were created at the same time as > the mkdir call are equivalent from the perspective of permission > checking. > > If you have the required permissions to look at the events directory > then there's no reason why listing the directory entries in there should > fail. This can't even happen right now. Ah, I think I know where the confusion lies. The tracing information in kernel/trace/*.c doesn't keep track of permission. It relies totally on fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with 'gid=xxx' then it's up to tracefs to maintain that information and not the tracing subsystem. The tracing subsystem only gives the "default" permissions (before boot finishes). The difference between normal file systems and pseudo file systems like debugfs and tracefs, is that normal file systems keep the permission information stored on the external device. That is, when the inodes and dentries are created, the information is retrieved from the stored file system. I think this may actually be a failure of debugfs (and tracefs as it was based on debugfs), in that the inodes and dentries are created at the same time the "files" backing them are. Which is normally at boot up and before the file system is mounted. That is, inodes and dentries are actually coupled with the data they represent. It's not a cache for a back store like a hard drive partition. To create a file in debugfs you do: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) That is, you pass a the name, the mode, the parent dentry, data, and the fops and that will create an inode and dentry (which is returned). This happens at boot up before user space is running and before debugfs is even mounted. Because debugfs is mostly for debugging, people don't care about how it's mounted. It is usually restricted to root only access. Especially since there's a lot of sensitive information that shouldn't be exposed to non-privileged users. The reason tracefs came about is that people asked me to be able to have access to tracing without needing to even enable debugfs. They also want to easily make it accessible to non root users and talking with Kees Cook, he recommended using ACL. But because it inherited a lot from debugfs, I started doing these tricks like walking the dentry tree to make it work a bit better. Because the dentries and inodes were created before mount, I had to play these tricks. But as Linus pointed out, that was the wrong way to do that. The right way was to use .getattr and .permission callbacks to figure out what the permissions to the files are. This has nothing to do with the creation of the files, it's about who has access to the files that the inodes point to. This sounds like another topic to bring up at LSFMM ;-) "Can we standardize pseudo file systems like debugfs and tracefs to act more like real file systems, and have inodes and dentries act as cache and not be so coupled to the data?" -- Steve
On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote: > On Mon, 8 Jan 2024 12:04:54 +0100 > Christian Brauner <brauner@kernel.org> wrote: > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > I'm very well aware that the dentries and inode aren't created during > > mkdir but the completely directory layout is determined. You're just > > splicing in dentries and inodes during lookup and readdir. > > > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > > do a lookup/readdir on > > > > ls -al /sys/kernel/tracing/instances/foo/events > > > > Why should the creation of the dentries and inodes ever fail due to a > > permission failure? > > They shouldn't. > > > The vfs did already verify that you had the required > > permissions to list entries in that directory. Why should filling up > > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > > That tracefs instance would be half-functional. And again, right now > > that inode_permission() check cannot even fail. > > And it shouldn't. But without dentries and inodes, how does VFS know what > is allowed to open the files? So say you do: mkdir /sys/kernel/tracing/instances/foo After this has returned we know everything we need to know about the new tracefs instance including the ownership and the mode of all inodes in /sys/kernel/tracing/instances/foo/events/* and below precisely because ownership is always inherited from the parent dentry and recorded in the metadata struct eventfs_inode. So say someone does: open("/sys/kernel/tracing/instances/foo/events/xfs"); and say this is the first time that someone accesses that events/ directory. When the open pathwalk is done, the vfs will determine via [1] may_lookup(inode_of(events)) whether you are able to list entries such as "xfs" in that directory. The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds it ends up calling i_op->eventfs_root_lookup(events). At this point tracefs/eventfs adds the inodes for all entries in that "events" directory including "xfs" based on the metadata it recorded during the mkdir. Since now someone is actually interested in them. And it initializes the inodes with ownership and everything and adds the dentries that belong into that directory. Nothing here depends on the permissions of the caller. The only permission that mattered was done in the VFS in [1]. If the caller has permissions to enter a directory they can lookup and list its contents. And its contents where determined/fixed etc when mkdir was called. So we just need to add the required objects into the caches (inode, dentry) whose addition we intentionally defered until someone actually needed them. So, eventfs_root_lookup() now initializes the inodes with the ownership from the stored metadata or from the parent dentry and splices in inodes and dentries. No permission checking is needed for this because it is always a recheck of what the vfs did in [1]. We now return to the vfs and path walk continues to the final component that you actually want to open which is that "xfs" directory in this example. We check the permissions on that inode via may_open("xfs") and we open that directory returning an fd to userspace ultimately. (I'm going by memory since I need to step out the door.)
On Wed, 10 Jan 2024 12:45:36 +0100 Christian Brauner <brauner@kernel.org> wrote: > So say you do: > > mkdir /sys/kernel/tracing/instances/foo > > After this has returned we know everything we need to know about the new > tracefs instance including the ownership and the mode of all inodes in > /sys/kernel/tracing/instances/foo/events/* and below precisely because > ownership is always inherited from the parent dentry and recorded in the > metadata struct eventfs_inode. > > So say someone does: > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > and say this is the first time that someone accesses that events/ > directory. > > When the open pathwalk is done, the vfs will determine via > > [1] may_lookup(inode_of(events)) > > whether you are able to list entries such as "xfs" in that directory. > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > it ends up calling i_op->eventfs_root_lookup(events). > > At this point tracefs/eventfs adds the inodes for all entries in that > "events" directory including "xfs" based on the metadata it recorded > during the mkdir. Since now someone is actually interested in them. And > it initializes the inodes with ownership and everything and adds the > dentries that belong into that directory. > > Nothing here depends on the permissions of the caller. The only > permission that mattered was done in the VFS in [1]. If the caller has > permissions to enter a directory they can lookup and list its contents. > And its contents where determined/fixed etc when mkdir was called. > > So we just need to add the required objects into the caches (inode, > dentry) whose addition we intentionally defered until someone actually > needed them. > > So, eventfs_root_lookup() now initializes the inodes with the ownership > from the stored metadata or from the parent dentry and splices in inodes > and dentries. No permission checking is needed for this because it is > always a recheck of what the vfs did in [1]. > > We now return to the vfs and path walk continues to the final component > that you actually want to open which is that "xfs" directory in this > example. We check the permissions on that inode via may_open("xfs") and > we open that directory returning an fd to userspace ultimately. > > (I'm going by memory since I need to step out the door.) So, let's say we do: chgrp -R rostedt /sys/kernel/tracing/ But I don't want rostedt to have access to xfs chgrp -R root /sys/kernel/tracing/events/xfs Both actions will create the inodes and dentries of all files and directories (because of "-R"). But once that is done, the ref counts go to zero. They stay around until reclaim. But then I open Chrome ;-) and it reclaims all the dentries and inodes, so we are back to here we were on boot. Now as rostedt I do: ls /sys/kernel/tracing/events/xfs The VFS layer doesn't know if I have permission to that or not, because all the inodes and dentries have been freed. It has to call back to eventfs to find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will recreated the inodes with the proper permission. Or are you saying that I don't need the ".permission" callback, because eventfs does it when it creates the inodes? But for eventfs to know what the permissions changes are, it uses .getattr and .setattr. -- Steve
On Wed, 10 Jan 2024 08:07:46 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Or are you saying that I don't need the ".permission" callback, because > eventfs does it when it creates the inodes? But for eventfs to know what > the permissions changes are, it uses .getattr and .setattr. OK, if your main argument is that we do not need .permission, I agree with you. But that's a trivial change and doesn't affect the complexity that eventfs is doing. In fact, removing the "permission" check is simply this patch: -- diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..f2af07a857e2 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, return 0; } -static int eventfs_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) -{ - set_top_events_ownership(inode); - return generic_permission(idmap, inode, mask); -} - static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr = eventfs_set_attr, .getattr = eventfs_get_attr, - .permission = eventfs_permission, }; static const struct inode_operations eventfs_file_inode_operations = { -- I only did that because Linus mentioned it, and I thought it was needed. I'll apply this patch too, as it appears to work with this code. Thanks! -- Steve
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> I'll apply this patch too, as it appears to work with this code.
I meant "appears to work without this code".
-- Steve
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 10 Jan 2024 08:07:46 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Or are you saying that I don't need the ".permission" callback, because > > eventfs does it when it creates the inodes? But for eventfs to know what > > the permissions changes are, it uses .getattr and .setattr. > > OK, if your main argument is that we do not need .permission, I agree with > you. But that's a trivial change and doesn't affect the complexity that > eventfs is doing. In fact, removing the "permission" check is simply this > patch: > > -- > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index fdff53d5a1f8..f2af07a857e2 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, > return 0; > } > > -static int eventfs_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > -{ > - set_top_events_ownership(inode); > - return generic_permission(idmap, inode, mask); > -} > - > static const struct inode_operations eventfs_root_dir_inode_operations = { > .lookup = eventfs_root_lookup, > .setattr = eventfs_set_attr, > .getattr = eventfs_get_attr, > - .permission = eventfs_permission, > }; > > static const struct inode_operations eventfs_file_inode_operations = { > -- > > I only did that because Linus mentioned it, and I thought it was needed. > I'll apply this patch too, as it appears to work with this code. Oh, eventfs files and directories don't need the .permissions because its inodes and dentries are not created until accessed. But the "events" directory itself has its dentry and inode created at boot up, but still uses the eventfs_root_dir_inode_operations. So the .permissions is still needed! If you look at the "set_top_events_ownership()" function, it has: /* The top events directory doesn't get automatically updated */ if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; That is, it does nothing if the entry is not the "events" directory. It falls back to he default "->permissions()" function for everything but the top level "events" directory. But this and .getattr are still needed for the events directory, because it suffers the same issue as the other tracefs entries. That is, it's inodes and dentries are created at boot up before it is mounted. So if the mount has gid=1000, it will be ignored. The .getattr is called by "stat" which ls does. So after boot up if you just do: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ls /sys/kernel/tracing/events/ 9p ext4 iomap module raw_syscalls thermal alarmtimer fib iommu msr rcu thp avc fib6 io_uring napi regmap timer block filelock ipi neigh regulator tlb bpf_test_run filemap irq net resctrl udp bpf_trace ftrace irq_matrix netfs rpm virtio_gpu[ ...] The above works because "ls" does a stat() on the directory first, which does a .getattr() call that updates the permissions of the existing "events" directory inode. BUT! If I had used my own getents() program that has: fd = openat(AT_FDCWD, argv[1], O_RDONLY); if (fd < 0) perror("openat"); n = getdents64(fd, buf, BUF_SIZE); if (n < 0) perror("getdents64"); Where it calls the openat() without doing a stat fist, and after boot, had done: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ./getdents /sys/kernel/tracing/events openat: Permission denied getdents64: Bad file descriptor It errors because he "events" inode permission hasn't been updated yet. Now after getting the above error, if I do the "ls" and then run it again: $ ls /sys/kernel/tracing/events > /dev/null $ ./getdents /sys/kernel/tracing/events enable header_page header_event initcall vsyscall syscalls it works! so no, I can't remove that .permissions callback from eventfs. -- Steve
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote: > On Wed, 10 Jan 2024 12:45:36 +0100 > Christian Brauner <brauner@kernel.org> wrote: > > > So say you do: > > > > mkdir /sys/kernel/tracing/instances/foo > > > > After this has returned we know everything we need to know about the new > > tracefs instance including the ownership and the mode of all inodes in > > /sys/kernel/tracing/instances/foo/events/* and below precisely because > > ownership is always inherited from the parent dentry and recorded in the > > metadata struct eventfs_inode. > > > > So say someone does: > > > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > > > and say this is the first time that someone accesses that events/ > > directory. > > > > When the open pathwalk is done, the vfs will determine via > > > > [1] may_lookup(inode_of(events)) > > > > whether you are able to list entries such as "xfs" in that directory. > > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > > it ends up calling i_op->eventfs_root_lookup(events). > > > > At this point tracefs/eventfs adds the inodes for all entries in that > > "events" directory including "xfs" based on the metadata it recorded > > during the mkdir. Since now someone is actually interested in them. And > > it initializes the inodes with ownership and everything and adds the > > dentries that belong into that directory. > > > > Nothing here depends on the permissions of the caller. The only > > permission that mattered was done in the VFS in [1]. If the caller has > > permissions to enter a directory they can lookup and list its contents. > > And its contents where determined/fixed etc when mkdir was called. > > > > So we just need to add the required objects into the caches (inode, > > dentry) whose addition we intentionally defered until someone actually > > needed them. > > > > So, eventfs_root_lookup() now initializes the inodes with the ownership > > from the stored metadata or from the parent dentry and splices in inodes > > and dentries. No permission checking is needed for this because it is > > always a recheck of what the vfs did in [1]. > > > > We now return to the vfs and path walk continues to the final component > > that you actually want to open which is that "xfs" directory in this > > example. We check the permissions on that inode via may_open("xfs") and > > we open that directory returning an fd to userspace ultimately. > > > > (I'm going by memory since I need to step out the door.) > > So, let's say we do: > > chgrp -R rostedt /sys/kernel/tracing/ The rostedt group needs exec permissions and "other" cannot have exec permissions otherwise you can trivially list the entries even if it's owned by root: chmod 750 /sys/kernel/tracing user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing drwxr-x--- 6 0 1000 0 Jan 11 18:23 tracing > > But I don't want rostedt to have access to xfs > > chgrp -R root /sys/kernel/tracing/events/xfs chmod 750 /sys/kernel/tracing/events/xfs user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs This ensure that if a user is in the group and the group has exec perms lookup is possible (For root this will usually work because CAP_DAC_READ_SEARCH overrides the exec requirement.). > > Both actions will create the inodes and dentries of all files and > directories (because of "-R"). But once that is done, the ref counts go to > zero. They stay around until reclaim. But then I open Chrome ;-) and it > reclaims all the dentries and inodes, so we are back to here we were on > boot. > > Now as rostedt I do: > > ls /sys/kernel/tracing/events/xfs > > The VFS layer doesn't know if I have permission to that or not, because all > the inodes and dentries have been freed. It has to call back to eventfs to > find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will > recreated the inodes with the proper permission. Very roughly, ignoring most of the complexity of lookup and focussing on the permission checking: When a caller looks up an entry in a directory then the VFS will call inode_permission(MAY_EXEC) on the directory the caller is trying to perform that lookup in. If the caller wants to lookup the "events" entry in the "tracing" directory then the VFS will call inode_permission(MAY_EXEC, "tracing") and then - assuming it's not in the cache - call into the lookup method of the filesystem. After the VFS has determined that the caller has the permissions to lookup the "events" entry in the "tracing" directory no further permission checks are needed. Path lookup now moves on to the next part which is looking up the "xfs" entry in the "events" directory. So again, VFS calls inode_permission(MAY_EXEC, "events") finds that caller has the permissions and then goes on to call into eventfs_root_lookup(). The VFS has already given the caller a dentry for "xfs" and is passing that down to tracefs/eventfs. So eventfs gets eventfs_rootfs_lookup(@dir[events], @dentry[xfs], ...) eventfs now wades through the recorded metadata entries for the @dir[events] directory and finds a matching metadata entry for "xfs". It allocates an inode for it and initializes the owner, mode etc based on the recorded information. The lookup_one_len() call finds the dentry that was just added by the VFS. The "xfs" dentry/inode is now in the cache. The VFS moves on. It calls inode_permission(MAY_EXEC, "xfs"). It sees that based on the mode you don't have access to that directory. Lookup fails. What I'm pointing out in the current logic is that the caller is taxed twice: (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") (2) And again when you call lookup_one_len() in eventfs_start_creating() _because_ the permission check in lookup_one_len() is the exact same permission check again that the vfs has done inode_permission(MAY_EXEC, "xfs"). The readdir case is similar only that instead of generating an inode for a single entry in the directory like in lookup, you're generating inodes and dentries for multiple entries. And you might end up calling eventfs_start_creating() like 600 times for xfs/ for example. So you end up calling lookup_one_len() and thus inode_permission() 600 times. And this inode_permission() check is the same check that the VFS already performed: inode_permission(MAY_EXEC, "xfs"). You can even break readdir semantics via tracefs right now: // We managed to open the directory so we have permission to list // directory entries in "xfs". fd = open("/sys/kernel/tracing/events/xfs"); // Remove ownership so we can't open the directory anymore chown("/sys/kernel/tracing/events/xfs", 0, 0); // Or just remove exec bit for the group and restrict to owner chmod("/sys/kernel/tracing/events/xfs", 700); // Drop caches to force an eventfs_root_lookup() on everything write("/proc/sys/vm/drop_caches", "3", 1); // Returns 0 even though directory has a lot of entries and we should be // able to list them getdents64(fd, ...); And the failure is in the inode_permission(MAY_EXEC, "xfs") call in lookup_one_len() in eventfs_start_creating() which now fails. But that's not how this is supposed to work. Once you hold an fd to a directory you can read it's entries even if the permissions change. There's no DAC read-time checks on the directory after that. But this is exactly what happens on eventfs. So it's even broken right now imho. I've tested this: user1@localhost:~/data/scripts$ ./readdir_fail /sys/kernel/tracing/events/xfs/ inode# file type d_reclen d_off d_name 4861 directory 24 1 . 1027 directory 24 2 .. So what I meant is something completely untested, possibly ugly, drafted like the below. But tbh, I suspect that eventfs_root_lookup() could even simply be rewritten to reuse the dentry it was provided from the VFS directly instead of even having to go through lookup_one_len() via eventfs_start_creating() at all. Then only the ->iterate_shared implementation needs to call the noperm lookup variant I hastily drafted below. But idk, I haven't given that part much thought. fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++--- fs/tracefs/inode.c | 2 +- include/linux/namei.h | 1 + 3 files changed, 45 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 963576e67f62..0bfd87e79c5d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2656,9 +2656,8 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt, } EXPORT_SYMBOL(vfs_path_lookup); -static int lookup_one_common(struct mnt_idmap *idmap, - const char *name, struct dentry *base, int len, - struct qstr *this) +static int lookup_one_common_noperm(const char *name, struct dentry *base, + int len, struct qstr *this) { this->name = name; this->len = len; @@ -2686,6 +2685,19 @@ static int lookup_one_common(struct mnt_idmap *idmap, return err; } + return 0; +} + +static inline int lookup_one_common(struct mnt_idmap *idmap, + const char *name, struct dentry *base, int len, + struct qstr *this) +{ + int ret; + + ret = lookup_one_common_noperm(name, base, len, this); + if (ret) + return ret; + return inode_permission(idmap, base->d_inode, MAY_EXEC); } @@ -2776,6 +2788,34 @@ struct dentry *lookup_one(struct mnt_idmap *idmap, const char *name, } EXPORT_SYMBOL(lookup_one); +/** + * lookup_one_noperm - filesystem helper to lookup single pathname component + * @name: pathname component to lookup + * @base: base directory to lookup from + * @len: maximum length @len should be interpreted to + * + * Note that this routine is purely a helper for filesystem usage and should + * not be called by generic code. This performs no permission checking on @base. + * + * The caller must hold base->i_mutex. + */ +struct dentry *lookup_one_noperm(const char *name, struct dentry *base, int len) +{ + struct dentry *dentry; + struct qstr this; + int err; + + WARN_ON_ONCE(!inode_is_locked(base->d_inode)); + + err = lookup_one_common_noperm(name, base, len, &this); + if (err) + return ERR_PTR(err); + + dentry = lookup_dcache(&this, base, 0); + return dentry ? dentry : __lookup_slow(&this, base, 0); +} +EXPORT_SYMBOL(lookup_one_noperm); + /** * lookup_one_unlocked - filesystem helper to lookup single pathname component * @idmap: idmap of the mount the lookup is performed from diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..fa04aa6a5ce7 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -524,7 +524,7 @@ struct dentry *eventfs_start_creating(const char *name, struct dentry *parent) if (unlikely(IS_DEADDIR(parent->d_inode))) dentry = ERR_PTR(-ENOENT); else - dentry = lookup_one_len(name, parent, strlen(name)); + dentry = lookup_one_noperm(name, parent, strlen(name)); if (!IS_ERR(dentry) && dentry->d_inode) { dput(dentry); diff --git a/include/linux/namei.h b/include/linux/namei.h index 3100371b5e32..a1491da37cbe 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -77,6 +77,7 @@ extern struct dentry *lookup_one_len(const char *, struct dentry *, int); extern struct dentry *lookup_one_len_unlocked(const char *, struct dentry *, int); extern struct dentry *lookup_positive_unlocked(const char *, struct dentry *, int); struct dentry *lookup_one(struct mnt_idmap *, const char *, struct dentry *, int); +struct dentry *lookup_one_noperm(const char *, struct dentry *, int); struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap, const char *name, struct dentry *base, int len);
On Thu, 11 Jan 2024 22:01:32 +0100 Christian Brauner <brauner@kernel.org> wrote: > What I'm pointing out in the current logic is that the caller is > taxed twice: > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > (2) And again when you call lookup_one_len() in eventfs_start_creating() > _because_ the permission check in lookup_one_len() is the exact > same permission check again that the vfs has done > inode_permission(MAY_EXEC, "xfs"). As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/ The eventfs files below "events" doesn't need the .permissions callback at all. It's only there because the "events" inode uses it. The .permissions call for eventfs has: static int eventfs_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { set_top_events_ownership(inode); return generic_permission(idmap, inode, mask); } Where the "set_top_events_ownership() is a nop for everything but the "events" directory. I guess I could have two ops: static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr = eventfs_set_attr, .getattr = eventfs_get_attr, .permission = eventfs_permission, }; static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr = eventfs_set_attr, .getattr = eventfs_get_attr, }; And use the second one for all dentries below the root, but I figured it's not that big of a deal if I called the permissions on all. Perhaps I should do it with two? Anyway, the issue is with "events" directory and remounting, because like the tracefs system, the inode and dentry for "evnets" is created at boot up, before the mount happens. The VFS layer is going to check the permissions of its inode and dentry, which will be incorrect if the mount was mounted with a "gid" option. -- Steve
On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > On Thu, 11 Jan 2024 22:01:32 +0100 > Christian Brauner <brauner@kernel.org> wrote: > > > What I'm pointing out in the current logic is that the caller is > > taxed twice: > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > _because_ the permission check in lookup_one_len() is the exact > > same permission check again that the vfs has done > > inode_permission(MAY_EXEC, "xfs"). > > As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/ > > The eventfs files below "events" doesn't need the .permissions callback at > all. It's only there because the "events" inode uses it. > > The .permissions call for eventfs has: It doesn't matter whether there's a ->permission handler. If you don't add one explicitly the VFS will simply call generic_permission(): inode_permission() -> do_inode_permission() { if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { if (likely(inode->i_op->permission)) return inode->i_op->permission(idmap, inode, mask); /* This gets set once for the inode lifetime */ spin_lock(&inode->i_lock); inode->i_opflags |= IOP_FASTPERM; spin_unlock(&inode->i_lock); } return generic_permission(idmap, inode, mask); } > Anyway, the issue is with "events" directory and remounting, because like > the tracefs system, the inode and dentry for "evnets" is created at boot > up, before the mount happens. The VFS layer is going to check the > permissions of its inode and dentry, which will be incorrect if the mount > was mounted with a "gid" option. The gid option has nothing to do with this and it is just handled fine if you remove the second permission checking in (2). You need to remove the inode_permission() code from eventfs_start_creating(). It is just an internal lookup and the fact that you have it in there allows userspace to break readdir on the eventfs portions of tracefs as I've shown in the parts of the mail that you cut off.
On Fri, 12 Jan 2024 09:27:24 +0100 Christian Brauner <brauner@kernel.org> wrote: > On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > > On Thu, 11 Jan 2024 22:01:32 +0100 > > Christian Brauner <brauner@kernel.org> wrote: > > > > > What I'm pointing out in the current logic is that the caller is > > > taxed twice: > > > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > > _because_ the permission check in lookup_one_len() is the exact > > > same permission check again that the vfs has done > > > inode_permission(MAY_EXEC, "xfs"). > > > > As I described in: https://lore.kernel.org/all/20240110133154.6e18feb9@gandalf.local.home/ > > > > The eventfs files below "events" doesn't need the .permissions callback at > > all. It's only there because the "events" inode uses it. > > > > The .permissions call for eventfs has: > > It doesn't matter whether there's a ->permission handler. If you don't > add one explicitly the VFS will simply call generic_permission(): > > inode_permission() > -> do_inode_permission() > { > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > if (likely(inode->i_op->permission)) > return inode->i_op->permission(idmap, inode, mask); > > /* This gets set once for the inode lifetime */ > spin_lock(&inode->i_lock); > inode->i_opflags |= IOP_FASTPERM; > spin_unlock(&inode->i_lock); > } > return generic_permission(idmap, inode, mask); > } Yes I know that, because that's where I knew what to call in the non "events" dir case. > > > Anyway, the issue is with "events" directory and remounting, because like > > the tracefs system, the inode and dentry for "evnets" is created at boot > > up, before the mount happens. The VFS layer is going to check the > > permissions of its inode and dentry, which will be incorrect if the mount > > was mounted with a "gid" option. > > The gid option has nothing to do with this and it is just handled fine > if you remove the second permission checking in (2). I guess I'm confused to what you are having an issue with. Is it just that the permission check gets called twice? > > You need to remove the inode_permission() code from > eventfs_start_creating(). It is just an internal lookup and the fact > that you have it in there allows userspace to break readdir on the > eventfs portions of tracefs as I've shown in the parts of the mail that > you cut off. That's because I didn't see how it was related to the way I fixed the mount=gid issue. Are you only concerned because of the check in eventfs_start_creating()? Yes, you posted code that would make things act funny for some code that I see no real world use case for. Yeah, it may not act "properly" but I'm not sure that's bad. Here, I'll paste it back: > // We managed to open the directory so we have permission to list > // directory entries in "xfs". > fd = open("/sys/kernel/tracing/events/xfs"); > > // Remove ownership so we can't open the directory anymore > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > // Or just remove exec bit for the group and restrict to owner > chmod("/sys/kernel/tracing/events/xfs", 700); > > // Drop caches to force an eventfs_root_lookup() on everything > write("/proc/sys/vm/drop_caches", "3", 1); This requires opening the directory, then having it's permissions change, and then immediately dropping the caches. > > // Returns 0 even though directory has a lot of entries and we should be > // able to list them > getdents64(fd, ...); And do we care? Since tracing exposes internal kernel information, perhaps this is a feature and not a bug. If someone who had access to the tracing system and you wanted to stop them, if they had a directory open that they no longer have access to, you don't want them to see what's left in the directory. In other words, I like the idea that the getends64(fd, ...) will fail! If there's a file underneath that wasn't change, and the admin thought that just keeping the top directory permissions off is good enough, then that attacker having that directory open before the directory had it's file permissions change is a way to still have access to the files below it. > > And the failure is in the inode_permission(MAY_EXEC, "xfs") call in > lookup_one_len() in eventfs_start_creating() which now fails. And I think is a good thing! Again, tracefs is special. It gives you access and possibly control to the kernel behavior. I like the fact that as soon as someone loses permission to a directory, they immediately lose it. -- Steve
On Fri, 12 Jan 2024 08:53:44 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > // We managed to open the directory so we have permission to list > > // directory entries in "xfs". > > fd = open("/sys/kernel/tracing/events/xfs"); > > > > // Remove ownership so we can't open the directory anymore > > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > > > // Or just remove exec bit for the group and restrict to owner > > chmod("/sys/kernel/tracing/events/xfs", 700); > > > > // Drop caches to force an eventfs_root_lookup() on everything > > write("/proc/sys/vm/drop_caches", "3", 1); > > This requires opening the directory, then having it's permissions > change, and then immediately dropping the caches. > > > > > // Returns 0 even though directory has a lot of entries and we should be > > // able to list them > > getdents64(fd, ...); > > And do we care? Hmm, maybe the issue you have is with the inconsistency of the action? For this to fail, it would require the admin to do both change the permission and to flush caches. If you don't flush the caches then the task with the dir open can still read it regardless. If the dentries were already created. In that case I'm fine if we change the creation of the dentries to not check the permission. But for now, it's just a weird side effect that I don't really see how it would affect any user's workflow. -- Steve
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 53d34a4b5a2b..641bffa0f139 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -45,6 +45,7 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID = BIT(17), EVENTFS_SAVE_GID = BIT(18), + EVENTFS_TOPLEVEL = BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * The events directory dentry is never freed, unless its * part of an instance that is deleted. It's attr is the * default for its child files and directories. - * Do not update it. It's not used for its own mode or ownership + * Do not update it. It's not used for its own mode or ownership. */ - if (!ei->is_events) + if (ei->is_events) { + /* But it still needs to know if it was modified */ + if (iattr->ia_valid & ATTR_UID) + ei->attr.mode |= EVENTFS_SAVE_UID; + if (iattr->ia_valid & ATTR_GID) + ei->attr.mode |= EVENTFS_SAVE_GID; + } else { update_attr(&ei->attr, iattr); + } } else { name = dentry->d_name.name; @@ -136,9 +144,67 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } +static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry) +{ + struct inode *inode; + + /* Only update if the "events" was on the top level */ + if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + /* Get the tracefs root from the parent */ + inode = d_inode(dentry->d_parent); + inode = d_inode(inode->i_sb->s_root); + ei->attr.uid = inode->i_uid; + ei->attr.gid = inode->i_gid; +} + +static void set_top_events_ownership(struct inode *inode) +{ + struct tracefs_inode *ti = get_tracefs(inode); + struct eventfs_inode *ei = ti->private; + struct dentry *dentry; + + /* The top events directory doesn't get automatically updated */ + if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + dentry = ei->dentry; + + update_top_events_attr(ei, dentry); + + if (!(ei->attr.mode & EVENTFS_SAVE_UID)) + inode->i_uid = ei->attr.uid; + + if (!(ei->attr.mode & EVENTFS_SAVE_GID)) + inode->i_gid = ei->attr.gid; +} + +static int eventfs_get_attr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct dentry *dentry = path->dentry; + struct inode *inode = d_backing_inode(dentry); + + set_top_events_ownership(inode); + + generic_fillattr(idmap, request_mask, inode, stat); + return 0; +} + +static int eventfs_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + set_top_events_ownership(inode); + return generic_permission(idmap, inode, mask); +} + static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr = eventfs_set_attr, + .getattr = eventfs_get_attr, + .permission = eventfs_permission, }; static const struct inode_operations eventfs_file_inode_operations = { @@ -174,6 +240,8 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) } while (!ei->is_events); mutex_unlock(&eventfs_mutex); + update_top_events_attr(ei, dentry); + return ei; } @@ -887,6 +955,14 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; + /* + * If the events directory is of the top instance, then parent + * is NULL. Set the attr.mode to reflect this and its permissions will + * default to the tracefs root dentry. + */ + if (!parent) + ei->attr.mode = EVENTFS_TOPLEVEL; + /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index bc86ffdb103b..63284f18741f 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -91,6 +91,7 @@ static int tracefs_syscall_mkdir(struct mnt_idmap *idmap, struct inode *inode, struct dentry *dentry, umode_t mode) { + struct tracefs_inode *ti; char *name; int ret; @@ -98,6 +99,15 @@ static int tracefs_syscall_mkdir(struct mnt_idmap *idmap, if (!name) return -ENOMEM; + /* + * This is a new directory that does not take the default of + * the rootfs. It becomes the default permissions for all the + * files and directories underneath it. + */ + ti = get_tracefs(inode); + ti->flags |= TRACEFS_INSTANCE_INODE; + ti->private = inode; + /* * The mkdir call can call the generic functions that create * the files within the tracefs system. It is up to the individual @@ -141,10 +151,76 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) return ret; } -static const struct inode_operations tracefs_dir_inode_operations = { +static void set_tracefs_inode_owner(struct inode *inode) +{ + struct tracefs_inode *ti = get_tracefs(inode); + struct inode *root_inode = ti->private; + + /* + * If this inode has never been referenced, then update + * the permissions to the superblock. + */ + if (!(ti->flags & TRACEFS_UID_PERM_SET)) + inode->i_uid = root_inode->i_uid; + + if (!(ti->flags & TRACEFS_GID_PERM_SET)) + inode->i_gid = root_inode->i_gid; +} + +static int tracefs_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + set_tracefs_inode_owner(inode); + return generic_permission(idmap, inode, mask); +} + +static int tracefs_getattr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct inode *inode = d_backing_inode(path->dentry); + + set_tracefs_inode_owner(inode); + generic_fillattr(idmap, request_mask, inode, stat); + return 0; +} + +static int tracefs_setattr(struct mnt_idmap *idmap, struct dentry *dentry, + struct iattr *attr) +{ + unsigned int ia_valid = attr->ia_valid; + struct inode *inode = d_inode(dentry); + struct tracefs_inode *ti = get_tracefs(inode); + + if (ia_valid & ATTR_UID) + ti->flags |= TRACEFS_UID_PERM_SET; + + if (ia_valid & ATTR_GID) + ti->flags |= TRACEFS_GID_PERM_SET; + + return simple_setattr(idmap, dentry, attr); +} + +static const struct inode_operations tracefs_instance_dir_inode_operations = { .lookup = simple_lookup, .mkdir = tracefs_syscall_mkdir, .rmdir = tracefs_syscall_rmdir, + .permission = tracefs_permission, + .getattr = tracefs_getattr, + .setattr = tracefs_setattr, +}; + +static const struct inode_operations tracefs_dir_inode_operations = { + .lookup = simple_lookup, + .permission = tracefs_permission, + .getattr = tracefs_getattr, + .setattr = tracefs_setattr, +}; + +static const struct inode_operations tracefs_file_inode_operations = { + .permission = tracefs_permission, + .getattr = tracefs_getattr, + .setattr = tracefs_setattr, }; struct inode *tracefs_get_inode(struct super_block *sb) @@ -183,87 +259,6 @@ struct tracefs_fs_info { struct tracefs_mount_opts mount_opts; }; -static void change_gid(struct dentry *dentry, kgid_t gid) -{ - if (!dentry->d_inode) - return; - dentry->d_inode->i_gid = gid; -} - -/* - * Taken from d_walk, but without he need for handling renames. - * Nothing can be renamed while walking the list, as tracefs - * does not support renames. This is only called when mounting - * or remounting the file system, to set all the files to - * the given gid. - */ -static void set_gid(struct dentry *parent, kgid_t gid) -{ - struct dentry *this_parent; - struct list_head *next; - - this_parent = parent; - spin_lock(&this_parent->d_lock); - - change_gid(this_parent, gid); -repeat: - next = this_parent->d_subdirs.next; -resume: - while (next != &this_parent->d_subdirs) { - struct tracefs_inode *ti; - struct list_head *tmp = next; - struct dentry *dentry = list_entry(tmp, struct dentry, d_child); - next = tmp->next; - - /* Note, getdents() can add a cursor dentry with no inode */ - if (!dentry->d_inode) - continue; - - spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); - - change_gid(dentry, gid); - - /* If this is the events directory, update that too */ - ti = get_tracefs(dentry->d_inode); - if (ti && (ti->flags & TRACEFS_EVENT_INODE)) - eventfs_update_gid(dentry, gid); - - if (!list_empty(&dentry->d_subdirs)) { - spin_unlock(&this_parent->d_lock); - spin_release(&dentry->d_lock.dep_map, _RET_IP_); - this_parent = dentry; - spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_); - goto repeat; - } - spin_unlock(&dentry->d_lock); - } - /* - * All done at this level ... ascend and resume the search. - */ - rcu_read_lock(); -ascend: - if (this_parent != parent) { - struct dentry *child = this_parent; - this_parent = child->d_parent; - - spin_unlock(&child->d_lock); - spin_lock(&this_parent->d_lock); - - /* go into the first sibling still alive */ - do { - next = child->d_child.next; - if (next == &this_parent->d_subdirs) - goto ascend; - child = list_entry(next, struct dentry, d_child); - } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)); - rcu_read_unlock(); - goto resume; - } - rcu_read_unlock(); - spin_unlock(&this_parent->d_lock); - return; -} - static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts) { substring_t args[MAX_OPT_ARGS]; @@ -336,10 +331,8 @@ static int tracefs_apply_options(struct super_block *sb, bool remount) if (!remount || opts->opts & BIT(Opt_uid)) inode->i_uid = opts->uid; - if (!remount || opts->opts & BIT(Opt_gid)) { - /* Set all the group ids to the mount option */ - set_gid(sb->s_root, opts->gid); - } + if (!remount || opts->opts & BIT(Opt_gid)) + inode->i_gid = opts->gid; return 0; } @@ -573,6 +566,33 @@ struct dentry *eventfs_end_creating(struct dentry *dentry) return dentry; } +/* Find the inode that this will use for default */ +static struct inode *instance_inode(struct dentry *parent, struct inode *inode) +{ + struct tracefs_inode *ti; + struct inode *root_inode; + + root_inode = d_inode(inode->i_sb->s_root); + + /* If parent is NULL then use root inode */ + if (!parent) + return root_inode; + + /* Find the inode that is flagged as an instance or the root inode */ + do { + inode = d_inode(parent); + if (inode == root_inode) + return root_inode; + + ti = get_tracefs(inode); + + if (ti->flags & TRACEFS_INSTANCE_INODE) + return inode; + } while ((parent = parent->d_parent)); + + return NULL; +} + /** * tracefs_create_file - create a file in the tracefs filesystem * @name: a pointer to a string containing the name of the file to create. @@ -603,6 +623,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) { + struct tracefs_inode *ti; struct dentry *dentry; struct inode *inode; @@ -621,7 +642,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, if (unlikely(!inode)) return tracefs_failed_creating(dentry); + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + inode->i_mode = mode; + inode->i_op = &tracefs_file_inode_operations; inode->i_fop = fops ? fops : &tracefs_file_operations; inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; @@ -634,6 +659,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, static struct dentry *__create_dir(const char *name, struct dentry *parent, const struct inode_operations *ops) { + struct tracefs_inode *ti; struct dentry *dentry = tracefs_start_creating(name, parent); struct inode *inode; @@ -651,6 +677,9 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); d_instantiate(dentry, inode); @@ -681,7 +710,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) if (security_locked_down(LOCKDOWN_TRACEFS)) return NULL; - return __create_dir(name, parent, &simple_dir_inode_operations); + return __create_dir(name, parent, &tracefs_dir_inode_operations); } /** @@ -712,7 +741,7 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; - dentry = __create_dir(name, parent, &tracefs_dir_inode_operations); + dentry = __create_dir(name, parent, &tracefs_instance_dir_inode_operations); if (!dentry) return NULL; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 42bdeb471a07..12b7d0150ae9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -5,6 +5,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), TRACEFS_EVENT_TOP_INODE = BIT(2), + TRACEFS_GID_PERM_SET = BIT(3), + TRACEFS_UID_PERM_SET = BIT(4), + TRACEFS_INSTANCE_INODE = BIT(5), }; struct tracefs_inode {