Message ID | 20240126150209.367ff402@gandalf.local.home (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | eventfs: Have inodes have unique inode numbers | expand |
Steven, stop making things more complicated than they need to be. And dammit, STOP COPYING VFS LAYER FUNCTIONS. It was a bad idea last time, it's a horribly bad idea this time too. I'm not taking this kind of crap. The whole "get_next_ino()" should be "atomic64_add_return()". End of story. You arent' special. If the VFS functions don't work for you, you don't use them, but dammit, you also don't then steal them without understanding what they do, and why they were necessary. The reason get_next_ino() is critical is because it's used by things like pipes and sockets etc that get created at high rates, the the inode numbers most definitely do not get cached. You copied that function without understanding why it does what it does, and as a result your code IS GARBAGE. AGAIN. Honestly, kill this thing with fire. It was a bad idea. I'm putting my foot down, and you are *NOT* doing unique regular file inode numbers uintil somebody points to a real problem. Because this whole "I make up problems, and then I write overly complicated crap code to solve them" has to stop,. No more. This stops here. I don't want to see a single eventfs patch that doesn't have a real bug report associated with it. And the next time I see you copying VFS functions (or any other core functions) without udnerstanding what the f*ck they do, and why they do it, I'm going to put you in my spam-filter for a week. I'm done. I'm really *really* tired of having to look at eventfs garbage. Linus
On Fri, 26 Jan 2024 12:25:05 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Steven, > stop making things more complicated than they need to be. > > And dammit, STOP COPYING VFS LAYER FUNCTIONS. > > It was a bad idea last time, it's a horribly bad idea this time too. > > I'm not taking this kind of crap. > > The whole "get_next_ino()" should be "atomic64_add_return()". End of story. I originally wrote it that way, and thought to myself that the VFS version is "faster" and switched to that. My fault for being too much into micro-optimizations. > > You arent' special. If the VFS functions don't work for you, you don't > use them, but dammit, you also don't then steal them without > understanding what they do, and why they were necessary. > > The reason get_next_ino() is critical is because it's used by things > like pipes and sockets etc that get created at high rates, the the > inode numbers most definitely do not get cached. Yes, I understood why it's optimized, and took it because it's been there since 2010 and figured it's pretty solid. > > You copied that function without understanding why it does what it > does, and as a result your code IS GARBAGE. > > AGAIN. > > Honestly, kill this thing with fire. It was a bad idea. I'm putting my > foot down, and you are *NOT* doing unique regular file inode numbers > uintil somebody points to a real problem. > > Because this whole "I make up problems, and then I write overly > complicated crap code to solve them" has to stop,. If I had just used the atomic_add_return() is it really that overly complicated? Yes, I copied from VFS because I figured if they put in the effort to make it faster then why not use that, even though it was overkill. > > No more. This stops here. > > I don't want to see a single eventfs patch that doesn't have a real > bug report associated with it. And the next time I see you copying VFS > functions (or any other core functions) without udnerstanding what the > f*ck they do, and why they do it, I'm going to put you in my > spam-filter for a week. > > I'm done. I'm really *really* tired of having to look at eventfs garbage. So we keep the same inode number until something breaks with it, even though, using unique ones is not that complicated? I'd be happy to change that patch to what I originally did before deciding to copy get_next_ino(): unsigned int tracefs_get_next_ino(int files) { static atomic_t next_inode; unsigned int res; do { res = atomic_add_return(files + 1, &next_inode); /* Check for overflow */ } while (unlikely(res < files + 1)); return res - files; } If I knew going back and copying over get_next_ino() was going to piss you off so much, I wouldn't have done that. Not to mention that the current code calls into get_next_ino() and then throws it away. That is, eventfs gets its inode structure from tracefs that adds the inode number to it using the VFS get_next_ino(). That gets thrown away by the single inode assigned. This just makes it more likely that the global get_inode_ino() is going to overflow due to eventfs, even though eventfs isn't even using them. I only did the one inode number because that's what you wanted. Is it that you want to move away from having inode numbers completely? At least for pseudo file systems? If that's the case, then we can look to get people to start doing that. First it would be fixing tools like 'tar' to ignore the inode numbers. Otherwise, I really rather keep it the way it has always been. That is, each file has its own unique inode number, and not have to deal with some strange bug report because it's not. Is there another file system that has just one inode number? -- Steve
On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote: > > So we keep the same inode number until something breaks with it, even > though, using unique ones is not that complicated? Using unique ones for directories was a trivial cleanup. The file case is clearly different. I thought it would be the same trivial one-liner, but nope. When you have to add 30 lines of code just to get unique inode numbers that nobody has shown any interest in, it's 30 lines too much. And when it happens in a filesystem that has a history of copying code from the VFS layer and having nasty bugs, it's *definitely* too much. Simplify. If you can clean things up and we have a few release of not-horrendous-bugs every other day, I may change my mind. As it is, I feel like I have to waste my time checking all your patches, and I'm saying "it's not worth it". Linus
On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote: > > I'd be happy to change that patch to what I originally did before deciding > to copy get_next_ino(): > > unsigned int tracefs_get_next_ino(int files) > { > static atomic_t next_inode; > unsigned int res; > > do { > res = atomic_add_return(files + 1, &next_inode); > > /* Check for overflow */ > } while (unlikely(res < files + 1)); > > return res - files; Still entirely pointless. If you have more than 4 billion inodes, something is really really wrong. So why is it ten lines instead of one? Dammit, this is a number that NOBODY HAS SHOWN IS EVEN WORTH EXISTING IN THE FIRST PLACE. So no. I'm not taking this. End of discussion. My point stands: I want this filesystem *stabilized*, and in s sane format. Look to *simplify* things. Send me patches that *remove* complexity, not add new complexity that you have zero evidence is worth it. Face it, eventfs isn't some kind of "real filesystem". It shouldn't even attempt to look like one. If somebody goes "I want to tar this thiing up", you should laugh in their face and call them names, not say "sure, let me whip up a 50-line patch to make this fragile thing even more complex". Linus
On Fri, 26 Jan 2024 13:36:20 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I'd be happy to change that patch to what I originally did before deciding > > to copy get_next_ino(): > > > > unsigned int tracefs_get_next_ino(int files) > > { > > static atomic_t next_inode; > > unsigned int res; > > > > do { > > res = atomic_add_return(files + 1, &next_inode); > > > > /* Check for overflow */ > > } while (unlikely(res < files + 1)); > > > > return res - files; > > Still entirely pointless. > > If you have more than 4 billion inodes, something is really really wrong. No, but you can trivially make a loop that creates and destroys directories that will eventually overflow the count. -- Steve
On Fri, 26 Jan 2024 13:31:01 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > As it is, I feel like I have to waste my time checking all your > patches, and I'm saying "it's not worth it". I'm basically done with this. I never said I was a VFS guy and I learned a lot doing this. I had really nobody to look at my code even though most of it went to the fsdevel list. Nobody said I was doing it wrong. Sorry to have wasted your time -- Steve
On Fri, 26 Jan 2024 at 13:36, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If you have more than 4 billion inodes, something is really really wrong. Btw, once again, the vfs layer function you took this from *does* have some reason to worry. Somebody might be doing 'pipe()' in a loop. Also, if your worry is "what if somebody mounts that thing a million times", the solution to *that* would have been to make it a per-sb counter, which I think would be cleaner anyway. But my real issue is that I think you would be *much* better off just deleting code, instead of adding new code. For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have? Isn't that entirely a left-over from the bad old days? So please try to look at things to *fix* and simplify, not at things to mess around with and make more complicated. Linus
On January 26, 2024 4:49:13 PM EST, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Fri, 26 Jan 2024 at 13:36, Linus Torvalds ><torvalds@linux-foundation.org> wrote: >> >> If you have more than 4 billion inodes, something is really really wrong. > >Btw, once again, the vfs layer function you took this from *does* have >some reason to worry. Somebody might be doing 'pipe()' in a loop. > >Also, if your worry is "what if somebody mounts that thing a million >times", the solution to *that* would have been to make it a per-sb >counter, which I think would be cleaner anyway. > I'm more worried about a loop of: cd /sys/kernel/tracing/instances while:; do mkdir foo ; rmdir foo: done Which is what my tests do. And I have run that for over a weekend. >But my real issue is that I think you would be *much* better off just >deleting code, instead of adding new code. > >For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have? >Isn't that entirely a left-over from the bad old days? > I'm not at my computer, but when I tried deleting that, it caused issues with the lookup code. -- Steve >So please try to look at things to *fix* and simplify, not at things >to mess around with and make more complicated. > > Linus
On 2024-01-26 16:49, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 13:36, Linus Torvalds > <torvalds@linux-foundation.org> wrote: [...] > So please try to look at things to *fix* and simplify, not at things > to mess around with and make more complicated. Hi Linus, I'm all aboard with making things as simple as possible and making sure no complexity is added for the sake of micro-optimization of slow-paths. I do however have a concern with the approach of using the same inode number for various files on the same filesystem: AFAIU it breaks userspace ABI expectations. See inode(7) for instance: Inode number stat.st_ino; statx.stx_ino Each file in a filesystem has a unique inode number. Inode numbers are guaranteed to be unique only within a filesystem (i.e., the same inode numbers may be used by different filesystems, which is the reason that hard links may not cross filesystem boundaries). This field contains the file's inode number. So user-space expecting inode numbers to be unique within a filesystem is not "legacy" in any way. Userspace is allowed to expect this from the ABI. I think that a safe approach to prevent ABI regressions, and just to prevent adding more ABI-corner cases that userspace will have to work-around, would be to issue unique numbers to files within eventfs, but in the simplest/obviously correct implementation possible. It is, after all, a slow path. The issue with the atomic_add_return without any kinds of checks is the scenarios of a userspace loop that would create/delete directories endlessly, thus causing inode re-use. This approach is simple, but it's unfortunately not obviously correct. Because eventfs allows userspace to do mkdir/rmdir, this is unfortunately possible. It would be OK if only the kernel had control over directory creation/removal, but it's not the case here. I would suggest this straightforward solution to this: a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8), b) keep track of inode allocation in a bitmap (within a single page), c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs. This way even the mkdir/rmdir loop will work fine, but it will prevent keeping too many inodes alive at any given time. The cost is a single page (4K) per eventfs instance. Thanks, Mathieu
On Fri, 26 Jan 2024 at 14:09, Steven Rostedt <rostedt@goodmis.org> wrote: > > I'm not at my computer, but when I tried deleting that, it caused issues with the lookup code. The VSF layer should be serializing all lookups of the same name. If it didn't, we'd have serious issues on other filesystems. So you should never get more than one concurrent lookup of one particular entry, and as long as the dentry exists, you should then not get a new one. It's one of the things that the VFS layer does to make things simple for the filesystem. But it's worth noting that that is about *one* entry. You can get concurrent lookups in the same directory for different names. Another thing that worries me is that odd locking that releases the lock in the middle. I don't understand why you release the tracefs_mutex() over create_file(), for example. There's a lot of "take, drop, re-take, re-drop" of that mutex that seems strange. Linus
On Fri, 26 Jan 2024 at 14:14, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > I do however have a concern with the approach of using the same > inode number for various files on the same filesystem: AFAIU it > breaks userspace ABI expectations. Virtual filesystems have always done that in various ways. Look at the whole discussion about the size of the file. Then look at /proc. And honestly, eventfs needs to be simplified. It's a mess. It's less of a mess than it used to be, but people should *NOT* think that it's a real filesystem. Don't use some POSIX standard as an expectation for things like /proc, /sys or tracefs. Linus
On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote: > I would suggest this straightforward solution to this: > > a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8), > > b) keep track of inode allocation in a bitmap (within a single page), > > c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs. ... reinventing the IDA?
On 2024-01-26 17:34, Matthew Wilcox wrote: > On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote: >> I would suggest this straightforward solution to this: >> >> a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8), >> >> b) keep track of inode allocation in a bitmap (within a single page), >> >> c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs. > > ... reinventing the IDA? Looking at include/linux/idr.h now that you mention it, yes, you are absolutely right! Thanks, Mathieu
On 2024-01-26 17:29, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 14:14, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> I do however have a concern with the approach of using the same >> inode number for various files on the same filesystem: AFAIU it >> breaks userspace ABI expectations. > > Virtual filesystems have always done that in various ways. > > Look at the whole discussion about the size of the file. Then look at /proc. Yes, there is even a note about stat.st_size in inode(7) explaining this: NOTES For pseudofiles that are autogenerated by the kernel, the file size (stat.st_size; statx.stx_size) reported by the kernel is not accurate. For example, the value 0 is returned for many files under the /proc di‐ rectory, while various files under /sys report a size of 4096 bytes, even though the file content is smaller. For such files, one should simply try to read as many bytes as possible (and append '\0' to the returned buffer if it is to be interpreted as a string). But having a pseudo-filesystem entirely consisting of duplicated inodes which are not hard links to the same file is something new/unexpected. > And honestly, eventfs needs to be simplified. It's a mess. It's less > of a mess than it used to be, but people should *NOT* think that it's > a real filesystem. I agree with simplifying it, but would rather not introduce userspace ABI regressions in the process, which will cause yet another kind of mess. > Don't use some POSIX standard as an expectation for things like /proc, > /sys or tracefs. If those filesystems choose to do things differently from POSIX, then it should be documented with the relevant ABIs, because userspace should be able to know (rather than guess) what it can expect. Thanks, Mathieu
On Fri, 26 Jan 2024 at 14:34, Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote: > > I would suggest this straightforward solution to this: > > > > a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8), > > > > b) keep track of inode allocation in a bitmap (within a single page), > > > > c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs. > > ... reinventing the IDA? Guysm, this is a random number that is *so* interesting that I seriously think we shouldn't have it at all. End result: nobody should care. Even the general VFS layer doesn't care. It literally avoids inode number zero, not because it would be a bad inode number, but simply because of some random historical oddity. In fact, I don't think we even have a reason for it. We have a commit 2adc376c5519 ("vfs: avoid creation of inode number 0 in get_next_ino") and that one calls out glibc for not deleting them. That makes no sense to me, but whatever. But note how the generic function does *not* try to make them unique, for example. They are just "unique enough". The generic function *does* care about being scalable in an SMP environment. To a disturbing degree. Oh well. Linus
On Fri, 26 Jan 2024 at 14:41, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Yes, there is even a note about stat.st_size in inode(7) explaining > this: Good. Send a patch to do the same for st_ino. Linus
On Fri, Jan 26, 2024 at 02:48:45PM -0800, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 14:34, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Jan 26, 2024 at 05:14:12PM -0500, Mathieu Desnoyers wrote: > > > I would suggest this straightforward solution to this: > > > > > > a) define a EVENTFS_MAX_INODES (e.g. 4096 * 8), > > > > > > b) keep track of inode allocation in a bitmap (within a single page), > > > > > > c) disallow allocating more than "EVENTFS_MAX_INODES" in eventfs. > > > > ... reinventing the IDA? > > Guysm, this is a random number that is *so* interesting that I > seriously think we shouldn't have it at all. > > End result: nobody should care. Even the general VFS layer doesn't care. > > It literally avoids inode number zero, not because it would be a bad > inode number, but simply because of some random historical oddity. > > In fact, I don't think we even have a reason for it. We have a commit > 2adc376c5519 ("vfs: avoid creation of inode number 0 in get_next_ino") > and that one calls out glibc for not deleting them. That makes no > sense to me, but whatever. Maybe we should take advantage of that historical oddity. All files in eventfs have inode number 0, problem solved.
On Fri, 26 Jan 2024 at 15:04, Matthew Wilcox <willy@infradead.org> wrote: > > Maybe we should take advantage of that historical oddity. All files > in eventfs have inode number 0, problem solved. That might not be a horrible idea. The same way "a directory with st_nlink 1 clearly cannot have a subdirectory count" (which tools like 'find' know about), maybe it would be a good idea to use a zero inode number for 'this file doesn't have an inode number". Now, presumably no tool knows that, but we could try to aim for that being some future standard thing. (And by "standard", I mean a practical one, not some POSIX thing. I think POSIX just mentions "numberr of hardlinks", and then the whole "a value of one means that we can't count subdirectories" is just a practical reality for things like FAT). Linus
On Fri, 26 Jan 2024 at 15:11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 26 Jan 2024 at 15:04, Matthew Wilcox <willy@infradead.org> wrote: > > > > Maybe we should take advantage of that historical oddity. All files > > in eventfs have inode number 0, problem solved. > > That might not be a horrible idea. Note the "might". I don't know why glibc would have special-cased st_ino of 0, but I suspect it's some internal oddity in the readdir() implementation. So considering that we do have that commit 2adc376c5519, I suspect it would just be more pain than its worth to try to teach user space about the whole "no inode number" thing. It might be safer to pick something like -1 instead. Linus
On Jan 26 2024, Linus Torvalds wrote: > Note the "might". I don't know why glibc would have special-cased > st_ino of 0, but I suspect it's some internal oddity in the readdir() > implementation. It was traditionally used for deleted entries in a directory.
On Fri, 26 Jan 2024 14:26:08 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > nother thing that worries me is that odd locking that releases the > lock in the middle. I don't understand why you release the > tracefs_mutex() over create_file(), for example. There's a lot of > "take, drop, re-take, re-drop" of that mutex that seems strange. This was because the create_file/dir() would call into the VFS which would grab locks, and on a final dput() on a ei dentry that is to be freed, calls back into eventfs_set_ei_status_free() which also grabs the eventfs_mutex. But it gets called with the same VFS locks that are taken by create_file/dir() VFS calls. This was caught by lockdep. Hence the dropping of those locks. The eventfs_mutex is just protecting the ei list and also assigning and clearing the ei->dentry. Now that dentry is used to synchronize the last close, and also to know if the ei was ever referenced. If ei->dentry is NULL it can be freed immediately (after SRCU) when the directory is deleted. But if ei->dentry is set, it means that something may still have a reference to it and must be freed after the last dput() and SRCU. Now some of this was needed due to the way the dir wrapper worked so I may be able to revisit this and possibly just use an ei->ref counter. But I wasted enough time on this and I'm way behind in my other responsibilities, so this is not something I can work on now. -- Steve
From: Linus Torvalds > Sent: 26 January 2024 21:36 > > On Fri, 26 Jan 2024 at 13:26, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I'd be happy to change that patch to what I originally did before deciding > > to copy get_next_ino(): > > > > unsigned int tracefs_get_next_ino(int files) > > { > > static atomic_t next_inode; > > unsigned int res; > > > > do { > > res = atomic_add_return(files + 1, &next_inode); > > > > /* Check for overflow */ > > } while (unlikely(res < files + 1)); > > > > return res - files; > > Still entirely pointless. > > If you have more than 4 billion inodes, something is really really wrong. > > So why is it ten lines instead of one? Doesn't Linux support 64bit inode numbers? They solve the wrap problem... I also don't know what filesystems like NTFS use - they don't have the concept of inode. IIRC NFS used to use the inode number for its 'file handle'. Rather a pain when trying to write code to export a layered FS (especially if a layer might be an NFS mount!). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, 27 Jan 2024 at 07:27, David Laight <David.Laight@aculab.com> wrote: > > Doesn't Linux support 64bit inode numbers? > They solve the wrap problem... Yes, but we've historically had issues with actually exposing them. The legacy stat() code has things like this: tmp.st_ino = stat->ino; if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino) return -EOVERFLOW; so if you have really old 32-bit user space, you generally do not actually want to have 64-bit inode numbers. This is why "get_next_ino()" returns a strictly 32-bit only inode number. You don't want to error out on a 'fstat()' just because you're on a big system that has been running for a long time. Now, 'stat64' was introduced for this reason back in 2.3.34, so back in 1999. So any half-way modern 32-bit environment doesn't have that issue, and maybe it's time to just sunset all the old stat() calls. Of course, the *really* old stat has a 16-bit inode number. Search for __old_kernel_stat to still see that. That's more of a curiosity than anything else. Linus
On Fri, 26 Jan 2024 at 13:49, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > For example, what purpose does 'e->dentry' and 'ei->d_childen[]' have? > Isn't that entirely a left-over from the bad old days? > > So please try to look at things to *fix* and simplify, not at things > to mess around with and make more complicated. So here's an attempt at some fairly trivial but entirely untested cleanup. I have *not* tested this at all, and I assume you have some extensive test-suite that you run. So these are "signed-off' in the sense that the patch looks fine, it builds in one configuration for me, but maybe there's something really odd going on. The first patch is trivial dead code removal. The second patch is because I really do not understand the reason for the 'ei->dentry' pointer, and it just looks messy. It looks _particularly_ messy when it is mixed up in operations that really do not need it and really shouldn't use it. The eventfs_find_events() code was using the dentry parent pointer to find the parent (fine, and simple), then looking up the tracefs inode using that (fine so far), but then it looked up the dentry using *that*. But it already *had* the dentry - it's that parent dentry it just used to find the tracefs inode. The code looked nonsensical. Similarly, it then (in the set_top_events_ownership() helper) used 'ei->dentry' to update the events attr, but all that really wants is the superblock root. So instead of passing a dentry, just pass the superblock pointer, which you can find in either the dentry or in the VFS inode, depending on which level you're working at. There are tons of other 'ei->dentry' uses, and I didn't look at those. Baby steps. But this *seems* like an obvious cleanup, and many small obvious cleanups later and perhaps the 'ei->dentry' pointer (and the '->d_children[]' array) can eventually go away. They should all be entirely useless - there's really no reason for a filesystem to hold on to back-pointers of dentries. Anybody willing to run the test-suite on this? Linus
On Sat, 27 Jan 2024 09:47:17 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Now some of this was needed due to the way the dir wrapper worked so I > may be able to revisit this and possibly just use an ei->ref counter. > But I wasted enough time on this and I'm way behind in my other > responsibilities, so this is not something I can work on now. Ironically, one of the responsibilities that I've been putting off to fix up eventfs was writing that document on a support group for maintainer burnout. :-p -- Steve
On Sat, 27 Jan 2024 13:47:32 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So here's an attempt at some fairly trivial but entirely untested cleanup. > > I have *not* tested this at all, and I assume you have some extensive > test-suite that you run. So these are "signed-off' in the sense that > the patch looks fine, it builds in one configuration for me, but maybe > there's something really odd going on. > > The first patch is trivial dead code removal. Ah yes. That was leftover code from the mount dentry walk that I removed. I missed that code removal. > > The second patch is because I really do not understand the reason for > the 'ei->dentry' pointer, and it just looks messy. I have to understand how the dentry lookup works. Basically, when the ei gets deleted, it can't be freed until all dentries it references (including its children) are no longer being accessed. Does that lookup get called only when a dentry with the name doesn't already exist? That is, can I assume that eventfs_root_lookup() is only called when the VFS file system could not find an existing dentry and it has to create one? If that's the case, then I can remove the ei->dentry and just add a ref counter that it was accessed. Then the final dput() should call eventfs_set_ei_status_free() (I hate that name and need to change it), and if the ei->is_freed is set, it can free the ei. The eventfs_remove_dir() can free the ei (after SRCU) if it has no references, otherwise it needs to wait for the final dput() to do the free. Again, if the ->lookup() only gets called if no dentry exists (where ei->dentry would already be NULL), then I can do this. I think the ei->dentry was required for the dir wrapper logic that we removed. > > It looks _particularly_ messy when it is mixed up in operations that > really do not need it and really shouldn't use it. > > The eventfs_find_events() code was using the dentry parent pointer to > find the parent (fine, and simple), then looking up the tracefs inode > using that (fine so far), but then it looked up the dentry using > *that*. But it already *had* the dentry - it's that parent dentry it > just used to find the tracefs inode. The code looked nonsensical. That was probably from rewriting that function a few different ways. > > Similarly, it then (in the set_top_events_ownership() helper) used > 'ei->dentry' to update the events attr, but all that really wants is > the superblock root. So instead of passing a dentry, just pass the > superblock pointer, which you can find in either the dentry or in the > VFS inode, depending on which level you're working at. > > There are tons of other 'ei->dentry' uses, and I didn't look at those. > Baby steps. But this *seems* like an obvious cleanup, and many small > obvious cleanups later and perhaps the 'ei->dentry' pointer (and the > '->d_children[]' array) can eventually go away. They should all be > entirely useless - there's really no reason for a filesystem to hold > on to back-pointers of dentries. > > Anybody willing to run the test-suite on this? It passed the ftrace selftests that are in the kernel, although the ownership test fails if you run it a second time. That fails before this patch too. It's because the test assumes that there's been no chgrp done on any of the files/directories, which then permanently assigns owership and ignores the default. The test needs to be fixed to handle this case, as it calls chgrp and chown so itself is permanently assigning ownership. I'll have to run this on my entire test suit. -- Steve
On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote: > > I have to understand how the dentry lookup works. Basically, when the > ei gets deleted, it can't be freed until all dentries it references > (including its children) are no longer being accessed. Does that lookup > get called only when a dentry with the name doesn't already exist? Dentry lookup gets called with the parent inode locked for reading, so a lookup can happen in parallel with readdir and other dentry lookup. BUT. Each dentry is also "self-serializing", so you will never see a lookup on the same name (in the same directory) concurrently. The implementation is very much internal to the VFS layer, and it's all kinds of nasty, with a new potential lookup waiting for the old one, verifying that the old one is still usable, and maybe repeating it all until we find a successful previous lookup or we're the only dentry remaining. It's nasty code that is very much in the "Al Viro" camp, but the point is that any normal filesystem should treat lookups as being concurrent with non-creation events, but not concurrently multiples. There *is* some common code with "atomic_open()", where filesystems that implement that then want to know if it's the *first* lookup, or a use of a previously looked up dentry, and they'll use the "d_in_lookup()" thing to determine that. So this whole "keep track of which dentries are *currently* being looked up is actually exposed, but any normal filesystem should never care. But if you think you have that issue (tracefs does not), you really want to talk to Al extensively. > That is, can I assume that eventfs_root_lookup() is only called when > the VFS file system could not find an existing dentry and it has to > create one? Correct. For any _particular_ name, you should think of lookup as serialized. > If that's the case, then I can remove the ei->dentry and just add a ref > counter that it was accessed. Then the final dput() should call > eventfs_set_ei_status_free() (I hate that name and need to change it), > and if the ei->is_freed is set, it can free the ei. Note that the final 'dput()' will happen *after* the dentry has been removed, so what can happen is lookup("name", d1); ... lookup successful, dentry is used .. ... dentry at some point has no more users .. ... memory pressure prunes unused dentries .. ... dentry gets unhashed and is no longer visible .. lookup("name", d2); ... new dentry is created .. final dput(d1); .. old dentry - that wasn't accessible any more is freed .. and this is actually one of the *reasons* that virtual filesystems must not try to cache dentry pointers in their internal data structures. Because note how the fuilesystem saw the new lookup(d2) of the same name *before* it saw the >d_release(d1) of the old dentry. And the above is fundamental: we obviously cannot call '->d_release()' until the old dentry is all dead and buried (lockref_mark_dead() etc), so pretty much by definition you'll have that ordering being possible. It's extremely unlikely, of course. I'll be you'll never hit it in testing. So if if you associate some internal data structure with a dentry, just *what* happens when you haven't been told abotu the old dentry being dead when the new one happens? See why I say that it's fundamentally wrong for a filesystem to try to track dentries? All the operations that can use a dentry will get one passed down to them by the VFS layer. The filesystem has no business trying to remember some dentry from a previous operation, and the filesystem *will* get it wrong. But also note how refcounting works fine. In fact, refcounting is pretty much the *only* thing that works fine. So what you *should* do is - at lookup(), when you save your filesystem data in "->d_fsdata", you increment a refcount - at ->d_release(), you decrement a refcount and now you're fine. Yes, when the above (very very unlikely) situation happens, you'll temporarily have a refcount incremented twice, but that's kind of the *point* of refcounts. Side note: this is pretty much true of any kernel data structure. If you have a kernel data structure that isn't just used within one thread, it must be refcounted. But it'as *doubly* true when you save references to something that the VFS maintains, because you DO NOT CONTROL the lifetime of that entity. > The eventfs_remove_dir() can free the ei (after SRCU) if it has no > references, otherwise it needs to wait for the final dput() to do the > free. Honestly, you should just *always* do refcounting. No "free after RCU delay" as an alternative. Just refcount it. Now, the RCU delay may be needed if the lookup of said structure happens under RCU, but no, saying "I use SRCU to make sure the lifetime is at least X" is just broken. The refcount is what gives the lifetime. Any form of RCU-delaying should then be purely about non-refcounting RCU lookups that may happen as the thing is dying (and said lookup should *look* at the refcount and say "oh, this is dead, I'm not returning this". > I think the ei->dentry was required for the dir wrapper logic that we > removed. I think all of this was due to the bogus readdir that created dentries willy-nilly and without the required serialization. And that was all horribly broken. It wasn't even the above kind of "really subtle race that you'll never hit in practice" broken. It was just absolutely broken with readdir and lookup racing on the same name and creating an unholy dentry mess. Linus
On Sun, 28 Jan 2024 at 12:53, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Now, the RCU delay may be needed if the lookup of said structure > happens under RCU, but no, saying "I use SRCU to make sure the > lifetime is at least X" is just broken. Put another way, the only reason for any RCU should be that you don't use locking at lookup, and the normal lookup routine should follow a pattern something like this: rcu_read_lock(); entry = find_entry(...); if (entry && !atomic_inc_not_zero(&entry->refcount)) entry = NULL; rcu_read_unlock(); and the freeing should basically follow a pattern like if (atomic_dec_and_test(&entry->refcount)) rcu_free(entry); IOW, the *lifetime* is entirely about the refcount. No "I have killed this entry" stuff. The RCU is purely about "look, we have to look up the entry while it's being torn down, so I can fundamentally race with the teardown, and so I need to be able to see that zero refcount". Of course, the "remove it from whatever hash lists or other data structures that can reach it" happens before the freeing, *One* such thing would be the "->d_release()" of a dentry that has a ref to it in d_fsdata, but presumably there are then other subsystem-specific hash tables etc that have their own refcounts. And a side note - I personally happen to believe that if you think you need SRCU rather than regular RCU, you've already done something wrong. And the reason for that is possibly because you've mixed up the refcount logic with some other subsystem locking logic, so you're using sleeping locks to protect a refcount. That's a mistake of its own. The refcounts are generally better just done using atomics (maybe krefs). Linus
On Sun, 28 Jan 2024 12:53:31 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I have to understand how the dentry lookup works. Basically, when the > > ei gets deleted, it can't be freed until all dentries it references > > (including its children) are no longer being accessed. Does that lookup > > get called only when a dentry with the name doesn't already exist? > > Dentry lookup gets called with the parent inode locked for reading, so > a lookup can happen in parallel with readdir and other dentry lookup. > > BUT. > > Each dentry is also "self-serializing", so you will never see a lookup > on the same name (in the same directory) concurrently. The above is what I wanted to know. > > The implementation is very much internal to the VFS layer, and it's > all kinds of nasty, with a new potential lookup waiting for the old > one, verifying that the old one is still usable, and maybe repeating > it all until we find a successful previous lookup or we're the only > dentry remaining. > > It's nasty code that is very much in the "Al Viro" camp, but the point > is that any normal filesystem should treat lookups as being concurrent > with non-creation events, but not concurrently multiples. > > There *is* some common code with "atomic_open()", where filesystems > that implement that then want to know if it's the *first* lookup, or a > use of a previously looked up dentry, and they'll use the > "d_in_lookup()" thing to determine that. So this whole "keep track of > which dentries are *currently* being looked up is actually exposed, > but any normal filesystem should never care. > > But if you think you have that issue (tracefs does not), you really > want to talk to Al extensively. > > > That is, can I assume that eventfs_root_lookup() is only called when > > the VFS file system could not find an existing dentry and it has to > > create one? > > Correct. For any _particular_ name, you should think of lookup as serialized. > > > If that's the case, then I can remove the ei->dentry and just add a ref > > counter that it was accessed. Then the final dput() should call > > eventfs_set_ei_status_free() (I hate that name and need to change it), > > and if the ei->is_freed is set, it can free the ei. > > Note that the final 'dput()' will happen *after* the dentry has been > removed, so what can happen is > > lookup("name", d1); > ... lookup successful, dentry is used .. > ... dentry at some point has no more users .. > ... memory pressure prunes unused dentries .. > ... dentry gets unhashed and is no longer visible .. > lookup("name", d2); > ... new dentry is created .. > final dput(d1); > .. old dentry - that wasn't accessible any more is freed .. Actually I was mistaken. I'm looking at the final iput() not dput(). > > and this is actually one of the *reasons* that virtual filesystems > must not try to cache dentry pointers in their internal data > structures. Because note how the fuilesystem saw the new lookup(d2) of > the same name *before* it saw the >d_release(d1) of the old dentry. > > And the above is fundamental: we obviously cannot call '->d_release()' > until the old dentry is all dead and buried (lockref_mark_dead() etc), > so pretty much by definition you'll have that ordering being possible. > > It's extremely unlikely, of course. I'll be you'll never hit it in testing. > > So if if you associate some internal data structure with a dentry, > just *what* happens when you haven't been told abotu the old dentry > being dead when the new one happens? > > See why I say that it's fundamentally wrong for a filesystem to try to > track dentries? All the operations that can use a dentry will get one > passed down to them by the VFS layer. The filesystem has no business > trying to remember some dentry from a previous operation, and the > filesystem *will* get it wrong. > > But also note how refcounting works fine. In fact, refcounting is > pretty much the *only* thing that works fine. So what you *should* do > is > > - at lookup(), when you save your filesystem data in "->d_fsdata", > you increment a refcount > > - at ->d_release(), you decrement a refcount > > and now you're fine. Yes, when the above (very very unlikely) > situation happens, you'll temporarily have a refcount incremented > twice, but that's kind of the *point* of refcounts. > > Side note: this is pretty much true of any kernel data structure. If > you have a kernel data structure that isn't just used within one > thread, it must be refcounted. But it'as *doubly* true when you save > references to something that the VFS maintains, because you DO NOT > CONTROL the lifetime of that entity. > > > The eventfs_remove_dir() can free the ei (after SRCU) if it has no > > references, otherwise it needs to wait for the final dput() to do the > > free. > > Honestly, you should just *always* do refcounting. No "free after RCU > delay" as an alternative. Just refcount it. > > Now, the RCU delay may be needed if the lookup of said structure > happens under RCU, but no, saying "I use SRCU to make sure the > lifetime is at least X" is just broken. > > The refcount is what gives the lifetime. Any form of RCU-delaying > should then be purely about non-refcounting RCU lookups that may > happen as the thing is dying (and said lookup should *look* at the > refcount and say "oh, this is dead, I'm not returning this". > > > I think the ei->dentry was required for the dir wrapper logic that we > > removed. > > I think all of this was due to the bogus readdir that created dentries > willy-nilly and without the required serialization. > > And that was all horribly broken. It wasn't even the above kind of > "really subtle race that you'll never hit in practice" broken. It was > just absolutely broken with readdir and lookup racing on the same name > and creating an unholy dentry mess. Hmm, if I understand the above, I could get rid of keeping around dentry and even remove the eventfs_set_ei_status_free(). I can try something to see if it works. -- Steve
On Sun, 28 Jan 2024 12:53:31 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Honestly, you should just *always* do refcounting. No "free after RCU > delay" as an alternative. Just refcount it. > > Now, the RCU delay may be needed if the lookup of said structure > happens under RCU, but no, saying "I use SRCU to make sure the > lifetime is at least X" is just broken. > > The refcount is what gives the lifetime. Any form of RCU-delaying > should then be purely about non-refcounting RCU lookups that may > happen as the thing is dying (and said lookup should *look* at the > refcount and say "oh, this is dead, I'm not returning this". The deleting of the ei is done outside the VFS logic. I use SRCU to synchronize looking at the ei children in the lookup. On deletion, I grab the eventfs_mutex, set ei->is_freed and then wait for SRCU to finish before freeing. The lookup checks ei->is_freed and doesn't do anything if set, but most that logic is under the SRCU, which is what I want to make sure is finished before the ei is deleted. Hmm, I still need the logic for iput(), as dentry->d_fsdata can still access the ei. That's where I need to have the ref counters. For a lookup, I need to up the ref count when I create a new inode for the ei or its children. Then in the iput() I decrement the ei ref count. I can only free the ei if the ref count is zero. The ref count is for knowing if an ei is referenced by a dentry->d_fsdata, and the SRCU is to make sure there's no lookups accessing an ei. -- Steve
On Sun, 28 Jan 2024 at 13:19, Steven Rostedt <rostedt@goodmis.org> wrote: > > The deleting of the ei is done outside the VFS logic. No. You're fundamentally doing it wrong. What you call "deletion" is just "remove from my hashes" or whatever. The lifetime of the object remains entirely unrelated to that. It is not free'd - removing it from the hashes should just be a reference counter decrement. > I use SRCU to synchronize looking at the ei children in the lookup. That's just wrong. Either you look things up under your own locks, in which case the SRCU dance is unnecessary and pointless. Or you use refcounts. In which case SRCU is also unnecessary and pointless. > On deletion, I > grab the eventfs_mutex, set ei->is_freed and then wait for SRCU to > finish before freeing. Again, bogus. Sure, you could do is "set ei->is_freed" to let any other users know (if they even care - why would they?). You'd use *locking* to serialize that. btu that has *NOTHING* to do with actually freing the data structure, and it has nothing to do with S$RCU - even if the locking might be blocking. Because *after* you have changed your data structures, and prefereably after you have already dropped your locks (to not hold them unnecessarily over any memory management) then you just do the normal "free the reference count", because you've removed the ref from your own data structures. You don't use "use SRCU before freeing". You use the pattern I showed: if (atomic_dec_and_test(&entry->refcount)) rcu_free(entry); in a "put_entry()" function, and EVERYBODY uses that function when they are done with it. In fact, the "rcu_free()" is likely entirely unnecessary, since I don't see that you ever look anything up under RCU. If all your lookups are done under the eventfs_mutex lock you have, just do if (atomic_dec_and_test(&entry->refcount)) kfree(entry); and you're done. By definition, once the refcount goes down to zero, there are no users, and if all your own data structures are maintained with a lock, there is never ever any reason to use a RCU delay. Sure, you'll have things like "dentry->d_fsdata" accesses that happen before you even take the lock, but that's fine - the d_fsdata pointer has a ref to it, so there's no locking needed for that lookup. It's just a direct pointer dereference, and it's protected by the refcount. No special cases. The user that sets "is_freed" is not special. Never will be. It's just one reference among many others, and YOU DO NOT CONTROL THE OTHER REFERENCES. If you've given a ref to dentry->d_fsdata, it's no longer yours to mess around with. All you can do is wait for the dentry to go away, at which point you do the same "put_dentry()" because exactly like your own data structures, it's JUST ANOTHER REFERENCE. See what I'm saying? Linus
On Sun, 28 Jan 2024 13:08:55 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 12:53, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Now, the RCU delay may be needed if the lookup of said structure > > happens under RCU, but no, saying "I use SRCU to make sure the > > lifetime is at least X" is just broken. > > Put another way, the only reason for any RCU should be that you don't > use locking at lookup, and the normal lookup routine should follow a > pattern something like this: > > rcu_read_lock(); > entry = find_entry(...); > if (entry && !atomic_inc_not_zero(&entry->refcount)) > entry = NULL; > rcu_read_unlock(); > > and the freeing should basically follow a pattern like > > if (atomic_dec_and_test(&entry->refcount)) > rcu_free(entry); Basically you are saying that when the ei is created, it should have a ref count of 1. If the lookup happens and does the atomic_inc_not_zero() it will only increment if the ref count is not 0 (which is basically equivalent to is_freed). And then we can have deletion of the object happen in both where the caller (kprobes) deletes the directory and in the final iput() reference (can I use iput and not the d_release()?), that it does the same as well. Where whatever sees the refcount of zero calls rcu_free? -- Steve
On Sun, 28 Jan 2024 at 13:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That's just wrong. > > Either you look things up under your own locks, in which case the SRCU > dance is unnecessary and pointless. > > Or you use refcounts. > > In which case SRCU is also unnecessary and pointless. So from what I can see, you actually protect almost everything with the eventfs_mutex, but the problem is that you then occasionally drop that mutex in the middle. The one valid reason for dropping it is the readdir callback, which does need to write to user space memory. But no, that's not a valid reason to use SRCU. It's a very *bad* reason to use SRCU. The thing is, you can fix it two ways: - either refcount things properly, ie when you do that lookup under your lock: mutex_lock(&eventfs_mutex); ei = READ_ONCE(ti->private); if (ei && ei->is_freed) ei = NULL; mutex_unlock(&eventfs_mutex); you just go "I now have a ref" to the ei, and you increment the refcount like you should, and then you dcrement it at the end when you're done. Btw, what's with the READ_ONCE()? You have locking. The other option is to simply re-lookup the ei when you re-get the eventfs_mutext anyway. Either of those cases, and the SRCU is entirely pointless. It really looks wrong, because you seem to take that eventfs_mutex everywhere anyway. Linus
On Sun, 28 Jan 2024 14:07:49 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 13:43, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > That's just wrong. > > > > Either you look things up under your own locks, in which case the SRCU > > dance is unnecessary and pointless. > > > > Or you use refcounts. > > > > In which case SRCU is also unnecessary and pointless. > > So from what I can see, you actually protect almost everything with > the eventfs_mutex, but the problem is that you then occasionally drop > that mutex in the middle. > > The one valid reason for dropping it is the readdir callback, which > does need to write to user space memory. > > But no, that's not a valid reason to use SRCU. It's a very *bad* > reason to use SRCU. > > The thing is, you can fix it two ways: > > - either refcount things properly, ie when you do that lookup under your lock: > > mutex_lock(&eventfs_mutex); > ei = READ_ONCE(ti->private); > if (ei && ei->is_freed) > ei = NULL; > mutex_unlock(&eventfs_mutex); > > you just go "I now have a ref" to the ei, and you increment the > refcount like you should, and then you dcrement it at the end when > you're done. > > Btw, what's with the READ_ONCE()? You have locking. > > The other option is to simply re-lookup the ei when you re-get the > eventfs_mutext anyway. > > Either of those cases, and the SRCU is entirely pointless. It really > looks wrong, because you seem to take that eventfs_mutex everywhere > anyway. The original code just used the mutex, but then we were hitting deadlocks because we used the mutex in the iput() logic. But this could have been due to the readdir logic causing the deadlocks. A lot of the design decisions were based on doing the dentry creation in the readdir code. Now that it's no longer there, I could go back and try taking the eventfs_mutex for the entirety of the lookup and see if lockdep complains again about also using it in the iput logic. Then yes, we can get rid of the SRCU as that was added as a way to get out of that deadlock. -- Steve
On Sun, 28 Jan 2024 at 14:01, Steven Rostedt <rostedt@goodmis.org> wrote: > > Basically you are saying that when the ei is created, it should have a > ref count of 1. If the lookup happens and does the > atomic_inc_not_zero() it will only increment if the ref count is not 0 > (which is basically equivalent to is_freed). Exactly. That's what we do with almost all our data structures. You can use the kref() infrastructure to give this a bit more structure, and avoid doing the atomics by hand. So then "get a ref" is literally kref_get(&ei->kref); and releasing it is kref_put(&ei->kref, ei_release); so you don't have the write out that "if (atomic_dec_and_test(..) kfree();". And "ei_release()" looks just something like this: static void ei_release(struct kref *ref) { kfree(container_of(ref, struct eventfs_inode, kref); } and that's literally all you need (ok, you need to add the 'kref' to the eventfs_inode, and initialize it at allocation time, of course). > And then we can have deletion of the object happen in both where the > caller (kprobes) deletes the directory and in the final iput() > reference (can I use iput and not the d_release()?), that it does the > same as well. > > Where whatever sees the refcount of zero calls rcu_free? Having looked more at your code, I actually don't see you even needing rcu_free(). It's not that you don't need SRCU (which is *much* heavier than RCU) - you don't even need the regular quick RCU at all. As far as I can see, you do all the lookups under eventfs_mutex, so there are actually no RCU users. And the VFS code (that obviously uses RCU internally) has a ref to it and just a direct pointer, so again, there's no real RCU involved. You seem to have used SRCU as a "I don't want to do refcounts" thing. I bet you'll notice that it clarifies things *enormously* to just use refcounts. Linus
On Sun, 28 Jan 2024 at 14:17, Steven Rostedt <rostedt@goodmis.org> wrote: > > The original code just used the mutex, but then we were hitting > deadlocks because we used the mutex in the iput() logic. But this could > have been due to the readdir logic causing the deadlocks. I agree that it's likely that the readdir logic caused the deadlocks on the eventfs_mutex, but at the same time it really does tend to be a good idea to drop any locks when dealing with readdir(). The issue with the readdir iterator is that it needs to access user space memory for every dirent it fills in, and any time you hold a lock across a user space access, you are now opening yourself up to having the lock have really long hold times. It's basically a great way to cauise a mini-DoS. And even if you now can do without the mutex in the release paths etc by just using refcounts, and even if you thus get rid of the deadlock itself, it's typically a very good idea to have the 'iterate_shared()' function drop all locks before writing to user space. The same is obviously true of 'read()' etc that also writes to user space, but you tend to see the issue much more with the readdir code, because it iterates over all these small things, and readdir() typically wants the lock more too (because there's all that directory metadata). So dropping the lock might not be something you *have* to do in iterate_shared, but it's often a good idea. But dropping the lock also doesn't tend to be a big problem, if you just have refcounted data structures. Sure, the data might change (because you dropped the lock), but at least the data structure itself is still there when you get the lock, so at most it might be a "I will need to re-check that the entry hasn't been removed" kind of thing. Linus
On Sun, 28 Jan 2024 14:17:43 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > You seem to have used SRCU as a "I don't want to do refcounts" thing. > I bet you'll notice that it clarifies things *enormously* to just use > refcounts. Well, removing creating dentries in the readdir() logic is what opened up the door to a lot of simplification. Thanks for helping me with that. As I believe that may have been the source of most of the deadlocks we were struggling with. But yeah, kref probably could have fixed that too. -- Steve
On Sat, 27 Jan 2024 13:47:32 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > There are tons of other 'ei->dentry' uses, and I didn't look at those. > Baby steps. But this *seems* like an obvious cleanup, and many small > obvious cleanups later and perhaps the 'ei->dentry' pointer (and the > '->d_children[]' array) can eventually go away. They should all be > entirely useless - there's really no reason for a filesystem to hold > on to back-pointers of dentries. I was working on getting rid of ei->dentry, but then I hit: void eventfs_remove_dir(struct eventfs_inode *ei) { struct dentry *dentry; if (!ei) return; mutex_lock(&eventfs_mutex); dentry = ei->dentry; eventfs_remove_rec(ei, 0); mutex_unlock(&eventfs_mutex); /* * If any of the ei children has a dentry, then the ei itself * must have a dentry. */ if (dentry) simple_recursive_removal(dentry, NULL); } Where it deletes the all the existing dentries in a tree. Is this a valid place to keep ei->dentry? I believe this is what makes the directory disappear from the user's view. But the ei->dentry is there to know that it is in the user's view to begin with. -- Steve
On Sun, 28 Jan 2024 at 14:51, Steven Rostedt <rostedt@goodmis.org> wrote: > > I was working on getting rid of ei->dentry, but then I hit: > > void eventfs_remove_dir(struct eventfs_inode *ei) > { [..] > > Where it deletes the all the existing dentries in a tree. Is this a > valid place to keep ei->dentry? No, when you remove the directory, just leave the child dentries alone. Remember: they are purely caches, and you either have - somebody is still using it (you can 'rmdir()' a directory that some other process has as its cwd, for example), which keeps it alive and active anyway - when the last user is done, the dcache code will just free the dentries anyway so there's no reason to remove any of the dentries by hand - and in fact simple_recursive_removal() never did that anyway for anything that was still busy. For a pure cached set of dentries (that have no users), doing the last "dput()" on a directory will free that directory dentry, but it will also automatically free all the unreachable children recursively. Sure, simple_recursive_removal() does other things (sets inode flags to S_DEAD, does fsnotify etc), but none of those should actually matter. I think that whole logic is simply left-over from when the dentries weren't a filesystem cache, but were the *actual* filesystem. So it actually became actively wrong when you started doing your own backing store, but it just didn't hurt (except for code legibility). Of course, eventfs is slightly odd and special in that this isn't a normal "rmdir()", so it can happen with files still populated. And those children will stick around and be useless baggage until they are shrunk under memory pressure. But I don't think it should *semantically* matter, exactly because they always could stay around anyway due to having users. There are some cacheability knobs like .d_delete = always_delete_dentry, which probably makes sense for any virtual filesystem (it limits caching of dentries with no more users - normally the dcache will happily keep caching dentries for the *next* user, but that may or may not make sense for virtual filesystems) So there is tuning that can be done, but in general, I think you should actively think of the dcache as just "it's just a cache, leave stale stuff alone" Linus
On Sun, 28 Jan 2024 15:24:09 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 14:51, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I was working on getting rid of ei->dentry, but then I hit: > > > > void eventfs_remove_dir(struct eventfs_inode *ei) > > { [..] > > > > Where it deletes the all the existing dentries in a tree. Is this a > > valid place to keep ei->dentry? > > No, when you remove the directory, just leave the child dentries > alone. Remember: they are purely caches, and you either have My fear is that they can be accessed again when there's no file around. Thus, the dentry and the inode are created when accessed with the fops loaded that calls into the tracing code. For example, when some creates a kprobe: # cd /sys/kernel/tracing # echo 'p:sched schedule' > kprobe_events # ls events/kprobes/sched/enable enable Now I do; # echo > kprobe_events Currently if I do: # ls events/kprobes/sched/enable ls: cannot access 'events/kprobes/sched/enable' But because the dentry doesn't go away immediately, it can cause issues. I rather not force them to be freed every time the dentry goes to zero, as that could cause more overhead in tracing where the tooling already does multiple scans of the eventfs directory for various reasons. If that dentry is still there, and someone does: echo 1 > events/kprobes/sched/enable after the ei was removed, wouldn't it call back into the open callback of the inode represented by that dentry? If that happens, it will call into the kprobe code and reference structures that have already been freed. Note, before adding that simple_recursive_removal() or my open coded version I had earlier, the kprobe files would stick around after the kprobe was freed, and I was able to crash the kernel. > > - somebody is still using it (you can 'rmdir()' a directory that some > other process has as its cwd, for example), which keeps it alive and > active anyway Wouldn't it be bad if the dentry hung around after the rmdir. You don't want to be able to access files after rmdir has finished. > > - when the last user is done, the dcache code will just free the > dentries anyway But it doesn't. That debug code I had before that showed the ei->dentry in show_event_dentries file would show that the dentries exist for some time when their ref count was zero. They only got freed when on drop_caches or memory reclaim. I like the fact that they hang around that we are not constantly allocating them for every reference. > > so there's no reason to remove any of the dentries by hand - and in > fact simple_recursive_removal() never did that anyway for anything > that was still busy. > > For a pure cached set of dentries (that have no users), doing the last > "dput()" on a directory will free that directory dentry, but it will > also automatically free all the unreachable children recursively. But it doesn't free it. It just makes it available to be freed. > > Sure, simple_recursive_removal() does other things (sets inode flags > to S_DEAD, does fsnotify etc), but none of those should actually > matter. > > I think that whole logic is simply left-over from when the dentries > weren't a filesystem cache, but were the *actual* filesystem. So it > actually became actively wrong when you started doing your own backing > store, but it just didn't hurt (except for code legibility). > > Of course, eventfs is slightly odd and special in that this isn't a > normal "rmdir()", so it can happen with files still populated. And > those children will stick around and be useless baggage until they > are shrunk under memory pressure. > > But I don't think it should *semantically* matter, exactly because > they always could stay around anyway due to having users. It does, because things like kprobes will call into tracefs to free its files so that it can free its own resources as the open callback will reference those resources. If those dentries sick around and allow the user to open the file again, it will cause a use after free bug. It does keep track of opens so the kprobe code will not be freed if a task has a reference already. You get an -EBUSY on the removal of kprobes if something has a reference to it. But on deleting, the return from the eventfs code that deleted the directory, is expected that there will be no more opens on the kprobe files. With stale dentries hanging around after the directory is deleted, that is not the case. > > There are some cacheability knobs like > > .d_delete = always_delete_dentry, > > which probably makes sense for any virtual filesystem (it limits > caching of dentries with no more users - normally the dcache will > happily keep caching dentries for the *next* user, but that may or may > not make sense for virtual filesystems) But if it were freed every time there's no more users, then when you do a stat() it will get referenced then freed (as the dentry ref count goes to zero after the stat() is completed). then the open/write/close will create and delete it, and this could be done several times for the same file. If the dentry is freed every time its ref count goes to zero, then it will need to be created for every operation. > > So there is tuning that can be done, but in general, I think you > should actively think of the dcache as just "it's just a cache, leave > stale stuff alone" I would like to keep it as a cache, but deleting every time the ref count goes to zero is not much of a cache. -- Steve
On Sun, 28 Jan 2024 18:59:43 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > - somebody is still using it (you can 'rmdir()' a directory that some > > other process has as its cwd, for example), which keeps it alive and > > active anyway > > Wouldn't it be bad if the dentry hung around after the rmdir. You don't > want to be able to access files after rmdir has finished. And thinking about this more, this is one thing that is different with eventfs than a normal file system. The rmdir in most cases where directories are deleted in eventfs will fail if there's any open files within it. eventfs doesn't itself enforce this, but the users of eventfs do. -- Steve
On Sun, 28 Jan 2024 at 16:21, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > Wouldn't it be bad if the dentry hung around after the rmdir. You don't > > want to be able to access files after rmdir has finished. Steven, I already told you that that is NORMAL. This is how UNIX filesystems work. Try this: mkdir dummy cd dummy echo "Hello" > hello ( sleep 10; cat ) < hello & rm hello cd .. rmdir dummy and guess what? It will print "hello" after that file has been removed, and the whole directory is gone. YOU NEED TO DEAL WITH THIS. > And thinking about this more, this is one thing that is different with > eventfs than a normal file system. The rmdir in most cases where > directories are deleted in eventfs will fail if there's any open files > within it. No. Stop thinking that eventfs is special. It's not. You need to deal with the realities of having made a filesystem. And one of those realities is that you don't control the dentries, and you can't randomly cache dentry state and then do things behind the VFS layer's back. So remove that broken function. Really. You did a filesystem, and that means that you had better play by the VFS rules. End of discussion. Now, you can then make your own "read" and "lookup" etc functions say "if the backing store data has been marked dead, I'll not do this". That's *YOUR* data structures, and that's your choice. But you need to STOP thinking you can play games with dentries. And you need to stop making up BS arguments for why you should be able to. So if you are thinking of a "Here's how to do a virtual filesystem" talk, I would suggest you start with one single word: "Don't". I'm convinced that we have made it much too easy to do a half-arsed virtual filesystem. And eventfs is way beyond half-arsed. It's now gone from half-arsed to "I told you how to do this right, and you are still arguing". That makes it full-arsed. So just stop the arsing around, and just get rid of those _broken_ dentry games. Linus
On Sun, 28 Jan 2024 at 17:00, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > mkdir dummy > cd dummy > echo "Hello" > hello > ( sleep 10; cat ) < hello & > rm hello > cd .. > rmdir dummy Note that it's worth repeating that simple_recursive_removal() wouldn't change any of the above. It only unhashes things and makes them *look* gone, doing things like clearing i_nlink etc. But those VFS data structures would still exist, and the files that had them open would still continue to be open. So if you thought that simple_recursive_removal() would make the above kind of thing not able to happen, and that eventfs wouldn't have to deal with dentries that point to event_inodes that are dead, you were always wrong. simple_recursive_removal() is mostly just lipstick on a pig. It does cause the cached dentries that have no active use be removed earlier, so it has that "memory pressure" kind of effect, but it has no real fundamental semantic effect. Of course, for a filesystem where the dentry tree *is* the underlying data (ie the 'tmpfs' kind, but also things like debugfs or ipathfs, for example), then things are different. There the dentries are the primary thing, and not just a cache in front of the backing store. But you didn't want that, and those days are long gone as far as tracefs is concerned. Linus
On Sun, 28 Jan 2024 17:00:08 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 16:21, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > Wouldn't it be bad if the dentry hung around after the rmdir. You don't > > > want to be able to access files after rmdir has finished. > > Steven, I already told you that that is NORMAL. > > This is how UNIX filesystems work. Try this: > > mkdir dummy > cd dummy > echo "Hello" > hello > ( sleep 10; cat ) < hello & Running strace on the above we have: openat(AT_FDCWD, "hello", O_RDONLY) = 3 dup2(3, 0) = 0 close(3) = 0 newfstatat(AT_FDCWD, "/usr/local/sbin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/usr/local/bin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/usr/sbin/sleep", 0x7ffee0e44a60, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/usr/bin/sleep", {st_mode=S_IFREG|0755, st_size=43888, ...}, 0) = 0 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], NULL, 8) = 0 So the file is ***opened*** and gets a referenced. YES I DEAL WITH THIS!!! This works fine! I have no problems with this. > rm hello > cd .. > rmdir dummy > > and guess what? It will print "hello" after that file has been > removed, and the whole directory is gone. > > YOU NEED TO DEAL WITH THIS. And I do very well THANK YOU! But if this does not call that simple_recursive_removal() the dentry *STAYS AROUND* and things CAN OPEN IT AFTER IT HAS BEEN REMOVED! That's equivalent to doing an ls on a directory you just deleted with rmdir and you still see the files. Note, eventfs has no call to rmdir here. It's a virtual file system. The directories disappear without user accessing the directory. Same for /proc. The directories for pids come and go when processes fork and exit. You don't want someone to be able to access /proc/1234 *after* the task 1234 exited and was cleaned up by its parent. Do you? And I'm not talking about if something opened the files in /proc/1234 before the task disappeared. That is dealt with. Just like if a file in eventfs is opened and the backing data is to be deleted. It either prevents the deletion, or in some cases it uses ref counts to know that something still has one of its files open. And it won't delete the data until everything has closed it. But after a file or directory has been deleted, NO file system allows it to be opened again. This isn't about something opening a file in eventfs and getting a reference to it, and then the file or directory is being deleted. That's covered. I'm talking about the directory being deleted and then allowing something to open a file within it AFTER the deletion has occurred. If a dentry is still around, THAT CAN HAPPEN! With a dentry still around with nothing accessing it, and you remove the data it represents, if you don't make that dentry invisible to user space, it can be opened AFTER it has been deleted. Without calling d_invalidate (which calls shrink_dcache_parent) on the dentry, it is still visible. Even with a ref count of zero and nothing has it opened. That means you can open that file again AFTER it has been deleted. The vfs_rmdir() calls shrink_dcache_parent() that looks to prune the dcache to make it not visible any more. But vfs_rmdir isn't ever called for eventfs. procfs calls d_invalidate which removes the dentry from being visible to the file system. I *use* to do that too until Al Viro suggested that I use the simple_recursive_removal() call that does all that for me. > > > And thinking about this more, this is one thing that is different with > > eventfs than a normal file system. The rmdir in most cases where > > directories are deleted in eventfs will fail if there's any open files > > within it. > > No. > > Stop thinking that eventfs is special. It's not. It's not special with respect to other virtual file systems, but virtual file systems *are* special compared to regular file systems. Why? Because regular file systems go through the VFS layer for pretty much *all* interactions with them. Virtual file systems interact with the kernel without going through VFS layer. In normal file systems, to remove a directory you have to go through rmdir which does all the nice things your are telling me about. But virtual file systems directories (except for tmpfs) have their directories removed by other means. The VFS layer *has no idea* that a directory is removed. With eventfs calling that simple_recursive_removal() tells the VFS layer this directory is being deleted, just as if someone called rmdir(). If I don't call that function VFS will think the directory is still around and be happy to allow users to open files in it AFTER the directory has been deleted. Your example above does not do what I'm talking about here. It shows something OPENING a file and then deleting the directory. Yes, if you have an opened reference to something and it gets deleted, you still have access to that reference. But you should not be able to get a new reference to something after it has been deleted. > > You need to deal with the realities of having made a filesystem. And > one of those realities is that you don't control the dentries, and you > can't randomly cache dentry state and then do things behind the VFS > layer's back. I'm not. I'm trying to let VFS know a directory is deleted. Because when you delete a kprobe, the directory that has the control files for that kprobe (like enabling it) go away too. I have to let VFS know that the directory is deleted, just like procfs has to tell it when a directory for a process id is no more. You don't kill tasks with: rmdir /proc/1234 And you don't delete kprobes with: rmdir events/kprobe/sched > > So remove that broken function. Really. You did a filesystem, and > that means that you had better play by the VFS rules. > > End of discussion. And I do it just like debugfs when it deletes files outside of VFS or procfs, and pretty much most virtual file systems. > > Now, you can then make your own "read" and "lookup" etc functions say > "if the backing store data has been marked dead, I'll not do this". > That's *YOUR* data structures, and that's your choice. > > But you need to STOP thinking you can play games with dentries. And > you need to stop making up BS arguments for why you should be able > to. > > So if you are thinking of a "Here's how to do a virtual filesystem" > talk, I would suggest you start with one single word: "Don't". > > I'm convinced that we have made it much too easy to do a half-arsed > virtual filesystem. And eventfs is way beyond half-arsed. > > It's now gone from half-arsed to "I told you how to do this right, and > you are still arguing". That makes it full-arsed. > > So just stop the arsing around, and just get rid of those _broken_ dentry games. Sorry, but you didn't prove your point. The example you gave me is already covered. Please tell me when a kprobe goes away, how do I let VFS know? Currently the tracing code (like kprobes and synthetic events) calls eventfs_remove_dir() with just a reference to that ei eventfs_inode structure. I currently use the ei->dentry to tell VFS "this directory is being deleted". What other means do I have to accomplish the same thing? -- Steve
On Sun, 28 Jan 2024 17:42:30 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 28 Jan 2024 at 17:00, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > mkdir dummy > > cd dummy > > echo "Hello" > hello > > ( sleep 10; cat ) < hello & > > rm hello > > cd .. > > rmdir dummy > > Note that it's worth repeating that simple_recursive_removal() > wouldn't change any of the above. It only unhashes things and makes > them *look* gone, doing things like clearing i_nlink etc. I know, but I already cover the above case. And that case is not what simple_recursive_removal() is covering. I'm worried about what can be opened after a deletion. Not what has already been opened. The simple_recrusive_removal() is the way to clear the dcache on those files and directories that are being removed so that no new references can happen on them. So, I removed the simple_recursive_removal() from the code to see what happened. Interesting, the opposite occurred. # cd /sys/kernel/tracing # echo 'p:sched schedule' > kprobe_events # ls events/kprobes enable filter sched # ls events/kprobes/sched enable filter format hist hist_debug id inject trigger # cat events/kprobes/sched/enable 0 # echo 'p:timer read_current_timer' >> kprobe_events # ls events/kprobes enable filter sched timer Now delete just one kprobe (keeping the kprobes directory around) # echo '-:sched schedule' >> kprobe_events # ls events/kprobes/ enable filter timer Now recreate that kprobe # echo 'p:sched schedule' >> kprobe_events # ls events/kprobes enable filter sched timer # ls events/kprobes/sched/ ls: reading directory 'events/kprobes/sched/': Invalid argument I have no access to the directory that was deleted and recreated. > > But those VFS data structures would still exist, and the files that > had them open would still continue to be open. > > So if you thought that simple_recursive_removal() would make the above > kind of thing not able to happen, and that eventfs wouldn't have to > deal with dentries that point to event_inodes that are dead, you were > always wrong. No but I want to shrink the dentries after the directory is removed. Perhaps something else is the error here. > > simple_recursive_removal() is mostly just lipstick on a pig. It does > cause the cached dentries that have no active use be removed earlier, > so it has that "memory pressure" kind of effect, but it has no real > fundamental semantic effect. I was using it to "flush" the cache on that directory. Nothing more. > > Of course, for a filesystem where the dentry tree *is* the underlying > data (ie the 'tmpfs' kind, but also things like debugfs or ipathfs, > for example), then things are different. Note, tracefs was built on debugfs. Only the "events" directory is "different". The rest of /sys/kernel/tracing behaves exactly like debugfs. > > There the dentries are the primary thing, and not just a cache in > front of the backing store. > > But you didn't want that, and those days are long gone as far as > tracefs is concerned. Well, as long as eventfs is ;-) -- Steve
On Sun, 28 Jan 2024 21:32:49 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > # echo 'p:sched schedule' >> kprobe_events > # ls events/kprobes > enable filter sched timer > > # ls events/kprobes/sched/ > ls: reading directory 'events/kprobes/sched/': Invalid argument > > I have no access to the directory that was deleted and recreated. Ah, this was because the final iput() does dentry->d_fsdata = NULL, and in the lookup code I have: mutex_lock(&eventfs_mutex); ei = READ_ONCE(ti->private); if (ei && ei->is_freed) ei = NULL; mutex_unlock(&eventfs_mutex); if (!ei) { printk("HELLO no ei\n"); goto out; } Where that printk() was triggering. So at least it's not calling back into the tracing code ;-) Interesting that it still did the lookup, even though it was already referenced. I'm still learning the internals of VFS. Anyway, after keeping the d_fsdata untouched (not going to NULL), just to see what would happen, I ran it again with KASAN and did trigger: [ 106.255468] ================================================================== [ 106.258400] BUG: KASAN: slab-use-after-free in tracing_open_file_tr+0x3a/0x120 [ 106.261228] Read of size 8 at addr ffff8881136f27b8 by task cat/868 [ 106.264506] CPU: 2 PID: 868 Comm: cat Not tainted 6.8.0-rc1-test-00008-gbee668990ac4-dirty #454 [ 106.267810] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 106.271337] Call Trace: [ 106.272406] <TASK> [ 106.273317] dump_stack_lvl+0x5c/0xc0 [ 106.274750] print_report+0xcf/0x670 [ 106.276173] ? __virt_addr_valid+0x15a/0x330 [ 106.278807] kasan_report+0xd8/0x110 [ 106.280172] ? tracing_open_file_tr+0x3a/0x120 [ 106.281745] ? tracing_open_file_tr+0x3a/0x120 [ 106.283343] tracing_open_file_tr+0x3a/0x120 [ 106.284887] do_dentry_open+0x3b7/0x950 [ 106.286284] ? __pfx_tracing_open_file_tr+0x10/0x10 [ 106.287992] path_openat+0xea8/0x11d0 That was with just these commands: cd /sys/kernel/tracing/ echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events echo 'p:timer read_current_timer' >> kprobe_events ls events/kprobes/ cat events/kprobes/sched/enable ls events/kprobes/sched echo '-:sched schedule' >> /sys/kernel/tracing/kprobe_events ls events/kprobes/sched/enable cat events/kprobes/sched/enable BTW, the ls after the deletion returned: # ls events/kprobes/sched/enable events/kprobes/sched/enable In a normal file system that would be equivalent to: # mkdir events/kprobes/sched # touch events/kprobes/sched/enable # rm -rf events/kprobes/sched # ls events/kprobes/sched/enable events/kprobes/sched/enable -- Steve
On Sun, 28 Jan 2024 at 19:40, Steven Rostedt <rostedt@goodmis.org> wrote: > > [ 106.258400] BUG: KASAN: slab-use-after-free in tracing_open_file_tr+0x3a/0x120 > [ 106.261228] Read of size 8 at addr ffff8881136f27b8 by task cat/868 Are you refcounting the pointers that you have in the dentries (and inodes)? Like we talked about you needing to do? Every time you assign a pointer to d_fsdata, you need to kref_get() it. You try to work around the tracefs weaknesses by trying to clean up the dentry data, but it's WRONG. You should refcount the data properly, so that you don't NEED to clean it out. Linus
> > > > You need to deal with the realities of having made a filesystem. And > > one of those realities is that you don't control the dentries, and you > > can't randomly cache dentry state and then do things behind the VFS > > layer's back. > > I'm not. I'm trying to let VFS know a directory is deleted. Because > when you delete a kprobe, the directory that has the control files for > that kprobe (like enabling it) go away too. I have to let VFS know that > the directory is deleted, just like procfs has to tell it when a > directory for a process id is no more. > > You don't kill tasks with: rmdir /proc/1234 > > And you don't delete kprobes with: rmdir events/kprobe/sched > > > > > So remove that broken function. Really. You did a filesystem, and > > that means that you had better play by the VFS rules. > > > > End of discussion. > > And I do it just like debugfs when it deletes files outside of VFS or > procfs, and pretty much most virtual file systems. > I think it is better if we used the term "pseudo" file systems, because to me VFS already stands for "virtual file system". > > > > Now, you can then make your own "read" and "lookup" etc functions say > > "if the backing store data has been marked dead, I'll not do this". > > That's *YOUR* data structures, and that's your choice. > > > > But you need to STOP thinking you can play games with dentries. And > > you need to stop making up BS arguments for why you should be able > > to. > > > > So if you are thinking of a "Here's how to do a virtual filesystem" > > talk, I would suggest you start with one single word: "Don't". > > > > I'm convinced that we have made it much too easy to do a half-arsed > > virtual filesystem. And eventfs is way beyond half-arsed. > > > > It's now gone from half-arsed to "I told you how to do this right, and > > you are still arguing". That makes it full-arsed. > > > > So just stop the arsing around, and just get rid of those _broken_ dentry games. > > Sorry, but you didn't prove your point. The example you gave me is > already covered. Please tell me when a kprobe goes away, how do I let > VFS know? Currently the tracing code (like kprobes and synthetic > events) calls eventfs_remove_dir() with just a reference to that ei > eventfs_inode structure. I currently use the ei->dentry to tell VFS > "this directory is being deleted". What other means do I have to > accomplish the same thing? > Would kernfs_node_dentry() have been useful in that case? Just looking at kernfs_node and eventfs_inode gives a very strong smell of reinventing. Note that the fact that eventfs has no rename makes the whole dentry business a lot less complicated than it is in the general VFS case - IIUC, an eventfs path either exists or it does not exist, but if it exists, it conceptually always refers to the same underlying object (kprobe). I am not claiming that kernfs can do everything that eventfs needs - I don't know, because I did not look into it. But the concept of detaching the pseudo filesystem backing objects from the vfs objects was already invented once and Greg has also said the same thing. My *feeling* at this point is that the best course of action is to use kernfs and to improve kernfs to meet eventfs needs, if anything needs improving at all. IMO, the length and content of this correspondence in itself is proof that virtual file systems should use an abstraction that is much less flexible than the VFS. Think of it this way - kernefs and VFS are both under-documented, but the chances of getting kernfs well documented are far better than ever being able to document all the subtleties of VFS for mortals. IOW, I would be happy if instead of the LSFMM topic "Making pseudo file systems inodes/dentries more like normal file systems" We would be talking about "Best practices for writing a pseudo filesystem" and/or "Missing kernfs functionality for writing pseudo filesystems" Thanks, Amir.
On Mon, 29 Jan 2024 08:44:29 +0200 Amir Goldstein <amir73il@gmail.com> wrote: Hi Amir, [ Suffering a bit of insomnia, I made the mistake of reading email from bed, and now I have to reply :-p ] > > > > And I do it just like debugfs when it deletes files outside of VFS or > > procfs, and pretty much most virtual file systems. > > > > I think it is better if we used the term "pseudo" file systems, because > to me VFS already stands for "virtual file system". Oops, you are absolutely correct. I meant "pseudo" but somehow switch to saying "virtual". :-p > > > > Sorry, but you didn't prove your point. The example you gave me is > > already covered. Please tell me when a kprobe goes away, how do I let > > VFS know? Currently the tracing code (like kprobes and synthetic > > events) calls eventfs_remove_dir() with just a reference to that ei > > eventfs_inode structure. I currently use the ei->dentry to tell VFS > > "this directory is being deleted". What other means do I have to > > accomplish the same thing? > > > > Would kernfs_node_dentry() have been useful in that case? > Just looking at kernfs_node and eventfs_inode gives a very strong > smell of reinventing. Close, but looking at kernfs real quick, I think I see the difference and why eventfs relies on the dentry, and why it doesn't need to. > > Note that the fact that eventfs has no rename makes the whole dentry > business a lot less complicated than it is in the general VFS case - > IIUC, an eventfs path either exists or it does not exist, but if it exists, > it conceptually always refers to the same underlying object (kprobe). > > I am not claiming that kernfs can do everything that eventfs needs - > I don't know, because I did not look into it. eventfs has one other major simplicity that kernfs does not. Not only does it not have renames, but files are not created or deleted individually. That is, when a directory is created, it will have a fixed number of files. It will have those same files until the directory is deleted. That means I have no meta data saving anything about the files, except a fixed array that that holds only a name and a callback for each file in that directory. > > But the concept of detaching the pseudo filesystem backing objects > from the vfs objects was already invented once and Greg has also > said the same thing. > > My *feeling* at this point is that the best course of action is to use > kernfs and to improve kernfs to meet eventfs needs, if anything needs > improving at all. > > IMO, the length and content of this correspondence in itself > is proof that virtual file systems should use an abstraction that > is much less flexible than the VFS. > > Think of it this way - kernefs and VFS are both under-documented, > but the chances of getting kernfs well documented are far better > than ever being able to document all the subtleties of VFS for mortals. > > IOW, I would be happy if instead of the LSFMM topic > "Making pseudo file systems inodes/dentries more like normal file systems" > We would be talking about > "Best practices for writing a pseudo filesystem" and/or > "Missing kernfs functionality for writing pseudo filesystems" I'm fine with that, and I think I also understand how to solve the issue with Linus here. Linus hates the fact that eventfs_inode has a reference to the dentry. The reason I do that is because when the dentry is created, it happens in lookup with a given file name. The registered callback for the file name will return back a "data" pointer that gets assigned to the dentry's inode->i_private data. This gets returned directly to the open function calls. That is, the kprobe will register a handle that gets assigned to the inode->i_private data just like it did when it was in debugfs. Then the open() call would get the kprode data via the inode->i_private. This is why reference counters do not work here. If I don't make the dentry and inode go away after the last reference, it is possible to call that open function with the i_private data directly. I noticed that kernfs assigns the kernfs_inode to the i_private data. Thus it has some control over what's in the i_private. I use that simple_recursive_removal() to clear out any inodes that have a dentry ref count of zero so that there will be no inode->i_private references in the open. I'll have to look more at kernfs to see how it works. Perhaps it can give me some ideas on not having to use dentry. Hmm, I may have been looking at this all wrong. I don't need the dentry. I need the inode! Instead of keeping a pointer to the dentry in the eventfs_inode, I should just be keeping a pointer to the inode. As then I could remove the inode->i_private to NULL on the last reference. The open() calls will have to check for NULL in that case. Are inodes part of the VFS system like dentry is? I would think not as tracefs has a "container_of" around the inode to get to the tracefs_inode, just like many other file systems do. That would remove my need to have to keep around any dentry. I first need to know why I'm seeing "ghost" files. That is without that simple_recursive_removal() call, I get: # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events # echo 'p:timer read_current_timer' >> kprobe_events # ls events/kprobes/ enable filter sched timer # cat events/kprobes/sched/enable 0 # ls events/kprobes/sched enable filter format hist hist_debug id inject trigger # echo '-:sched schedule' >> /sys/kernel/tracing/kprobe_events # ls events/kprobes enable filter timer # ls events/kprobes/sched/enable events/kprobes/sched/enable ?? The sched directory has been deleted but I can still "ls" the files within it. -- Steve
On 2024-01-26 17:49, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 14:41, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> Yes, there is even a note about stat.st_size in inode(7) explaining >> this: > > Good. Send a patch to do the same for st_ino. This breaks "cp -aH" and "cp -aL". Even setting st_nlink={0,1} does not help there, from coreutils 9.1: copy_internal(): [...] else if (x->preserve_links && !x->hard_link && (1 < src_sb.st_nlink || (command_line_arg && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS) || x->dereference == DEREF_ALWAYS)) { earlier_file = remember_copied (dst_relname, src_sb.st_ino, src_sb.st_dev); } Thanks, Mathieu
On Mon, 29 Jan 2024 at 08:00, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > This breaks "cp -aH" and "cp -aL". Do we care? Do we have a user that cares? Has anybody ever hit it? Why would you ever do anything like that to tracefs filesystem? In other words: my point is that tracefs just isn't a regular filesystem. Never was, never will be. And people should be *aware* of that. We should not say "hey, if it doesn't work like a normal filesystem, it's a bug". Try "cp -aL" on /proc, and guess what? It won't work all that well either. For entirely *different* reasons. You'll get some variation of "Input/output error"s, and insanely big files and quite possibly you'll end up with recursive copying as you try to copy the file that is /proc/self/fd/<output>. It's just a nonsensical operation to do, and if somebody says "I can't copy /proc on my system" it's a PEBKAC, not a kernel problem. The "no regressions" rule is not about made-up "if I do this, behavior changes". The "no regressions" rule is about *users*. If you have an actual user that has been doing insane things, and we change something, and now the insane thing no longer works, at that point it's a regression, and we'll sigh, and go "Users are insane" and have to fix it. But if you have some random test that now behaves differently, it's not a regression. It's a *warning* sign, sure: tests are useful. So tests can show when something user-visible changed, and as such they are a "there be monsters here" sign that maybe some user experience will hit it too. So I agree that "just use the same inode number" changes behavior. I agree that it can be a bit of a danger. But these "look, I can see a difference" isn't an argument. And honestly, I have now spent *days* looking at tracefs, and I'm finding core fundamental bugs that would cause actual oopses and/or wild pointer accesses. All of which makes me go "this code needs to be simpler and *cleaner* and stop making problems". In other words: tracefs is such a complete mess that I do not care one *whit* about "cp -aL". I care about "this is actual kernel instability". Linus
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6b211522a13e..7be7a694b106 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,14 +32,11 @@ */ static DEFINE_MUTEX(eventfs_mutex); -/* Choose something "unique" ;-) */ -#define EVENTFS_FILE_INODE_INO 0x12c4e37 - /* Just try to make something consistent and unique */ -static int eventfs_dir_ino(struct eventfs_inode *ei) +static int eventfs_dir_ino(struct eventfs_inode *ei, int nr_files) { if (!ei->ino) - ei->ino = get_next_ino(); + ei->ino = tracefs_get_next_ino(nr_files); return ei->ino; } @@ -327,6 +324,7 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) * @parent: parent dentry for this file. * @data: something that the caller will want to get to later on. * @fop: struct file_operations that should be used for this file. + * @ino: inode number for this file * * This function creates a dentry that represents a file in the eventsfs_inode * directory. The inode.i_private pointer will point to @data in the open() @@ -335,7 +333,8 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) static struct dentry *create_file(const char *name, umode_t mode, struct eventfs_attr *attr, struct dentry *parent, void *data, - const struct file_operations *fop) + const struct file_operations *fop, + unsigned int ino) { struct tracefs_inode *ti; struct dentry *dentry; @@ -363,9 +362,7 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_op = &eventfs_file_inode_operations; inode->i_fop = fop; inode->i_private = data; - - /* All files will have the same inode number */ - inode->i_ino = EVENTFS_FILE_INODE_INO; + inode->i_ino = ino; ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -377,12 +374,14 @@ static struct dentry *create_file(const char *name, umode_t mode, /** * create_dir - create a dir in the tracefs filesystem * @ei: the eventfs_inode that represents the directory to create - * @parent: parent dentry for this file. + * @parent: parent dentry for this directory. + * @nr_files: The number of files (not directories) this directory has * * This function will create a dentry for a directory represented by * a eventfs_inode. */ -static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent) +static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent, + int nr_files) { struct tracefs_inode *ti; struct dentry *dentry; @@ -404,7 +403,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = &eventfs_file_operations; /* All directories will have the same inode number */ - inode->i_ino = eventfs_dir_ino(ei); + inode->i_ino = eventfs_dir_ino(ei, nr_files); ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -504,7 +503,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx, mutex_unlock(&eventfs_mutex); - dentry = create_file(name, mode, attr, parent, data, fops); + dentry = create_file(name, mode, attr, parent, data, fops, ei->ino + idx + 1); mutex_lock(&eventfs_mutex); @@ -598,7 +597,7 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, } mutex_unlock(&eventfs_mutex); - dentry = create_dir(ei, parent); + dentry = create_dir(ei, parent, ei->nr_entries); mutex_lock(&eventfs_mutex); @@ -786,7 +785,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) if (r <= 0) continue; - ino = EVENTFS_FILE_INODE_INO; + ino = ei->ino + i + 1; if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) goto out; @@ -810,7 +809,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - ino = eventfs_dir_ino(ei_child); + ino = eventfs_dir_ino(ei_child, ei_child->nr_entries); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..2187be6d7b23 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = { .setattr = tracefs_setattr, }; +/* Copied from get_next_ino() but adds allocation for multiple inodes */ +#define LAST_INO_BATCH 1024 +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) +static DEFINE_PER_CPU(unsigned int, last_ino); + +unsigned int tracefs_get_next_ino(int files) +{ + unsigned int *p = &get_cpu_var(last_ino); + unsigned int res = *p; + +#ifdef CONFIG_SMP + /* Check if adding files+1 overflows */ + if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { + static atomic_t shared_last_ino; + int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); + + res = next - LAST_INO_BATCH; + } +#endif + + res++; + /* get_next_ino should not provide a 0 inode number */ + if (unlikely(!res)) + res++; + *p = res + files; + put_cpu_var(last_ino); + return res; +} + struct inode *tracefs_get_inode(struct super_block *sb) { struct inode *inode = new_inode(sb); - if (inode) { - inode->i_ino = get_next_ino(); + if (inode) simple_inode_init_ts(inode); - } + return inode; } @@ -644,6 +672,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + inode->i_ino = tracefs_get_next_ino(0); + d_instantiate(dentry, inode); fsnotify_create(d_inode(dentry->d_parent), dentry); return tracefs_end_creating(dentry); @@ -669,6 +699,7 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_fop = &simple_dir_operations; inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + inode->i_ino = tracefs_get_next_ino(0); ti = get_tracefs(inode); ti->private = instance_inode(parent, inode); diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 45397df9bb65..7dd6678229d0 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -75,6 +75,7 @@ static inline struct tracefs_inode *get_tracefs(const struct inode *inode) return container_of(inode, struct tracefs_inode, vfs_inode); } +unsigned int tracefs_get_next_ino(int files); struct dentry *tracefs_start_creating(const char *name, struct dentry *parent); struct dentry *tracefs_end_creating(struct dentry *dentry); struct dentry *tracefs_failed_creating(struct dentry *dentry);