Message ID | 20240130190355.11486-5-torvalds@linux-foundation.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/6] tracefs: avoid using the ei->dentry pointer unnecessarily | expand |
On Tue, 30 Jan 2024 11:03:54 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > -static void free_ei(struct eventfs_inode *ei) > +static inline struct eventfs_inode *alloc_ei(const char *name) > { > - kfree_const(ei->name); > - kfree(ei->d_children); > - kfree(ei->entry_attrs); > - kfree(ei); > + int namesize = strlen(name) + 1; > + struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL); > + > + if (ei) { > + memcpy((char *)ei->name, name, namesize); > + kref_init(&ei->kref); > + } I'm going to be putting back the ei->name pointer as the above actually adds more memory usage. The reason being is that all static trace events (not kprobes or other dynamic events) are created with the TRACE_EVENT() macro that points to a constant name. Passing in a const to kstrdup_const() checks if the name passed in is a constant or not. If it is, it doesn't allocate the name and just passes back the pointer. Removing the pointer to "name" removed 8 bytes on 64 bit machines from the eventfs_inode structure, but added strlen(name)+1 bytes to it, where the majority of trace event names are greater than 8 bytes, and are constant strings. Another reason for converting back to the way it was is I have moving the ei allocation to a slab on my TODO list. -- Steve > + return ei; > } > > diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h > index 8f38740bfb5b..72db3bdc4dfb 100644 > --- a/fs/tracefs/internal.h > +++ b/fs/tracefs/internal.h > @@ -34,8 +34,7 @@ struct eventfs_attr { > * @entries: the array of entries representing the files in the directory > * @name: the name of the directory to create > * @children: link list into the child eventfs_inode > - * @dentry: the dentry of the directory > - * @d_children: The array of dentries to represent the files when created > + * @events_dir: the dentry of the events directory > * @entry_attrs: Saved mode and ownership of the @d_children > * @attr: Saved mode and ownership of eventfs_inode itself > * @data: The private data to pass to the callbacks > @@ -44,12 +43,11 @@ struct eventfs_attr { > * @nr_entries: The number of items in @entries > */ > struct eventfs_inode { > + struct kref kref; > struct list_head list; > const struct eventfs_entry *entries; > - const char *name; > struct list_head children; > - struct dentry *dentry; /* Check is_freed to access */ > - struct dentry **d_children; > + struct dentry *events_dir; > struct eventfs_attr *entry_attrs; > struct eventfs_attr attr; > void *data; > @@ -66,6 +64,7 @@ struct eventfs_inode { > struct llist_node llist; > struct rcu_head rcu; > }; > + const char name[]; > }; > > static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
On Tue, 30 Jan 2024 at 12:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > I'm going to be putting back the ei->name pointer as the above actually > adds more memory usage. I did it mainly because I hate having multiple different allocation sites that then have to do that kref_init() etc individually, and once there was a single site the "name" thing really looked lik ean obvious simplification. That said, I think you're confused about the memory usage. Sure, 'kstrdup_const()' optimizes away the allocation for static constant strings, but what it does *not* do is to optimize away the pointer. In contrast, allocating them together gets rid of the pointer itself, because now the name is just an offset in the structure. And the pointer is 8 bytes right there. So allocating the string _with_ the ei will always save at least 8 bytes. So whenever the string is less than that in length it's *always* a win. And even when it's not an absolute win, it will take advantage of the kmalloc() quantized sizes too, and generally not be a loss even with longer names. So I'm pretty sure you are simply entirely wrong on the memory usage. Not counting the size of the pointer is overlooking a big piece of the puzzle. Btw, you can look at name lengths in tracefs with something stupid like this: find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c and you will see that (at least in my case) names of lengths 1..7 are dominating it all: 1 . 2189 .. 34 ... 2229 .... 207 ..... 6833 ...... 2211 ....... with the rest being insignificant in comparison. The reason? There's a *lot* of those 'filter' and 'enable' and 'id' files. All of which are better off without a 'const char *name' taking 8 bytes. Linus
On Tue, 30 Jan 2024 13:52:05 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 30 Jan 2024 at 12:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I'm going to be putting back the ei->name pointer as the above actually > > adds more memory usage. > > I did it mainly because I hate having multiple different allocation > sites that then have to do that kref_init() etc individually, and once > there was a single site the "name" thing really looked lik ean obvious > simplification. > > That said, I think you're confused about the memory usage. > > Sure, 'kstrdup_const()' optimizes away the allocation for static > constant strings, but what it does *not* do is to optimize away the > pointer. > > In contrast, allocating them together gets rid of the pointer itself, > because now the name is just an offset in the structure. > > And the pointer is 8 bytes right there. > > So allocating the string _with_ the ei will always save at least 8 bytes. Not if the string is not allocated, and ei->name is just pointing to an existing string. Which is what kstrdup_const() does if the string is a constant. If the length of the string was one byte, so you allocate sizeof(*ei) + 2 (for the '\0' as well), is kzalloc() going to allocate anything less than 8 byte alignment? According to /proc/slabinfo, any allocation will be 8 bytes. Thus it is the same as having the pointer if the string is constant but less than 8 bytes in length. But a waste if it is more. > > So whenever the string is less than that in length it's *always* a win. > > And even when it's not an absolute win, it will take advantage of the > kmalloc() quantized sizes too, and generally not be a loss even with > longer names. > > So I'm pretty sure you are simply entirely wrong on the memory usage. > Not counting the size of the pointer is overlooking a big piece of the > puzzle. > > Btw, you can look at name lengths in tracefs with something stupid like this: > > find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c > > and you will see that (at least in my case) names of lengths 1..7 are > dominating it all: > > 1 . > 2189 .. > 34 ... > 2229 .... > 207 ..... > 6833 ...... > 2211 ....... > > with the rest being insignificant in comparison. > > The reason? There's a *lot* of those 'filter' and 'enable' and 'id' > files. All of which are better off without a 'const char *name' taking > 8 bytes. Remember, the files don't have an ei allocated. Only directories. # find . -type d | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c 1 . 2 .. 30 ... 14 .... 21 ..... 15 ...... 18 ....... Above is 8 bytes (length of 7 + '\0') 27 ........ 36 ......... 35 .......... 49 ........... 63 ............ 85 ............. 140 .............. 180 ............... 180 ................ 155 ................. 171 .................. 158 ................... 134 .................... 116 ..................... 106 ...................... 81 ....................... 61 ........................ 58 ......................... 43 .......................... 41 ........................... 23 ............................ 14 ............................. 13 .............................. 9 ............................... 7 ................................ 6 ................................. 5 .................................. 1 ................................... 2 .................................... 1 ..................................... 1 ........................................ The rest is all greater than the pointer size. Also, it does prevent me from switching to a slab. I added the below patch (after changing this back to kstrdup_const), that determines if the string is allocated by checking if ei->name is the same as the pointer passed in. This showed me: # cat /sys/kernel/tracing/dyn_ftrace_total_info 53698 pages:220 groups: 8 const cnt:2099 len:22446 dyn cnt: 3 len:56 That is, 2099 ei->name's point to a constant string without allocation. Granted, I built my system without many modules. If you have more code built as modules, that number may be less. I then counted the length of the string - 7 bytes (which is the same as the length of the string plus the '\0' character - 8 bytes) to accommodate the pointer size, and it's a savings of 22KB per instance. And in actuality, I should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to return 35 bytes back but 64 bytes will be allocated. I think that's a win. -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index e9819d719d2a..ed9ffaac9a32 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -785,6 +785,25 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) goto out; } +int sdr_ei_const; +int sdr_ei_len; +int sdr_ei_dyn; +int sdr_ei_dyn_len; + +static void sdr_count(struct eventfs_inode *ei, const char *name) +{ + int len = strlen(name); + + if (ei->name == name) { + sdr_ei_const++; + if (len > 7) + sdr_ei_len += len - 7; + } else { + sdr_ei_dyn++; + sdr_ei_dyn_len += len; + } +} + /** * eventfs_create_dir - Create the eventfs_inode for this directory * @name: The name of the directory to create. @@ -838,6 +857,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode kfree(ei); return ERR_PTR(-ENOMEM); } + sdr_count(ei, name); if (size) { ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL); @@ -920,6 +940,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry ei->name = kstrdup_const(name, GFP_KERNEL); if (!ei->name) goto fail; + sdr_count(ei, name); /* Save the ownership of this directory */ uid = d_inode(dentry->d_parent)->i_uid; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..3b254ec4d831 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8701,6 +8701,11 @@ static const struct file_operations tracing_stats_fops = { #ifdef CONFIG_DYNAMIC_FTRACE +extern int sdr_ei_const; +extern int sdr_ei_len; +extern int sdr_ei_dyn; +extern int sdr_ei_dyn_len; + static ssize_t tracing_read_dyn_info(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) @@ -8714,10 +8719,11 @@ tracing_read_dyn_info(struct file *filp, char __user *ubuf, if (!buf) return -ENOMEM; - r = scnprintf(buf, 256, "%ld pages:%ld groups: %ld\n", + r = scnprintf(buf, 256, "%ld pages:%ld groups: %ld const cnt:%d len:%d dyn cnt: %d len:%d\n", ftrace_update_tot_cnt, ftrace_number_of_pages, - ftrace_number_of_groups); + ftrace_number_of_groups, + sdr_ei_const, sdr_ei_len, sdr_ei_dyn, sdr_ei_dyn_len); ret = simple_read_from_buffer(ubuf, cnt, ppos, buf, r); kfree(buf);
On Tue, 30 Jan 2024 at 13:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Btw, you can look at name lengths in tracefs with something stupid like this: > > find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c Actually, since only directories have these 'ei' fields, that find should have "-type d", and then the stats change. Most directory filenames there are indeed longer than 8 bytes (16 bytes seems to be the most common length). That means that whether it wins or loses in allocation size tends to be mostly about the kmalloc sizes and the size of that structure. And it turns out that the real memory savings there would be this patch --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -55,15 +55,6 @@ struct eventfs_inode { unsigned int is_events:1; unsigned int nr_entries:30; unsigned int ino; - /* - * Union - used for deletion - * @llist: for calling dput() if needed after RCU - * @rcu: eventfs_inode to delete in RCU - */ - union { - struct llist_node llist; - struct rcu_head rcu; - }; const char name[]; }; since with all the other cleanups, neither of those fields are actually used. With that, the base size of 'struct eventfs_inode' actually becomes 96 bytes for me. And at least on x86-64, the kmalloc sizes are 96 and then 128 bytes. But that means that - with the added name pointer, the eventfs_inode would grow to 104 bytes, and grow to that next allocation size (128) - with the embedded name, the size is 96+strlen+1, and will also need at least a 128 byte allocation, but any name that is shorter than 32 bytes is free so it ends up being a wash. You're going to allocate 128 bytes regardless. But the deallocation is simpler. Linus
On Tue, 30 Jan 2024 at 14:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > Remember, the files don't have an ei allocated. Only directories. Crossed emails. > I then counted the length of the string - 7 bytes (which is the same as the > length of the string plus the '\0' character - 8 bytes) to accommodate the > pointer size, and it's a savings of 22KB per instance. And in actuality, I > should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to > return 35 bytes back but 64 bytes will be allocated. No. See my other email. The kmalloc sizes actually means that it comes out exactly even. Linus
On Tue, 30 Jan 2024 14:58:26 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 30 Jan 2024 at 14:55, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Remember, the files don't have an ei allocated. Only directories. > > Crossed emails. > > > I then counted the length of the string - 7 bytes (which is the same as the > > length of the string plus the '\0' character - 8 bytes) to accommodate the > > pointer size, and it's a savings of 22KB per instance. And in actuality, I > > should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to > > return 35 bytes back but 64 bytes will be allocated. > > No. See my other email. The kmalloc sizes actually means that it comes > out exactly even. > But that wouldn't be the case if I switched over to allocating as its own slab, as it would make it better condensed. But, I can keep it your way until I do that, as the biggest savings I needed was getting rid of all the file meta-data as that was what took up 10s of megabytes. -- Steve
On Tue, 30 Jan 2024 at 14:56, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > With that, the base size of 'struct eventfs_inode' actually becomes 96 > bytes for me. It can be shrunk some more. The field ordering is suboptimal. Pointers are 8 bytes and 8-byte aligned, but 'struct kref' is just 4 bytes, and 'struct eventfs_attr' is 12 bytes and 4-byte aligned. So if you pack all the 8-byte-aligned fields at the beginning, and the 4-byte-aligned ones at the end, you get 88 bytes. At which point a name pointer would *just* fit in 96 bytes. ... and then some debug option is enabled, and it all goes to hell again. Linus
On Tue, 30 Jan 2024 15:06:13 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 30 Jan 2024 at 14:56, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > With that, the base size of 'struct eventfs_inode' actually becomes 96 > > bytes for me. > > It can be shrunk some more. > > The field ordering is suboptimal. Pointers are 8 bytes and 8-byte > aligned, but 'struct kref' is just 4 bytes, and 'struct eventfs_attr' > is 12 bytes and 4-byte aligned. > > So if you pack all the 8-byte-aligned fields at the beginning, and the > 4-byte-aligned ones at the end, you get 88 bytes. > > At which point a name pointer would *just* fit in 96 bytes. Does that mean I should keep the kstrdup_const()? > > ... and then some debug option is enabled, and it all goes to hell again. Heh, I'm not too concerned about debug options. As anyone concerned about memory size should have already have audited their .config to turn off anything they don't need. -- Steve
On Tue, 30 Jan 2024 at 15:10, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > At which point a name pointer would *just* fit in 96 bytes. > > Does that mean I should keep the kstrdup_const()? You should check that my math matches reality and relevant configurations, but yes, at 96 bytes that should fit exactly in one slab entry on both x86-64 and arm64 (now that arm finally does sane kmalloc sizes - for the longest time they were all 128-byte aligned due to historical horrible DMA coherency issues). But you might also add a comment to the eventfs_inode definition, because if that ever changes, the math changes again. For example, adding just one more attribute data would make it all fall apart. If you *really* want to optimize that data structure, I think you can make the child list be a 'hlist'. That makes the list head ('children' becomes a 'hlist_head') smaller, even if the list entry ('list' becomes a 'hlist_node') stays the same size. Linus
On Tue, Jan 30, 2024 at 11:03:54AM -0800, Linus Torvalds wrote: > inode = tracefs_get_inode(dentry->d_sb); > if (unlikely(!inode)) > - return eventfs_failed_creating(dentry); > + return ERR_PTR(-ENOMEM); That belongs in the lookup crapectomy patch - bisect hazard from stray dput(). Same for similar chunks below.
On Tue, 30 Jan 2024 11:03:54 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > +/* > + * eventfs_inode reference count management. > + * > + * NOTE! We count only references from dentries, in the > + * form 'dentry->d_fsdata'. There are also references from > + * directory inodes ('ti->private'), but the dentry reference > + * count is always a superset of the inode reference count. > + */ > +static void release_ei(struct kref *ref) > +{ > + struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); > + kfree(ei->entry_attrs); I did modify this to add back the kfree_const(ei->name); I would also like to add a: WARN_ON_ONCE(!ei->is_freed); The kref_put() logic should add the protection that this is guaranteed to be true as the logic in the removal does: ei->is_freed = 1; [..] kref_put(ei); I would think that the last "put" would always have the "is_freed" set. The WARN_ON is to catch things doing too many put_ei(). > + kfree(ei); > +} > + > @@ -951,5 +857,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) > * sticks around while the other ei->dentry are created > * and destroyed dynamically. > */ > + simple_recursive_removal(dentry, NULL); Actually, this doesn't work. It expects all the dentries in the list to have positive ref counts, as it does dput() on them. If there's any cached, it will bug with a dput() on a dentry of zero. The only dentries that should have non zero ref counts are ones that are currently being accessed and wont go away until finished. I think all we need to do is invalidate the top dentry. Is that true? Does this work: d_invalidate(dentry); to make the directory no longer accessible. And all dentries within it should be zero, and hopefully go away on memory reclaim. The last patch removes it, but if you want to test the deletion, you can do this: # cd /sys/kernel/tracing # mkdir instances/foo # ls instances/foo/events # rmdir instances/foo But also note that this patch fails with the "ghost" files with the kprobe test again. When I apply patch 6, that goes away. I'm guessing that's because this needs the d_revalidate() call. To not break bisection, I think we need to merge this and patch 6 together. Also patch 6 removes the simple_recursive_removal() which without at least the d_invalidate() causes a dput splat. That's because the rmdir calls tracefs_remove() which calls simple_recursive_removal() that will walk to the events directory and I'm assuming if it hasn't been invalidated, it walks into it causing the same issue as a simple_recursive_removal() would have here. Try the above mkdir and rmdir to see. -- Steve > dput(dentry); > }
On Tue, 30 Jan 2024 at 21:09, Steven Rostedt <rostedt@goodmis.org> wrote: > > I would think that the last "put" would always have the "is_freed" set. The > WARN_ON is to catch things doing too many put_ei(). Yes, that looks sane to me. > > + simple_recursive_removal(dentry, NULL); > > Actually, this doesn't work. Yes, note how the next patch just removed it entirely. > Does this work: > > d_invalidate(dentry); It does, but it's basically irrelevant with the d_revalidate approach. Basically, once you have d_revalidate(), the unhashing happens there, and it's just extra work and pointless to do it elsewhere. So if you look at the "clean up dentry ops and add revalidate function" patch, you'll see that it just does - simple_recursive_removal(dentry, NULL); and the thing is just history. So really, that final patch is the one that fixes the whole eventfs mess for good (knock wood). But you can't do it first, because it basically depends on all the refcount fixes. It might be possible to re-organize the patches so that the refcount changes go first, then the d_revalidate(), and then the rest. But I suspect they all really end up depending on each other some way, because the basic issue was that the whole "keep unrefcounted dentry pointers around" was just wrong. So it doesn't end up right until it's _all_ fixed, because every step of the way exposes some problem. At least that was my experience. Fix one thing, and it exposes the hack that another thing depended on. This is actually something that Al is a master at. You sometimes see him send one big complicated patch where he talks about all the problems in some area and it's one huge "fix up everything patch" that looks very scary. And then a week later he sends a series of 19 patches that all make sense and all look "obvious" and all make small progress. And magically they end up matching that big cleanup patch in the end. And you just *know* that it didn't start out as that beautiful logical series, because you saw the big messy patch first... Linus
On Tue, 30 Jan 2024 21:25:30 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Does this work: > > > > d_invalidate(dentry); > > It does, but it's basically irrelevant with the d_revalidate approach. > > Basically, once you have d_revalidate(), the unhashing happens there, > and it's just extra work and pointless to do it elsewhere. > > So if you look at the "clean up dentry ops and add revalidate > function" patch, you'll see that it just does > > - simple_recursive_removal(dentry, NULL); > > and the thing is just history. With even the last patch included, without the d_invalidate() I get errors with simply doing: # cd /sys/kernel/tracing # mkdir instances/foo # ls instances/foo/events # rmdir instances/foo As the rmdir calls tracefs_remove() that calls simple_recursive_removal() that then walks into the "events" directory. Without that d_invalidate, it walks beyond just the top directory and then splats on the dentries that are cached. > > So really, that final patch is the one that fixes the whole eventfs > mess for good (knock wood). But you can't do it first, because it > basically depends on all the refcount fixes. I'm running my full suite with the final patch included, plus some of the updates I mentioned in replies to other patches, as well as including this "d_invalidate()" as it doesn't pass without it. > > It might be possible to re-organize the patches so that the refcount > changes go first, then the d_revalidate(), and then the rest. But I > suspect they all really end up depending on each other some way, > because the basic issue was that the whole "keep unrefcounted dentry > pointers around" was just wrong. So it doesn't end up right until > it's _all_ fixed, because every step of the way exposes some problem. > > At least that was my experience. Fix one thing, and it exposes the > hack that another thing depended on. > > This is actually something that Al is a master at. You sometimes see > him send one big complicated patch where he talks about all the > problems in some area and it's one huge "fix up everything patch" that > looks very scary. > > And then a week later he sends a series of 19 patches that all make > sense and all look "obvious" and all make small progress. > > And magically they end up matching that big cleanup patch in the end. > And you just *know* that it didn't start out as that beautiful logical > series, because you saw the big messy patch first... I'll take a look at breaking the patches up further, as I now have a much better understanding of dentries then I did before this discussion. -- Steve
On Tue, 30 Jan 2024 at 21:33, Steven Rostedt <rostedt@goodmis.org> wrote: > > With even the last patch included, without the d_invalidate() I get errors > with simply doing: > > # cd /sys/kernel/tracing > # mkdir instances/foo > # ls instances/foo/events > # rmdir instances/foo > > As the rmdir calls tracefs_remove() that calls simple_recursive_removal() > that then walks into the "events" directory. Without that d_invalidate, it > walks beyond just the top directory and then splats on the dentries that > are cached. Ugh. This is only an issue for that "events" dir, right? The one that has a proper refcount on the dentry in 'ei->events_dir'? Because yes, then doing d_invalidate() looks like the right thing. Linus
On Tue, 30 Jan 2024 at 21:57, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ugh. Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how it does that "unlock and re-lock inode" thing. It's a disease, and no, it's not an acceptable response to "lockdep shows there's a locking problem". The comment says "It is up to the individual mkdir routine to handle races" but that's *completely* bogus and shows a fundamental misunderstanding of locking. Dropping the lock in the middle of a locked section does NOT affect just the section that you dropped the lock around. It affects the *caller*, who took the lock in the first place, and who may well assume that the lock protects things for the whole duration of the lock. And so dropping the lock in the middle of an operation is a bad BAD *BAD* thing to do. Honestly, I had only been looking at the eventfs_inode.c lookup code. But now that I started looking at mkdir/rmdir, I'm seeing the same signs of horrible mistakes there too. And yes, it might be a real problem. For example, for the rmdir case, the actual lock was taken by 'vfs_rmdir()', and it does *not* protect only the ->rmdir() call itself. It also, for example, is supposed to make the ->rmdir be atomic wrt things like dentry->d_inode->i_flags |= S_DEAD; and dont_mount(dentry); detach_mounts(dentry); but now because the inode lock was dropped in the middle, for all we know a "mkdir()" could come in on the same name, and make a mockery of the whole inode locking. The inode lock on that directory that is getting removed is also what is supposed to make it impossible to add new files to the directory while the rmdir is going on. Again, dropping the lock violates those kinds of guarantees. "git blame" actually fingers Al for that "unlock and relock", but that's because Al did a search-and-replace on "mutex_[un]lock(&inode->i_mutex)" and replaced it with "inode_[un]lock(inode)" back in 2016. The real culprit is you, and that sh*t-show goes back to commit 277ba04461c2 ("tracing: Add interface to allow multiple trace buffers"). Christ. Now I guess I need to start looking if there is anything else horrid lurking in that mkdir/rmdir path. I doubt the inode locking problem ends up mattering in this situation. Mostly since this is only tracefs, and normal users shouldn't be able to access these things anyway. And I think (hope?) that you only accept mkdir/rmdir in specific places. But this really is another case of "This code smells *fundamentally* wrong". Adding Al, in the hopes that he will take a look at this too. Linus
On Tue, 30 Jan 2024 22:20:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 30 Jan 2024 at 21:57, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Ugh. > > Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how > it does that "unlock and re-lock inode" thing. I'd figured you'd like that one. > > It's a disease, and no, it's not an acceptable response to "lockdep > shows there's a locking problem". > > The comment says "It is up to the individual mkdir routine to handle > races" but that's *completely* bogus and shows a fundamental > misunderstanding of locking. > > Dropping the lock in the middle of a locked section does NOT affect > just the section that you dropped the lock around. > > It affects the *caller*, who took the lock in the first place, and who > may well assume that the lock protects things for the whole duration > of the lock. > > And so dropping the lock in the middle of an operation is a bad BAD > *BAD* thing to do. > > Honestly, I had only been looking at the eventfs_inode.c lookup code. > But now that I started looking at mkdir/rmdir, I'm seeing the same > signs of horrible mistakes there too. > > And yes, it might be a real problem. For example, for the rmdir case, > the actual lock was taken by 'vfs_rmdir()', and it does *not* protect > only the ->rmdir() call itself. > > It also, for example, is supposed to make the ->rmdir be atomic wrt things like > > dentry->d_inode->i_flags |= S_DEAD; > > and > > dont_mount(dentry); > detach_mounts(dentry); > > but now because the inode lock was dropped in the middle, for all we > know a "mkdir()" could come in on the same name, and make a mockery of > the whole inode locking. > > The inode lock on that directory that is getting removed is also what > is supposed to make it impossible to add new files to the directory > while the rmdir is going on. > > Again, dropping the lock violates those kinds of guarantees. > > "git blame" actually fingers Al for that "unlock and relock", but > that's because Al did a search-and-replace on > "mutex_[un]lock(&inode->i_mutex)" and replaced it with > "inode_[un]lock(inode)" back in 2016. > > The real culprit is you, and that sh*t-show goes back to commit > 277ba04461c2 ("tracing: Add interface to allow multiple trace > buffers"). > > Christ. Now I guess I need to start looking if there is anything else > horrid lurking in that mkdir/rmdir path. > > I doubt the inode locking problem ends up mattering in this situation. > Mostly since this is only tracefs, and normal users shouldn't be able > to access these things anyway. And I think (hope?) that you only > accept mkdir/rmdir in specific places. Yes, this was very deliberate. It was for a very "special" feature, and thus very limited. > > But this really is another case of "This code smells *fundamentally* wrong". > > Adding Al, in the hopes that he will take a look at this too. This is something I asked Al about when I wrote it. This isn't a normal rmdir. I remember telling Al what this was doing. Basically, it's just a way to tell the tracing infrastructure to create a new instance. It doesn't actually create a normal directory. It's similar to the kprobe_events interface, where writing to a file would create directories and files. Ideally I wanted a mkdir interface as it felt more realistic, and I was ready to have another "echo foo > make_instance" if this didn't work. But after talking with Al, I felt that it could work. The main issue is that mkdir doesn't just make a directory, it creates the entire tree (like what is done in /sys/fs/cgroup). If this was more like kernfs instead of debugfs, I may not have had this problem. That was another time I tried to understand how krenfs worked. This is the reason all the opens in the tracing code has: struct trace_array *tr = inode->i_private; int ret; ret = tracing_check_open_get_tr(tr); if (ret) return ret; In kernel/trace/trace.c: LIST_HEAD(ftrace_trace_arrays); int trace_array_get(struct trace_array *this_tr) { struct trace_array *tr; int ret = -ENODEV; mutex_lock(&trace_types_lock); list_for_each_entry(tr, &ftrace_trace_arrays, list) { if (tr == this_tr) { tr->ref++; ret = 0; break; } } mutex_unlock(&trace_types_lock); return ret; } static void __trace_array_put(struct trace_array *this_tr) { WARN_ON(!this_tr->ref); this_tr->ref--; } void trace_array_put(struct trace_array *this_tr) { if (!this_tr) return; mutex_lock(&trace_types_lock); __trace_array_put(this_tr); mutex_unlock(&trace_types_lock); } int tracing_check_open_get_tr(struct trace_array *tr) { int ret; ret = security_locked_down(LOCKDOWN_TRACEFS); if (ret) return ret; if (tracing_disabled) return -ENODEV; if (tr && trace_array_get(tr) < 0) return -ENODEV; return 0; } The rmdir code that tracefs calls is also in kernel/trace/trace.c: static int instance_rmdir(const char *name) { struct trace_array *tr; int ret; mutex_lock(&event_mutex); mutex_lock(&trace_types_lock); ret = -ENODEV; tr = trace_array_find(name); if (tr) ret = __remove_instance(tr); mutex_unlock(&trace_types_lock); mutex_unlock(&event_mutex); return ret; } The mkdir creates a new trace_array (tr) and rmdir destroys it. If there's any reference on a trace array the rmdir fails. On every open, a check is made to see if the trace array that was added to the inode still exists, and if it does, it ups its ref count to prevent a rmdir from happening. It's a very limited mkdir and rmdir. I remember Al asking me questions about what it can and can't do, and me telling him to make sure the lock release wasn't going to cause problems. I wrote several fuzzers on this code that perform mkdir and rmdir on the instances while other tasks are trying to access them (reading and writing). In the beginning, it found a few bugs. But I was finally able to get my fuzzers to work non-stop for over a month and that's when I felt that this is fine. I was always weary of this code because of the locking situation. The kernel selftests has a stress test on this code too. tools/testing/selftests/ftrace/test.d/instances/instance-event.tc tools/testing/selftests/ftrace/test.d/instances/instance.tc Which is run as part of the selftest suite, which is run by most people testing the kernel, including KernelCI. I haven't had a problem with this code in years, and unless we find a real bug that needs to be fixed, I'm afraid to touch it. -- Steve
On Wed, 31 Jan 2024 07:57:40 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > static int instance_rmdir(const char *name) > { > struct trace_array *tr; > int ret; > > mutex_lock(&event_mutex); Note, event_mutex prevents dynamic events from being created. No kprobe can be added while the event_mutex is held (not to be confused with eventfs_mutex). > mutex_lock(&trace_types_lock); > > ret = -ENODEV; > tr = trace_array_find(name); > if (tr) > ret = __remove_instance(tr); > > mutex_unlock(&trace_types_lock); > mutex_unlock(&event_mutex); > > return ret; > } And static int instance_mkdir(const char *name) { struct trace_array *tr; int ret; mutex_lock(&event_mutex); mutex_lock(&trace_types_lock); ret = -EEXIST; if (trace_array_find(name)) goto out_unlock; tr = trace_array_create(name); ret = PTR_ERR_OR_ZERO(tr); out_unlock: mutex_unlock(&trace_types_lock); mutex_unlock(&event_mutex); return ret; } The above functions would have been basically what would have been called if I had created: echo foo >> make_instance echo '!foo' >> remove_instance In fact, IIRC, I did do the above first, and then moved that code into mkdir/rmdir. Such that the tracefs mkdir and rmdir were just helpers and not real "mkdir" and "rmdir" routines. This isn't in git history because it was done only on my local repository. If you also notice, tracefs only allows mkdir/rmdir to be assigned to one directory. Once it is assigned, no other directories can have mkdir rmdir functionality. /sys/kernel/tracing/instances is the *only* directory that can have mkdir rmdir on it. __init struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *parent, int (*mkdir)(const char *name), int (*rmdir)(const char *name)) { struct dentry *dentry; /* Only allow one instance of the instances directory. */ if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; And IIRC, Al told me that if I released the locks, it's up to the above functions to prevent the races from occurring. That is, no file can be created or remove when the mkdir and rmdir are running. Which the event_mutex prevents. Basically, I freeze external file creation and deletion within tracefs/eventfs when the above mkdir/rmdir is being executed, and prevent rmdir if any file within it is open. -- Steve
On Wed, 31 Jan 2024 at 05:14, Steven Rostedt <rostedt@goodmis.org> wrote: > > If you also notice, tracefs only allows mkdir/rmdir to be assigned to > one directory. Once it is assigned, no other directories can have mkdir > rmdir functionality. I think that does limit the damage, but it's not clear that it is actually safe. Because you don't hold the inode lock, somebody could come in and do a mkdir inside the other one that is being removed, ie something like - thread 1 does took the inode lock, called ->rmdir - it then drops the inode lock (both parent and the directory that is getting removed) and gets the event lock - but thread 2 can come in between that inode lock drop and event lock Notice: thread 2 comes in where there is *no* locking. Nada. Zilch. This is what worries me. But it does help that it's all only in *one* directory. At least another mkdir cannot happen *inside* the one that is going away while the locks are not held. So the good news is that it does mean that there's only one point that is being protected. But I do worry about things like this (in vfs_rmdir()): inode_lock(dentry->d_inode); error = -EBUSY; if (is_local_mountpoint(dentry) || (dentry->d_inode->i_flags & S_KERNEL_FILE)) goto out; error = security_inode_rmdir(dir, dentry); if (error) goto out; error = dir->i_op->rmdir(dir, dentry); if (error) goto out; notice how it does that "is this a mount point" test atomically wrt the rmdir before it is allowed to proceed. And I do think that the inode lock is what also protects it from concurrent mounts. So now what happens when that "thread 2" above comes in while there is *no* locking, and mounts something there? Now, I'm not saying this is a huge problem. But it's very much an example of a thing that *could* be a problem. Dropping locks in the middle is just very scary. Linus
On Wed, 31 Jan 2024 10:08:37 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 31 Jan 2024 at 05:14, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > If you also notice, tracefs only allows mkdir/rmdir to be assigned to > > one directory. Once it is assigned, no other directories can have mkdir > > rmdir functionality. > > I think that does limit the damage, but it's not clear that it is actually safe. > > Because you don't hold the inode lock, somebody could come in and do a > mkdir inside the other one that is being removed, ie something like > > - thread 1 does took the inode lock, called ->rmdir > > - it then drops the inode lock (both parent and the directory that is > getting removed) and gets the event lock > > - but thread 2 can come in between that inode lock drop and event lock > > Notice: thread 2 comes in where there is *no* locking. Nada. Zilch. > > This is what worries me. Yep, and that was my biggest concern too, which is why I have stress tests that try to hit that above scenario. Well, rmdir and other accesses including other mkdir's of the same name. As my knowledge on the inode life time is still limited, my main concern was just corruption of the dentry/inodes themselves. But the first one to get the event_mutex determines the state of the file system. If thread 1 is doing rmdir, what would thread 2 do that can harm it? The rmdir calls trace_remove() which is basically retrying to remove the directory again, and hopefully has the proper locking just like removing the kprobe trace event that deletes directories. It can have references on it. Now if something were to get a reference, and a valid dentry, as soon as the open function is called, the tracing logic will see that the trace_array no longer exists and returns an error. All the open functions for files that are created in an instance (and that includes eventfs files) have a check to see if the inode->i_private data is still valid. The trace_array structure represents the directory, and there's a link list of all the trace_array structures that is protected by the trace_types_lock. It grabs that lock, iterates the array to see if the passed in trace_array is on it, if it is, it ups the ref count (preventing a rmdir from succeeding) and returns it. If it is not, it returns NULL and the open call fails as if it opened a nonexistent file. > > But it does help that it's all only in *one* directory. At least > another mkdir cannot happen *inside* the one that is going away while > the locks are not held. So the good news is that it does mean that > there's only one point that is being protected. > > But I do worry about things like this (in vfs_rmdir()): > > inode_lock(dentry->d_inode); > > error = -EBUSY; > if (is_local_mountpoint(dentry) || > (dentry->d_inode->i_flags & S_KERNEL_FILE)) > goto out; > > error = security_inode_rmdir(dir, dentry); > if (error) > goto out; > > error = dir->i_op->rmdir(dir, dentry); > if (error) > goto out; > > notice how it does that "is this a mount point" test atomically wrt > the rmdir before it is allowed to proceed. You mean if someone did: # mkdir instances/foo # rmdir instances/foo and at the same time, someone else did # mount -t ext4 /dev/sda instances/foo ? OK, I never thought of that use case. Although, I think if someone is trying to mount anything in tracefs, they can keep the pieces. ;-) > > And I do think that the inode lock is what also protects it from > concurrent mounts. So now what happens when that "thread 2" above > comes in while there is *no* locking, and mounts something there? > > Now, I'm not saying this is a huge problem. But it's very much an > example of a thing that *could* be a problem. Dropping locks in the > middle is just very scary. No arguments from me. I really didn't like the dropping of the locks, and tried hard to avoid it. If switching over to kernfs can solve that, I'd let that conversion happen. I'm all for someone switching tracefs over to kernfs if it solves all these unknown bugs, as long as it doesn't hurt the memory savings of eventfs. But again, I'm also willing to make eventfs its own file system (although I don't have the time yet to do that) where tracefs isn't burdened by it. -- Steve
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 1d0102bfd7da..a37db0dac302 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -62,6 +62,34 @@ enum { #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) +/* + * eventfs_inode reference count management. + * + * NOTE! We count only references from dentries, in the + * form 'dentry->d_fsdata'. There are also references from + * directory inodes ('ti->private'), but the dentry reference + * count is always a superset of the inode reference count. + */ +static void release_ei(struct kref *ref) +{ + struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + kfree(ei->entry_attrs); + kfree(ei); +} + +static inline void put_ei(struct eventfs_inode *ei) +{ + if (ei) + kref_put(&ei->kref, release_ei); +} + +static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei) +{ + if (ei) + kref_get(&ei->kref); + return ei; +} + static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); @@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode, * directory. The inode.i_private pointer will point to @data in the open() * call. */ -static struct dentry *lookup_file(struct dentry *dentry, +static struct dentry *lookup_file(struct eventfs_inode *parent_ei, + struct dentry *dentry, umode_t mode, struct eventfs_attr *attr, void *data, @@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry, mode |= S_IFREG; if (WARN_ON_ONCE(!S_ISREG(mode))) - return NULL; + return ERR_PTR(-EIO); inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) - return eventfs_failed_creating(dentry); + return ERR_PTR(-ENOMEM); /* If the user updated the directory's attributes, use them */ update_inode_attr(dentry, inode, attr, mode); @@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry, ti->flags = TRACEFS_EVENT_INODE; ti->private = NULL; // Directories have 'ei', files not + // Files have their parent's ei as their fsdata + dentry->d_fsdata = get_ei(parent_ei); + d_add(dentry, inode); fsnotify_create(dentry->d_parent->d_inode, dentry); - return eventfs_end_creating(dentry); + return NULL; }; /** @@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) - return eventfs_failed_creating(dentry); + return ERR_PTR(-ENOMEM); /* If the user updated the directory's attributes, use them */ update_inode_attr(dentry, inode, &ei->attr, @@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, ti->flags = TRACEFS_EVENT_INODE; ti->private = ei; - dentry->d_fsdata = ei; - ei->dentry = dentry; // Remove me! + dentry->d_fsdata = get_ei(ei); inc_nlink(inode); d_add(dentry, inode); inc_nlink(dentry->d_parent->d_inode); fsnotify_mkdir(dentry->d_parent->d_inode, dentry); - return eventfs_end_creating(dentry); + return NULL; } -static void free_ei(struct eventfs_inode *ei) +static inline struct eventfs_inode *alloc_ei(const char *name) { - kfree_const(ei->name); - kfree(ei->d_children); - kfree(ei->entry_attrs); - kfree(ei); + int namesize = strlen(name) + 1; + struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL); + + if (ei) { + memcpy((char *)ei->name, name, namesize); + kref_init(&ei->kref); + } + return ei; } + /** * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode * @ti: the tracefs_inode of the dentry @@ -387,38 +423,20 @@ static void free_ei(struct eventfs_inode *ei) void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) { struct eventfs_inode *ei; - int i; mutex_lock(&eventfs_mutex); - ei = dentry->d_fsdata; - if (!ei) - goto out; - - /* This could belong to one of the files of the ei */ - if (ei->dentry != dentry) { - for (i = 0; i < ei->nr_entries; i++) { - if (ei->d_children[i] == dentry) - break; - } - if (WARN_ON_ONCE(i == ei->nr_entries)) - goto out; - ei->d_children[i] = NULL; - } else if (ei->is_freed) { - free_ei(ei); - } else { - ei->dentry = NULL; + if (ei) { + dentry->d_fsdata = NULL; + put_ei(ei); } - - dentry->d_fsdata = NULL; - out: mutex_unlock(&eventfs_mutex); } /** * lookup_file_dentry - create a dentry for a file of an eventfs_inode * @ei: the eventfs_inode that the file will be created under - * @idx: the index into the d_children[] of the @ei + * @idx: the index into the entry_attrs[] of the @ei * @parent: The parent dentry of the created file. * @name: The name of the file to create * @mode: The mode of the file. @@ -435,17 +453,11 @@ lookup_file_dentry(struct dentry *dentry, const struct file_operations *fops) { struct eventfs_attr *attr = NULL; - struct dentry **e_dentry = &ei->d_children[idx]; if (ei->entry_attrs) attr = &ei->entry_attrs[idx]; - dentry->d_fsdata = ei; // NOTE: ei of _parent_ - lookup_file(dentry, mode, attr, data, fops); - - *e_dentry = dentry; // Remove me - - return dentry; + return lookup_file(ei, dentry, mode, attr, data, fops); } /** @@ -466,6 +478,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, struct tracefs_inode *ti; struct eventfs_inode *ei; const char *name = dentry->d_name.name; + struct dentry *result; ti = get_tracefs(dir); if (!(ti->flags & TRACEFS_EVENT_INODE)) @@ -482,7 +495,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, continue; if (ei_child->is_freed) goto enoent; - lookup_dir_entry(dentry, ei, ei_child); + result = lookup_dir_entry(dentry, ei, ei_child); goto out; } @@ -499,17 +512,18 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, if (entry->callback(name, &mode, &data, &fops) <= 0) goto enoent; - lookup_file_dentry(dentry, ei, i, mode, data, fops); + result = lookup_file_dentry(dentry, ei, i, mode, data, fops); goto out; } enoent: /* Nothing found? */ d_add(dentry, NULL); + result = NULL; out: mutex_unlock(&eventfs_mutex); - return NULL; + return result; } /* @@ -659,25 +673,10 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode if (!parent) return ERR_PTR(-EINVAL); - ei = kzalloc(sizeof(*ei), GFP_KERNEL); + ei = alloc_ei(name); if (!ei) return ERR_PTR(-ENOMEM); - ei->name = kstrdup_const(name, GFP_KERNEL); - if (!ei->name) { - kfree(ei); - return ERR_PTR(-ENOMEM); - } - - if (size) { - ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL); - if (!ei->d_children) { - kfree_const(ei->name); - kfree(ei); - return ERR_PTR(-ENOMEM); - } - } - ei->entries = entries; ei->nr_entries = size; ei->data = data; @@ -691,7 +690,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode /* Was the parent freed? */ if (list_empty(&ei->list)) { - free_ei(ei); + put_ei(ei); ei = NULL; } return ei; @@ -726,28 +725,20 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry if (IS_ERR(dentry)) return ERR_CAST(dentry); - ei = kzalloc(sizeof(*ei), GFP_KERNEL); + ei = alloc_ei(name); if (!ei) - goto fail_ei; + goto fail; inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) goto fail; - if (size) { - ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL); - if (!ei->d_children) - goto fail; - } - - ei->dentry = dentry; + // Note: we have a ref to the dentry from tracefs_start_creating() + ei->events_dir = dentry; ei->entries = entries; ei->nr_entries = size; ei->is_events = 1; ei->data = data; - ei->name = kstrdup_const(name, GFP_KERNEL); - if (!ei->name) - goto fail; /* Save the ownership of this directory */ uid = d_inode(dentry->d_parent)->i_uid; @@ -778,7 +769,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry inode->i_op = &eventfs_root_dir_inode_operations; inode->i_fop = &eventfs_file_operations; - dentry->d_fsdata = ei; + dentry->d_fsdata = get_ei(ei); /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); @@ -790,72 +781,11 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry return ei; fail: - kfree(ei->d_children); - kfree(ei); - fail_ei: + put_ei(ei); tracefs_failed_creating(dentry); return ERR_PTR(-ENOMEM); } -static LLIST_HEAD(free_list); - -static void eventfs_workfn(struct work_struct *work) -{ - struct eventfs_inode *ei, *tmp; - struct llist_node *llnode; - - llnode = llist_del_all(&free_list); - llist_for_each_entry_safe(ei, tmp, llnode, llist) { - /* This dput() matches the dget() from unhook_dentry() */ - for (int i = 0; i < ei->nr_entries; i++) { - if (ei->d_children[i]) - dput(ei->d_children[i]); - } - /* This should only get here if it had a dentry */ - if (!WARN_ON_ONCE(!ei->dentry)) - dput(ei->dentry); - } -} - -static DECLARE_WORK(eventfs_work, eventfs_workfn); - -static void free_rcu_ei(struct rcu_head *head) -{ - struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu); - - if (ei->dentry) { - /* Do not free the ei until all references of dentry are gone */ - if (llist_add(&ei->llist, &free_list)) - queue_work(system_unbound_wq, &eventfs_work); - return; - } - - /* If the ei doesn't have a dentry, neither should its children */ - for (int i = 0; i < ei->nr_entries; i++) { - WARN_ON_ONCE(ei->d_children[i]); - } - - free_ei(ei); -} - -static void unhook_dentry(struct dentry *dentry) -{ - if (!dentry) - return; - /* - * Need to add a reference to the dentry that is expected by - * simple_recursive_removal(), which will include a dput(). - */ - dget(dentry); - - /* - * Also add a reference for the dput() in eventfs_workfn(). - * That is required as that dput() will free the ei after - * the SRCU grace period is over. - */ - dget(dentry); -} - /** * eventfs_remove_rec - remove eventfs dir or file from list * @ei: eventfs_inode to be removed. @@ -868,8 +798,6 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level) { struct eventfs_inode *ei_child; - if (!ei) - return; /* * Check recursion depth. It should never be greater than 3: * 0 - events/ @@ -881,28 +809,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level) return; /* search for nested folders or files */ - list_for_each_entry_srcu(ei_child, &ei->children, list, - lockdep_is_held(&eventfs_mutex)) { - /* Children only have dentry if parent does */ - WARN_ON_ONCE(ei_child->dentry && !ei->dentry); + list_for_each_entry(ei_child, &ei->children, list) eventfs_remove_rec(ei_child, level + 1); - } - ei->is_freed = 1; - - for (int i = 0; i < ei->nr_entries; i++) { - if (ei->d_children[i]) { - /* Children only have dentry if parent does */ - WARN_ON_ONCE(!ei->dentry); - unhook_dentry(ei->d_children[i]); - } - } - - unhook_dentry(ei->dentry); - - list_del_rcu(&ei->list); - call_srcu(&eventfs_srcu, &ei->rcu, free_rcu_ei); + list_del(&ei->list); + put_ei(ei); } /** @@ -913,22 +825,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level) */ void eventfs_remove_dir(struct eventfs_inode *ei) { - struct dentry *dentry; - if (!ei) return; mutex_lock(&eventfs_mutex); - dentry = ei->dentry; eventfs_remove_rec(ei, 0); mutex_unlock(&eventfs_mutex); - - /* - * If any of the ei children has a dentry, then the ei itself - * must have a dentry. - */ - if (dentry) - simple_recursive_removal(dentry, NULL); } /** @@ -941,7 +843,11 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) { struct dentry *dentry; - dentry = ei->dentry; + dentry = ei->events_dir; + if (!dentry) + return; + + ei->events_dir = NULL; eventfs_remove_dir(ei); /* @@ -951,5 +857,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) * sticks around while the other ei->dentry are created * and destroyed dynamically. */ + simple_recursive_removal(dentry, NULL); dput(dentry); } diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 8f38740bfb5b..72db3bdc4dfb 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -34,8 +34,7 @@ struct eventfs_attr { * @entries: the array of entries representing the files in the directory * @name: the name of the directory to create * @children: link list into the child eventfs_inode - * @dentry: the dentry of the directory - * @d_children: The array of dentries to represent the files when created + * @events_dir: the dentry of the events directory * @entry_attrs: Saved mode and ownership of the @d_children * @attr: Saved mode and ownership of eventfs_inode itself * @data: The private data to pass to the callbacks @@ -44,12 +43,11 @@ struct eventfs_attr { * @nr_entries: The number of items in @entries */ struct eventfs_inode { + struct kref kref; struct list_head list; const struct eventfs_entry *entries; - const char *name; struct list_head children; - struct dentry *dentry; /* Check is_freed to access */ - struct dentry **d_children; + struct dentry *events_dir; struct eventfs_attr *entry_attrs; struct eventfs_attr attr; void *data; @@ -66,6 +64,7 @@ struct eventfs_inode { struct llist_node llist; struct rcu_head rcu; }; + const char name[]; }; static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
The eventfs inode had pointers to dentries (and child dentries) without actually holding a refcount on said pointer. That is fundamentally broken, and while eventfs tried to then maintain coherence with dentries going away by hooking into the '.d_iput' callback, that doesn't actually work since it's not ordered wrt lookups. There were two reasonms why eventfs tried to keep a pointer to a dentry: - the creation of a 'events' directory would actually have a stable dentry pointer that it created with tracefs_start_creating(). And it needed that dentry when tearing it all down again in eventfs_remove_events_dir(). This use is actually ok, because the special top-level events directory dentries are actually stable, not just a temporary cache of the eventfs data structures. - the 'eventfs_inode' (aka ei) needs to stay around as long as there are dentries that refer to it. It then used these dentry pointers as a replacement for doing reference counting: it would try to make sure that there was only ever one dentry associated with an event_inode, and keep a child dentry array around to see which dentries might still refer to the parent ei. This gets rid of the invalid dentry pointer use, and renames the one valid case to a different name to make it clear that it's not just any random dentry. The magic child dentry array that is kind of a "reverse reference list" is simply replaced by having child dentries take a ref to the ei. As does the directory dentries. That makes the broken use case go away. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/tracefs/event_inode.c | 245 ++++++++++++--------------------------- fs/tracefs/internal.h | 9 +- 2 files changed, 80 insertions(+), 174 deletions(-)