Message ID | 20240116114711.7e8637be@gandalf.local.home (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | eventfs: Create dentries and inodes at dir open | expand |
On Tue, 16 Jan 2024 09:55:15 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > [ html crud because I still don't have power or real Internet, just trying > to keep an eye on things on my phone. Mailing lists removed to avoid > bounces, please put them back in replies that don't have my horrible > formatting ] > > No. > > Christ, you're making things worse again > > The reason for the bug is that you're still messing with the dentries at > readdir() time. I may have deleted the comment, but the only reason I created the inodes/destries is to keep the consistent inode number as it's created at the time the inodes and dentries are. Ah I did delete the comment, but it is still applicable :-/ /* - * Need to create the dentries and inodes to have a consistent - * inode number. + * Need to make a struct eventfs_dent array, start by + * allocating enough for just the files, which is a fixed + * array. Then use realloc via add_entry() for the directories + * which is stored in a linked list. */ So if for some reason, user space did a getdents() and used the inode numbers to match what it found, they would likely be the same. > > Just stop it. Readdir should not be creating dentries. Readdir should not > be even *looking* at dentries. You're not a virtual filesystem, the > dentries are just caches for filename lookup, and have *nothing* to do with > readdir. Actually, it's not looking at them. I did it as a way to just have the inode numbers be consistent. dents[cnt].ino = d->d_inode->i_ino; Yes, that's the only reason I create them. The dentry/inode is not used for anything outside of that. > > So just iterate over your own internal data structures in readdir. DO NOT > CREATE DENTRIES OR INODES FOR READDIR. > > I've told you before, and I'll tell you again: either you are a real and > proper virtual filesystem and you let the vfs layer manage *everything*, > and the dentries and inodes are all you have. Or you are a *real* > filesystem and you maintain your own data structures and the dentries and > inodes are just the in-memory caches. > > This "do both" is UNACCEPTABLE. > The dentries were created for the inode numbers so that I did not need to add them to meta data. They are generated at creation time. I don't know how important inode numbers are. If they can all just have random inode numbers and it doesn't break user space, where an inode number will be one value at one read and another shortly after, is that going to cause a problem? Maybe I can just use a hash to generate he inode numbers from the name? Hopefully there will be no collisions. Then I don't need the dentry creation at all. I do realize that if the dentries get freed due to reclaim and recreated, their inode numbers will be different. But for a virtual file system, I don't know how important having consistent inode numbers is. -- Steve
On Tue, 16 Jan 2024 13:12:28 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Maybe I can just use a hash to generate he inode numbers from the name? > Hopefully there will be no collisions. Then I don't need the dentry > creation at all. Maybe I could use a hash of the address of the meta data to create the inode number? That may be unique enough. -- Steve
On Tue, 16 Jan 2024 10:21:49 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Here's a clue: just fix your inode numbers. > > I can think of many ways to do it. Here's a couple: > > - use a fixed inode number for all inodes. It's fine. Really. You might > confuse some programs that still do getpwd() the legacy way, but hey, > nobody cares > > - just put the inode number in the same data structure everything else is > > > - make the inode number be a hash of the address of your data structure. > That's actually the closest to a traditional "real" inode number, which is > just an index to some on-disk thing > > I'm sure there are other *trivial* solutions. > > None of this is an excuse to misuse sentries. > > Try the first one - one single inode number - first. You shouldn't be doing > iget() anyway, so why do you care so deeply about a number that makes no > sense and nobody should care about? It was me being paranoid that using the same inode number would break user space. If that is not a concern, then I'm happy to just make it either the same, or maybe just hash the ei and name that it is associated with. If I do not fully understand how something is used, I try hard to make it act the same as it does for other use cases. That is, I did all this to keep inodes unique and consistent because I did not know if it would break something if I didn't. Removing that requirement does make it much easier to implement readdir. I think I'll do the hashing, just because I'm still paranoid that something might still break if they are all the same. -- Steve
On Tue, 16 Jan 2024 10:53:36 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Let's at least *try* to move towards a better and simpler world, in other > words. OK, but I did just finish the below. I'll save it for another time if the single inode becomes an issue. I cropped it to just 31 bits, so I'm not sure how much that would still expose kernel addresses. But as you prefer a single inode number, I'll submit that instead, and store this as one of my many git branches that I hoard. This patch would give me: ~# ls -i /sys/kernel/tracing/events 2024739183 9p 801480813 iocost 281269778 qdisc 401505785 alarmtimer 641794539 iomap 1206040657 ras 1930290274 avc 1239754069 iommu 275816503 raw_syscalls 1108006611 block 138722947 io_uring 1758073266 rcu 1528091656 bpf_test_run 694421747 ipi 1384703124 regmap 557364529 bpf_trace 2765888 irq 30507018 regulator 1351362737 cfg80211 369905250 irq_matrix 2078862821 resctrl 886445085 cgroup 1115008967 irq_vectors 324090967 rpm 796209754 clk 1448778784 jbd2 1141318882 rseq 478739129 compaction 99410253 kmem 1274783780 rtc 1714397712 cpuhp 2001779594 ksm 1409072807 sched 720701943 csd 51728677 kyber 347239934 scsi 1824588347 dev 1507974437 libata 1768671172 sd 1640041299 devfreq 1421146927 lock 1167562598 signal 121399632 dma_fence 956825721 mac80211 4116590 skb 975908383 drm 738868061 maple_tree 1435501164 smbus 1227060804 e1000e_trace 969175760 mce 1664441095 sock 1770307058 enable 1225375766 mdio 1634697993 spi 1372107864 error_report 744198394 migrate 556841757 swiotlb 1356665351 exceptions 602669807 mmap 400337480 syscalls [..] ~# ls -i /sys/kernel/tracing/events/sched 1906992193 enable 1210369853 sched_process_wait 592505447 filter 1443020725 sched_stat_blocked 1742488280 sched_kthread_stop 1213185672 sched_stat_iowait 1674576234 sched_kthread_stop_ret 1908510325 sched_stat_runtime 1999376743 sched_kthread_work_execute_end 1570203600 sched_stat_sleep 1041533096 sched_kthread_work_execute_start 608113040 sched_stat_wait 166824308 sched_kthread_work_queue_work 2033295833 sched_stick_numa 492462616 sched_migrate_task 1617500395 sched_swap_numa 1786868196 sched_move_numa 1865216679 sched_switch 691687388 sched_pi_setprio 1465729401 sched_wait_task 1610977146 sched_process_exec 969941227 sched_wake_idle_without_ipi 610128037 sched_process_exit 84398030 sched_wakeup 2139284699 sched_process_fork 750220489 sched_wakeup_new 169172804 sched_process_free 1024064201 sched_waking -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a0598eb3e26e..57448bf4acb4 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -19,6 +19,7 @@ #include <linux/namei.h> #include <linux/workqueue.h> #include <linux/security.h> +#include <linux/siphash.h> #include <linux/tracefs.h> #include <linux/kref.h> #include <linux/delay.h> @@ -32,6 +33,9 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Used for making inode numbers */ +static siphash_key_t inode_key; + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); static int eventfs_dir_release(struct inode *inode, struct file *file); +/* Copied from scripts/kconfig/symbol.c */ +static unsigned strhash(const char *s) +{ + /* fnv32 hash */ + unsigned hash = 2166136261U; + for (; *s; s++) + hash = (hash ^ *s) * 0x01000193; + return hash; +} + +/* Just try to make something consistent and unique */ +static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name) +{ + unsigned long sip = (unsigned long)ei; + + sip += strhash(name); + sip = siphash_1u32((int)sip, &inode_key); + + /* keep it positive */ + return sip & ((1U << 31) - 1); +} + static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { unsigned int ia_valid = iattr->ia_valid; @@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx, mutex_unlock(&eventfs_mutex); dentry = create_file(name, mode, attr, parent, data, fops); + if (!IS_ERR_OR_NULL(dentry)) + dentry->d_inode->i_ino = eventfs_inode_ino(ei, name); mutex_lock(&eventfs_mutex); @@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, mutex_unlock(&eventfs_mutex); dentry = create_dir(ei, parent); + if (!IS_ERR_OR_NULL(dentry)) + dentry->d_inode->i_ino = eventfs_inode_ino(ei, ei->name); mutex_lock(&eventfs_mutex); @@ -721,19 +751,19 @@ struct eventfs_list { int count; }; -static int update_entry(struct eventfs_dents *dents, struct dentry *d, - int type, int cnt) +static int update_entry(struct eventfs_dents *dents, struct eventfs_inode *ei, + const char *name, int type, int cnt) { - dents[cnt].name = kstrdup_const(d->d_name.name, GFP_KERNEL); + dents[cnt].name = kstrdup_const(name, GFP_KERNEL); if (!dents[cnt].name) return -ENOMEM; - dents[cnt].ino = d->d_inode->i_ino; + dents[cnt].ino = eventfs_inode_ino(ei, name); dents[cnt].type = type; return 0; } -static int add_entry(struct eventfs_dents **edents, struct dentry *d, - int type, int cnt) +static int add_entry(struct eventfs_dents **edents, struct eventfs_inode *ei, + const char *name, int type, int cnt) { struct eventfs_dents *tmp; @@ -742,7 +772,7 @@ static int add_entry(struct eventfs_dents **edents, struct dentry *d, return -ENOMEM; *edents = tmp; - return update_entry(tmp, d, type, cnt); + return update_entry(tmp, ei, name, type, cnt); } /* @@ -758,8 +788,6 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) struct tracefs_inode *ti; struct eventfs_inode *ei; struct eventfs_dents *dents; - struct dentry *ei_dentry = NULL; - struct dentry *dentry; const char *name; umode_t mode; void *data; @@ -780,11 +808,11 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) mutex_lock(&eventfs_mutex); ei = READ_ONCE(ti->private); - if (ei && !ei->is_freed) - ei_dentry = READ_ONCE(ei->dentry); + if (ei && ei->is_freed) + ei = NULL; mutex_unlock(&eventfs_mutex); - if (!ei_dentry) { + if (!ei) { srcu_read_unlock(&eventfs_srcu, idx); return -ENOENT; } @@ -810,7 +838,6 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) return -ENOMEM; } - inode_lock(ei_dentry->d_inode); for (i = 0; i < ei->nr_entries; i++) { void *cdata = data; entry = &ei->entries[i]; @@ -830,12 +857,7 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) /* callbacks returning zero just means skip this file */ if (r == 0) continue; - dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); - if (!dentry) - continue; - - ret = update_entry(dents, dentry, DT_REG, cnt); - dput(dentry); + ret = update_entry(dents, ei, name, DT_REG, cnt); if (ret < 0) goto fail; @@ -845,12 +867,7 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) list_for_each_entry_srcu(ei_child, &ei->children, list, srcu_read_lock_held(&eventfs_srcu)) { - dentry = create_dir_dentry(ei, ei_child, ei_dentry); - if (!dentry) - continue; - - ret = add_entry(&dents, dentry, DT_DIR, cnt); - dput(dentry); + ret = add_entry(&dents, ei_child, ei_child->name, DT_DIR, cnt); if (ret < 0) goto fail; cnt++; @@ -859,12 +876,10 @@ static int eventfs_dir_open(struct inode *inode, struct file *file) edents->count = cnt; edents->dents = dents; - inode_unlock(ei_dentry->d_inode); srcu_read_unlock(&eventfs_srcu, idx); file->private_data = edents; return 0; fail: - inode_unlock(ei_dentry->d_inode); srcu_read_unlock(&eventfs_srcu, idx); for (; cnt >= 0; cnt--) kfree_const(dents[cnt].name); @@ -1029,6 +1044,9 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry if (IS_ERR(dentry)) return ERR_CAST(dentry); + if (siphash_key_is_zero(&inode_key)) + get_random_bytes(&inode_key, sizeof(inode_key)); + ei = kzalloc(sizeof(*ei), GFP_KERNEL); if (!ei) goto fail_ei;
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..50b37f31d835 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -53,7 +53,9 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); +static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); +static int eventfs_dir_release(struct inode *inode, struct file *file); static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { @@ -212,7 +214,9 @@ static const struct inode_operations eventfs_file_inode_operations = { static const struct file_operations eventfs_file_operations = { .read = generic_read_dir, + .open = eventfs_dir_open, .iterate_shared = eventfs_iterate, + .release = eventfs_dir_release, .llseek = generic_file_llseek, }; @@ -706,34 +710,71 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, return ret; } +struct eventfs_dents { + const char *name; + int ino; + int type; +}; + +struct eventfs_list { + struct eventfs_dents *dents; + int count; +}; + +static int update_entry(struct eventfs_dents *dents, struct dentry *d, + int type, int cnt) +{ + dents[cnt].name = kstrdup_const(d->d_name.name, GFP_KERNEL); + if (!dents[cnt].name) + return -ENOMEM; + dents[cnt].ino = d->d_inode->i_ino; + dents[cnt].type = type; + return 0; +} + +static int add_entry(struct eventfs_dents **edents, struct dentry *d, + int type, int cnt) +{ + struct eventfs_dents *tmp; + + tmp = krealloc(*edents, sizeof(**edents) * (cnt + 1), GFP_NOFS); + if (!tmp) + return -ENOMEM; + *edents = tmp; + + return update_entry(tmp, d, type, cnt); +} + /* * Walk the children of a eventfs_inode to fill in getdents(). */ -static int eventfs_iterate(struct file *file, struct dir_context *ctx) +static int eventfs_dir_open(struct inode *inode, struct file *file) { const struct file_operations *fops; struct inode *f_inode = file_inode(file); const struct eventfs_entry *entry; + struct eventfs_list *edents; struct eventfs_inode *ei_child; struct tracefs_inode *ti; struct eventfs_inode *ei; + struct eventfs_dents *dents; struct dentry *ei_dentry = NULL; struct dentry *dentry; const char *name; umode_t mode; + void *data; + int cnt = 0; int idx; - int ret = -EINVAL; - int ino; - int i, r, c; - - if (!dir_emit_dots(file, ctx)) - return 0; + int ret; + int i; + int r; ti = get_tracefs(f_inode); if (!(ti->flags & TRACEFS_EVENT_INODE)) return -EINVAL; - c = ctx->pos - 2; + if (WARN_ON_ONCE(file->private_data)) + return -EINVAL; idx = srcu_read_lock(&eventfs_srcu); @@ -743,80 +784,136 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) ei_dentry = READ_ONCE(ei->dentry); mutex_unlock(&eventfs_mutex); - if (!ei || !ei_dentry) - goto out; + if (!ei_dentry) { + srcu_read_unlock(&eventfs_srcu, idx); + return -ENOENT; + } + + data = ei->data; + + edents = kmalloc(sizeof(*edents), GFP_KERNEL); + if (!edents) { + srcu_read_unlock(&eventfs_srcu, idx); + return -ENOMEM; + } /* - * Need to create the dentries and inodes to have a consistent - * inode number. + * Need to make a struct eventfs_dent array, start by + * allocating enough for just the files, which is a fixed + * array. Then use realloc via add_entry() for the directories + * which is stored in a linked list. */ - ret = 0; - - /* Start at 'c' to jump over already read entries */ - for (i = c; i < ei->nr_entries; i++, ctx->pos++) { - void *cdata = ei->data; + dents = kcalloc(ei->nr_entries, sizeof(*dents), GFP_KERNEL); + if (!dents) { + srcu_read_unlock(&eventfs_srcu, idx); + kfree(edents); + return -ENOMEM; + } + inode_lock(ei_dentry->d_inode); + for (i = 0; i < ei->nr_entries; i++) { + void *cdata = data; entry = &ei->entries[i]; name = entry->name; - mutex_lock(&eventfs_mutex); - /* If ei->is_freed then just bail here, nothing more to do */ - if (ei->is_freed) { - mutex_unlock(&eventfs_mutex); - goto out; - } - r = entry->callback(name, &mode, &cdata, &fops); + /* If ei->is_freed, then the event itself may be too */ + if (!ei->is_freed) + r = entry->callback(name, &mode, &cdata, &fops); + else + r = -1; mutex_unlock(&eventfs_mutex); - if (r <= 0) + /* If the ei is being freed, no need to continue */ + if (r < 0) { + ret = -ENOENT; + goto fail; + } + /* callbacks returning zero just means skip this file */ + if (r == 0) continue; - dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); if (!dentry) - goto out; - ino = dentry->d_inode->i_ino; + continue; + + ret = update_entry(dents, dentry, DT_REG, cnt); dput(dentry); - if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) - goto out; - } + if (ret < 0) + goto fail; - /* Subtract the skipped entries above */ - c -= min((unsigned int)c, (unsigned int)ei->nr_entries); + cnt++; + } list_for_each_entry_srcu(ei_child, &ei->children, list, srcu_read_lock_held(&eventfs_srcu)) { - - if (c > 0) { - c--; + dentry = create_dir_dentry(ei, ei_child, ei_dentry); + if (!dentry) continue; - } - ctx->pos++; + ret = add_entry(&dents, dentry, DT_DIR, cnt); + dput(dentry); + if (ret < 0) + goto fail; + cnt++; + } - if (ei_child->is_freed) - continue; + edents->count = cnt; + edents->dents = dents; - name = ei_child->name; + inode_unlock(ei_dentry->d_inode); + srcu_read_unlock(&eventfs_srcu, idx); + file->private_data = edents; + return 0; + fail: + inode_unlock(ei_dentry->d_inode); + srcu_read_unlock(&eventfs_srcu, idx); + for (; cnt >= 0; cnt--) + kfree_const(dents[cnt].name); + kfree(dents); + kfree(edents); + return ret; +} - dentry = create_dir_dentry(ei, ei_child, ei_dentry); - if (!dentry) - goto out_dec; - ino = dentry->d_inode->i_ino; - dput(dentry); +static int eventfs_dir_release(struct inode *inode, struct file *file) +{ + struct eventfs_list *edents = file->private_data; + struct tracefs_inode *ti; + int i; + + ti = get_tracefs(inode); + if (!(ti->flags & TRACEFS_EVENT_INODE)) + return -EINVAL; - if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) - goto out_dec; + if (WARN_ON_ONCE(!edents)) + return -EINVAL; + + for (i = 0; i < edents->count; i++) { + kfree_const(edents->dents[i].name); } - ret = 1; - out: - srcu_read_unlock(&eventfs_srcu, idx); - return ret; + kfree(edents->dents); + kfree(edents); + return 0; +} - out_dec: - /* Incremented ctx->pos without adding something, reset it */ - ctx->pos--; - goto out; +static int eventfs_iterate(struct file *file, struct dir_context *ctx) +{ + struct eventfs_list *edents = file->private_data; + int i, c; + + if (!dir_emit_dots(file, ctx)) + return 0; + + c = ctx->pos - 2; + + /* Start at 'c' to jump over already read entries */ + for (i = c; i < edents->count; i++, ctx->pos++) { + + if (!dir_emit(ctx, edents->dents[i].name, + strlen(edents->dents[i].name), + edents->dents[i].ino, edents->dents[i].type)) + break; + } + return 0; } /**