diff mbox series

eventfs: Create dentries and inodes at dir open

Message ID 20240116114711.7e8637be@gandalf.local.home (mailing list archive)
State Superseded
Headers show
Series eventfs: Create dentries and inodes at dir open | expand

Commit Message

Steven Rostedt Jan. 16, 2024, 4:47 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.

To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.

Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].

But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.

But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].

The real fix should be to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.

The difference between this and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.

This means that the state of the eventfs at the dir open remains the same
from the point of view of the getdents() function, until the dir is closed.
This also means that the getdents() will not fail. If there's an issue, it
fails at the dir open.

[1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/
[2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home/
[3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218477@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.sang@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/

Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.sang@intel.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c | 209 ++++++++++++++++++++++++++++-----------
 1 file changed, 153 insertions(+), 56 deletions(-)

Comments

Steven Rostedt Jan. 16, 2024, 6:12 p.m. UTC | #1
On Tue, 16 Jan 2024 09:55:15 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ html crud because I still don't have power or real Internet, just trying
> to keep an eye on things on my phone. Mailing lists removed to avoid
> bounces, please put them back in replies that don't have my horrible
> formatting ]
> 
> No.
> 
> Christ, you're making things worse again
> 
> The reason for the bug is that you're still messing with the dentries at
> readdir() time.

I may have deleted the comment, but the only reason I created the
inodes/destries is to keep the consistent inode number as it's created at
the time the inodes and dentries are.

Ah I did delete the comment, but it is still applicable :-/

       /*
-        * Need to create the dentries and inodes to have a consistent
-        * inode number.
+        * Need to make a struct eventfs_dent array, start by
+        * allocating enough for just the files, which is a fixed
+        * array. Then use realloc via add_entry() for the directories
+        * which is stored in a linked list.
         */

So if for some reason, user space did a getdents() and used the inode
numbers to match what it found, they would likely be the same.

> 
> Just stop it. Readdir should not be creating dentries. Readdir should not
> be even *looking* at dentries. You're not a virtual filesystem, the
> dentries are just caches for filename lookup, and have *nothing* to do with
> readdir.

Actually, it's not looking at them. I did it as a way to just have the
inode numbers be consistent.

	dents[cnt].ino = d->d_inode->i_ino;

Yes, that's the only reason I create them. The dentry/inode is not used for
anything outside of that.

> 
> So just iterate over your own internal data structures in readdir. DO NOT
> CREATE DENTRIES OR INODES FOR READDIR.
> 
> I've told you before, and I'll tell you again: either you are a real and
> proper virtual filesystem and you let the vfs layer manage *everything*,
> and the dentries and inodes are all you have. Or you are a *real*
> filesystem and you maintain your own data structures and the dentries and
> inodes are just the in-memory caches.
> 
> This "do both" is UNACCEPTABLE.
> 

The dentries were created for the inode numbers so that I did not need to
add them to meta data. They are generated at creation time.

I don't know how important inode numbers are. If they can all just have
random inode numbers and it doesn't break user space, where an inode number
will be one value at one read and another shortly after, is that going to
cause a problem?

Maybe I can just use a hash to generate he inode numbers from the name?
Hopefully there will be no collisions. Then I don't need the dentry
creation at all.

I do realize that if the dentries get freed due to reclaim and recreated,
their inode numbers will be different. But for a virtual file system, I
don't know how important having consistent inode numbers is.

-- Steve
Steven Rostedt Jan. 16, 2024, 6:16 p.m. UTC | #2
On Tue, 16 Jan 2024 13:12:28 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Maybe I can just use a hash to generate he inode numbers from the name?
> Hopefully there will be no collisions. Then I don't need the dentry
> creation at all.

Maybe I could use a hash of the address of the meta data to create the
inode number? That may be unique enough.

-- Steve
Steven Rostedt Jan. 16, 2024, 6:37 p.m. UTC | #3
On Tue, 16 Jan 2024 10:21:49 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Here's a clue: just fix your inode numbers.
> 
> I can think of many ways to do it. Here's a couple:
> 
>  - use a fixed inode number for all inodes. It's fine. Really. You might
> confuse some programs that still do getpwd() the legacy way, but hey,
> nobody cares
> 
>  - just put the inode number in the same data structure everything else is
>
> 
>  - make the inode number be a hash of the address of your data structure.
> That's actually the closest to a traditional "real" inode number, which is
> just an index to some on-disk thing
> 
> I'm sure there are other *trivial* solutions.
> 
> None of this is an excuse to misuse sentries.
> 
> Try the first one - one single inode number - first. You shouldn't be doing
> iget() anyway, so why do you care so deeply about a number that makes no
> sense and nobody should care about?

It was me being paranoid that using the same inode number would break user
space. If that is not a concern, then I'm happy to just make it either the
same, or maybe just hash the ei and name that it is associated with.

If I do not fully understand how something is used, I try hard to make it
act the same as it does for other use cases. That is, I did all this to
keep inodes unique and consistent because I did not know if it would break
something if I didn't.

Removing that requirement does make it much easier to implement readdir.

I think I'll do the hashing, just because I'm still paranoid that something
might still break if they are all the same.

-- Steve
Steven Rostedt Jan. 16, 2024, 7:42 p.m. UTC | #4
On Tue, 16 Jan 2024 10:53:36 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Let's at least *try* to move towards a better and simpler world, in other
> words.

OK, but I did just finish the below. I'll save it for another time if the
single inode becomes an issue.

I cropped it to just 31 bits, so I'm not sure how much that would still
expose kernel addresses.

But as you prefer a single inode number, I'll submit that instead, and
store this as one of my many git branches that I hoard.

This patch would give me:

~# ls -i /sys/kernel/tracing/events
2024739183 9p               801480813 iocost        281269778 qdisc
 401505785 alarmtimer       641794539 iomap        1206040657 ras
1930290274 avc             1239754069 iommu         275816503 raw_syscalls
1108006611 block            138722947 io_uring     1758073266 rcu
1528091656 bpf_test_run     694421747 ipi          1384703124 regmap
 557364529 bpf_trace          2765888 irq            30507018 regulator
1351362737 cfg80211         369905250 irq_matrix   2078862821 resctrl
 886445085 cgroup          1115008967 irq_vectors   324090967 rpm
 796209754 clk             1448778784 jbd2         1141318882 rseq
 478739129 compaction        99410253 kmem         1274783780 rtc
1714397712 cpuhp           2001779594 ksm          1409072807 sched
 720701943 csd               51728677 kyber         347239934 scsi
1824588347 dev             1507974437 libata       1768671172 sd
1640041299 devfreq         1421146927 lock         1167562598 signal
 121399632 dma_fence        956825721 mac80211        4116590 skb
 975908383 drm              738868061 maple_tree   1435501164 smbus
1227060804 e1000e_trace     969175760 mce          1664441095 sock
1770307058 enable          1225375766 mdio         1634697993 spi
1372107864 error_report     744198394 migrate       556841757 swiotlb
1356665351 exceptions       602669807 mmap          400337480 syscalls
[..]

~# ls -i /sys/kernel/tracing/events/sched
1906992193 enable                            1210369853 sched_process_wait
 592505447 filter                            1443020725 sched_stat_blocked
1742488280 sched_kthread_stop                1213185672 sched_stat_iowait
1674576234 sched_kthread_stop_ret            1908510325 sched_stat_runtime
1999376743 sched_kthread_work_execute_end    1570203600 sched_stat_sleep
1041533096 sched_kthread_work_execute_start   608113040 sched_stat_wait
 166824308 sched_kthread_work_queue_work     2033295833 sched_stick_numa
 492462616 sched_migrate_task                1617500395 sched_swap_numa
1786868196 sched_move_numa                   1865216679 sched_switch
 691687388 sched_pi_setprio                  1465729401 sched_wait_task
1610977146 sched_process_exec                 969941227 sched_wake_idle_without_ipi
 610128037 sched_process_exit                  84398030 sched_wakeup
2139284699 sched_process_fork                 750220489 sched_wakeup_new
 169172804 sched_process_free                1024064201 sched_waking

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a0598eb3e26e..57448bf4acb4 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -19,6 +19,7 @@
 #include <linux/namei.h>
 #include <linux/workqueue.h>
 #include <linux/security.h>
+#include <linux/siphash.h>
 #include <linux/tracefs.h>
 #include <linux/kref.h>
 #include <linux/delay.h>
@@ -32,6 +33,9 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Used for making inode numbers */
+static siphash_key_t inode_key;
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file *file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 static int eventfs_dir_release(struct inode *inode, struct file *file);
 
+/* Copied from scripts/kconfig/symbol.c */
+static unsigned strhash(const char *s)
+{
+	/* fnv32 hash */
+	unsigned hash = 2166136261U;
+	for (; *s; s++)
+		hash = (hash ^ *s) * 0x01000193;
+	return hash;
+}
+
+/* Just try to make something consistent and unique */
+static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name)
+{
+	unsigned long sip = (unsigned long)ei;
+
+	sip += strhash(name);
+	sip = siphash_1u32((int)sip, &inode_key);
+
+	/* keep it positive */
+	return sip & ((1U << 31) - 1);
+}
+
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
 	unsigned int ia_valid = iattr->ia_valid;
@@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 	mutex_unlock(&eventfs_mutex);
 
 	dentry = create_file(name, mode, attr, parent, data, fops);
+	if (!IS_ERR_OR_NULL(dentry))
+		dentry->d_inode->i_ino = eventfs_inode_ino(ei, name);
 
 	mutex_lock(&eventfs_mutex);
 
@@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
 	mutex_unlock(&eventfs_mutex);
 
 	dentry = create_dir(ei, parent);
+	if (!IS_ERR_OR_NULL(dentry))
+		dentry->d_inode->i_ino = eventfs_inode_ino(ei, ei->name);
 
 	mutex_lock(&eventfs_mutex);
 
@@ -721,19 +751,19 @@ struct eventfs_list {
 	int			count;
 };
 
-static int update_entry(struct eventfs_dents *dents, struct dentry *d,
-		     int type, int cnt)
+static int update_entry(struct eventfs_dents *dents, struct eventfs_inode *ei,
+			const char *name, int type, int cnt)
 {
-	dents[cnt].name = kstrdup_const(d->d_name.name, GFP_KERNEL);
+	dents[cnt].name = kstrdup_const(name, GFP_KERNEL);
 	if (!dents[cnt].name)
 		return -ENOMEM;
-	dents[cnt].ino = d->d_inode->i_ino;
+	dents[cnt].ino = eventfs_inode_ino(ei, name);
 	dents[cnt].type = type;
 	return 0;
 }
 
-static int add_entry(struct eventfs_dents **edents, struct dentry *d,
-		     int type, int cnt)
+static int add_entry(struct eventfs_dents **edents, struct eventfs_inode *ei,
+		     const char *name, int type, int cnt)
 {
 	struct eventfs_dents *tmp;
 
@@ -742,7 +772,7 @@ static int add_entry(struct eventfs_dents **edents, struct dentry *d,
 		return -ENOMEM;
 	*edents = tmp;
 
-	return update_entry(tmp, d, type, cnt);
+	return update_entry(tmp, ei, name, type, cnt);
 }
 
 /*
@@ -758,8 +788,6 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	struct eventfs_dents *dents;
-	struct dentry *ei_dentry = NULL;
-	struct dentry *dentry;
 	const char *name;
 	umode_t mode;
 	void *data;
@@ -780,11 +808,11 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 
 	mutex_lock(&eventfs_mutex);
 	ei = READ_ONCE(ti->private);
-	if (ei && !ei->is_freed)
-		ei_dentry = READ_ONCE(ei->dentry);
+	if (ei && ei->is_freed)
+		ei = NULL;
 	mutex_unlock(&eventfs_mutex);
 
-	if (!ei_dentry) {
+	if (!ei) {
 		srcu_read_unlock(&eventfs_srcu, idx);
 		return -ENOENT;
 	}
@@ -810,7 +838,6 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 		return -ENOMEM;
 	}
 
-	inode_lock(ei_dentry->d_inode);
 	for (i = 0; i < ei->nr_entries; i++) {
 		void *cdata = data;
 		entry = &ei->entries[i];
@@ -830,12 +857,7 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 		/* callbacks returning zero just means skip this file */
 		if (r == 0)
 			continue;
-		dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
-		if (!dentry)
-			continue;
-
-		ret = update_entry(dents, dentry, DT_REG, cnt);
-		dput(dentry);
+		ret = update_entry(dents, ei, name, DT_REG, cnt);
 
 		if (ret < 0)
 		    goto fail;
@@ -845,12 +867,7 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 
 	list_for_each_entry_srcu(ei_child, &ei->children, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		dentry = create_dir_dentry(ei, ei_child, ei_dentry);
-		if (!dentry)
-			continue;
-
-		ret = add_entry(&dents, dentry, DT_DIR, cnt);
-		dput(dentry);
+		ret = add_entry(&dents, ei_child, ei_child->name, DT_DIR, cnt);
 		if (ret < 0)
 			goto fail;
 		cnt++;
@@ -859,12 +876,10 @@ static int eventfs_dir_open(struct inode *inode, struct file *file)
 	edents->count = cnt;
 	edents->dents = dents;
 
-	inode_unlock(ei_dentry->d_inode);
 	srcu_read_unlock(&eventfs_srcu, idx);
 	file->private_data = edents;
 	return 0;
  fail:
-	inode_unlock(ei_dentry->d_inode);
 	srcu_read_unlock(&eventfs_srcu, idx);
 	for (; cnt >= 0; cnt--)
 		kfree_const(dents[cnt].name);
@@ -1029,6 +1044,9 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	if (IS_ERR(dentry))
 		return ERR_CAST(dentry);
 
+	if (siphash_key_is_zero(&inode_key))
+		get_random_bytes(&inode_key, sizeof(inode_key));
+
 	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
 	if (!ei)
 		goto fail_ei;
diff mbox series

Patch

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..50b37f31d835 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -53,7 +53,9 @@  enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags);
+static int eventfs_dir_open(struct inode *inode, struct file *file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
+static int eventfs_dir_release(struct inode *inode, struct file *file);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -212,7 +214,9 @@  static const struct inode_operations eventfs_file_inode_operations = {
 
 static const struct file_operations eventfs_file_operations = {
 	.read		= generic_read_dir,
+	.open		= eventfs_dir_open,
 	.iterate_shared	= eventfs_iterate,
+	.release	= eventfs_dir_release,
 	.llseek		= generic_file_llseek,
 };
 
@@ -706,34 +710,71 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 	return ret;
 }
 
+struct eventfs_dents {
+	const char		*name;
+	int			ino;
+	int			type;
+};
+
+struct eventfs_list {
+	struct eventfs_dents	*dents;
+	int			count;
+};
+
+static int update_entry(struct eventfs_dents *dents, struct dentry *d,
+		     int type, int cnt)
+{
+	dents[cnt].name = kstrdup_const(d->d_name.name, GFP_KERNEL);
+	if (!dents[cnt].name)
+		return -ENOMEM;
+	dents[cnt].ino = d->d_inode->i_ino;
+	dents[cnt].type = type;
+	return 0;
+}
+
+static int add_entry(struct eventfs_dents **edents, struct dentry *d,
+		     int type, int cnt)
+{
+	struct eventfs_dents *tmp;
+
+	tmp = krealloc(*edents, sizeof(**edents) * (cnt + 1), GFP_NOFS);
+	if (!tmp)
+		return -ENOMEM;
+	*edents = tmp;
+
+	return update_entry(tmp, d, type, cnt);
+}
+
 /*
  * Walk the children of a eventfs_inode to fill in getdents().
  */
-static int eventfs_iterate(struct file *file, struct dir_context *ctx)
+static int eventfs_dir_open(struct inode *inode, struct file *file)
 {
 	const struct file_operations *fops;
 	struct inode *f_inode = file_inode(file);
 	const struct eventfs_entry *entry;
+	struct eventfs_list *edents;
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
+	struct eventfs_dents *dents;
 	struct dentry *ei_dentry = NULL;
 	struct dentry *dentry;
 	const char *name;
 	umode_t mode;
+	void *data;
+	int cnt = 0;
 	int idx;
-	int ret = -EINVAL;
-	int ino;
-	int i, r, c;
-
-	if (!dir_emit_dots(file, ctx))
-		return 0;
+	int ret;
+	int i;
+	int r;
 
 	ti = get_tracefs(f_inode);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
 		return -EINVAL;
 
-	c = ctx->pos - 2;
+	if (WARN_ON_ONCE(file->private_data))
+		return -EINVAL;
 
 	idx = srcu_read_lock(&eventfs_srcu);
 
@@ -743,80 +784,136 @@  static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 		ei_dentry = READ_ONCE(ei->dentry);
 	mutex_unlock(&eventfs_mutex);
 
-	if (!ei || !ei_dentry)
-		goto out;
+	if (!ei_dentry) {
+		srcu_read_unlock(&eventfs_srcu, idx);
+		return -ENOENT;
+	}
+
+	data = ei->data;
+
+	edents = kmalloc(sizeof(*edents), GFP_KERNEL);
+	if (!edents) {
+		srcu_read_unlock(&eventfs_srcu, idx);
+		return -ENOMEM;
+	}
 
 	/*
-	 * Need to create the dentries and inodes to have a consistent
-	 * inode number.
+	 * Need to make a struct eventfs_dent array, start by
+	 * allocating enough for just the files, which is a fixed
+	 * array. Then use realloc via add_entry() for the directories
+	 * which is stored in a linked list.
 	 */
-	ret = 0;
-
-	/* Start at 'c' to jump over already read entries */
-	for (i = c; i < ei->nr_entries; i++, ctx->pos++) {
-		void *cdata = ei->data;
+	dents = kcalloc(ei->nr_entries, sizeof(*dents), GFP_KERNEL);
+	if (!dents) {
+		srcu_read_unlock(&eventfs_srcu, idx);
+		kfree(edents);
+		return -ENOMEM;
+	}
 
+	inode_lock(ei_dentry->d_inode);
+	for (i = 0; i < ei->nr_entries; i++) {
+		void *cdata = data;
 		entry = &ei->entries[i];
 		name = entry->name;
-
 		mutex_lock(&eventfs_mutex);
-		/* If ei->is_freed then just bail here, nothing more to do */
-		if (ei->is_freed) {
-			mutex_unlock(&eventfs_mutex);
-			goto out;
-		}
-		r = entry->callback(name, &mode, &cdata, &fops);
+		/* If ei->is_freed, then the event itself may be too */
+		if (!ei->is_freed)
+			r = entry->callback(name, &mode, &cdata, &fops);
+		else
+			r = -1;
 		mutex_unlock(&eventfs_mutex);
-		if (r <= 0)
+		/* If the ei is being freed, no need to continue */
+		if (r < 0) {
+			ret = -ENOENT;
+			goto fail;
+		}
+		/* callbacks returning zero just means skip this file */
+		if (r == 0)
 			continue;
-
 		dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
 		if (!dentry)
-			goto out;
-		ino = dentry->d_inode->i_ino;
+			continue;
+
+		ret = update_entry(dents, dentry, DT_REG, cnt);
 		dput(dentry);
 
-		if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
-			goto out;
-	}
+		if (ret < 0)
+		    goto fail;
 
-	/* Subtract the skipped entries above */
-	c -= min((unsigned int)c, (unsigned int)ei->nr_entries);
+		cnt++;
+	}
 
 	list_for_each_entry_srcu(ei_child, &ei->children, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-
-		if (c > 0) {
-			c--;
+		dentry = create_dir_dentry(ei, ei_child, ei_dentry);
+		if (!dentry)
 			continue;
-		}
 
-		ctx->pos++;
+		ret = add_entry(&dents, dentry, DT_DIR, cnt);
+		dput(dentry);
+		if (ret < 0)
+			goto fail;
+		cnt++;
+	}
 
-		if (ei_child->is_freed)
-			continue;
+	edents->count = cnt;
+	edents->dents = dents;
 
-		name = ei_child->name;
+	inode_unlock(ei_dentry->d_inode);
+	srcu_read_unlock(&eventfs_srcu, idx);
+	file->private_data = edents;
+	return 0;
+ fail:
+	inode_unlock(ei_dentry->d_inode);
+	srcu_read_unlock(&eventfs_srcu, idx);
+	for (; cnt >= 0; cnt--)
+		kfree_const(dents[cnt].name);
+	kfree(dents);
+	kfree(edents);
+	return ret;
+}
 
-		dentry = create_dir_dentry(ei, ei_child, ei_dentry);
-		if (!dentry)
-			goto out_dec;
-		ino = dentry->d_inode->i_ino;
-		dput(dentry);
+static int eventfs_dir_release(struct inode *inode, struct file *file)
+{
+	struct eventfs_list *edents = file->private_data;
+	struct tracefs_inode *ti;
+	int i;
+
+	ti = get_tracefs(inode);
+	if (!(ti->flags & TRACEFS_EVENT_INODE))
+		return -EINVAL;
 
-		if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
-			goto out_dec;
+	if (WARN_ON_ONCE(!edents))
+		return -EINVAL;
+
+	for (i = 0; i < edents->count; i++) {
+		kfree_const(edents->dents[i].name);
 	}
-	ret = 1;
- out:
-	srcu_read_unlock(&eventfs_srcu, idx);
 
-	return ret;
+	kfree(edents->dents);
+	kfree(edents);
+	return 0;
+}
 
- out_dec:
-	/* Incremented ctx->pos without adding something, reset it */
-	ctx->pos--;
-	goto out;
+static int eventfs_iterate(struct file *file, struct dir_context *ctx)
+{
+	struct eventfs_list *edents = file->private_data;
+	int i, c;
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
+
+	c = ctx->pos - 2;
+
+	/* Start at 'c' to jump over already read entries */
+	for (i = c; i < edents->count; i++, ctx->pos++) {
+
+		if (!dir_emit(ctx, edents->dents[i].name,
+			      strlen(edents->dents[i].name),
+			      edents->dents[i].ino, edents->dents[i].type))
+			break;
+	}
+	return 0;
 }
 
 /**