mbox series

[v4,00/10] tracing: introducing eventfs

Message ID 1689248004-8158-1-git-send-email-akaher@vmware.com (mailing list archive)
Headers show
Series tracing: introducing eventfs | expand

Message

Ajay Kaher July 13, 2023, 11:33 a.m. UTC
Events Tracing infrastructure contains lot of files, directories
(internally in terms of inodes, dentries). And ends up by consuming
memory in MBs. We can have multiple events of Events Tracing, which
further requires more memory.

Instead of creating inodes/dentries, eventfs could keep meta-data and
skip the creation of inodes/dentries. As and when require, eventfs will
create the inodes/dentries only for required files/directories.
Also eventfs would delete the inodes/dentries once no more requires
but preserve the meta data.

Tracing events took ~9MB, with this approach it took ~4.5MB
for ~10K files/dir.

v3:
Patch 3,4,5,7,9:
         removed all the eventfs_rwsem code and replaced it with an srcu
         lock for the readers, and a mutex to synchronize the writers of
         the list.

Patch 2: moved 'tracefs_inode' and 'get_tracefs()' to v4 03/10

Patch 3: moved the struct eventfs_file and eventfs_inode into event_inode.c
         as it really should not be exposed to all users.

Patch 5: added a recursion check to eventfs_remove_rec() as it is really
         dangerous to have unchecked recursion in the kernel (we do have
         a fixed size stack).

         have the free use srcu callbacks. After the srcu grace periods
         are done, it adds the eventfs_file onto a llist (lockless link
         list) and wakes up a work queue. Then the work queue does the
         freeing (this needs to be done in task/workqueue context, as
         srcu callbacks are done in softirq context).

Patch 6: renamed:
         eventfs_create_file() -> create_file()
         eventfs_create_dir() -> create_dir()

v2:
Patch 01: new patch:'Require all trace events to have a TRACE_SYSTEM'
Patch 02: moved from v1 1/9
Patch 03: moved from v1 2/9
          As suggested by Zheng Yejian, introduced eventfs_prepare_ef()
          helper function to add files or directories to eventfs
          fix WARNING reported by kernel test robot in v1 8/9
Patch 04: moved from v1 3/9
          used eventfs_prepare_ef() to add files
          fix WARNING reported by kernel test robot in v1 8/9
Patch 05: moved from v1 4/9
          fix compiling warning reported by kernel test robot in v1 4/9
Patch 06: moved from v1 5/9
Patch 07: moved from v1 6/9
Patch 08: moved from v1 7/9
Patch 09: moved from v1 8/9
          rebased because of v3 01/10
Patch 10: moved from v1 9/9

v1:
Patch 1: add header file
Patch 2: resolved kernel test robot issues
         protecting eventfs lists using nested eventfs_rwsem
Patch 3: protecting eventfs lists using nested eventfs_rwsem
Patch 4: improve events cleanup code to fix crashes
Patch 5: resolved kernel test robot issues
         removed d_instantiate_anon() calls
Patch 6: resolved kernel test robot issues
         fix kprobe test in eventfs_root_lookup()
         protecting eventfs lists using nested eventfs_rwsem
Patch 7: remove header file
Patch 8: pass eventfs_rwsem as argument to eventfs functions
         called eventfs_remove_events_dir() instead of tracefs_remove()
         from event_trace_del_tracer()
Patch 9: new patch to fix kprobe test case

Ajay Kaher (9):
  tracefs: Rename some tracefs function
  eventfs: Implement eventfs dir creation functions
  eventfs: Implement eventfs file add functions
  eventfs: Implement eventfs file, directory remove function
  eventfs: Implement functions to create eventfs files and directories
  eventfs: Implement eventfs lookup, read, open functions
  eventfs: Implement tracefs_inode_cache
  eventfs: Move tracing/events to eventfs
  test: ftrace: Fix kprobe test for eventfs

Steven Rostedt (Google) (1):
  tracing: Require all trace events to have a TRACE_SYSTEM

 fs/tracefs/Makefile                           |   1 +
 fs/tracefs/event_inode.c                      | 711 ++++++++++++++++++
 fs/tracefs/inode.c                            | 124 ++-
 fs/tracefs/internal.h                         |  25 +
 include/linux/trace_events.h                  |   1 +
 include/linux/tracefs.h                       |  32 +
 kernel/trace/trace.h                          |   2 +-
 kernel/trace/trace_events.c                   |  78 +-
 .../ftrace/test.d/kprobe/kprobe_args_char.tc  |   4 +-
 .../test.d/kprobe/kprobe_args_string.tc       |   4 +-
 10 files changed, 930 insertions(+), 52 deletions(-)
 create mode 100644 fs/tracefs/event_inode.c
 create mode 100644 fs/tracefs/internal.h

Comments

Steven Rostedt July 14, 2023, 10:58 p.m. UTC | #1
On Thu, 13 Jul 2023 17:03:14 +0530
Ajay Kaher <akaher@vmware.com> wrote:

> Events Tracing infrastructure contains lot of files, directories
> (internally in terms of inodes, dentries). And ends up by consuming
> memory in MBs. We can have multiple events of Events Tracing, which
> further requires more memory.
> 
> Instead of creating inodes/dentries, eventfs could keep meta-data and
> skip the creation of inodes/dentries. As and when require, eventfs will
> create the inodes/dentries only for required files/directories.
> Also eventfs would delete the inodes/dentries once no more requires
> but preserve the meta data.
> 
> Tracing events took ~9MB, with this approach it took ~4.5MB
> for ~10K files/dir.

I think we are very close to getting this in for the next merge window. I
ran several tests and so far it's holding up!

I made a bunch of nits for this series, but nothing major. Mostly fixing up
change logs and comments, as well as some naming conventions and
reorganizing the series a little bit.

Anyway, I'm hoping that v5 will be ready to go into linux-next.

Thanks a lot Ajay for working on this!

-- Steve


> 
> v3:
> Patch 3,4,5,7,9:
>          removed all the eventfs_rwsem code and replaced it with an srcu
>          lock for the readers, and a mutex to synchronize the writers of
>          the list.
> 
> Patch 2: moved 'tracefs_inode' and 'get_tracefs()' to v4 03/10
> 
> Patch 3: moved the struct eventfs_file and eventfs_inode into event_inode.c
>          as it really should not be exposed to all users.
> 
> Patch 5: added a recursion check to eventfs_remove_rec() as it is really
>          dangerous to have unchecked recursion in the kernel (we do have
>          a fixed size stack).
> 
>          have the free use srcu callbacks. After the srcu grace periods
>          are done, it adds the eventfs_file onto a llist (lockless link
>          list) and wakes up a work queue. Then the work queue does the
>          freeing (this needs to be done in task/workqueue context, as
>          srcu callbacks are done in softirq context).
> 
> Patch 6: renamed:
>          eventfs_create_file() -> create_file()
>          eventfs_create_dir() -> create_dir()
> 
> v2:
> Patch 01: new patch:'Require all trace events to have a TRACE_SYSTEM'
> Patch 02: moved from v1 1/9
> Patch 03: moved from v1 2/9
>           As suggested by Zheng Yejian, introduced eventfs_prepare_ef()
>           helper function to add files or directories to eventfs
>           fix WARNING reported by kernel test robot in v1 8/9
> Patch 04: moved from v1 3/9
>           used eventfs_prepare_ef() to add files
>           fix WARNING reported by kernel test robot in v1 8/9
> Patch 05: moved from v1 4/9
>           fix compiling warning reported by kernel test robot in v1 4/9
> Patch 06: moved from v1 5/9
> Patch 07: moved from v1 6/9
> Patch 08: moved from v1 7/9
> Patch 09: moved from v1 8/9
>           rebased because of v3 01/10
> Patch 10: moved from v1 9/9
> 
> v1:
> Patch 1: add header file
> Patch 2: resolved kernel test robot issues
>          protecting eventfs lists using nested eventfs_rwsem
> Patch 3: protecting eventfs lists using nested eventfs_rwsem
> Patch 4: improve events cleanup code to fix crashes
> Patch 5: resolved kernel test robot issues
>          removed d_instantiate_anon() calls
> Patch 6: resolved kernel test robot issues
>          fix kprobe test in eventfs_root_lookup()
>          protecting eventfs lists using nested eventfs_rwsem
> Patch 7: remove header file
> Patch 8: pass eventfs_rwsem as argument to eventfs functions
>          called eventfs_remove_events_dir() instead of tracefs_remove()
>          from event_trace_del_tracer()
> Patch 9: new patch to fix kprobe test case
> 
> Ajay Kaher (9):
>   tracefs: Rename some tracefs function
>   eventfs: Implement eventfs dir creation functions
>   eventfs: Implement eventfs file add functions
>   eventfs: Implement eventfs file, directory remove function
>   eventfs: Implement functions to create eventfs files and directories
>   eventfs: Implement eventfs lookup, read, open functions
>   eventfs: Implement tracefs_inode_cache
>   eventfs: Move tracing/events to eventfs
>   test: ftrace: Fix kprobe test for eventfs
> 
> Steven Rostedt (Google) (1):
>   tracing: Require all trace events to have a TRACE_SYSTEM
> 
>  fs/tracefs/Makefile                           |   1 +
>  fs/tracefs/event_inode.c                      | 711 ++++++++++++++++++
>  fs/tracefs/inode.c                            | 124 ++-
>  fs/tracefs/internal.h                         |  25 +
>  include/linux/trace_events.h                  |   1 +
>  include/linux/tracefs.h                       |  32 +
>  kernel/trace/trace.h                          |   2 +-
>  kernel/trace/trace_events.c                   |  78 +-
>  .../ftrace/test.d/kprobe/kprobe_args_char.tc  |   4 +-
>  .../test.d/kprobe/kprobe_args_string.tc       |   4 +-
>  10 files changed, 930 insertions(+), 52 deletions(-)
>  create mode 100644 fs/tracefs/event_inode.c
>  create mode 100644 fs/tracefs/internal.h
>
Ajay Kaher July 16, 2023, 5:32 p.m. UTC | #2
> On 15-Jul-2023, at 4:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Thu, 13 Jul 2023 17:03:14 +0530
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> Events Tracing infrastructure contains lot of files, directories
>> (internally in terms of inodes, dentries). And ends up by consuming
>> memory in MBs. We can have multiple events of Events Tracing, which
>> further requires more memory.
>> 
>> Instead of creating inodes/dentries, eventfs could keep meta-data and
>> skip the creation of inodes/dentries. As and when require, eventfs will
>> create the inodes/dentries only for required files/directories.
>> Also eventfs would delete the inodes/dentries once no more requires
>> but preserve the meta data.
>> 
>> Tracing events took ~9MB, with this approach it took ~4.5MB
>> for ~10K files/dir.
> 
> I think we are very close to getting this in for the next merge window. I
> ran several tests and so far it's holding up!
> 
> I made a bunch of nits for this series, but nothing major. Mostly fixing up
> change logs and comments, as well as some naming conventions and
> reorganizing the series a little bit.
> 
> Anyway, I'm hoping that v5 will be ready to go into linux-next.
> 
> Thanks a lot Ajay for working on this!


Thanks Steve, hopefully I will fix all the pending nits in v5.
Here is the checkpatch.pl report:


./scripts/checkpatch.pl v4/*
--------------------------
v4/0000-cover-letter.patch
--------------------------
total: 0 errors, 0 warnings, 0 lines checked

v4/0000-cover-letter.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0001-tracing-Require-all-trace-events-to-have-a-TRACE_SYS.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 22 lines checked

v4/0001-tracing-Require-all-trace-events-to-have-a-TRACE_SYS.patch has no obvious style problems and is ready for submission.
--------------------------------------------------
v4/0002-tracefs-Rename-some-tracefs-function.patch
--------------------------------------------------
total: 0 errors, 0 warnings, 71 lines checked

v4/0002-tracefs-Rename-some-tracefs-function.patch has no obvious style problems and is ready for submission.
--------------------------------------------------------------
v4/0003-eventfs-Implement-eventfs-dir-creation-functions.patch
--------------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#52:
new file mode 100644

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#194: FILE: fs/tracefs/event_inode.c:138:
+	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#229: FILE: fs/tracefs/event_inode.c:173:
+		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#261: FILE: fs/tracefs/event_inode.c:205:
+		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,

total: 0 errors, 4 warnings, 297 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

v4/0003-eventfs-Implement-eventfs-dir-creation-functions.patch has style problems, please review.
----------------------------------------------------------
v4/0004-eventfs-Implement-eventfs-file-add-functions.patch
----------------------------------------------------------
total: 0 errors, 0 warnings, 101 lines checked

v4/0004-eventfs-Implement-eventfs-file-add-functions.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0005-eventfs-Implement-eventfs-file-directory-remove-func.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 129 lines checked

v4/0005-eventfs-Implement-eventfs-file-directory-remove-func.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0006-eventfs-Implement-functions-to-create-eventfs-files-.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 182 lines checked

v4/0006-eventfs-Implement-functions-to-create-eventfs-files-.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0007-eventfs-Implement-eventfs-lookup-read-open-functions.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 224 lines checked

v4/0007-eventfs-Implement-eventfs-lookup-read-open-functions.patch has no obvious style problems and is ready for submission.
---------------------------------------------------
v4/0008-eventfs-Implement-tracefs_inode_cache.patch
---------------------------------------------------
total: 0 errors, 0 warnings, 68 lines checked

v4/0008-eventfs-Implement-tracefs_inode_cache.patch has no obvious style problems and is ready for submission.
----------------------------------------------------
v4/0009-eventfs-Move-tracing-events-to-eventfs.patch
----------------------------------------------------
total: 0 errors, 0 warnings, 241 lines checked

v4/0009-eventfs-Move-tracing-events-to-eventfs.patch has no obvious style problems and is ready for submission.
-----------------------------------------------------
v4/0010-test-ftrace-Fix-kprobe-test-for-eventfs.patch
-----------------------------------------------------
total: 0 errors, 0 warnings, 32 lines checked

v4/0010-test-ftrace-Fix-kprobe-test-for-eventfs.patch has no obvious style problems and is ready for submission.

-Ajay
Steven Rostedt July 18, 2023, 1:40 p.m. UTC | #3
On Sun, 16 Jul 2023 17:32:35 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> Thanks Steve, hopefully I will fix all the pending nits in v5.
> Here is the checkpatch.pl report:

Hold off on v5. I hit the following on v4:

[  220.170527] BUG: unable to handle page fault for address: fffffffffffffff0
[  220.172792] #PF: supervisor read access in kernel mode
[  220.174618] #PF: error_code(0x0000) - not-present page
[  220.176516] PGD 13104d067 P4D 13104d067 PUD 13104f067 PMD 0
[  220.178559] Oops: 0000 [#1] PREEMPT SMP PTI
[  220.180087] CPU: 3 PID: 35 Comm: kworker/u8:1 Not tainted 6.5.0-rc1-test-00021-gdd6e7af33766-dirty #15
[  220.183441] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  220.186629] Workqueue: events_unbound eventfs_workfn
[  220.188286] RIP: 0010:eventfs_set_ef_status_free+0x17/0x40
[  220.190091] Code: 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 47 18 48 8b 40 30 48 83 f8 10 74 1b <f6> 40 f0 02 74 15 48 8b 47 78 48 85 c0 74 0c c6 40 5a 00 48 c7 40
[  220.195360] RSP: 0018:ffffa731c0147e20 EFLAGS: 00010287
[  220.196802] RAX: 0000000000000000 RBX: ffff97ca512ca000 RCX: 0000000000000000
[  220.198703] RDX: 0000000000000001 RSI: ffff97ca52d18010 RDI: ffff97ca512ca000
[  220.200540] RBP: ffff97ca52cb3780 R08: 0000000000000064 R09: 00000000802a0022
[  220.202324] R10: 0000000000039e80 R11: ffff97cabffd5000 R12: ffff97ca512ca058
[  220.204012] R13: ffff97ca52cb3780 R14: ffff97ca40153705 R15: ffffffffad5c1848
[  220.205685] FS:  0000000000000000(0000) GS:ffff97cabbd80000(0000) knlGS:0000000000000000
[  220.207476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  220.208764] CR2: fffffffffffffff0 CR3: 000000010a01a001 CR4: 0000000000170ee0
[  220.210342] Call Trace:
[  220.210879]  <TASK>
[  220.211359]  ? __die+0x23/0x70
[  220.212036]  ? page_fault_oops+0xa4/0x180
[  220.212904]  ? exc_page_fault+0xf6/0x190
[  220.213738]  ? asm_exc_page_fault+0x26/0x30
[  220.214586]  ? eventfs_set_ef_status_free+0x17/0x40
[  220.216081]  tracefs_dentry_iput+0x39/0x50
[  220.217370]  __dentry_kill+0xdc/0x170
[  220.218581]  dput+0x142/0x310
[  220.219647]  eventfs_workfn+0x42/0x70
[  220.220805]  process_one_work+0x1e2/0x3e0
[  220.222031]  worker_thread+0x1da/0x390
[  220.223204]  ? __pfx_worker_thread+0x10/0x10
[  220.224476]  kthread+0xf7/0x130
[  220.225543]  ? __pfx_kthread+0x10/0x10
[  220.226735]  ret_from_fork+0x2c/0x50
[  220.227898]  </TASK>
[  220.228792] Modules linked in:
[  220.229860] CR2: fffffffffffffff0
[  220.230960] ---[ end trace 0000000000000000 ]---


I think I know the issue, and looking to see if I can fix it.

-- Steve
Ajay Kaher July 19, 2023, 10:25 a.m. UTC | #4
> On 18-Jul-2023, at 7:10 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Sun, 16 Jul 2023 17:32:35 +0000
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> Thanks Steve, hopefully I will fix all the pending nits in v5.
>> Here is the checkpatch.pl report:
> 
> Hold off on v5. I hit the following on v4:

OK.

> 
> [  220.170527] BUG: unable to handle page fault for address: fffffffffffffff0
> [  220.172792] #PF: supervisor read access in kernel mode
> [  220.174618] #PF: error_code(0x0000) - not-present page
> [  220.176516] PGD 13104d067 P4D 13104d067 PUD 13104f067 PMD 0
> [  220.178559] Oops: 0000 [#1] PREEMPT SMP PTI
> [  220.180087] CPU: 3 PID: 35 Comm: kworker/u8:1 Not tainted 6.5.0-rc1-test-00021-gdd6e7af33766-dirty #15
> [  220.183441] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [  220.186629] Workqueue: events_unbound eventfs_workfn
> [  220.188286] RIP: 0010:eventfs_set_ef_status_free+0x17/0x40
> [  220.190091] Code: 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 47 18 48 8b 40 30 48 83 f8 10 74 1b <f6> 40 f0 02 74 15 48 8b 47 78 48 85 c0 74 0c c6 40 5a 00 48 c7 40
> [  220.195360] RSP: 0018:ffffa731c0147e20 EFLAGS: 00010287
> [  220.196802] RAX: 0000000000000000 RBX: ffff97ca512ca000 RCX: 0000000000000000
> [  220.198703] RDX: 0000000000000001 RSI: ffff97ca52d18010 RDI: ffff97ca512ca000
> [  220.200540] RBP: ffff97ca52cb3780 R08: 0000000000000064 R09: 00000000802a0022
> [  220.202324] R10: 0000000000039e80 R11: ffff97cabffd5000 R12: ffff97ca512ca058
> [  220.204012] R13: ffff97ca52cb3780 R14: ffff97ca40153705 R15: ffffffffad5c1848
> [  220.205685] FS:  0000000000000000(0000) GS:ffff97cabbd80000(0000) knlGS:0000000000000000
> [  220.207476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  220.208764] CR2: fffffffffffffff0 CR3: 000000010a01a001 CR4: 0000000000170ee0
> [  220.210342] Call Trace:
> [  220.210879]  <TASK>
> [  220.211359]  ? __die+0x23/0x70
> [  220.212036]  ? page_fault_oops+0xa4/0x180
> [  220.212904]  ? exc_page_fault+0xf6/0x190
> [  220.213738]  ? asm_exc_page_fault+0x26/0x30
> [  220.214586]  ? eventfs_set_ef_status_free+0x17/0x40
> [  220.216081]  tracefs_dentry_iput+0x39/0x50
> [  220.217370]  __dentry_kill+0xdc/0x170
> [  220.218581]  dput+0x142/0x310
> [  220.219647]  eventfs_workfn+0x42/0x70
> [  220.220805]  process_one_work+0x1e2/0x3e0
> [  220.222031]  worker_thread+0x1da/0x390
> [  220.223204]  ? __pfx_worker_thread+0x10/0x10
> [  220.224476]  kthread+0xf7/0x130
> [  220.225543]  ? __pfx_kthread+0x10/0x10
> [  220.226735]  ret_from_fork+0x2c/0x50
> [  220.227898]  </TASK>
> [  220.228792] Modules linked in:
> [  220.229860] CR2: fffffffffffffff0
> [  220.230960] ---[ end trace 0000000000000000 ]---
> 
> 
> I think I know the issue, and looking to see if I can fix it.


- Is it also reproducible on v3?
- Is it manually reproducible or reproducible using any specific script?

Let me know if I can help.

-Ajay
Steven Rostedt July 19, 2023, 2:23 p.m. UTC | #5
On Wed, 19 Jul 2023 10:25:28 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> - Is it also reproducible on v3?
> - Is it manually reproducible or reproducible using any specific script?
> 
> Let me know if I can help.

Just tried it against v3, and it gave me the splat that I originally had
and starting to fix, which now gives me another splat. I'll spend a couple
more days on it and start sharing code and seeing if we can work together
on this.

Here's the reproducer (of both v3 splat and the bug I'm hitting now).

 ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
 ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
 ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events

v3 gives me (and my updates too)

======================================================
 WARNING: possible circular locking dependency detected
 6.5.0-rc1-test+ #576 Not tainted
 ------------------------------------------------------
 trace-cmd/840 is trying to acquire lock:
 ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
 
 but task is already holding lock:
 ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:
        down_read_nested+0x41/0x180
        eventfs_root_lookup+0x42/0x120
        __lookup_slow+0xff/0x1b0
        walk_component+0xdb/0x150
        path_lookupat+0x67/0x1a0
        filename_lookup+0xe4/0x1f0
        vfs_statx+0x9e/0x180
        vfs_fstatat+0x51/0x70
        __do_sys_newfstatat+0x3f/0x80
        do_syscall_64+0x3a/0xc0
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 
 -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:
        __lock_acquire+0x165d/0x2390
        lock_acquire+0xd4/0x2d0
        down_write+0x3b/0xd0
        dcache_dir_open_wrapper+0xc1/0x1b0
        do_dentry_open+0x20c/0x510
        path_openat+0x7ad/0xc60
        do_filp_open+0xaf/0x160
        do_sys_openat2+0xab/0xe0
        __x64_sys_openat+0x6a/0xa0
        do_syscall_64+0x3a/0xc0
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   rlock(eventfs_rwsem/1);
                                lock(&sb->s_type->i_mutex_key#5);
                                lock(eventfs_rwsem/1);
   lock(&sb->s_type->i_mutex_key#5);
 
  *** DEADLOCK ***

 1 lock held by trace-cmd/840:
  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
 
 stack backtrace:
 CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x90
  check_noncircular+0x14b/0x160
  __lock_acquire+0x165d/0x2390
  lock_acquire+0xd4/0x2d0
  ? dcache_dir_open_wrapper+0xc1/0x1b0
  down_write+0x3b/0xd0
  ? dcache_dir_open_wrapper+0xc1/0x1b0
  dcache_dir_open_wrapper+0xc1/0x1b0
  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
  do_dentry_open+0x20c/0x510
  path_openat+0x7ad/0xc60
  do_filp_open+0xaf/0x160
  do_sys_openat2+0xab/0xe0
  __x64_sys_openat+0x6a/0xa0
  do_syscall_64+0x3a/0xc0
  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 RIP: 0033:0x7f1743267e41
 Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
 RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
 RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
 RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
 R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
 R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
  </TASK>


I moved the code around a bit, and it appears that kprobes is getting
dput() more than once.

I moved the d_invalidate() and dput() into the workqueue function, and on
kprobes going away, d_invalidate() frees it, and dput() is now corrupted.

Still investigating. The VFS layer is a magic box that needs the right
wizard hat to deal with, but I unfortunately am waiting on back order to
retrieve that specific hat :-p

-- Steve
Ajay Kaher July 19, 2023, 6:37 p.m. UTC | #6
> On 19-Jul-2023, at 7:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Wed, 19 Jul 2023 10:25:28 +0000
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> - Is it also reproducible on v3?
>> - Is it manually reproducible or reproducible using any specific script?
>> 
>> Let me know if I can help.
> 
> Just tried it against v3, and it gave me the splat that I originally had
> and starting to fix, which now gives me another splat. I'll spend a couple
> more days on it and start sharing code and seeing if we can work together
> on this.
> 
> Here's the reproducer (of both v3 splat and the bug I'm hitting now).
> 
> ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events

I tried above steps on v4 but couldn’t reproduce:

root@photon-6 [ ~/sdb/linux ]# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
root@photon-6 [ ~/sdb/linux ]# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
enable  filter  format  id  trigger
root@photon-6 [ ~/sdb/linux ]# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
-bash: echo: write error: No such file or directory

I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
will have unordered list and could cause problem.
 

> 
> v3 gives me (and my updates too)
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.5.0-rc1-test+ #576 Not tainted
> ------------------------------------------------------
> trace-cmd/840 is trying to acquire lock:
> ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
> 
> but task is already holding lock:
> ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:
>        down_read_nested+0x41/0x180
>        eventfs_root_lookup+0x42/0x120
>        __lookup_slow+0xff/0x1b0
>        walk_component+0xdb/0x150
>        path_lookupat+0x67/0x1a0
>        filename_lookup+0xe4/0x1f0
>        vfs_statx+0x9e/0x180
>        vfs_fstatat+0x51/0x70
>        __do_sys_newfstatat+0x3f/0x80
>        do_syscall_64+0x3a/0xc0
>        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:
>        __lock_acquire+0x165d/0x2390
>        lock_acquire+0xd4/0x2d0
>        down_write+0x3b/0xd0
>        dcache_dir_open_wrapper+0xc1/0x1b0
>        do_dentry_open+0x20c/0x510
>        path_openat+0x7ad/0xc60
>        do_filp_open+0xaf/0x160
>        do_sys_openat2+0xab/0xe0
>        __x64_sys_openat+0x6a/0xa0
>        do_syscall_64+0x3a/0xc0
>        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   rlock(eventfs_rwsem/1);
>                                lock(&sb->s_type->i_mutex_key#5);
>                                lock(eventfs_rwsem/1);
>   lock(&sb->s_type->i_mutex_key#5);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by trace-cmd/840:
>  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> 
> stack backtrace:
> CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x57/0x90
>  check_noncircular+0x14b/0x160
>  __lock_acquire+0x165d/0x2390
>  lock_acquire+0xd4/0x2d0
>  ? dcache_dir_open_wrapper+0xc1/0x1b0
>  down_write+0x3b/0xd0
>  ? dcache_dir_open_wrapper+0xc1/0x1b0
>  dcache_dir_open_wrapper+0xc1/0x1b0
>  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
>  do_dentry_open+0x20c/0x510
>  path_openat+0x7ad/0xc60
>  do_filp_open+0xaf/0x160
>  do_sys_openat2+0xab/0xe0
>  __x64_sys_openat+0x6a/0xa0
>  do_syscall_64+0x3a/0xc0
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> RIP: 0033:0x7f1743267e41
> Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
> RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
> RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
> RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
> R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
> R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
>  </TASK>
> 

This is expected from v3 (just ignore as of now), if eventfs_set_ef_status_free crash not
reproduced on v3 then it’s v4 issue.

-Ajay

> 
> I moved the code around a bit, and it appears that kprobes is getting
> dput() more than once.
> 
> I moved the d_invalidate() and dput() into the workqueue function, and on
> kprobes going away, d_invalidate() frees it, and dput() is now corrupted.
> 
> Still investigating. The VFS layer is a magic box that needs the right
> wizard hat to deal with, but I unfortunately am waiting on back order to
> retrieve that specific hat :-p
> 
> -- Steve
> 
> !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
Steven Rostedt July 19, 2023, 6:40 p.m. UTC | #7
On Wed, 19 Jul 2023 18:37:12 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> > Here's the reproducer (of both v3 splat and the bug I'm hitting now).
> > 
> > ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> > ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> > ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events  
> 
> I tried above steps on v4 but couldn’t reproduce:
> 
> root@photon-6 [ ~/sdb/linux ]# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> root@photon-6 [ ~/sdb/linux ]# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> enable  filter  format  id  trigger
> root@photon-6 [ ~/sdb/linux ]# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> -bash: echo: write error: No such file or directory
> 
> I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
> will have unordered list and could cause problem.

I modified the srcu portion a bit. Will post soon. I think I got something
working.

I'm having doubt that the dput()s were needed in the eventfs_remove_rec(),
as the d_invalidate() appears to be enough. I'm still testing.

>  
> 
> > 
> > v3 gives me (and my updates too)
> > 
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.5.0-rc1-test+ #576 Not tainted
> > ------------------------------------------------------
> > trace-cmd/840 is trying to acquire lock:
> > ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
> > 
> > but task is already holding lock:
> > ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> >   
> > -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:  
> >        down_read_nested+0x41/0x180
> >        eventfs_root_lookup+0x42/0x120
> >        __lookup_slow+0xff/0x1b0
> >        walk_component+0xdb/0x150
> >        path_lookupat+0x67/0x1a0
> >        filename_lookup+0xe4/0x1f0
> >        vfs_statx+0x9e/0x180
> >        vfs_fstatat+0x51/0x70
> >        __do_sys_newfstatat+0x3f/0x80
> >        do_syscall_64+0x3a/0xc0
> >        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >   
> > -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:  
> >        __lock_acquire+0x165d/0x2390
> >        lock_acquire+0xd4/0x2d0
> >        down_write+0x3b/0xd0
> >        dcache_dir_open_wrapper+0xc1/0x1b0
> >        do_dentry_open+0x20c/0x510
> >        path_openat+0x7ad/0xc60
> >        do_filp_open+0xaf/0x160
> >        do_sys_openat2+0xab/0xe0
> >        __x64_sys_openat+0x6a/0xa0
> >        do_syscall_64+0x3a/0xc0
> >        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > 
> > other info that might help us debug this:
> > 
> >  Possible unsafe locking scenario:
> > 
> >        CPU0                    CPU1
> >        ----                    ----
> >   rlock(eventfs_rwsem/1);
> >                                lock(&sb->s_type->i_mutex_key#5);
> >                                lock(eventfs_rwsem/1);
> >   lock(&sb->s_type->i_mutex_key#5);
> > 
> >  *** DEADLOCK ***
> > 
> > 1 lock held by trace-cmd/840:
> >  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> > 
> > stack backtrace:
> > CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x57/0x90
> >  check_noncircular+0x14b/0x160
> >  __lock_acquire+0x165d/0x2390
> >  lock_acquire+0xd4/0x2d0
> >  ? dcache_dir_open_wrapper+0xc1/0x1b0
> >  down_write+0x3b/0xd0
> >  ? dcache_dir_open_wrapper+0xc1/0x1b0
> >  dcache_dir_open_wrapper+0xc1/0x1b0
> >  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
> >  do_dentry_open+0x20c/0x510
> >  path_openat+0x7ad/0xc60
> >  do_filp_open+0xaf/0x160
> >  do_sys_openat2+0xab/0xe0
> >  __x64_sys_openat+0x6a/0xa0
> >  do_syscall_64+0x3a/0xc0
> >  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > RIP: 0033:0x7f1743267e41
> > Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
> > RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
> > RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
> > RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
> > R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
> > R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
> >  </TASK>
> >   
> 
> This is expected from v3 (just ignore as of now), if eventfs_set_ef_status_free crash not
> reproduced on v3 then it’s v4 issue.

The issue comes from fixing the above ;-)

-- Steve
Steven Rostedt July 21, 2023, 1:18 p.m. UTC | #8
[ Resending because claws-mail is messing with the Cc again. It doesn't like quotes :-p ]

On Wed, 19 Jul 2023 14:40:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
> > will have unordered list and could cause problem.  
> 
> I modified the srcu portion a bit. Will post soon. I think I got something
> working.
> 
> I'm having doubt that the dput()s were needed in the eventfs_remove_rec(),
> as the d_invalidate() appears to be enough. I'm still testing.

OK, I got this working and it appears to pass all my tests. I actually got
it working Wednesday night, but I tried a different approach on Thursday
that got rid of the evenfs_file and only used eventfs_inodes and makes the
files more dynamic. There's still a couple of corner cases that are not
working with this approach (the dentry counters are getting out of sync).
This should not stop this from going in. I'll continue working on that
approach for the next merge cycle. But for now, here's the patch to this
series that works.

I'm going to post it here, and then reply to it with comments about my
changes.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 4db048250cdb..2718de1533e6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -36,16 +36,36 @@ struct eventfs_file {
 	const struct file_operations	*fop;
 	const struct inode_operations	*iop;
 	union {
+		struct list_head	del_list;
 		struct rcu_head		rcu;
-		struct llist_node	llist;  /* For freeing after RCU */
+		unsigned long		is_freed; /* Freed if one of the above is set */
 	};
 	void				*data;
 	umode_t				mode;
-	bool				created;
+	unsigned int			flags;
 };
 
 static DEFINE_MUTEX(eventfs_mutex);
 DEFINE_STATIC_SRCU(eventfs_srcu);
+
+static struct dentry *eventfs_root_lookup(struct inode *dir,
+					  struct dentry *dentry,
+					  unsigned int flags);
+static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
+static int eventfs_release(struct inode *inode, struct file *file);
+
+static const struct inode_operations eventfs_root_dir_inode_operations = {
+	.lookup		= eventfs_root_lookup,
+};
+
+static const struct file_operations eventfs_file_operations = {
+	.open           = dcache_dir_open_wrapper,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.llseek		= generic_file_llseek,
+	.release        = eventfs_release,
+};
+
 /**
  * create_file - create a file in the tracefs filesystem
  * @name: the name of the file to create.
@@ -123,17 +143,12 @@ static struct dentry *create_file(const char *name, umode_t mode,
  * If tracefs is not enabled in the kernel, the value -%ENODEV will be
  * returned.
  */
-static struct dentry *create_dir(const char *name, umode_t mode,
-				 struct dentry *parent, void *data,
-				 const struct file_operations *fop,
-				 const struct inode_operations *iop)
+static struct dentry *create_dir(const char *name, struct dentry *parent, void *data)
 {
 	struct tracefs_inode *ti;
 	struct dentry *dentry;
 	struct inode *inode;
 
-	WARN_ON(!S_ISDIR(mode));
-
 	dentry = eventfs_start_creating(name, parent);
 	if (IS_ERR(dentry))
 		return dentry;
@@ -142,9 +157,9 @@ static struct dentry *create_dir(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
 
-	inode->i_mode = mode;
-	inode->i_op = iop;
-	inode->i_fop = fop;
+	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
+	inode->i_op = &eventfs_root_dir_inode_operations;
+	inode->i_fop = &eventfs_file_operations;
 	inode->i_private = data;
 
 	ti = get_tracefs(inode);
@@ -169,15 +184,27 @@ void eventfs_set_ef_status_free(struct dentry *dentry)
 	struct tracefs_inode *ti_parent;
 	struct eventfs_file *ef;
 
+	mutex_lock(&eventfs_mutex);
 	ti_parent = get_tracefs(dentry->d_parent->d_inode);
 	if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE))
-		return;
+		goto out;
 
 	ef = dentry->d_fsdata;
 	if (!ef)
-		return;
-	ef->created = false;
+		goto out;
+	/*
+	 * If ef was freed, then the LSB bit is set for d_fsdata.
+	 * But this should not happen, as it should still have a
+	 * ref count that prevents it. Warn in case it does.
+	 */
+	if (WARN_ON_ONCE((unsigned long)ef & 1))
+		goto out;
+
+	dentry->d_fsdata = NULL;
+
 	ef->dentry = NULL;
+ out:
+	mutex_unlock(&eventfs_mutex);
 }
 
 /**
@@ -202,6 +229,79 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
 	ti->private = ef->ei;
 }
 
+static struct dentry *
+create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
+{
+	bool invalidate = false;
+	struct dentry *dentry;
+
+	mutex_lock(&eventfs_mutex);
+	if (ef->is_freed) {
+		mutex_unlock(&eventfs_mutex);
+		return NULL;
+	}
+	if (ef->dentry) {
+		dentry = ef->dentry;
+		/* On dir open, up the ref count */
+		if (!lookup)
+			dget(dentry);
+		mutex_unlock(&eventfs_mutex);
+		return dentry;
+	}
+	mutex_unlock(&eventfs_mutex);
+
+	if (!lookup)
+		inode_lock(parent->d_inode);
+
+	if (ef->ei)
+		dentry = create_dir(ef->name, parent, ef->data);
+	else
+		dentry = create_file(ef->name, ef->mode, parent,
+				     ef->data, ef->fop);
+
+	if (!lookup)
+		inode_unlock(parent->d_inode);
+
+	mutex_lock(&eventfs_mutex);
+	if (IS_ERR_OR_NULL(dentry)) {
+		/* If the ef was already updated get it */
+		dentry = ef->dentry;
+		if (dentry && !lookup)
+			dget(dentry);
+		mutex_unlock(&eventfs_mutex);
+		return dentry;
+	}
+
+	if (!ef->dentry && !ef->is_freed) {
+		ef->dentry = dentry;
+		if (ef->ei)
+			eventfs_post_create_dir(ef);
+		dentry->d_fsdata = ef;
+	} else {
+		/* A race here, should try again (unless freed) */
+		invalidate = true;
+	}
+	mutex_unlock(&eventfs_mutex);
+	if (invalidate)
+		d_invalidate(dentry);
+
+	if (lookup || invalidate)
+		dput(dentry);
+
+	return invalidate ? NULL : dentry;
+}
+
+static bool match_event_file(struct eventfs_file *ef, const char *name)
+{
+	bool ret;
+
+	mutex_lock(&eventfs_mutex);
+	ret = !ef->is_freed && strcmp(ef->name, name) == 0;
+	mutex_unlock(&eventfs_mutex);
+
+	return ret;
+}
+
 /**
  * eventfs_root_lookup - lookup routine to create file/dir
  * @dir: directory in which lookup to be done
@@ -211,7 +311,6 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
  * Used to create dynamic file/dir with-in @dir, search with-in ei
  * list, if @dentry found go ahead and create the file/dir
  */
-
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags)
@@ -230,30 +329,10 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		if (strcmp(ef->name, dentry->d_name.name))
+		if (!match_event_file(ef, dentry->d_name.name))
 			continue;
 		ret = simple_lookup(dir, dentry, flags);
-		if (ef->created)
-			continue;
-		mutex_lock(&eventfs_mutex);
-		ef->created = true;
-		if (ef->ei)
-			ef->dentry = create_dir(ef->name, ef->mode, ef->d_parent,
-						ef->data, ef->fop, ef->iop);
-		else
-			ef->dentry = create_file(ef->name, ef->mode, ef->d_parent,
-						 ef->data, ef->fop);
-
-		if (IS_ERR_OR_NULL(ef->dentry)) {
-			ef->created = false;
-			mutex_unlock(&eventfs_mutex);
-		} else {
-			if (ef->ei)
-				eventfs_post_create_dir(ef);
-			ef->dentry->d_fsdata = ef;
-			mutex_unlock(&eventfs_mutex);
-			dput(ef->dentry);
-		}
+		create_dentry(ef, ef->d_parent, true);
 		break;
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
@@ -270,6 +349,7 @@ static int eventfs_release(struct inode *inode, struct file *file)
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	struct eventfs_file *ef;
+	struct dentry *dentry;
 	int idx;
 
 	ti = get_tracefs(inode);
@@ -280,8 +360,11 @@ static int eventfs_release(struct inode *inode, struct file *file)
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		if (ef->created)
-			dput(ef->dentry);
+		mutex_lock(&eventfs_mutex);
+		dentry = ef->dentry;
+		mutex_unlock(&eventfs_mutex);
+		if (dentry)
+			dput(dentry);
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
 	return dcache_dir_close(inode, file);
@@ -312,47 +395,12 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
 	ei = ti->private;
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_rcu(ef, &ei->e_top_files, list) {
-		if (ef->created) {
-			dget(ef->dentry);
-			continue;
-		}
-		mutex_lock(&eventfs_mutex);
-		ef->created = true;
-
-		inode_lock(dentry->d_inode);
-		if (ef->ei)
-			ef->dentry = create_dir(ef->name, ef->mode, dentry,
-						ef->data, ef->fop, ef->iop);
-		else
-			ef->dentry = create_file(ef->name, ef->mode, dentry,
-						 ef->data, ef->fop);
-		inode_unlock(dentry->d_inode);
-
-		if (IS_ERR_OR_NULL(ef->dentry)) {
-			ef->created = false;
-		} else {
-			if (ef->ei)
-				eventfs_post_create_dir(ef);
-			ef->dentry->d_fsdata = ef;
-		}
-		mutex_unlock(&eventfs_mutex);
+		create_dentry(ef, dentry, false);
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
 	return dcache_dir_open(inode, file);
 }
 
-static const struct file_operations eventfs_file_operations = {
-	.open           = dcache_dir_open_wrapper,
-	.read		= generic_read_dir,
-	.iterate_shared	= dcache_readdir,
-	.llseek		= generic_file_llseek,
-	.release        = eventfs_release,
-};
-
-static const struct inode_operations eventfs_root_dir_inode_operations = {
-	.lookup		= eventfs_root_lookup,
-};
-
 /**
  * eventfs_prepare_ef - helper function to prepare eventfs_file
  * @name: the name of the file/directory to create.
@@ -470,11 +518,7 @@ struct eventfs_file *eventfs_add_subsystem_dir(const char *name,
 	ti_parent = get_tracefs(parent->d_inode);
 	ei_parent = ti_parent->private;
 
-	ef = eventfs_prepare_ef(name,
-		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
-		&eventfs_file_operations,
-		&eventfs_root_dir_inode_operations, NULL);
-
+	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
 	if (IS_ERR(ef))
 		return ef;
 
@@ -502,11 +546,7 @@ struct eventfs_file *eventfs_add_dir(const char *name,
 	if (!ef_parent)
 		return ERR_PTR(-EINVAL);
 
-	ef = eventfs_prepare_ef(name,
-		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
-		&eventfs_file_operations,
-		&eventfs_root_dir_inode_operations, NULL);
-
+	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
 	if (IS_ERR(ef))
 		return ef;
 
@@ -601,37 +641,15 @@ int eventfs_add_file(const char *name, umode_t mode,
 	return 0;
 }
 
-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
-	struct eventfs_file *ef, *tmp;
-	struct llist_node *llnode;
-
-	llnode = llist_del_all(&free_list);
-	llist_for_each_entry_safe(ef, tmp, llnode, llist) {
-		if (ef->created && ef->dentry)
-			dput(ef->dentry);
-		kfree(ef->name);
-		kfree(ef->ei);
-		kfree(ef);
-	}
-}
-
-DECLARE_WORK(eventfs_work, eventfs_workfn);
-
 static void free_ef(struct rcu_head *head)
 {
 	struct eventfs_file *ef = container_of(head, struct eventfs_file, rcu);
 
-	if (!llist_add(&ef->llist, &free_list))
-		return;
-
-	queue_work(system_unbound_wq, &eventfs_work);
+	kfree(ef->name);
+	kfree(ef->ei);
+	kfree(ef);
 }
 
-
-
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ef: eventfs_file to be removed.
@@ -639,7 +657,7 @@ static void free_ef(struct rcu_head *head)
  * This function recursively remove eventfs_file which
  * contains info of file or dir.
  */
-static void eventfs_remove_rec(struct eventfs_file *ef, int level)
+static void eventfs_remove_rec(struct eventfs_file *ef, struct list_head *head, int level)
 {
 	struct eventfs_file *ef_child;
 
@@ -659,15 +677,12 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
 		/* search for nested folders or files */
 		list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list,
 					 lockdep_is_held(&eventfs_mutex)) {
-			eventfs_remove_rec(ef_child, level + 1);
+			eventfs_remove_rec(ef_child, head, level + 1);
 		}
 	}
 
-	if (ef->created && ef->dentry)
-		d_invalidate(ef->dentry);
-
 	list_del_rcu(&ef->list);
-	call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
+	list_add_tail(&ef->del_list, head);
 }
 
 /**
@@ -678,12 +693,62 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
  */
 void eventfs_remove(struct eventfs_file *ef)
 {
+	struct eventfs_file *tmp;
+	LIST_HEAD(ef_del_list);
+	struct dentry *dentry_list = NULL;
+	struct dentry *dentry;
+
 	if (!ef)
 		return;
 
 	mutex_lock(&eventfs_mutex);
-	eventfs_remove_rec(ef, 0);
+	eventfs_remove_rec(ef, &ef_del_list, 0);
+
+	list_for_each_entry_safe(ef, tmp, &ef_del_list, del_list) {
+		if (ef->dentry) {
+			unsigned long ptr = (unsigned long)dentry_list;
+
+			/* Keep the dentry from being freed yet */
+			dget(ef->dentry);
+
+			/*
+			 * Paranoid: The dget() above should prevent the dentry
+			 * from being freed and calling eventfs_set_ef_status_free().
+			 * But just in case, set the link list LSB pointer to 1
+			 * and have eventfs_set_ef_status_free() check that to
+			 * make sure that if it does happen, it will not think
+			 * the d_fsdata is an event_file.
+			 *
+			 * For this to work, no event_file should be allocated
+			 * on a odd space, as the ef should always be allocated
+			 * to be at least word aligned. Check for that too.
+			 */
+			WARN_ON_ONCE(ptr & 1);
+
+			ef->dentry->d_fsdata = (void *)(ptr | 1);
+			dentry_list = ef->dentry;
+			ef->dentry = NULL;
+		}
+		call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
+	}
 	mutex_unlock(&eventfs_mutex);
+
+	while (dentry_list) {
+		unsigned long ptr;
+
+		dentry = dentry_list;
+		ptr = (unsigned long)dentry->d_fsdata & ~1UL;
+		dentry_list = (struct dentry *)ptr;
+		dentry->d_fsdata = NULL;
+		d_invalidate(dentry);
+		mutex_lock(&eventfs_mutex);
+		/* dentry should now have at least a single reference */
+		WARN_ONCE((int)d_count(dentry) < 1,
+			  "dentry %px less than one reference (%d) after invalidate\n",
+			  dentry, d_count(dentry));
+		mutex_unlock(&eventfs_mutex);
+		dput(dentry);
+	}
 }
 
 /**
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index c443a0c32a8c..1b880b5cd29d 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -22,4 +22,6 @@ struct dentry *tracefs_end_creating(struct dentry *dentry);
 struct dentry *tracefs_failed_creating(struct dentry *dentry);
 struct inode *tracefs_get_inode(struct super_block *sb);
 
+void eventfs_set_ef_status_free(struct dentry *dentry);
+
 #endif /* _TRACEFS_INTERNAL_H */
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 4d30b0cafc5f..47c1b4d21735 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -51,8 +51,6 @@ void eventfs_remove(struct eventfs_file *ef);
 
 void eventfs_remove_events_dir(struct dentry *dentry);
 
-void eventfs_set_ef_status_free(struct dentry *dentry);
-
 struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops);
Steven Rostedt July 21, 2023, 1:19 p.m. UTC | #9
[ Resending because claws-mail is messing with the Cc again. It doesn't like quotes :-p ]

On Fri, 21 Jul 2023 08:48:39 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 4db048250cdb..2718de1533e6 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -36,16 +36,36 @@ struct eventfs_file {
>  	const struct file_operations	*fop;
>  	const struct inode_operations	*iop;
>  	union {
> +		struct list_head	del_list;
>  		struct rcu_head		rcu;
> -		struct llist_node	llist;  /* For freeing after RCU */
> +		unsigned long		is_freed; /* Freed if one of the above is set */

I changed the freeing around. The dentries are freed before returning from
eventfs_remove_dir().

I also added a "is_freed" field that is part of the union and is set if
list elements have content. Note, since the union was criticized before, I
will state the entire purpose of doing this patch set is to save memory.
This structure will be used for every event file. What's the point of
getting rid of dentries if we are replacing it with something just as big?
Anyway, struct dentry does the exact same thing!

>  	};
>  	void				*data;
>  	umode_t				mode;
> -	bool				created;
> +	unsigned int			flags;

Bah, I forgot to remove flags (one iteration replaced the created with
flags to set both created and freed). I removed the freed with the above
"is_freed" and noticed that created is set if and only if ef->dentry is
set. So instead of using the created boolean, just test ef->dentry.

The flags isn't used and can be removed. I just forgot to do so.

>  };
>  
>  static DEFINE_MUTEX(eventfs_mutex);
>  DEFINE_STATIC_SRCU(eventfs_srcu);
> +
> +static struct dentry *eventfs_root_lookup(struct inode *dir,
> +					  struct dentry *dentry,
> +					  unsigned int flags);
> +static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
> +static int eventfs_release(struct inode *inode, struct file *file);
> +
> +static const struct inode_operations eventfs_root_dir_inode_operations = {
> +	.lookup		= eventfs_root_lookup,
> +};
> +
> +static const struct file_operations eventfs_file_operations = {
> +	.open           = dcache_dir_open_wrapper,
> +	.read		= generic_read_dir,
> +	.iterate_shared	= dcache_readdir,
> +	.llseek		= generic_file_llseek,
> +	.release        = eventfs_release,
> +};
> +

In preparing for getting rid of eventfs_file, I noticed that all
directories are set to the above ops. In create_dir() instead of passing in
ef->*ops, just use these directly. This does help with future work.

>  /**
>   * create_file - create a file in the tracefs filesystem
>   * @name: the name of the file to create.
> @@ -123,17 +143,12 @@ static struct dentry *create_file(const char *name, umode_t mode,
>   * If tracefs is not enabled in the kernel, the value -%ENODEV will be
>   * returned.
>   */
> -static struct dentry *create_dir(const char *name, umode_t mode,
> -				 struct dentry *parent, void *data,
> -				 const struct file_operations *fop,
> -				 const struct inode_operations *iop)
> +static struct dentry *create_dir(const char *name, struct dentry *parent, void *data)
>  {

As stated, the directories always used the same *op values, so I just hard
coded it.

>  	struct tracefs_inode *ti;
>  	struct dentry *dentry;
>  	struct inode *inode;
>  
> -	WARN_ON(!S_ISDIR(mode));
> -
>  	dentry = eventfs_start_creating(name, parent);
>  	if (IS_ERR(dentry))
>  		return dentry;
> @@ -142,9 +157,9 @@ static struct dentry *create_dir(const char *name, umode_t mode,
>  	if (unlikely(!inode))
>  		return eventfs_failed_creating(dentry);
>  
> -	inode->i_mode = mode;
> -	inode->i_op = iop;
> -	inode->i_fop = fop;
> +	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
> +	inode->i_op = &eventfs_root_dir_inode_operations;
> +	inode->i_fop = &eventfs_file_operations;
>  	inode->i_private = data;
>  
>  	ti = get_tracefs(inode);
> @@ -169,15 +184,27 @@ void eventfs_set_ef_status_free(struct dentry *dentry)
>  	struct tracefs_inode *ti_parent;
>  	struct eventfs_file *ef;
>  
> +	mutex_lock(&eventfs_mutex);

To synchronize with the removals, I needed to add locking here.

>  	ti_parent = get_tracefs(dentry->d_parent->d_inode);
>  	if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE))
> -		return;
> +		goto out;
>  
>  	ef = dentry->d_fsdata;
>  	if (!ef)
> -		return;
> -	ef->created = false;
> +		goto out;
> +	/*
> +	 * If ef was freed, then the LSB bit is set for d_fsdata.
> +	 * But this should not happen, as it should still have a
> +	 * ref count that prevents it. Warn in case it does.
> +	 */
> +	if (WARN_ON_ONCE((unsigned long)ef & 1))
> +		goto out;

During the remove, a dget() is done to keep the dentry from freeing. To
make sure that it doesn't get freed, I added this test.

> +
> +	dentry->d_fsdata = NULL;
> +
>  	ef->dentry = NULL;
> + out:
> +	mutex_unlock(&eventfs_mutex);
>  }
>  
>  /**
> @@ -202,6 +229,79 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
>  	ti->private = ef->ei;
>  }
>  
> +static struct dentry *
> +create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
> +{

Because both the lookup and the dir_open_wrapper did basically the same
thing, I created a helper function so that I didn't have to update both
locations.

> +	bool invalidate = false;
> +	struct dentry *dentry;
> +
> +	mutex_lock(&eventfs_mutex);
> +	if (ef->is_freed) {
> +		mutex_unlock(&eventfs_mutex);
> +		return NULL;
> +	}

Ignore if the ef is on its way to be freed.

> +	if (ef->dentry) {
> +		dentry = ef->dentry;

If the ef already has a dentry (created) then use it.

> +		/* On dir open, up the ref count */
> +		if (!lookup)
> +			dget(dentry);
> +		mutex_unlock(&eventfs_mutex);
> +		return dentry;
> +	}
> +	mutex_unlock(&eventfs_mutex);
> +
> +	if (!lookup)
> +		inode_lock(parent->d_inode);
> +
> +	if (ef->ei)
> +		dentry = create_dir(ef->name, parent, ef->data);
> +	else
> +		dentry = create_file(ef->name, ef->mode, parent,
> +				     ef->data, ef->fop);
> +
> +	if (!lookup)
> +		inode_unlock(parent->d_inode);
> +
> +	mutex_lock(&eventfs_mutex);
> +	if (IS_ERR_OR_NULL(dentry)) {

With the lock dropped, the dentry could have been created causing it to
fail. Check if the ef->dentry exists, and if so, use it instead.

Note, if the ef is freed, it should not have a dentry.

> +		/* If the ef was already updated get it */
> +		dentry = ef->dentry;
> +		if (dentry && !lookup)
> +			dget(dentry);
> +		mutex_unlock(&eventfs_mutex);
> +		return dentry;
> +	}
> +
> +	if (!ef->dentry && !ef->is_freed) {

With the lock dropped, the dentry could have been filled too. If so, drop
the created dentry and use the one owned by the ef->dentry.

> +		ef->dentry = dentry;
> +		if (ef->ei)
> +			eventfs_post_create_dir(ef);
> +		dentry->d_fsdata = ef;
> +	} else {
> +		/* A race here, should try again (unless freed) */
> +		invalidate = true;

I had a WARN_ON() once here. Probably could add a:

		WARN_ON_ONCE(!ef->is_freed);

> +	}
> +	mutex_unlock(&eventfs_mutex);
> +	if (invalidate)
> +		d_invalidate(dentry);
> +
> +	if (lookup || invalidate)
> +		dput(dentry);
> +
> +	return invalidate ? NULL : dentry;
> +}
> +
> +static bool match_event_file(struct eventfs_file *ef, const char *name)
> +{

A bit of a paranoid helper function. I wanted to make sure to synchronize
with the removals.

> +	bool ret;
> +
> +	mutex_lock(&eventfs_mutex);
> +	ret = !ef->is_freed && strcmp(ef->name, name) == 0;
> +	mutex_unlock(&eventfs_mutex);
> +
> +	return ret;
> +}
> +
>  /**
>   * eventfs_root_lookup - lookup routine to create file/dir
>   * @dir: directory in which lookup to be done
> @@ -211,7 +311,6 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
>   * Used to create dynamic file/dir with-in @dir, search with-in ei
>   * list, if @dentry found go ahead and create the file/dir
>   */
> -
>  static struct dentry *eventfs_root_lookup(struct inode *dir,
>  					  struct dentry *dentry,
>  					  unsigned int flags)
> @@ -230,30 +329,10 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
>  	idx = srcu_read_lock(&eventfs_srcu);
>  	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
>  				 srcu_read_lock_held(&eventfs_srcu)) {
> -		if (strcmp(ef->name, dentry->d_name.name))
> +		if (!match_event_file(ef, dentry->d_name.name))
>  			continue;
>  		ret = simple_lookup(dir, dentry, flags);
> -		if (ef->created)
> -			continue;
> -		mutex_lock(&eventfs_mutex);
> -		ef->created = true;
> -		if (ef->ei)
> -			ef->dentry = create_dir(ef->name, ef->mode, ef->d_parent,
> -						ef->data, ef->fop, ef->iop);
> -		else
> -			ef->dentry = create_file(ef->name, ef->mode, ef->d_parent,
> -						 ef->data, ef->fop);
> -
> -		if (IS_ERR_OR_NULL(ef->dentry)) {
> -			ef->created = false;
> -			mutex_unlock(&eventfs_mutex);
> -		} else {
> -			if (ef->ei)
> -				eventfs_post_create_dir(ef);
> -			ef->dentry->d_fsdata = ef;
> -			mutex_unlock(&eventfs_mutex);
> -			dput(ef->dentry);
> -		}
> +		create_dentry(ef, ef->d_parent, true);
>  		break;
>  	}
>  	srcu_read_unlock(&eventfs_srcu, idx);
> @@ -270,6 +349,7 @@ static int eventfs_release(struct inode *inode, struct file *file)
>  	struct tracefs_inode *ti;
>  	struct eventfs_inode *ei;
>  	struct eventfs_file *ef;
> +	struct dentry *dentry;
>  	int idx;
>  
>  	ti = get_tracefs(inode);
> @@ -280,8 +360,11 @@ static int eventfs_release(struct inode *inode, struct file *file)
>  	idx = srcu_read_lock(&eventfs_srcu);
>  	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
>  				 srcu_read_lock_held(&eventfs_srcu)) {
> -		if (ef->created)
> -			dput(ef->dentry);
> +		mutex_lock(&eventfs_mutex);
> +		dentry = ef->dentry;
> +		mutex_unlock(&eventfs_mutex);
> +		if (dentry)
> +			dput(dentry);
>  	}
>  	srcu_read_unlock(&eventfs_srcu, idx);
>  	return dcache_dir_close(inode, file);
> @@ -312,47 +395,12 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
>  	ei = ti->private;
>  	idx = srcu_read_lock(&eventfs_srcu);
>  	list_for_each_entry_rcu(ef, &ei->e_top_files, list) {
> -		if (ef->created) {
> -			dget(ef->dentry);
> -			continue;
> -		}
> -		mutex_lock(&eventfs_mutex);
> -		ef->created = true;
> -
> -		inode_lock(dentry->d_inode);
> -		if (ef->ei)
> -			ef->dentry = create_dir(ef->name, ef->mode, dentry,
> -						ef->data, ef->fop, ef->iop);
> -		else
> -			ef->dentry = create_file(ef->name, ef->mode, dentry,
> -						 ef->data, ef->fop);
> -		inode_unlock(dentry->d_inode);
> -
> -		if (IS_ERR_OR_NULL(ef->dentry)) {
> -			ef->created = false;
> -		} else {
> -			if (ef->ei)
> -				eventfs_post_create_dir(ef);
> -			ef->dentry->d_fsdata = ef;
> -		}
> -		mutex_unlock(&eventfs_mutex);
> +		create_dentry(ef, dentry, false);
>  	}
>  	srcu_read_unlock(&eventfs_srcu, idx);
>  	return dcache_dir_open(inode, file);
>  }
>  
> -static const struct file_operations eventfs_file_operations = {
> -	.open           = dcache_dir_open_wrapper,
> -	.read		= generic_read_dir,
> -	.iterate_shared	= dcache_readdir,
> -	.llseek		= generic_file_llseek,
> -	.release        = eventfs_release,
> -};
> -
> -static const struct inode_operations eventfs_root_dir_inode_operations = {
> -	.lookup		= eventfs_root_lookup,
> -};
> -
>  /**
>   * eventfs_prepare_ef - helper function to prepare eventfs_file
>   * @name: the name of the file/directory to create.
> @@ -470,11 +518,7 @@ struct eventfs_file *eventfs_add_subsystem_dir(const char *name,
>  	ti_parent = get_tracefs(parent->d_inode);
>  	ei_parent = ti_parent->private;
>  
> -	ef = eventfs_prepare_ef(name,
> -		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
> -		&eventfs_file_operations,
> -		&eventfs_root_dir_inode_operations, NULL);
> -
> +	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);

For directories, just use the hard coded values.

>  	if (IS_ERR(ef))
>  		return ef;
>  
> @@ -502,11 +546,7 @@ struct eventfs_file *eventfs_add_dir(const char *name,
>  	if (!ef_parent)
>  		return ERR_PTR(-EINVAL);
>  
> -	ef = eventfs_prepare_ef(name,
> -		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
> -		&eventfs_file_operations,
> -		&eventfs_root_dir_inode_operations, NULL);
> -
> +	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);

ditto.

>  	if (IS_ERR(ef))
>  		return ef;
>  
> @@ -601,37 +641,15 @@ int eventfs_add_file(const char *name, umode_t mode,
>  	return 0;
>  }
>  
> -static LLIST_HEAD(free_list);
> -
> -static void eventfs_workfn(struct work_struct *work)
> -{
> -	struct eventfs_file *ef, *tmp;
> -	struct llist_node *llnode;
> -
> -	llnode = llist_del_all(&free_list);
> -	llist_for_each_entry_safe(ef, tmp, llnode, llist) {
> -		if (ef->created && ef->dentry)
> -			dput(ef->dentry);
> -		kfree(ef->name);
> -		kfree(ef->ei);
> -		kfree(ef);
> -	}
> -}
> -
> -DECLARE_WORK(eventfs_work, eventfs_workfn);
> -
>  static void free_ef(struct rcu_head *head)
>  {
>  	struct eventfs_file *ef = container_of(head, struct eventfs_file, rcu);
>  
> -	if (!llist_add(&ef->llist, &free_list))
> -		return;
> -
> -	queue_work(system_unbound_wq, &eventfs_work);
> +	kfree(ef->name);
> +	kfree(ef->ei);
> +	kfree(ef);

Since I did not do the dput() or d_invalidate() here I don't need call this
from task context. This simplifies the process.

>  }
>  
> -
> -
>  /**
>   * eventfs_remove_rec - remove eventfs dir or file from list
>   * @ef: eventfs_file to be removed.
> @@ -639,7 +657,7 @@ static void free_ef(struct rcu_head *head)
>   * This function recursively remove eventfs_file which
>   * contains info of file or dir.
>   */
> -static void eventfs_remove_rec(struct eventfs_file *ef, int level)
> +static void eventfs_remove_rec(struct eventfs_file *ef, struct list_head *head, int level)
>  {
>  	struct eventfs_file *ef_child;
>  
> @@ -659,15 +677,12 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
>  		/* search for nested folders or files */
>  		list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list,
>  					 lockdep_is_held(&eventfs_mutex)) {
> -			eventfs_remove_rec(ef_child, level + 1);
> +			eventfs_remove_rec(ef_child, head, level + 1);
>  		}
>  	}
>  
> -	if (ef->created && ef->dentry)
> -		d_invalidate(ef->dentry);
> -
>  	list_del_rcu(&ef->list);
> -	call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
> +	list_add_tail(&ef->del_list, head);

Hold off on freeing the ef. Add it to a link list to do so later.

>  }
>  
>  /**
> @@ -678,12 +693,62 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
>   */
>  void eventfs_remove(struct eventfs_file *ef)
>  {
> +	struct eventfs_file *tmp;
> +	LIST_HEAD(ef_del_list);
> +	struct dentry *dentry_list = NULL;
> +	struct dentry *dentry;
> +
>  	if (!ef)
>  		return;
>  
>  	mutex_lock(&eventfs_mutex);
> -	eventfs_remove_rec(ef, 0);
> +	eventfs_remove_rec(ef, &ef_del_list, 0);

The above returns back with ef_del_list holding all the ef's to be freed.

I probably could have just passed the dentry_list down instead, but I
wanted the below complexity done in a non recursive function.

> +
> +	list_for_each_entry_safe(ef, tmp, &ef_del_list, del_list) {
> +		if (ef->dentry) {
> +			unsigned long ptr = (unsigned long)dentry_list;
> +
> +			/* Keep the dentry from being freed yet */
> +			dget(ef->dentry);
> +
> +			/*
> +			 * Paranoid: The dget() above should prevent the dentry
> +			 * from being freed and calling eventfs_set_ef_status_free().
> +			 * But just in case, set the link list LSB pointer to 1
> +			 * and have eventfs_set_ef_status_free() check that to
> +			 * make sure that if it does happen, it will not think
> +			 * the d_fsdata is an event_file.
> +			 *
> +			 * For this to work, no event_file should be allocated
> +			 * on a odd space, as the ef should always be allocated
> +			 * to be at least word aligned. Check for that too.
> +			 */
> +			WARN_ON_ONCE(ptr & 1);
> +
> +			ef->dentry->d_fsdata = (void *)(ptr | 1);

Set the d_fsdata to be a link list. The comment above needs to say to say
struct eventfs_file and struct dentry should be word aligned. Anyway, while
the eventfs_mutex is held, set all the dentries belonging to eventfs_files
to the dentry_list and clear the ef->dentry.


> +			dentry_list = ef->dentry;
> +			ef->dentry = NULL;
> +		}
> +		call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
> +	}
>  	mutex_unlock(&eventfs_mutex);
> +
> +	while (dentry_list) {
> +		unsigned long ptr;
> +
> +		dentry = dentry_list;
> +		ptr = (unsigned long)dentry->d_fsdata & ~1UL;
> +		dentry_list = (struct dentry *)ptr;
> +		dentry->d_fsdata = NULL;

With the mutex released, it is safe to free the dentries here. This also
must be done before returning from this function, as when I had it done in
the workqueue, it was failing some tests that would remove a dynamic event
and still see that the directory was still around!

> +		d_invalidate(dentry);
> +		mutex_lock(&eventfs_mutex);
> +		/* dentry should now have at least a single reference */
> +		WARN_ONCE((int)d_count(dentry) < 1,
> +			  "dentry %px less than one reference (%d) after invalidate\n",

I did update the above to:

		WARN_ONCE((int)d_count(dentry) < 1,
			  "dentry %px (%s) less than one reference (%d) after invalidate\n",
			  dentry, dentry->d_name.name, d_count(dentry));

To include the name of the dentry (my current work is triggering this still).

> +			  dentry, d_count(dentry));
> +		mutex_unlock(&eventfs_mutex);
> +		dput(dentry);
> +	}
>  }
>  
>  /**
> diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
> index c443a0c32a8c..1b880b5cd29d 100644
> --- a/fs/tracefs/internal.h
> +++ b/fs/tracefs/internal.h
> @@ -22,4 +22,6 @@ struct dentry *tracefs_end_creating(struct dentry *dentry);
>  struct dentry *tracefs_failed_creating(struct dentry *dentry);
>  struct inode *tracefs_get_inode(struct super_block *sb);
>  
> +void eventfs_set_ef_status_free(struct dentry *dentry);
> +
>  #endif /* _TRACEFS_INTERNAL_H */
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 4d30b0cafc5f..47c1b4d21735 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -51,8 +51,6 @@ void eventfs_remove(struct eventfs_file *ef);
>  
>  void eventfs_remove_events_dir(struct dentry *dentry);
>  
> -void eventfs_set_ef_status_free(struct dentry *dentry);
> -

Oh, and eventfs_set_ef_status_free() should not be exported to outside the
tracefs system.

-- Steve


>  struct dentry *tracefs_create_file(const char *name, umode_t mode,
>  				   struct dentry *parent, void *data,
>  				   const struct file_operations *fops);
Nadav Amit July 21, 2023, 5:17 p.m. UTC | #10
> On Jul 21, 2023, at 6:19 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>      union {
>> +             struct list_head        del_list;
>>              struct rcu_head         rcu;
>> -             struct llist_node       llist;  /* For freeing after RCU */
>> +             unsigned long           is_freed; /* Freed if one of the above is set */
> 
> I changed the freeing around. The dentries are freed before returning from
> eventfs_remove_dir().
> 
> I also added a "is_freed" field that is part of the union and is set if
> list elements have content. Note, since the union was criticized before, I
> will state the entire purpose of doing this patch set is to save memory.
> This structure will be used for every event file. What's the point of
> getting rid of dentries if we are replacing it with something just as big?
> Anyway, struct dentry does the exact same thing!

Hey, don’t shoot me…

[And admittedly, I didn’t review the whole series after v1.]

I understand your position, but I think that at least is_freed should not
be in the union, and you can just put it after umode_t.

Even for the matter of size, it should not matter in most architectures
since umode_t is 16-bit, as natural alignment is at least 32-bits.

[ And “bool" is clearer type for is_freed. ]
Steven Rostedt July 21, 2023, 5:24 p.m. UTC | #11
On Fri, 21 Jul 2023 17:17:24 +0000
Nadav Amit <namit@vmware.com> wrote:


> > I also added a "is_freed" field that is part of the union and is set if
> > list elements have content. Note, since the union was criticized before, I
> > will state the entire purpose of doing this patch set is to save memory.
> > This structure will be used for every event file. What's the point of
> > getting rid of dentries if we are replacing it with something just as big?
> > Anyway, struct dentry does the exact same thing!  
> 
> Hey, don’t shoot me…

 ;-)

> 
> [And admittedly, I didn’t review the whole series after v1.]
> 
> I understand your position, but I think that at least is_freed should not
> be in the union, and you can just put it after umode_t.
> 
> Even for the matter of size, it should not matter in most architectures
> since umode_t is 16-bit, as natural alignment is at least 32-bits.

My new code I'm working on removes the umode_t and will require this then.
I rather have that part tested before adding other drastic changes.

-- Steve


> 
> [ And “bool" is clearer type for is_freed. ]
>
Steven Rostedt July 21, 2023, 5:30 p.m. UTC | #12
On Fri, 21 Jul 2023 17:17:24 +0000
Nadav Amit <namit@vmware.com> wrote:

> [ And “bool" is clearer type for is_freed. ]

Oh, and to answer why I didn't use bool. I want to make sure that it
doesn't use just part of the other elements in the union. If the compiler
decides to use one byte for the bool (which it is perfectly valid to do
so), and it happens to map over a zero of the other elements in the union,
then it will give a false negative.

By using unsigned long, it will be guaranteed to contain some content.

-- Steve
Steven Rostedt July 21, 2023, 7:14 p.m. UTC | #13
On Fri, 21 Jul 2023 09:18:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> OK, I got this working and it appears to pass all my tests. I actually got
> it working Wednesday night, but I tried a different approach on Thursday
> that got rid of the evenfs_file and only used eventfs_inodes and makes the
> files more dynamic. There's still a couple of corner cases that are not
> working with this approach (the dentry counters are getting out of sync).
> This should not stop this from going in. I'll continue working on that
> approach for the next merge cycle. But for now, here's the patch to this
> series that works.

Just figured out my bug with my new design. It was in the eventfs_release()
code, where I have a loop that does the dput on the children, but I wasn't
dput(child), I was dput(parent) in that loop!!

Anyway, for this merge window I much rather have this code in, and for the
next merge window I'll add this update.

I can post the new design too, but first let's focus on this.

-- Steve
Steven Rostedt July 21, 2023, 8:40 p.m. UTC | #14
On Fri, 21 Jul 2023 09:19:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > +	} else {
> > +		/* A race here, should try again (unless freed) */
> > +		invalidate = true;  
> 
> I had a WARN_ON() once here. Probably could add a:
> 
> 		WARN_ON_ONCE(!ef->is_freed);

Yeah this should have a WARN_ON_ONCE() because the only way to get here
with having a dentry and the ef->dentry being set is if we have two
dentries with the same name in the same directory. Which should never
happen.

I think we can add:

		/*
		 * Should never happen unless we get here due to being freed.
		 * Otherwise it means two dentries exist with the same name.
		 */
		WARN_ON_ONCE(!ef->is_freed);

> 
> > +	}


-- Steve
Ajay Kaher July 26, 2023, 6:54 p.m. UTC | #15
> On 22-Jul-2023, at 2:10 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Fri, 21 Jul 2023 09:19:47 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>> +   } else {
>>> +           /* A race here, should try again (unless freed) */
>>> +           invalidate = true;
>> 
>> I had a WARN_ON() once here. Probably could add a:
>> 
>>              WARN_ON_ONCE(!ef->is_freed);
> 
> Yeah this should have a WARN_ON_ONCE() because the only way to get here
> with having a dentry and the ef->dentry being set is if we have two
> dentries with the same name in the same directory. Which should never
> happen.
> 
> I think we can add:
> 
>                /*
>                 * Should never happen unless we get here due to being freed.
>                 * Otherwise it means two dentries exist with the same name.
>                 */
>                WARN_ON_ONCE(!ef->is_freed);

I missed WARN_ON_ONCE in v5 06/10, I will add in v6 06/10.

- Ajay