mbox series

[v2,0/3] fs: introduce file_ref_t

Message ID 20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org (mailing list archive)
Headers show
Series fs: introduce file_ref_t | expand

Message

Christian Brauner Oct. 7, 2024, 2:23 p.m. UTC
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

I've been testing this with will-it-scale using a multi-threaded fstat()
on the same file descriptor on a machine that Jens gave me access (thank
you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on workloads with 256+ and
more threads comparing v6.12-rc1 as base with and without these patches
applied.

In vfs.file now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Don't introduce a separate rcuref_long_t library just implement it
  only for files for now.
- Retain memory barrier by using atomic_long_add_negative() instead of
  atomic_long_add_negative_relaxed() to order the loads before and after
  the increment in __fget_files_rcu().
- Link to v1: https://lore.kernel.org/r/20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org

---
Christian Brauner (3):
      fs: protect backing files with rcu
      fs: add file_ref
      fs: port files to file_ref

 drivers/gpu/drm/i915/gt/shmem_utils.c |   2 +-
 drivers/gpu/drm/vmwgfx/ttm_object.c   |   2 +-
 fs/eventpoll.c                        |   2 +-
 fs/file.c                             | 120 ++++++++++++++++++++++++++++++++--
 fs/file_table.c                       |  23 +++++--
 include/linux/file_ref.h              | 116 ++++++++++++++++++++++++++++++++
 include/linux/fs.h                    |   9 +--
 7 files changed, 253 insertions(+), 21 deletions(-)
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20240927-brauner-file-rcuref-bfa4a4ba915b

Comments

Jens Axboe Oct. 7, 2024, 6:27 p.m. UTC | #1
On 10/7/24 8:23 AM, Christian Brauner wrote:
> As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
> O(N^2) behaviour under contention with N concurrent operations and it is
> in a hot path in __fget_files_rcu().
> 
> The rcuref infrastructures remedies this problem by using an
> unconditional increment relying on safe- and dead zones to make this
> work and requiring rcu protection for the data structure in question.
> This not just scales better it also introduces overflow protection.
> 
> However, in contrast to generic rcuref, files require a memory barrier
> and thus cannot rely on *_relaxed() atomic operations and also require
> to be built on atomic_long_t as having massive amounts of reference
> isn't unheard of even if it is just an attack.
> 
> As suggested by Linus, add a file specific variant instead of making
> this a generic library.
> 
> I've been testing this with will-it-scale using a multi-threaded fstat()
> on the same file descriptor on a machine that Jens gave me access (thank
> you very much!):
> 
> processor       : 511
> vendor_id       : AuthenticAMD
> cpu family      : 25
> model           : 160
> model name      : AMD EPYC 9754 128-Core Processor
> 
> and I consistently get a 3-5% improvement on workloads with 256+ and
> more threads comparing v6.12-rc1 as base with and without these patches
> applied.

FWIW, I ran this on another box, which is a 2-socket with these CPUs:

AMD EPYC 7763 64-Core Processor

hence 128 cores, 256 threads. I ran my usual max iops test case, which
is 24 threads, each driving a fast drive. If I run without io_uring
direct descriptors, then fget/fput is hit decently hard. In that case, I
see a net reduction of about 1.2% CPU time for the fget/fput parts. So
not as huge a win as mentioned above, but it's also using way fewer
threads and different file descriptors. I'd say that's a pretty
noticeable win!