diff mbox series

[v5,3/3] fs/file.c: add fast path in find_next_fd()

Message ID 20240717145018.3972922-4-yu.ma@intel.com (mailing list archive)
State New
Headers show
Series fs/file.c: optimize the critical section of file_lock in | expand

Commit Message

Ma, Yu July 17, 2024, 2:50 p.m. UTC
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Mateusz Guzik July 19, 2024, 5:53 p.m. UTC | #1
On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>
> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> word contains next_fd, as:
> (1) next_fd indicates the lower bound for the first free fd.
> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.

this is stale -- now the fast path searches up to 64 fds in the lower bitmap

> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
>

> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> versions.

I think this paragraph is droppable. You already got an ack from Jan
below, so stating he agrees with the patch is redundant. As for me I
don't think this warrants mentioning. Just remove it, perhaps
Christian will be willing to massage it by himself to avoid another
series posting.

> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/file.c b/fs/file.c
> index 1be2a5bcc7c4..729c07a4fc28 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>         unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
> +       unsigned int bit;
> +
> +       /*
> +        * Try to avoid looking at the second level bitmap
> +        */
> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> +                                start & (BITS_PER_LONG - 1));
> +       if (bit < BITS_PER_LONG)
> +               return bit + bitbit * BITS_PER_LONG;
>
>         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>         if (bitbit >= maxfd)
> --
> 2.43.0
>
Ma, Yu July 20, 2024, 12:57 p.m. UTC | #2
On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
>> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
>> word contains next_fd, as:
>> (1) next_fd indicates the lower bound for the first free fd.
>> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
> this is stale -- now the fast path searches up to 64 fds in the lower bitmap

Nope, this is still valid, as the searching size of the fast path inside 
of find_next_fd() is always 64, it will execute the fast path inside of 
find_next_zero_bit().


>
>> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>>
>> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
>> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
>> versions.
> I think this paragraph is droppable. You already got an ack from Jan
> below, so stating he agrees with the patch is redundant. As for me I
> don't think this warrants mentioning. Just remove it, perhaps
> Christian will be willing to massage it by himself to avoid another
> series posting.

The idea of fast path for the word contains next_fd is from you, 
although this patch is small, I think it is reasonable to record here 
out of my respect. Appreciate for your guide and comments on this patch, 
I've learned a lot on the way of resolving problems :)


Regards

Yu

>> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
>> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>>   fs/file.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 1be2a5bcc7c4..729c07a4fc28 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>          unsigned int bitbit = start / BITS_PER_LONG;
>> +       unsigned int bit;
>> +
>> +       /*
>> +        * Try to avoid looking at the second level bitmap
>> +        */
>> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
>> +                                start & (BITS_PER_LONG - 1));
>> +       if (bit < BITS_PER_LONG)
>> +               return bit + bitbit * BITS_PER_LONG;
>>
>>          bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>>          if (bitbit >= maxfd)
>> --
>> 2.43.0
>>
>
Mateusz Guzik July 20, 2024, 2:22 p.m. UTC | #3
I think this is getting too much fluff traffic at this point, which is
partially my fault.

I'm buggering off.

Overall the patchset looks good, I don't see any technical reasons to
avoid merging it.

On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote:
>
>
> On 7/20/2024 1:53 AM, Mateusz Guzik wrote:
> > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote:
> >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
> >> word contains next_fd, as:
> >> (1) next_fd indicates the lower bound for the first free fd.
> >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> >> searching.
> > this is stale -- now the fast path searches up to 64 fds in the lower bitmap
>
> Nope, this is still valid, as the searching size of the fast path inside
> of find_next_fd() is always 64, it will execute the fast path inside of
> find_next_zero_bit().
>
>
> >
> >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion),
> >> it would never be shrunk. The search size increases but there are few open fds
> >> available here.
> >>
> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
> >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
> >> versions.
> > I think this paragraph is droppable. You already got an ack from Jan
> > below, so stating he agrees with the patch is redundant. As for me I
> > don't think this warrants mentioning. Just remove it, perhaps
> > Christian will be willing to massage it by himself to avoid another
> > series posting.
>
> The idea of fast path for the word contains next_fd is from you,
> although this patch is small, I think it is reasonable to record here
> out of my respect. Appreciate for your guide and comments on this patch,
> I've learned a lot on the way of resolving problems :)
>
>
> Regards
>
> Yu
>
> >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
> >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Yu Ma <yu.ma@intel.com>
> >> ---
> >>   fs/file.c | 9 +++++++++
> >>   1 file changed, 9 insertions(+)
> >>
> >> diff --git a/fs/file.c b/fs/file.c
> >> index 1be2a5bcc7c4..729c07a4fc28 100644
> >> --- a/fs/file.c
> >> +++ b/fs/file.c
> >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >>          unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
> >>          unsigned int maxbit = maxfd / BITS_PER_LONG;
> >>          unsigned int bitbit = start / BITS_PER_LONG;
> >> +       unsigned int bit;
> >> +
> >> +       /*
> >> +        * Try to avoid looking at the second level bitmap
> >> +        */
> >> +       bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
> >> +                                start & (BITS_PER_LONG - 1));
> >> +       if (bit < BITS_PER_LONG)
> >> +               return bit + bitbit * BITS_PER_LONG;
> >>
> >>          bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >>          if (bitbit >= maxfd)
> >> --
> >> 2.43.0
> >>
> >
kernel test robot Aug. 6, 2024, 1:48 p.m. UTC | #4
Hello,

kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on:


commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()")
url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/
patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()

testcase: will-it-scale
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:

	nr_task: 100%
	mode: thread
	test: open3
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale

commit: 
  5bb3423bf9 ("fs/file.c: conditionally clear full_fds")
  b8decf0015 ("fs/file.c: add fast path in find_next_fd()")

5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    848151            +6.2%     901119 ±  2%  will-it-scale.224.threads
      3785            +6.3%       4022 ±  2%  will-it-scale.per_thread_ops
    848151            +6.2%     901119 ±  2%  will-it-scale.workload
      0.28 ±  4%     +13.3%       0.32 ±  3%  perf-stat.i.MPKI
     31.31 ±  3%      +2.0       33.28        perf-stat.i.cache-miss-rate%
  14955855 ±  4%     +13.6%   16995785 ±  4%  perf-stat.i.cache-misses
  49676581            +6.7%   53009444 ±  3%  perf-stat.i.cache-references
     43955 ±  4%     -12.3%      38549 ±  4%  perf-stat.i.cycles-between-cache-misses
      0.28 ±  4%     +13.4%       0.32 ±  4%  perf-stat.overall.MPKI
     29.84 ±  3%      +1.9       31.78 ±  2%  perf-stat.overall.cache-miss-rate%
     43445 ±  4%     -12.1%      38200 ±  4%  perf-stat.overall.cycles-between-cache-misses
  19005976            -5.4%   17972604 ±  2%  perf-stat.overall.path-length
  14869677 ±  4%     +13.6%   16898438 ±  4%  perf-stat.ps.cache-misses
  49821402            +6.7%   53168235 ±  3%  perf-stat.ps.cache-references
     49.42            -0.1       49.34        perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.40            -0.1       49.32        perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
     49.25            -0.1       49.18        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64
     49.20            -0.1       49.13        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
     49.33            -0.1       49.26        perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.28            -0.1       49.22        perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
     50.14            +0.0       50.18        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
     50.17            +0.0       50.21        perf-profile.calltrace.cycles-pp.open64
      0.64 ±  5%      +0.1        0.75 ±  6%  perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.62 ±  5%      +0.1        0.74 ±  6%  perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
     49.42            -0.1       49.34        perf-profile.children.cycles-pp.alloc_fd
     49.40            -0.1       49.32        perf-profile.children.cycles-pp.file_close_fd
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.file_close_fd_locked
      0.15 ±  5%      +0.0        0.17 ±  4%  perf-profile.children.cycles-pp.init_file
      0.22 ±  3%      +0.0        0.25 ±  3%  perf-profile.children.cycles-pp.alloc_empty_file
      0.18 ±  6%      +0.0        0.22 ±  6%  perf-profile.children.cycles-pp.__fput
     50.14            +0.0       50.18        perf-profile.children.cycles-pp.__x64_sys_openat
     50.18            +0.0       50.22        perf-profile.children.cycles-pp.open64
      0.18 ± 14%      +0.0        0.23 ±  7%  perf-profile.children.cycles-pp.do_dentry_open
      0.30 ±  8%      +0.1        0.36 ±  8%  perf-profile.children.cycles-pp.do_open
      0.64 ±  5%      +0.1        0.75 ±  6%  perf-profile.children.cycles-pp.do_filp_open
      0.63 ±  5%      +0.1        0.75 ±  6%  perf-profile.children.cycles-pp.path_openat
      0.06            -0.0        0.05        perf-profile.self.cycles-pp.file_close_fd_locked
      0.16 ±  2%      +0.0        0.18 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.08 ± 12%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.__fput
      0.05 ±  7%      +0.1        0.10 ±  4%  perf-profile.self.cycles-pp.alloc_fd




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index 1be2a5bcc7c4..729c07a4fc28 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -491,6 +491,15 @@  static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
 	unsigned int maxbit = maxfd / BITS_PER_LONG;
 	unsigned int bitbit = start / BITS_PER_LONG;
+	unsigned int bit;
+
+	/*
+	 * Try to avoid looking at the second level bitmap
+	 */
+	bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG,
+				 start & (BITS_PER_LONG - 1));
+	if (bit < BITS_PER_LONG)
+		return bit + bitbit * BITS_PER_LONG;
 
 	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
 	if (bitbit >= maxfd)