diff mbox series

vfs: Delete the associated dentry when deleting a file

Message ID 20240515091727.22034-1-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series vfs: Delete the associated dentry when deleting a file | expand

Commit Message

Yafang Shao May 15, 2024, 9:17 a.m. UTC
Our applications, built on Elasticsearch[0], frequently create and delete
files. These applications operate within containers, some with a memory
limit exceeding 100GB. Over prolonged periods, the accumulation of negative
dentries within these containers can amount to tens of gigabytes.

Upon container exit, directories are deleted. However, due to the numerous
associated dentries, this process can be time-consuming. Our users have
expressed frustration with this prolonged exit duration, which constitutes
our first issue.

Simultaneously, other processes may attempt to access the parent directory
of the Elasticsearch directories. Since the task responsible for deleting
the dentries holds the inode lock, processes attempting directory lookup
experience significant delays. This issue, our second problem, is easily
demonstrated:

  - Task 1 generates negative dentries:
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file. Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries. Given our current
challenges, we're resubmitting the proposal. All relevant stakeholders from
previous discussions have been included for reference.

Some alternative solutions are also under discussion[2][3], such as
shrinking child dentries outside of the parent inode lock or even
asynchronously shrinking child dentries. However, given the straightforward
nature of the current solution, I believe this approach is still necessary.

[0]. https://github.com/elastic/elasticsearch
[1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com
[2]. https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/
[3]. https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Colin Walters <walters@verbum.org>
---
 fs/dcache.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

Comments

Linus Torvalds May 15, 2024, 4:05 p.m. UTC | #1
Oliver,
 is there any chance you could run this through the test robot
performance suite? The original full patch at

    https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/

and it would be interesting if the test robot could see if the patch
makes any difference on any other loads?

Thanks,
                     Linus

On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Our applications, built on Elasticsearch[0], frequently create and delete
> files. These applications operate within containers, some with a memory
> limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> dentries within these containers can amount to tens of gigabytes.
>
> Upon container exit, directories are deleted. However, due to the numerous
> associated dentries, this process can be time-consuming. Our users have
> expressed frustration with this prolonged exit duration, which constitutes
> our first issue.
Oliver Sang May 16, 2024, 1:44 p.m. UTC | #2
hi, Linus,

On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
>  is there any chance you could run this through the test robot
> performance suite? The original full patch at
> 
>     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> 
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?

sure. it's our great pleasure. thanks a lot to involve us.

the tests are started, but due to resource contraint, it may cost 2-3 days
to run enough tests.

will keep you guys updated, Thanks!

> 
> Thanks,
>                      Linus
> 
> On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.
Oliver Sang May 22, 2024, 8:11 a.m. UTC | #3
Hello,

kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:


commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")
url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/vfs-Delete-the-associated-dentry-when-deleting-a-file/20240516-144340
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3c999d1ae3c75991902a1a7dad0cb62c2a3008b4
patch link: https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
patch subject: [PATCH] vfs: Delete the associated dentry when deleting a file

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	disk: 1HDD
	testtime: 60s
	fs: ext4
	test: touch
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240522/202405221518.ecea2810-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-13/performance/1HDD/ext4/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/touch/stress-ng/60s

commit: 
  3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
  3681ce3644 ("vfs: Delete the associated dentry when deleting a file")

3c999d1ae3c75991 3681ce364442ce2ec7c7fbe90ad 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 9.884e+08           -49.1%  5.027e+08 ±  4%  cpuidle..time
     23.46 ±  2%     -38.3%      14.47 ±  5%  iostat.cpu.idle
     75.71           +11.7%      84.60        iostat.cpu.system
    620100 ±  5%     -24.0%     471157 ±  6%  numa-numastat.node0.local_node
    650333 ±  3%     -22.3%     505122 ±  5%  numa-numastat.node0.numa_hit
    736690 ±  3%     -21.7%     577182 ±  4%  numa-numastat.node1.local_node
    772759 ±  3%     -21.1%     609537 ±  3%  numa-numastat.node1.numa_hit
     23.42           -38.1%      14.50 ±  5%  vmstat.cpu.id
     54.39 ±  3%     +13.4%      61.67 ±  2%  vmstat.procs.r
     75507 ±  6%     +24.6%      94113 ±  5%  vmstat.system.cs
    177863           +14.2%     203176        vmstat.system.in
     21.30 ±  2%      -9.5       11.76 ±  7%  mpstat.cpu.all.idle%
      0.37 ±  2%      +0.1        0.42 ±  2%  mpstat.cpu.all.irq%
      0.14            -0.0        0.11 ±  2%  mpstat.cpu.all.soft%
     77.37            +9.4       86.79        mpstat.cpu.all.sys%
      0.81            +0.1        0.92 ±  2%  mpstat.cpu.all.usr%
     68.24           -11.5%      60.40        stress-ng.time.elapsed_time
     68.24           -11.5%      60.40        stress-ng.time.elapsed_time.max
    175478            +5.3%     184765        stress-ng.time.involuntary_context_switches
      5038           +12.4%       5663        stress-ng.time.percent_of_cpu_this_job_got
    259551 ±  2%      +6.7%     277010        stress-ng.touch.ops_per_sec
    454143 ±  2%      -9.2%     412504 ±  2%  meminfo.Active
    445748 ±  2%      -9.3%     404154 ±  2%  meminfo.Active(anon)
   1714332 ±  2%     -92.8%     123761 ±  2%  meminfo.KReclaimable
   7111721           -23.7%    5423125        meminfo.Memused
   1714332 ±  2%     -92.8%     123761 ±  2%  meminfo.SReclaimable
    291963           -35.6%     188122        meminfo.SUnreclaim
   2006296           -84.5%     311883        meminfo.Slab
   7289014           -24.0%    5539852        meminfo.max_used_kB
    875347 ±  3%     -91.3%      75880 ± 30%  numa-meminfo.node0.KReclaimable
    875347 ±  3%     -91.3%      75880 ± 30%  numa-meminfo.node0.SReclaimable
    161663 ±  7%     -31.3%     111087 ± 12%  numa-meminfo.node0.SUnreclaim
   1037010 ±  3%     -82.0%     186968 ± 16%  numa-meminfo.node0.Slab
    838678 ±  3%     -94.3%      48016 ± 50%  numa-meminfo.node1.KReclaimable
    838678 ±  3%     -94.3%      48016 ± 50%  numa-meminfo.node1.SReclaimable
    130259 ±  8%     -40.8%      77055 ± 18%  numa-meminfo.node1.SUnreclaim
    968937 ±  3%     -87.1%     125072 ± 25%  numa-meminfo.node1.Slab
    218872 ±  3%     -91.3%      18961 ± 31%  numa-vmstat.node0.nr_slab_reclaimable
     40425 ±  7%     -31.3%      27772 ± 12%  numa-vmstat.node0.nr_slab_unreclaimable
    649973 ±  3%     -22.3%     504715 ±  5%  numa-vmstat.node0.numa_hit
    619740 ±  5%     -24.0%     470750 ±  6%  numa-vmstat.node0.numa_local
    209697 ±  3%     -94.3%      12019 ± 50%  numa-vmstat.node1.nr_slab_reclaimable
     32567 ±  8%     -40.8%      19264 ± 18%  numa-vmstat.node1.nr_slab_unreclaimable
    771696 ±  3%     -21.1%     608506 ±  3%  numa-vmstat.node1.numa_hit
    735626 ±  3%     -21.7%     576154 ±  4%  numa-vmstat.node1.numa_local
    111469 ±  2%      -9.3%     101103 ±  2%  proc-vmstat.nr_active_anon
    446.10           -36.6%     283.05        proc-vmstat.nr_dirtied
     18812            +2.5%      19287        proc-vmstat.nr_kernel_stack
    428165 ±  2%     -92.8%      30970 ±  2%  proc-vmstat.nr_slab_reclaimable
     72954           -35.5%      47032        proc-vmstat.nr_slab_unreclaimable
    111469 ±  2%      -9.3%     101103 ±  2%  proc-vmstat.nr_zone_active_anon
   1424982           -21.7%    1116423        proc-vmstat.numa_hit
   1358679           -22.7%    1050103 ±  2%  proc-vmstat.numa_local
   4261989 ±  3%     -14.9%    3625822 ±  3%  proc-vmstat.pgalloc_normal
    399826            -4.6%     381396        proc-vmstat.pgfault
   3996977 ±  3%     -15.9%    3359964 ±  3%  proc-vmstat.pgfree
     13500            -5.9%      12708        proc-vmstat.pgpgout
      0.13 ±  3%      -0.0        0.08 ±  4%  perf-profile.children.cycles-pp.d_lookup
      0.12 ±  4%      -0.0        0.08        perf-profile.children.cycles-pp.__d_lookup
      0.11 ±  2%      +0.0        0.13 ±  3%  perf-profile.children.cycles-pp.irq_exit_rcu
      0.06 ±  7%      +0.0        0.09 ±  2%  perf-profile.children.cycles-pp.__slab_free
      0.02 ±100%      +0.0        0.06 ±  8%  perf-profile.children.cycles-pp.__memcg_slab_free_hook
      0.07 ±  7%      +0.0        0.12 ±  4%  perf-profile.children.cycles-pp.run_ksoftirqd
      0.09 ±  5%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.smpboot_thread_fn
      0.18 ±  3%      +0.1        0.24 ±  3%  perf-profile.children.cycles-pp.kmem_cache_free
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.kthread
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.ret_from_fork
      0.10 ±  7%      +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.14 ±  4%      +0.1        0.20 ±  3%  perf-profile.children.cycles-pp.rcu_do_batch
      0.15 ±  3%      +0.1        0.21 ±  2%  perf-profile.children.cycles-pp.rcu_core
      0.18 ±  3%      +0.1        0.24 ±  2%  perf-profile.children.cycles-pp.handle_softirqs
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.list_lru_del
      0.82 ±  2%      +0.1        0.89 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock
      0.01 ±238%      +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.__call_rcu_common
      0.00            +0.1        0.08 ±  3%  perf-profile.children.cycles-pp.d_lru_del
      0.00            +0.2        0.17 ±  3%  perf-profile.children.cycles-pp.__dentry_kill
      0.33 ± 11%      +0.2        0.50 ±  6%  perf-profile.children.cycles-pp.dput
      0.09 ±  4%      -0.0        0.05 ± 22%  perf-profile.self.cycles-pp.__d_lookup
      0.06 ±  5%      +0.0        0.09 ±  5%  perf-profile.self.cycles-pp.__slab_free
      0.74 ±  2%      +0.1        0.79        perf-profile.self.cycles-pp._raw_spin_lock
      2.55 ±  4%     -40.6%       1.51 ±  5%  perf-stat.i.MPKI
 1.219e+10           +12.8%  1.376e+10        perf-stat.i.branch-instructions
      0.27 ±  2%      -0.1        0.21 ±  2%  perf-stat.i.branch-miss-rate%
  22693772           +14.3%   25941855        perf-stat.i.branch-misses
     39.43            -4.6       34.83        perf-stat.i.cache-miss-rate%
  95834044 ±  3%     +10.9%  1.063e+08 ±  4%  perf-stat.i.cache-misses
 2.658e+08 ±  2%     +14.8%  3.051e+08 ±  4%  perf-stat.i.cache-references
     77622 ±  6%     +25.5%      97395 ±  5%  perf-stat.i.context-switches
      2.82            +3.3%       2.91        perf-stat.i.cpi
 1.821e+11           +12.5%  2.049e+11        perf-stat.i.cpu-cycles
      9159 ±  3%     +22.9%      11257 ±  3%  perf-stat.i.cpu-migrations
 6.223e+10           +12.8%  7.019e+10        perf-stat.i.instructions
      0.36            -4.8%       0.34        perf-stat.i.ipc
      1.19 ±  9%     +27.6%       1.51 ±  5%  perf-stat.i.metric.K/sec
      4716            +6.3%       5011        perf-stat.i.minor-faults
      4716            +6.3%       5011        perf-stat.i.page-faults
     35.93            -1.2       34.77        perf-stat.overall.cache-miss-rate%
 1.214e+10           +11.5%  1.353e+10        perf-stat.ps.branch-instructions
  22276184           +13.3%   25240906        perf-stat.ps.branch-misses
  95133798 ±  3%      +9.8%  1.045e+08 ±  4%  perf-stat.ps.cache-misses
 2.648e+08 ±  2%     +13.5%  3.004e+08 ±  4%  perf-stat.ps.cache-references
     77612 ±  6%     +23.8%      96101 ±  5%  perf-stat.ps.context-switches
 1.813e+11           +11.1%  2.015e+11        perf-stat.ps.cpu-cycles
      9093 ±  3%     +21.5%      11047 ±  2%  perf-stat.ps.cpu-migrations
 6.192e+10           +11.5%  6.901e+10        perf-stat.ps.instructions
      4639            +5.6%       4899        perf-stat.ps.minor-faults
      4640            +5.6%       4899        perf-stat.ps.page-faults




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Oliver Sang May 22, 2024, 8:51 a.m. UTC | #4
hi, Linus, hi, Yafang Shao,


On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
>  is there any chance you could run this through the test robot
> performance suite? The original full patch at
> 
>     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> 
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?
> 

we just reported a stress-ng performance improvement by this patch [1]

test robot applied this patch upon
  3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")

filesystem is not our team's major domain, so we just made some limited review
of the results, and decided to send out the report FYI.

at first stage, we decided to check below catagories of tests as priority:

stress-ng filesystem
filebench mailserver
reaim fileserver

we also pick sysbench-fileio, blogbench into coverage.

here is a summary.

for stress-ng, besided [1] which was reported, we got below data that are
about this patch comparing to 3c999d1ae3.

either there is no significant performance change, or the change is smaller
than the noise which will make test robot's bisect fail, so these information
is just FYI. and if you have any doubt about any subtests, could you let us know
then we could check further?

(also included some net test results)

      12.87 ±  6%      -0.6%      12.79        stress-ng.xattr.ops_per_sec
       6721 ±  5%      +7.5%       7224 ± 27%  stress-ng.rawdev.ops_per_sec
       9002 ±  7%      -8.7%       8217        stress-ng.dirmany.ops_per_sec
    8594743 ±  4%      -3.0%    8337417        stress-ng.rawsock.ops_per_sec
       2056 ±  3%      +2.9%       2116        stress-ng.dirdeep.ops_per_sec
       4307 ± 21%      -6.9%       4009        stress-ng.dir.ops_per_sec
     137946 ± 18%      +5.8%     145942        stress-ng.fiemap.ops_per_sec
   22413006 ±  2%      +2.5%   22982512 ±  2%  stress-ng.sockdiag.ops_per_sec
     286714 ±  2%      -3.8%     275876 ±  5%  stress-ng.udp-flood.ops_per_sec
      82904 ± 46%     -31.6%      56716        stress-ng.sctp.ops_per_sec
    9853408            -0.3%    9826387        stress-ng.ping-sock.ops_per_sec
      84667 ± 12%     -26.7%      62050 ± 17%  stress-ng.dccp.ops_per_sec
      61750 ± 25%     -24.2%      46821 ± 38%  stress-ng.open.ops_per_sec
     583443 ±  3%      -3.4%     563822        stress-ng.file-ioctl.ops_per_sec
      11919 ± 28%     -34.3%       7833        stress-ng.dentry.ops_per_sec
      18.59 ± 12%     -23.9%      14.15 ± 27%  stress-ng.swap.ops_per_sec
     246.37 ±  2%     +15.9%     285.58 ± 12%  stress-ng.aiol.ops_per_sec
       7.45            -4.8%       7.10 ±  7%  stress-ng.fallocate.ops_per_sec
     207.97 ±  7%      +5.2%     218.70        stress-ng.copy-file.ops_per_sec
      69.87 ±  7%      +5.8%      73.93 ±  5%  stress-ng.fpunch.ops_per_sec
       0.25 ± 21%     +24.0%       0.31        stress-ng.inode-flags.ops_per_sec
     849.35 ±  6%      +1.4%     861.51        stress-ng.mknod.ops_per_sec
     926144 ±  4%      -5.2%     877558        stress-ng.lease.ops_per_sec
      82924            -2.1%      81220        stress-ng.fcntl.ops_per_sec
       6.19 ±124%     -50.7%       3.05        stress-ng.chattr.ops_per_sec
     676.90 ±  4%      -1.9%     663.94 ±  5%  stress-ng.iomix.ops_per_sec
       0.93 ±  6%      +5.6%       0.98 ±  7%  stress-ng.symlink.ops_per_sec
    1703608            -3.8%    1639057 ±  3%  stress-ng.eventfd.ops_per_sec
    1735861            -0.6%    1726072        stress-ng.sockpair.ops_per_sec
      85440            -2.0%      83705        stress-ng.rawudp.ops_per_sec
       6198            +0.6%       6236        stress-ng.sockabuse.ops_per_sec
      39226            +0.0%      39234        stress-ng.sock.ops_per_sec
       1358            +0.3%       1363        stress-ng.tun.ops_per_sec
    9794021            -1.7%    9623340        stress-ng.icmp-flood.ops_per_sec
    1324728            +0.3%    1328244        stress-ng.epoll.ops_per_sec
     146150            -2.0%     143231        stress-ng.rawpkt.ops_per_sec
    6381112            -0.4%    6352696        stress-ng.udp.ops_per_sec
    1234258            +0.2%    1236738        stress-ng.sockfd.ops_per_sec
      23954            -0.1%      23932        stress-ng.sockmany.ops_per_sec
     257030            -0.1%     256860        stress-ng.netdev.ops_per_sec
    6337097            +0.1%    6341130        stress-ng.flock.ops_per_sec
     173212            -0.3%     172728        stress-ng.rename.ops_per_sec
     199.69            +0.6%     200.82        stress-ng.sync-file.ops_per_sec
     606.57            +0.8%     611.53        stress-ng.chown.ops_per_sec
     183549            -0.9%     181975        stress-ng.handle.ops_per_sec
       1299            +0.0%       1299        stress-ng.hdd.ops_per_sec
   98371066            +0.2%   98571113        stress-ng.lockofd.ops_per_sec
      25.49            -4.3%      24.39        stress-ng.ioprio.ops_per_sec
   96745191            -1.5%   95333632        stress-ng.locka.ops_per_sec
     582.35            +0.1%     582.86        stress-ng.chmod.ops_per_sec
    2075897            -2.2%    2029552        stress-ng.getdent.ops_per_sec
      60.47            -1.9%      59.34        stress-ng.metamix.ops_per_sec
      14161            -0.3%      14123        stress-ng.io.ops_per_sec
      23.98            -1.5%      23.61        stress-ng.link.ops_per_sec
      27514            +0.0%      27528        stress-ng.filename.ops_per_sec
      44955            +1.6%      45678        stress-ng.dnotify.ops_per_sec
     160.94            +0.4%     161.51        stress-ng.inotify.ops_per_sec
    2452224            +4.0%    2549607        stress-ng.lockf.ops_per_sec
       6761            +0.3%       6779        stress-ng.fsize.ops_per_sec
     775083            -1.5%     763487        stress-ng.fanotify.ops_per_sec
     309124            -4.2%     296285        stress-ng.utime.ops_per_sec
      25567            -0.1%      25530        stress-ng.dup.ops_per_sec
       1858            +0.9%       1876        stress-ng.procfs.ops_per_sec
     105804            -3.9%     101658        stress-ng.access.ops_per_sec
       1.04            -1.9%       1.02        stress-ng.chdir.ops_per_sec
      82753            -0.3%      82480        stress-ng.fstat.ops_per_sec
     681128            +3.7%     706375        stress-ng.acl.ops_per_sec
      11892            -0.1%      11875        stress-ng.bind-mount.ops_per_sec


for filebench, similar results, but data is less unstable than stress-ng, which
means for most of them, we regarded them as that they should not be impacted by
this patch.

for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
notice any significant performance changes. we even doubt whether they go
through the code path changed by this patch.

so for these, we don't list full results here.

BTW, besides filesystem tests, this patch is also piped into other performance
test categories such like net, scheduler, mm and others, and since it also goes
into our so-called hourly kernels, it could run by full other performance test
suites which test robot supports. so in following 2-3 weeks, it's still possible
for us to report other results including regression.

thanks

[1] https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/


> Thanks,
>                      Linus
> 
> On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.
Linus Torvalds May 22, 2024, 4 p.m. UTC | #5
On Wed, 22 May 2024 at 01:12, kernel test robot <oliver.sang@intel.com> wrote:
>
> kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:
>
> commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")

Ok, since everything else is at least tentatively in the noise, and
the only hard numbers we have are the ones from Yafang's Elasticsearch
load and this - both of which say that this is a good patch - I
decided to just apply this ASAP just to get more testing.

I just wanted to note very explicitly that this is very much
tentative: this will be reverted very aggressively if somebody reports
some other real-world load performance issues, and we'll have to look
at other solutions. But I just don't think we'll get much more actual
testing of this without just saying "let's try it".

Also, I ended up removing the part of the patch that stopped clearing
the DCACHE_CANT_MOUNT bit. I think it's right, but it's really
unrelated to the actual problem at hand, and there are other cleanups
- like the unnecessary dget/dput pair - in this path that could also
be looked at.

Anyway, let's see if somebody notices any issues with this. And I
think we should look at the "shrink dentries" case anyway for other
reasons, since it's easy to create a ton of negative dentries with
just lots of lookups (rather than lots of unlinking of existing
files).

Of course, if you do billions of lookups of different files that do
not exist in the same directory, I suspect you just have yourself to
blame, so the "lots of negative lookups" load doesn't sound
particularly realistic.

TLDR; I applied it for testing because we're in the merge window and
there's no reason not to try.

                 Linus
Matthew Wilcox May 22, 2024, 5:13 p.m. UTC | #6
On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> Of course, if you do billions of lookups of different files that do
> not exist in the same directory, I suspect you just have yourself to
> blame, so the "lots of negative lookups" load doesn't sound
> particularly realistic.

Oh no.  We have real customers that this hits and it's not even stupid.

Every time an "event" happens, they look up something like a hash in three
different directories (like $PATH or multiple -I flags to the compiler).
Order is important, so they can't just look it up in the directory that
it's most likely to exist in.  It usually fails to exist in directory A
and B, so we create dentries that say it doesn't.  And those dentries are
literally never referenced again.  Then directory C has the file they're
looking for (or it doesn't and it gets created because the process has
write access to C and not A or B).

plan9 handles this so much better because it has that union-mount stuff
instead of search paths.  So it creates one dentry that tells it which of
those directories it actually exists in.  But we're stuck with unix-style
search paths, so life is pain.
Linus Torvalds May 22, 2024, 6:11 p.m. UTC | #7
On Wed, 22 May 2024 at 10:14, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> > Of course, if you do billions of lookups of different files that do
> > not exist in the same directory, I suspect you just have yourself to
> > blame, so the "lots of negative lookups" load doesn't sound
> > particularly realistic.
>
> Oh no.  We have real customers that this hits and it's not even stupid.

Oh, negative dentries exist, and yes, they are a major feature on some
loads. Exactly because of the kinds of situations you describe.

In fact, that's the _point_. Things like PATH lookup require negative
dentries for good performance, because otherwise all the missed cases
would force a lookup all the way out to the filesystem.

So having thousands of negative dentries is normal and expected. And
it will grow for bigger machines and loads, of course.

That said, I don't think we've had people on real loads complain about
them being in the hundreds of millions like Yafang's case.

> plan9 handles this so much better because it has that union-mount stuff
> instead of search paths.  So it creates one dentry that tells it which of
> those directories it actually exists in.  But we're stuck with unix-style
> search paths, so life is pain.

I suspect you are just not as aware of the downsides of the plan9 models.

People tend to think plan9 was great. It had huge and fundamental
design mistakes.

Union mounts à la plan9 aren't hugely wonderful, and there's a reason
overlayfs does things differently (not that that is hugely wonderful
either).

             Linus
Yafang Shao May 23, 2024, 2:21 a.m. UTC | #8
On Wed, May 22, 2024 at 4:51 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
>
> hi, Linus, hi, Yafang Shao,
>
>
> On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> > Oliver,
> >  is there any chance you could run this through the test robot
> > performance suite? The original full patch at
> >
> >     https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/
> >
> > and it would be interesting if the test robot could see if the patch
> > makes any difference on any other loads?
> >
>
> we just reported a stress-ng performance improvement by this patch [1]

Awesome!

>
> test robot applied this patch upon
>   3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
>
> filesystem is not our team's major domain, so we just made some limited review
> of the results, and decided to send out the report FYI.
>
> at first stage, we decided to check below catagories of tests as priority:
>
> stress-ng filesystem
> filebench mailserver
> reaim fileserver
>
> we also pick sysbench-fileio, blogbench into coverage.
>
> here is a summary.
>
> for stress-ng, besided [1] which was reported, we got below data that are
> about this patch comparing to 3c999d1ae3.
>
> either there is no significant performance change, or the change is smaller
> than the noise which will make test robot's bisect fail, so these information
> is just FYI. and if you have any doubt about any subtests, could you let us know
> then we could check further?
>
> (also included some net test results)
>
>       12.87 ą  6%      -0.6%      12.79        stress-ng.xattr.ops_per_sec
>        6721 ą  5%      +7.5%       7224 ą 27%  stress-ng.rawdev.ops_per_sec
>        9002 ą  7%      -8.7%       8217        stress-ng.dirmany.ops_per_sec
>     8594743 ą  4%      -3.0%    8337417        stress-ng.rawsock.ops_per_sec
>        2056 ą  3%      +2.9%       2116        stress-ng.dirdeep.ops_per_sec
>        4307 ą 21%      -6.9%       4009        stress-ng.dir.ops_per_sec
>      137946 ą 18%      +5.8%     145942        stress-ng.fiemap.ops_per_sec
>    22413006 ą  2%      +2.5%   22982512 ą  2%  stress-ng.sockdiag.ops_per_sec
>      286714 ą  2%      -3.8%     275876 ą  5%  stress-ng.udp-flood.ops_per_sec
>       82904 ą 46%     -31.6%      56716        stress-ng.sctp.ops_per_sec
>     9853408            -0.3%    9826387        stress-ng.ping-sock.ops_per_sec
>       84667 ą 12%     -26.7%      62050 ą 17%  stress-ng.dccp.ops_per_sec
>       61750 ą 25%     -24.2%      46821 ą 38%  stress-ng.open.ops_per_sec
>      583443 ą  3%      -3.4%     563822        stress-ng.file-ioctl.ops_per_sec
>       11919 ą 28%     -34.3%       7833        stress-ng.dentry.ops_per_sec
>       18.59 ą 12%     -23.9%      14.15 ą 27%  stress-ng.swap.ops_per_sec
>      246.37 ą  2%     +15.9%     285.58 ą 12%  stress-ng.aiol.ops_per_sec
>        7.45            -4.8%       7.10 ą  7%  stress-ng.fallocate.ops_per_sec
>      207.97 ą  7%      +5.2%     218.70        stress-ng.copy-file.ops_per_sec
>       69.87 ą  7%      +5.8%      73.93 ą  5%  stress-ng.fpunch.ops_per_sec
>        0.25 ą 21%     +24.0%       0.31        stress-ng.inode-flags.ops_per_sec
>      849.35 ą  6%      +1.4%     861.51        stress-ng.mknod.ops_per_sec
>      926144 ą  4%      -5.2%     877558        stress-ng.lease.ops_per_sec
>       82924            -2.1%      81220        stress-ng.fcntl.ops_per_sec
>        6.19 ą124%     -50.7%       3.05        stress-ng.chattr.ops_per_sec
>      676.90 ą  4%      -1.9%     663.94 ą  5%  stress-ng.iomix.ops_per_sec
>        0.93 ą  6%      +5.6%       0.98 ą  7%  stress-ng.symlink.ops_per_sec
>     1703608            -3.8%    1639057 ą  3%  stress-ng.eventfd.ops_per_sec
>     1735861            -0.6%    1726072        stress-ng.sockpair.ops_per_sec
>       85440            -2.0%      83705        stress-ng.rawudp.ops_per_sec
>        6198            +0.6%       6236        stress-ng.sockabuse.ops_per_sec
>       39226            +0.0%      39234        stress-ng.sock.ops_per_sec
>        1358            +0.3%       1363        stress-ng.tun.ops_per_sec
>     9794021            -1.7%    9623340        stress-ng.icmp-flood.ops_per_sec
>     1324728            +0.3%    1328244        stress-ng.epoll.ops_per_sec
>      146150            -2.0%     143231        stress-ng.rawpkt.ops_per_sec
>     6381112            -0.4%    6352696        stress-ng.udp.ops_per_sec
>     1234258            +0.2%    1236738        stress-ng.sockfd.ops_per_sec
>       23954            -0.1%      23932        stress-ng.sockmany.ops_per_sec
>      257030            -0.1%     256860        stress-ng.netdev.ops_per_sec
>     6337097            +0.1%    6341130        stress-ng.flock.ops_per_sec
>      173212            -0.3%     172728        stress-ng.rename.ops_per_sec
>      199.69            +0.6%     200.82        stress-ng.sync-file.ops_per_sec
>      606.57            +0.8%     611.53        stress-ng.chown.ops_per_sec
>      183549            -0.9%     181975        stress-ng.handle.ops_per_sec
>        1299            +0.0%       1299        stress-ng.hdd.ops_per_sec
>    98371066            +0.2%   98571113        stress-ng.lockofd.ops_per_sec
>       25.49            -4.3%      24.39        stress-ng.ioprio.ops_per_sec
>    96745191            -1.5%   95333632        stress-ng.locka.ops_per_sec
>      582.35            +0.1%     582.86        stress-ng.chmod.ops_per_sec
>     2075897            -2.2%    2029552        stress-ng.getdent.ops_per_sec
>       60.47            -1.9%      59.34        stress-ng.metamix.ops_per_sec
>       14161            -0.3%      14123        stress-ng.io.ops_per_sec
>       23.98            -1.5%      23.61        stress-ng.link.ops_per_sec
>       27514            +0.0%      27528        stress-ng.filename.ops_per_sec
>       44955            +1.6%      45678        stress-ng.dnotify.ops_per_sec
>      160.94            +0.4%     161.51        stress-ng.inotify.ops_per_sec
>     2452224            +4.0%    2549607        stress-ng.lockf.ops_per_sec
>        6761            +0.3%       6779        stress-ng.fsize.ops_per_sec
>      775083            -1.5%     763487        stress-ng.fanotify.ops_per_sec
>      309124            -4.2%     296285        stress-ng.utime.ops_per_sec
>       25567            -0.1%      25530        stress-ng.dup.ops_per_sec
>        1858            +0.9%       1876        stress-ng.procfs.ops_per_sec
>      105804            -3.9%     101658        stress-ng.access.ops_per_sec
>        1.04            -1.9%       1.02        stress-ng.chdir.ops_per_sec
>       82753            -0.3%      82480        stress-ng.fstat.ops_per_sec
>      681128            +3.7%     706375        stress-ng.acl.ops_per_sec
>       11892            -0.1%      11875        stress-ng.bind-mount.ops_per_sec
>
>
> for filebench, similar results, but data is less unstable than stress-ng, which
> means for most of them, we regarded them as that they should not be impacted by
> this patch.
>
> for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
> notice any significant performance changes. we even doubt whether they go
> through the code path changed by this patch.
>
> so for these, we don't list full results here.
>
> BTW, besides filesystem tests, this patch is also piped into other performance
> test categories such like net, scheduler, mm and others, and since it also goes
> into our so-called hourly kernels, it could run by full other performance test
> suites which test robot supports. so in following 2-3 weeks, it's still possible
> for us to report other results including regression.
>

That's great. Many thanks for your help.
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 71a8e943a0fa..2ffdb98e9166 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2360,19 +2360,17 @@  EXPORT_SYMBOL(d_hash_and_lookup);
  * - unhash this dentry and free it.
  *
  * Usually, we want to just turn this into
- * a negative dentry, but if anybody else is
- * currently using the dentry or the inode
- * we can't do that and we fall back on removing
- * it from the hash queues and waiting for
- * it to be deleted later when it has no users
+ * a negative dentry, but certain workloads can
+ * generate a large number of negative dentries.
+ * Therefore, it would be better to simply
+ * unhash it.
  */
  
 /**
  * d_delete - delete a dentry
  * @dentry: The dentry to delete
  *
- * Turn the dentry into a negative dentry if possible, otherwise
- * remove it from the hash queues so it can be deleted later
+ * Remove the dentry from the hash queues so it can be deleted later.
  */
  
 void d_delete(struct dentry * dentry)
@@ -2381,14 +2379,14 @@  void d_delete(struct dentry * dentry)
 
 	spin_lock(&inode->i_lock);
 	spin_lock(&dentry->d_lock);
+	__d_drop(dentry);
+
 	/*
 	 * Are we the only user?
 	 */
 	if (dentry->d_lockref.count == 1) {
-		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
 		dentry_unlink_inode(dentry);
 	} else {
-		__d_drop(dentry);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&inode->i_lock);
 	}