mbox series

[v3,0/3] symlink length caching

Message ID 20241120112037.822078-1-mjguzik@gmail.com (mailing list archive)
Headers show
Series symlink length caching | expand

Message

Mateusz Guzik Nov. 20, 2024, 11:20 a.m. UTC
quote:
    When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
    speed up when issuing readlink on /initrd.img on ext4.

The size is stored in a union with i_devices, which is never looked at
unless the inode is for a device.

Benchmark code at the bottom.

ext4 and tmpfs are patched, other filesystems can also get there with
some more work.

Arguably the current get_link API should be patched to let the fs return
the size, but that's not a churn I'm interested into diving in.

On my v1 Jan remarked 1.5% is not a particularly high win questioning
whether doing this makes sense. I noted the value is only this small
because of other slowdowns.

To elaborate here are highlights while benching on Sapphire Rapids:
1. putname using atomics (over 3.5% on my profile)

sounds like Al has plans to do something here(?), I'm not touching it if
it can be helped. the atomic definitely does not have to be there in the
common case.

2. kmem_cache_alloc_noprof/kmem_cache_free (over 7% combined) 

They are both dog slow due to cmpxchg16b. A patchset was posted which
adds a different allocation/free fast path which should whack majority
of the problem, see: https://lore.kernel.org/linux-mm/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz/

If this lands I'll definitely try to make the pathname allocs use it,
should drop about 5-6 percentage points on this sucker.

3. __legitimize_mnt issues a full fence (again over 3%)

As far as avoiding the fence is concerned waiting on rcu grace period on
unmount should do the trick. However, I found there is a bunch
complexity there to sort out before doing this will be feasible (notably
there are multiple mounts freed in one go, this needs to be batched).
There may be other issues which I missed and which make this not worth
it, but the fence is definitely avoidable in principle and I would be
surprised if there was no way to sensibly get there. No ETA, for now I'm
just pointing this out.

There is also the idea to speculatively elide lockref, but when I tried
that last time I ran into significant LSM-related trouble.

All that aside there is also quite a bit of branching and func calling
which does not need to be there (example: make vfsuid/vfsgid, could be
combined into one routine etc.).

Ultimately there is single-threaded perf left on the table in various
spots.

v3:
- use a union instead of a dedicated field, used up with i_devices

v2:
- add a dedicated field, flag and a helper instead of using i_size
https://lore.kernel.org/linux-fsdevel/20241119094555.660666-1-mjguzik@gmail.com/

v1 can be found here:
https://lore.kernel.org/linux-fsdevel/20241118085357.494178-1-mjguzik@gmail.com/

benchmark:
plug into will-it-scale into tests/readlink1.c:

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>

char *testcase_description = "readlink /initrd.img";

void testcase(unsigned long long *iterations, unsigned long nr)
{
        char *tmplink = "/initrd.img";
        char buf[1024];

        while (1) {
                int error = readlink(tmplink, buf, sizeof(buf));
                assert(error > 0);

                (*iterations)++;
        }
}

Mateusz Guzik (3):
  vfs: support caching symlink lengths in inodes
  ext4: use inode_set_cached_link()
  tmpfs: use inode_set_cached_link()

 fs/ext4/inode.c                |  3 ++-
 fs/ext4/namei.c                |  4 +++-
 fs/namei.c                     | 34 +++++++++++++++++++---------------
 fs/proc/namespaces.c           |  2 +-
 include/linux/fs.h             | 15 +++++++++++++--
 mm/shmem.c                     |  6 ++++--
 security/apparmor/apparmorfs.c |  2 +-
 7 files changed, 43 insertions(+), 23 deletions(-)

Comments

Christian Brauner Nov. 21, 2024, 12:34 p.m. UTC | #1
On Wed, 20 Nov 2024 12:20:33 +0100, Mateusz Guzik wrote:
> quote:
>     When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
>     speed up when issuing readlink on /initrd.img on ext4.
> 
> The size is stored in a union with i_devices, which is never looked at
> unless the inode is for a device.
> 
> [...]

Applied to the vfs-6.14.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.14.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.14.misc

[1/3] vfs: support caching symlink lengths in inodes
      https://git.kernel.org/vfs/vfs/c/5cbb3c7e0051
[2/3] ext4: use inode_set_cached_link()
      https://git.kernel.org/vfs/vfs/c/740456f67017
[3/3] tmpfs: use inode_set_cached_link()
      https://git.kernel.org/vfs/vfs/c/30071e02c163
Jeff Layton Nov. 21, 2024, 2:16 p.m. UTC | #2
On Wed, 2024-11-20 at 12:20 +0100, Mateusz Guzik wrote:
> quote:
>     When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
>     speed up when issuing readlink on /initrd.img on ext4.
> 
> The size is stored in a union with i_devices, which is never looked at
> unless the inode is for a device.
> 
> Benchmark code at the bottom.
> 
> ext4 and tmpfs are patched, other filesystems can also get there with
> some more work.
> 
> Arguably the current get_link API should be patched to let the fs return
> the size, but that's not a churn I'm interested into diving in.
> 
> On my v1 Jan remarked 1.5% is not a particularly high win questioning
> whether doing this makes sense. I noted the value is only this small
> because of other slowdowns.
> 
> To elaborate here are highlights while benching on Sapphire Rapids:
> 1. putname using atomics (over 3.5% on my profile)
> 
> sounds like Al has plans to do something here(?), I'm not touching it if
> it can be helped. the atomic definitely does not have to be there in the
> common case.
> 
> 2. kmem_cache_alloc_noprof/kmem_cache_free (over 7% combined) 
> 
> They are both dog slow due to cmpxchg16b. A patchset was posted which
> adds a different allocation/free fast path which should whack majority
> of the problem, see: https://lore.kernel.org/linux-mm/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz/
> 
> If this lands I'll definitely try to make the pathname allocs use it,
> should drop about 5-6 percentage points on this sucker.
> 
> 3. __legitimize_mnt issues a full fence (again over 3%)
> 
> As far as avoiding the fence is concerned waiting on rcu grace period on
> unmount should do the trick. However, I found there is a bunch
> complexity there to sort out before doing this will be feasible (notably
> there are multiple mounts freed in one go, this needs to be batched).
> There may be other issues which I missed and which make this not worth
> it, but the fence is definitely avoidable in principle and I would be
> surprised if there was no way to sensibly get there. No ETA, for now I'm
> just pointing this out.
> 
> There is also the idea to speculatively elide lockref, but when I tried
> that last time I ran into significant LSM-related trouble.
> 
> All that aside there is also quite a bit of branching and func calling
> which does not need to be there (example: make vfsuid/vfsgid, could be
> combined into one routine etc.).
> 
> Ultimately there is single-threaded perf left on the table in various
> spots.
> 
> v3:
> - use a union instead of a dedicated field, used up with i_devices
> 
> v2:
> - add a dedicated field, flag and a helper instead of using i_size
> https://lore.kernel.org/linux-fsdevel/20241119094555.660666-1-mjguzik@gmail.com/
> 
> v1 can be found here:
> https://lore.kernel.org/linux-fsdevel/20241118085357.494178-1-mjguzik@gmail.com/
> 
> benchmark:
> plug into will-it-scale into tests/readlink1.c:
> 
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <assert.h>
> #include <string.h>
> 
> char *testcase_description = "readlink /initrd.img";
> 
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         char *tmplink = "/initrd.img";
>         char buf[1024];
> 
>         while (1) {
>                 int error = readlink(tmplink, buf, sizeof(buf));
>                 assert(error > 0);
> 
>                 (*iterations)++;
>         }
> }
> 
> Mateusz Guzik (3):
>   vfs: support caching symlink lengths in inodes
>   ext4: use inode_set_cached_link()
>   tmpfs: use inode_set_cached_link()
> 
>  fs/ext4/inode.c                |  3 ++-
>  fs/ext4/namei.c                |  4 +++-
>  fs/namei.c                     | 34 +++++++++++++++++++---------------
>  fs/proc/namespaces.c           |  2 +-
>  include/linux/fs.h             | 15 +++++++++++++--
>  mm/shmem.c                     |  6 ++++--
>  security/apparmor/apparmorfs.c |  2 +-
>  7 files changed, 43 insertions(+), 23 deletions(-)
> 

Nice work, Mateusz!

Reviewed-by: Jeff Layton <jlayton@kernel.org>