diff mbox series

xfs_release lock contention

Message ID ejy4ska7orznl75364ehskucg7ibo3j3biwkui6q6mv42im6o5@pzl7pwwxjrg3 (mailing list archive)
State Accepted, archived
Headers show
Series xfs_release lock contention | expand

Commit Message

Mateusz Guzik Aug. 7, 2024, 4:27 a.m. UTC
I'm looking at false-sharing problems concerning multicore open + read +
close cycle on one inode and during my survey I found xfs is heavily
serializing on a spinlock in xfs_release, making it perform the worst
out of the btrfs/ext4/xfs trio.

A trivial test case plopped into will-it-scale is at the end.

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' tells me:
[snip]
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+49
    rwsem_wake.isra.0+57
    up_write+69
    xfs_iunlock+244
    xfs_release+175
    __fput+238
    __x64_sys_close+60
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 41132
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irq+42
    rwsem_down_read_slowpath+164
    down_read+72
    xfs_ilock+125
    xfs_file_buffered_read+71
    xfs_file_read_iter+115
    vfs_read+604
    ksys_read+103
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 137639
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock+41
    xfs_release+196
    __fput+238
    __x64_sys_close+60
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 1432766

The xfs_release -> _raw_spin_lock thing is the XFS_ITRUNCATED flag test.

Also note how eofblock code inducing write trylock gets in the way of
doing the read (first 2 stacks).

General note is that for most files real files there are no "blocks past
eof" or otherwise truncation going on and there is presumably a way to
locklessly handle that, which should also reduce single-threaded
overhead.

For testing purposes I wrote a total hack which merely branches on
i_flags and i_delayed_blks == 0, but I have no idea if that's any good
-- *something* definitely is doable here, I leave that to people who
know the fs.

When running a kernel with a change whacking one lockref cycle on open:
https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/T/#u

... and toggling the short-circuit outlined above I'm seeing +50% ops/s
and going above btrfs. Then top of the profile is the false sharing I'm
looking at with other filesystems.

So that would be my report. Whatever you guys do with it is your
business, I'm bailing to the false-sharing problem.

Should you address this in a committable manner, there is no need to
credit or cc me.

Cheers.

the not patch:


test case (plop into will-it-scale, say tests/openreadclose3.c and run
./openreadclose3_processes -t 24):

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>

#define BUFSIZE 1024

static char tmpfile[] = "/tmp/willitscale.XXXXXX";

char *testcase_description = "Same file open/read/close";

void testcase_prepare(unsigned long nr_tasks)
{
        char buf[BUFSIZE];
        int fd = mkstemp(tmpfile);

        assert(fd >= 0);
        memset(buf, 'A', sizeof(buf));
        assert(write(fd, buf, sizeof(buf)) == sizeof(buf));
        close(fd);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        char buf[BUFSIZE];

        while (1) {
                int fd = open(tmpfile, O_RDONLY);
                assert(fd >= 0);
                assert(read(fd, buf, sizeof(buf)) == sizeof(buf));
                close(fd);

                (*iterations)++;
        }
}

void testcase_cleanup(void)
{
        unlink(tmpfile);
}

Comments

Dave Chinner Aug. 7, 2024, 5:43 a.m. UTC | #1
On Wed, Aug 07, 2024 at 06:27:21AM +0200, Mateusz Guzik wrote:
> I'm looking at false-sharing problems concerning multicore open + read +
> close cycle on one inode and during my survey I found xfs is heavily
> serializing on a spinlock in xfs_release, making it perform the worst
> out of the btrfs/ext4/xfs trio.
> 
> A trivial test case plopped into will-it-scale is at the end.
> 
> bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' tells me:
> [snip]
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock_irqsave+49
>     rwsem_wake.isra.0+57
>     up_write+69
>     xfs_iunlock+244
>     xfs_release+175
>     __fput+238
>     __x64_sys_close+60
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 41132
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock_irq+42
>     rwsem_down_read_slowpath+164
>     down_read+72
>     xfs_ilock+125
>     xfs_file_buffered_read+71
>     xfs_file_read_iter+115
>     vfs_read+604
>     ksys_read+103
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 137639
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock+41
>     xfs_release+196
>     __fput+238
>     __x64_sys_close+60
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 1432766
> 
> The xfs_release -> _raw_spin_lock thing is the XFS_ITRUNCATED flag test.

Yeah, these all ring old bells in the back of my skull.

> 
> test case (plop into will-it-scale, say tests/openreadclose3.c and run
> ./openreadclose3_processes -t 24):
> 
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <assert.h>
> 
> #define BUFSIZE 1024
> 
> static char tmpfile[] = "/tmp/willitscale.XXXXXX";
> 
> char *testcase_description = "Same file open/read/close";
> 
> void testcase_prepare(unsigned long nr_tasks)
> {
>         char buf[BUFSIZE];
>         int fd = mkstemp(tmpfile);
> 
>         assert(fd >= 0);
>         memset(buf, 'A', sizeof(buf));
>         assert(write(fd, buf, sizeof(buf)) == sizeof(buf));
>         close(fd);
> }
> 
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         char buf[BUFSIZE];
> 
>         while (1) {
>                 int fd = open(tmpfile, O_RDONLY);
>                 assert(fd >= 0);
>                 assert(read(fd, buf, sizeof(buf)) == sizeof(buf));
>                 close(fd);

Oh, yeah, I defintely sent patches once upon a time to address
this.

<scrummage around old patch stacks>

Yep, there it is:

https://lore.kernel.org/linux-xfs/20190207050813.24271-4-david@fromorbit.com/

This would completely remove the rwsem traffic from O_RDONLY file
closes.

None of it would address the XFS_ITRUNCATED contention issue, but
that's just another of those "test, test-and-clear" cases that avoid
the atomic ops by testing if the bit is set without the lock
first....

Hmmm, I thought I saw these patches go past on the list again
recently.  Yeah, that was a coupl eof months ago:

https://lore.kernel.org/linux-xfs/20240623053532.857496-1-hch@lst.de/

Christoph, any progress on merging that patchset?

-Dave.
Christoph Hellwig Aug. 7, 2024, 2:37 p.m. UTC | #2
On Wed, Aug 07, 2024 at 03:43:53PM +1000, Dave Chinner wrote:
> Hmmm, I thought I saw these patches go past on the list again
> recently.  Yeah, that was a coupl eof months ago:
> 
> https://lore.kernel.org/linux-xfs/20240623053532.857496-1-hch@lst.de/
> 
> Christoph, any progress on merging that patchset?

I expected them to go into 6.11 but they didn't.  I'll resend them
after retesting them as there weren't any actionable items except for
a few typo fixes.
Darrick J. Wong Aug. 7, 2024, 2:56 p.m. UTC | #3
On Wed, Aug 07, 2024 at 04:37:59PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 07, 2024 at 03:43:53PM +1000, Dave Chinner wrote:
> > Hmmm, I thought I saw these patches go past on the list again
> > recently.  Yeah, that was a coupl eof months ago:
> > 
> > https://lore.kernel.org/linux-xfs/20240623053532.857496-1-hch@lst.de/
> > 
> > Christoph, any progress on merging that patchset?
> 
> I expected them to go into 6.11 but they didn't.  I'll resend them
> after retesting them as there weren't any actionable items except for
> a few typo fixes.

Heh, I think there's like one patch out of the whole bunch that didn't
get a RVB tag, so that's probably why it didn't go in. :(

--D
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936c..1cc62c21e709 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1079,6 +1079,8 @@  xfs_itruncate_extents_flags(
        return error;
 }

+extern unsigned long magic_tunable;
+
 int
 xfs_release(
        xfs_inode_t     *ip)
@@ -1089,6 +1091,13 @@  xfs_release(
        if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
                return 0;

+       if (magic_tunable) {
+               if (!(ip->i_flags & XFS_ITRUNCATED))
+                       return 0;
+               if (ip->i_delayed_blks == 0)
+                       return 0;
+       }
+
        /* If this is a read-only mount, don't do this (would generate I/O) */
        if (xfs_is_readonly(mp))
                return 0;