mbox series

[v9,0/7] fuse: full atomic open and atomic-open-revalidate

Message ID 20230920173445.3943581-1-bschubert@ddn.com (mailing list archive)
Headers show
Series fuse: full atomic open and atomic-open-revalidate | expand

Message

Bernd Schubert Sept. 20, 2023, 5:34 p.m. UTC
In FUSE, as of now, uncached lookups are expensive over the wire.
E.g additional latencies and stressing (meta data) servers from
thousands of clients. With atomic-open lookup before open
can be avoided.

Here is the link to performance numbers
https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/

Here is the libfuse pull request
https://github.com/libfuse/libfuse/pull/813

The patches are passing passthrough_hp xfstests (libfuse part applied),
although we had to introduce umount retries into xfstests, as recent
kernels/xfstests fail umount in some tests with
EBUSY - independent of atomic open. (Although outstanding for v7)

I'm especially interested in Al's and Christians opinion about the
atomic open dentry revalidation in v7. If the vfs changes are
acceptable, would it be possible to also look at the other patches
and their vfs/dcache interaction? I __hope__ I got it right and I hope
the vfs can be accepted.

v9:
    - Followed Miklos suggestion and added another patch to further
      optimize atomic revalidate/open, which avoids dentry
      acquire/release and also avoids double call into ->d_revalidate
    - Updates following Miklos' review
    - Dropped a temporary comment in patch 2/7 (accidental leftover)

v8: - Another slight indentation fix in _fuse_atomic_open
    - Fix compilation error in patch 4 (fuse atomic revalidate)
    - Remove LOOKUP_ATOMIC_REVALIDATE
    - Switch from DCACHE_ATOMIC_OPEN flag to return value and
      and introduce an enum for d_revalidate return values.
    - checkpatch fixes

v7: - Indentation and style fixes for _fuse_atomic_open.
    - Remodel atomic open to avoid races with parallel lookup, similar
      to NFS commit c94c09535c4debcc439f55b5b6d9ebe57bd4665a and what
      is done in _nfs4_open_and_get_state()
      A WARN_ONCE() and fallback is added to ensure operation is on
      negative dentries only.
    - Error handling is done via the fallback fuse_create_open()
      to reduce complexity and code duplication.
    - Remove entry cache invalidation on ENOENT in the atomic-open
      patch, as atomic-open so far operates on negative dentries only.
    - Remove fuse_advise_use_readdirplus() in _fuse_atomic_open
      (Thanks Miklos)
    - Add forgotten free_ext_value() (Thanks Miklos).
    - Declare struct fuse_inode per condition as the value needs to
      be retrieved anyway per condition.
    - Added atomic open-revalidation and required vfs changes
    - Added myself (Bernd) as Co-developed-by to Dharmendras patches, as
      I did substantial modifications.
    - More as reminder for myself, so far these tests below are
      done manually or with custom scripts, I think we need xfstests
      for these.

        With updated libfuse /scratch/dest is mounted by:
        passthrough_hp -o allow_other --foreground --debug-fuse /scratch/source /scratch/dest

        1) Test atomic open (file create) and negative dentry open

            rm -f /scratch/source/file # ensure file does not exist
            mount /scratch/dest  # overlay of /scratch source
            echo a > /scratch/dest/file # non-existing file
            umount and mount /scratch/test (clear cache)
            cat /scratch/dest/file
            rm -f /scratch/dest/file

        2) Test dir open

            mkdir /scratch/source/dir
            mount /scratch/dest  # overlay of /scratch source
            cat /scratch/source/dir
            rmdir /scratch/source/dir

        3)  Test revalidate without file change

            mount /scratch/dest
            echo "a" > /scratch/dest/file
            echo "b" >> /scratch/dest/file
            echo "c" >> /scratch/dest/file
            cat /scratch/dest/file
            rm -f /scratch/dest/file

        4)  Test revalidate by underlying file change

            mount /scratch/dest
            echo "a" > /scratch/dest/file
            cat /scratch/dest/file
            rm -f /scratch/source/file # unknown to dest mount
            str="b"
            echo "${str}" > /scratch/source/file
            reval=$(cat /scratch/dest/file)
            if [ "$str" != "reval" ]; then
                echo "String mismatch after revalidation"
                exit 1
            fi
            rm -f /scratch/dest/file

        5) Test revalidate by underlying file change, but with
           O_CREATE included. Tests dentry creation by the atomic
           revalidate

            mount /scratch/dest
            echo "a" >> /scratch/dest/file
            rm -f /scratch/source/file
            echo "b" > /scratch/source/file

            # revalidate includes O_CREATE
            echo "c" >> /scratch/dest/file

        6) Repeat above tests, but with additional "--nocache"
           passthrough_hp option

v6: Addressed Miklos comments and rewrote atomic open into its own
    function. Also dropped for now is the revalidate optimization, we
    have the code/patch, but it needs more testing. Also easier to
    first agree on atomic open and then to land the next optimization.

v5: Addressed comments

v3: Addressed comments

v4: Addressed all comments and refactored the code into 3 separate patches
    respectively for Atomic create, Atomic open, optimizing lookup in
    d_revalidate().

v3: handle review comments

v2: fixed a memory leak in <fuse_atomic_open_common>

Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dharmendra Singh <dsingh@ddn.com>
Cc: Horst Birthelmer <hbirthelmer@ddn.com>
Cc: linux-fsdevel@vger.kernel.org

Bernd Schubert (6):
  fuse: rename fuse_create_open
  fuse: introduce atomic open
  vfs: Optimize atomic_open() on positive dentry
  fuse: Revalidate positive entries in fuse_atomic_open
  fuse: Return D_REVALIDATE_ATOMIC for cached dentries
  fuse: Avoid code duplication in atomic open

Miklos Szeredi (1):
  [RFC] Allow atomic_open() on positive dentry

 fs/fuse/dir.c             | 390 +++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          |   6 +
 fs/namei.c                |  66 ++++++-
 include/linux/namei.h     |   7 +
 include/uapi/linux/fuse.h |   3 +
 5 files changed, 461 insertions(+), 11 deletions(-)

Comments

Amir Goldstein Sept. 21, 2023, 9:33 a.m. UTC | #1
On Thu, Sep 21, 2023 at 9:31 AM Bernd Schubert <bschubert@ddn.com> wrote:
>
> In FUSE, as of now, uncached lookups are expensive over the wire.
> E.g additional latencies and stressing (meta data) servers from
> thousands of clients. With atomic-open lookup before open
> can be avoided.
>
> Here is the link to performance numbers
> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
>
> Here is the libfuse pull request
> https://github.com/libfuse/libfuse/pull/813
>
> The patches are passing passthrough_hp xfstests (libfuse part applied),
> although we had to introduce umount retries into xfstests, as recent
> kernels/xfstests fail umount in some tests with
> EBUSY - independent of atomic open. (Although outstanding for v7)

Hi Bernd!

I was using xfstests to test passthrough_hp (for FUSE kernel passthrough).
FYI, I have made some improvements to the mount helper
in libfuse [1] to support remount, which helps pass a few tests.

So far, I have all the tests in group -g quick.rw pass with the baseline
passthrough_hp (over xfs).

Do you have a baseline for the entire quick/auto group to share with me?
Can you share the patch that you are using to avoid the EBUSY errors?

Note that Chritian has suggested a method to use inotify
IN_UNMOUNT event to wait for sb shutdown in fstests [2].

Thanks,
Amir.

[1] https://github.com/amir73il/libfuse/commits/fuse-backing-fd
[2] https://lore.kernel.org/linux-fsdevel/20230908-verflachen-neudefinition-4da649d673a9@brauner/
Bernd Schubert Sept. 21, 2023, 11:59 a.m. UTC | #2
On 9/21/23 11:33, Amir Goldstein wrote:
> On Thu, Sep 21, 2023 at 9:31 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> In FUSE, as of now, uncached lookups are expensive over the wire.
>> E.g additional latencies and stressing (meta data) servers from
>> thousands of clients. With atomic-open lookup before open
>> can be avoided.
>>
>> Here is the link to performance numbers
>> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
>>
>> Here is the libfuse pull request
>> https://github.com/libfuse/libfuse/pull/813
>>
>> The patches are passing passthrough_hp xfstests (libfuse part applied),
>> although we had to introduce umount retries into xfstests, as recent
>> kernels/xfstests fail umount in some tests with
>> EBUSY - independent of atomic open. (Although outstanding for v7)
> 
> Hi Bernd!
> 
> I was using xfstests to test passthrough_hp (for FUSE kernel passthrough).
> FYI, I have made some improvements to the mount helper
> in libfuse [1] to support remount, which helps pass a few tests.

Thanks, just asked there to send it separate to upstream.

> 
> So far, I have all the tests in group -g quick.rw pass with the baseline
> passthrough_hp (over xfs).
> 
> Do you have a baseline for the entire quick/auto group to share with me?

Please find my results attached. I have opened a libfuse issue for generic/477,
(open_by_handle_at tests) but I'm not sure if this is passthrough_hp only (it
trusts the passed node id, without checking if there is an inode object for it).
Possibly fuse.ko passes an invalide node id - this is something for a rainy
weekend (or so) to investigate...


> Can you share the patch that you are using to avoid the EBUSY errors?


The simple version to avoid _most_ of EBUSY is this


diff --git a/common/rc b/common/rc
index 741579af..a40fca3b 100644
--- a/common/rc
+++ b/common/rc
@@ -305,6 +305,7 @@ _scratch_mount_idmapped()
  
  _scratch_unmount()
  {
+       sync
         case "$FSTYP" in
         overlay)
                 _overlay_scratch_unmount



The better version is this
https://github.com/kdave/xfstests/commit/33a15af07bb044e2773a83df1c7e0a0df280a4b7

> 
> Note that Chritian has suggested a method to use inotify
> IN_UNMOUNT event to wait for sb shutdown in fstests [2].

Thanks, I had seen the discussion. Although I (silently) wondered if something
like MNT_BLOCk as umount2 flag wouldn't be easier.

> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/libfuse/commits/fuse-backing-fd
> [2] https://lore.kernel.org/linux-fsdevel/20230908-verflachen-neudefinition-4da649d673a9@brauner/


Thanks,
Bernd
Amir Goldstein Sept. 21, 2023, 2:24 p.m. UTC | #3
On Thu, Sep 21, 2023 at 3:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> On 9/21/23 11:33, Amir Goldstein wrote:
> > On Thu, Sep 21, 2023 at 9:31 AM Bernd Schubert <bschubert@ddn.com> wrote:
> >>
> >> In FUSE, as of now, uncached lookups are expensive over the wire.
> >> E.g additional latencies and stressing (meta data) servers from
> >> thousands of clients. With atomic-open lookup before open
> >> can be avoided.
> >>
> >> Here is the link to performance numbers
> >> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
> >>
> >> Here is the libfuse pull request
> >> https://github.com/libfuse/libfuse/pull/813
> >>
> >> The patches are passing passthrough_hp xfstests (libfuse part applied),
> >> although we had to introduce umount retries into xfstests, as recent
> >> kernels/xfstests fail umount in some tests with
> >> EBUSY - independent of atomic open. (Although outstanding for v7)
> >
> > Hi Bernd!
> >
> > I was using xfstests to test passthrough_hp (for FUSE kernel passthrough).
> > FYI, I have made some improvements to the mount helper
> > in libfuse [1] to support remount, which helps pass a few tests.
>
> Thanks, just asked there to send it separate to upstream.
>
> >
> > So far, I have all the tests in group -g quick.rw pass with the baseline
> > passthrough_hp (over xfs).
> >
> > Do you have a baseline for the entire quick/auto group to share with me?
>
> Please find my results attached.

Not too bad.
3 more tests can pass with my mount helper fix for remount ;)

> I have opened a libfuse issue for generic/477,
> (open_by_handle_at tests) but I'm not sure if this is passthrough_hp only (it
> trusts the passed node id, without checking if there is an inode object for it).
> Possibly fuse.ko passes an invalide node id - this is something for a rainy
> weekend (or so) to investigate...

Stale file handles after mount cycle are expected.
FUSE is not equipped to handle this correctly.

NFS clients may even get access to the wrong inode
after FUSE restart/reexport, if FUSE is exported with the same
NFS fsid.

See this discussion [3] about how this could be solved hackishly
with existing FUSE protocol (for fs that know how to open by ino)
and about the LOOKUP_HANDLE protocol command that is
needed to solve this in a generic way.

>
>
> > Can you share the patch that you are using to avoid the EBUSY errors?
>
>
> The simple version to avoid _most_ of EBUSY is this
>
>
> diff --git a/common/rc b/common/rc
> index 741579af..a40fca3b 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -305,6 +305,7 @@ _scratch_mount_idmapped()
>
>   _scratch_unmount()
>   {
> +       sync
>          case "$FSTYP" in
>          overlay)
>                  _overlay_scratch_unmount
>
>
>
> The better version is this
> https://github.com/kdave/xfstests/commit/33a15af07bb044e2773a83df1c7e0a0df280a4b7
>
> >
> > Note that Chritian has suggested a method to use inotify
> > IN_UNMOUNT event to wait for sb shutdown in fstests [2].
>
> Thanks, I had seen the discussion. Although I (silently) wondered if something
> like MNT_BLOCk as umount2 flag wouldn't be easier.
>

You'd better keep wondering silently unless you want to upset Christian ;)

Thanks,
Amir.

> > [1] https://github.com/amir73il/libfuse/commits/fuse-backing-fd
> > [2] https://lore.kernel.org/linux-fsdevel/20230908-verflachen-neudefinition-4da649d673a9@brauner/
[3] https://lore.kernel.org/linux-fsdevel/CAOQ4uxg_FV8U833qVkgPaAWJ4MNcnGoy9Gci41bmak4_ROSc3g@mail.gmail.com/
Bernd Schubert Sept. 21, 2023, 2:44 p.m. UTC | #4
On 9/21/23 16:24, Amir Goldstein wrote:
> On Thu, Sep 21, 2023 at 3:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> On 9/21/23 11:33, Amir Goldstein wrote:
>>> On Thu, Sep 21, 2023 at 9:31 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>
>>>> In FUSE, as of now, uncached lookups are expensive over the wire.
>>>> E.g additional latencies and stressing (meta data) servers from
>>>> thousands of clients. With atomic-open lookup before open
>>>> can be avoided.
>>>>
>>>> Here is the link to performance numbers
>>>> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
>>>>
>>>> Here is the libfuse pull request
>>>> https://github.com/libfuse/libfuse/pull/813
>>>>
>>>> The patches are passing passthrough_hp xfstests (libfuse part applied),
>>>> although we had to introduce umount retries into xfstests, as recent
>>>> kernels/xfstests fail umount in some tests with
>>>> EBUSY - independent of atomic open. (Although outstanding for v7)
>>>
>>> Hi Bernd!
>>>
>>> I was using xfstests to test passthrough_hp (for FUSE kernel passthrough).
>>> FYI, I have made some improvements to the mount helper
>>> in libfuse [1] to support remount, which helps pass a few tests.
>>
>> Thanks, just asked there to send it separate to upstream.
>>
>>>
>>> So far, I have all the tests in group -g quick.rw pass with the baseline
>>> passthrough_hp (over xfs).
>>>
>>> Do you have a baseline for the entire quick/auto group to share with me?
>>
>> Please find my results attached.
> 
> Not too bad.
> 3 more tests can pass with my mount helper fix for remount ;)
> 
>> I have opened a libfuse issue for generic/477,
>> (open_by_handle_at tests) but I'm not sure if this is passthrough_hp only (it
>> trusts the passed node id, without checking if there is an inode object for it).
>> Possibly fuse.ko passes an invalide node id - this is something for a rainy
>> weekend (or so) to investigate...
> 
> Stale file handles after mount cycle are expected.
> FUSE is not equipped to handle this correctly.

I know and I don't have a problem with that. Issue is that the test triggers a
heap buffer overflow,  see the ASAN report here

https://github.com/libfuse/libfuse/issues/838

A possible reason might be an invalid node id by open_by_handle_at, or
lookup/release is not right. As I said, will investigate once I have a free
minute.

> 
> NFS clients may even get access to the wrong inode
> after FUSE restart/reexport, if FUSE is exported with the same
> NFS fsid.
> 
> See this discussion [3] about how this could be solved hackishly
> with existing FUSE protocol (for fs that know how to open by ino)
> and about the LOOKUP_HANDLE protocol command that is
> needed to solve this in a generic way.

I will read through it later. I would prefer adding support up to
MAX_HANDLE_SZ - our file systems typically exceed 64 bit inode sizes.
Without having it read, I would just expose exportfs methods to userspace
(which might be the LOOKUP_HANDLE protocol).


> 
>>
>>
>>> Can you share the patch that you are using to avoid the EBUSY errors?
>>
>>
>> The simple version to avoid _most_ of EBUSY is this
>>
>>
>> diff --git a/common/rc b/common/rc
>> index 741579af..a40fca3b 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -305,6 +305,7 @@ _scratch_mount_idmapped()
>>
>>    _scratch_unmount()
>>    {
>> +       sync
>>           case "$FSTYP" in
>>           overlay)
>>                   _overlay_scratch_unmount
>>
>>
>>
>> The better version is this
>> https://github.com/kdave/xfstests/commit/33a15af07bb044e2773a83df1c7e0a0df280a4b7
>>
>>>
>>> Note that Chritian has suggested a method to use inotify
>>> IN_UNMOUNT event to wait for sb shutdown in fstests [2].
>>
>> Thanks, I had seen the discussion. Although I (silently) wondered if something
>> like MNT_BLOCk as umount2 flag wouldn't be easier.
>>
> 
> You'd better keep wondering silently unless you want to upset Christian ;)

Ouch, Christian is in CC, inotify is fine ;)


Thanks,
Bernd
Amir Goldstein Sept. 29, 2023, 11:34 a.m. UTC | #5
On Thu, Sep 21, 2023 at 5:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Sep 21, 2023 at 3:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
> >
> > On 9/21/23 11:33, Amir Goldstein wrote:
> > > On Thu, Sep 21, 2023 at 9:31 AM Bernd Schubert <bschubert@ddn.com> wrote:
> > >>
> > >> In FUSE, as of now, uncached lookups are expensive over the wire.
> > >> E.g additional latencies and stressing (meta data) servers from
> > >> thousands of clients. With atomic-open lookup before open
> > >> can be avoided.
> > >>
> > >> Here is the link to performance numbers
> > >> https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@gmail.com/
> > >>
> > >> Here is the libfuse pull request
> > >> https://github.com/libfuse/libfuse/pull/813
> > >>
> > >> The patches are passing passthrough_hp xfstests (libfuse part applied),
> > >> although we had to introduce umount retries into xfstests, as recent
> > >> kernels/xfstests fail umount in some tests with
> > >> EBUSY - independent of atomic open. (Although outstanding for v7)
> > >
> > > Hi Bernd!
> > >
> > > I was using xfstests to test passthrough_hp (for FUSE kernel passthrough).
> > > FYI, I have made some improvements to the mount helper
> > > in libfuse [1] to support remount, which helps pass a few tests.
> >
> > Thanks, just asked there to send it separate to upstream.

Now upstream. Thanks for your help!

> >
> > >
> > > So far, I have all the tests in group -g quick.rw pass with the baseline
> > > passthrough_hp (over xfs).
> > >
> > > Do you have a baseline for the entire quick/auto group to share with me?
> >
> > Please find my results attached.
>
> Not too bad.
> 3 more tests can pass with my mount helper fix for remount ;)
>

FYI, here is a wdiff of my -g auto passthough_hp test run compared to yours:

[-unpatched-6.5-]{+upatched-6.6-rc3+}
Failures: generic/003 [-generic/020-] {+generic/099+} generic/184
generic/192 generic/263 [-generic/294 generic/306-] {+generic/317
generic/318 generic/319 generic/375+} generic/401 {+generic/423+}
generic/426 [-generic/427-] generic/434 [-generic/452-]
{+generic/444+} generic/467 [-generic/468-] generic/477
[-generic/478-] generic/617 {+generic/532+} generic/631 generic/633
generic/683 [-generic/688-]

Some of my {+NEW+} failures are because I have POSIX_ACL support enabled
in Kconfig, so the same tests are [not run] in your results.
I suspect that several permission related tests that PASS for you and FAIL
for me may also be because of enabled POSIX_ACL.
I was also running passthouhg_hp with -odefault_permissions, but AFAIK
this did not change the fstests results.

> >
> >
> > > Can you share the patch that you are using to avoid the EBUSY errors?
> >
> >
> > The simple version to avoid _most_ of EBUSY is this
> >

You know, I am testing passthrough_hp with kernel 6.6-rc3 and I did
not encounter any EBUST errors.

Maybe there is some relevant vfs fix in 6.6-rc3, because you were testing 6.5?
Or maybe it's because my test VM has only 2 cpus.

Thanks,
Amir.