mbox series

[v2,0/3] io_uring: add getdents64 support

Message ID 20211124231700.1158521-1-shr@fb.com (mailing list archive)
Headers show
Series io_uring: add getdents64 support | expand

Message

Stefan Roesch Nov. 24, 2021, 11:16 p.m. UTC
This series adds support for getdents64 in liburing. The intent is to
provide a more complete I/O interface for io_uring.

Patch 1: fs: add parameter use_fpos to iterate_dir()
  This adds a new parameter to the function iterate_dir() so the
  caller can specify if the position is the file position or the
  position stored in the buffer context.

Patch 2: fs: split off vfs_getdents function from getdents64 system call
  This splits of the iterate_dir part of the syscall in its own
  dedicated function. This allows to call the function directly from
  liburing.

Patch 3: io_uring: add support for getdents64
  Adds the functions to io_uring to support getdents64.

There is also a patch series for the changes to liburing. This includes
a new test. The patch series is called "liburing: add getdents support."

The following tests have been performed:
- new liburing getdents test program has been run
- xfstests have been run
- both tests have been repeated with the kernel memory leak checker
  and no leaks have been reported.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
V2: Updated the iterate_dir calls in fs/ksmbd, fs/ecryptfs and arch/alpha with
    the additional parameter.

Stefan Roesch (3):
  fs: add parameter use_fpos to iterate_dir function
  fs: split off vfs_getdents function of getdents64 syscall
  io_uring: add support for getdents64

 arch/alpha/kernel/osf_sys.c   |  2 +-
 fs/ecryptfs/file.c            |  2 +-
 fs/exportfs/expfs.c           |  2 +-
 fs/internal.h                 |  8 +++++
 fs/io_uring.c                 | 52 ++++++++++++++++++++++++++++
 fs/ksmbd/smb2pdu.c            |  2 +-
 fs/ksmbd/vfs.c                |  4 +--
 fs/nfsd/nfs4recover.c         |  2 +-
 fs/nfsd/vfs.c                 |  2 +-
 fs/overlayfs/readdir.c        |  6 ++--
 fs/readdir.c                  | 64 ++++++++++++++++++++++++++---------
 include/linux/fs.h            |  2 +-
 include/uapi/linux/io_uring.h |  1 +
 13 files changed, 121 insertions(+), 28 deletions(-)


base-commit: f0afafc21027c39544a2c1d889b0cff75b346932

Comments

Clay Harris Nov. 25, 2021, 4:42 a.m. UTC | #1
I seem to recall making a few comments the last time a getdents64
for io_uring was proposed; in particular I wanted to bring up one
here.  This applies only to altering the internal interface, which
io_uring would use, although wiring up a new syscall would be a nice
addition.

The current interface has 2 issues:

1)
getdents64 requires at least two calls to read a directory.
One or more to get the dents and a final call to see the EOF.
With small directories, this means literally 50% of the calls
are wasted.

2)
The fpos cannot be changed atomically with a read, so it is not
possible to to safely perform concurrent reads on the same fd.

But, the kernel knows (most, if not all of the time) that it is at
EOF at the time it returns the last buffer.  So, it would be very
useful to get an EOF indicator back with the final buffer.  This
could just a flag, or for instance make an fpos parameter which is
both input and output, returning the (post read) fpos or zero at
EOF.

Futhermore, for input, one could supply:
	0:	Start from directory beginning
	-1:	Read from current position
	other:	(output from previous call) Read from here

On Wed, Nov 24 2021 at 15:16:57 -0800, Stefan Roesch quoth thus:

> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.
> 
> Patch 1: fs: add parameter use_fpos to iterate_dir()
>   This adds a new parameter to the function iterate_dir() so the
>   caller can specify if the position is the file position or the
>   position stored in the buffer context.
> 
> Patch 2: fs: split off vfs_getdents function from getdents64 system call
>   This splits of the iterate_dir part of the syscall in its own
>   dedicated function. This allows to call the function directly from
>   liburing.
> 
> Patch 3: io_uring: add support for getdents64
>   Adds the functions to io_uring to support getdents64.
> 
> There is also a patch series for the changes to liburing. This includes
> a new test. The patch series is called "liburing: add getdents support."
> 
> The following tests have been performed:
> - new liburing getdents test program has been run
> - xfstests have been run
> - both tests have been repeated with the kernel memory leak checker
>   and no leaks have been reported.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>
> ---
> V2: Updated the iterate_dir calls in fs/ksmbd, fs/ecryptfs and arch/alpha with
>     the additional parameter.
> 
> Stefan Roesch (3):
>   fs: add parameter use_fpos to iterate_dir function
>   fs: split off vfs_getdents function of getdents64 syscall
>   io_uring: add support for getdents64
> 
>  arch/alpha/kernel/osf_sys.c   |  2 +-
>  fs/ecryptfs/file.c            |  2 +-
>  fs/exportfs/expfs.c           |  2 +-
>  fs/internal.h                 |  8 +++++
>  fs/io_uring.c                 | 52 ++++++++++++++++++++++++++++
>  fs/ksmbd/smb2pdu.c            |  2 +-
>  fs/ksmbd/vfs.c                |  4 +--
>  fs/nfsd/nfs4recover.c         |  2 +-
>  fs/nfsd/vfs.c                 |  2 +-
>  fs/overlayfs/readdir.c        |  6 ++--
>  fs/readdir.c                  | 64 ++++++++++++++++++++++++++---------
>  include/linux/fs.h            |  2 +-
>  include/uapi/linux/io_uring.h |  1 +
>  13 files changed, 121 insertions(+), 28 deletions(-)
> 
> 
> base-commit: f0afafc21027c39544a2c1d889b0cff75b346932
> -- 
> 2.30.2
Stefan Roesch Dec. 1, 2021, 6:01 a.m. UTC | #2
On 11/24/21 8:42 PM, Clay Harris wrote:
> 
> I seem to recall making a few comments the last time a getdents64
> for io_uring was proposed; in particular I wanted to bring up one
> here.  This applies only to altering the internal interface, which
> io_uring would use, although wiring up a new syscall would be a nice
> addition.
> 
> The current interface has 2 issues:
> 
> 1)
> getdents64 requires at least two calls to read a directory.
> One or more to get the dents and a final call to see the EOF.
> With small directories, this means literally 50% of the calls
> are wasted.
> 
> 2)
> The fpos cannot be changed atomically with a read, so it is not
> possible to to safely perform concurrent reads on the same fd.
> 
> But, the kernel knows (most, if not all of the time) that it is at
> EOF at the time it returns the last buffer.  So, it would be very
> useful to get an EOF indicator back with the final buffer.  This
> could just a flag, or for instance make an fpos parameter which is
> both input and output, returning the (post read) fpos or zero at
> EOF.
> 
> Futhermore, for input, one could supply:
> 	0:	Start from directory beginning
> 	-1:	Read from current position
> 	other:	(output from previous call) Read from here
> 

While I can understand the wish to optimize the getdents call, this
has its own set of challenges:

- The getdents API is following the logic of other read API's. None
  of these API's has the logic you described above. This would be
  inconsistent.
- The eof needs to be stored in another field. The dirent structure
  does not have space in the field, so a new data structure needs to be defined.
- However the goal is to provide a familiar interface to the user.
- If the user wants to reduce the number of calls he can still provide
  a bigger user buffer.

> On Wed, Nov 24 2021 at 15:16:57 -0800, Stefan Roesch quoth thus:
> 
>> This series adds support for getdents64 in liburing. The intent is to
>> provide a more complete I/O interface for io_uring.
>>
>> Patch 1: fs: add parameter use_fpos to iterate_dir()
>>   This adds a new parameter to the function iterate_dir() so the
>>   caller can specify if the position is the file position or the
>>   position stored in the buffer context.
>>
>> Patch 2: fs: split off vfs_getdents function from getdents64 system call
>>   This splits of the iterate_dir part of the syscall in its own
>>   dedicated function. This allows to call the function directly from
>>   liburing.
>>
>> Patch 3: io_uring: add support for getdents64
>>   Adds the functions to io_uring to support getdents64.
>>
>> There is also a patch series for the changes to liburing. This includes
>> a new test. The patch series is called "liburing: add getdents support."
>>
>> The following tests have been performed:
>> - new liburing getdents test program has been run
>> - xfstests have been run
>> - both tests have been repeated with the kernel memory leak checker
>>   and no leaks have been reported.
>>
>> Signed-off-by: Stefan Roesch <shr@fb.com>
>> ---
>> V2: Updated the iterate_dir calls in fs/ksmbd, fs/ecryptfs and arch/alpha with
>>     the additional parameter.
>>
>> Stefan Roesch (3):
>>   fs: add parameter use_fpos to iterate_dir function
>>   fs: split off vfs_getdents function of getdents64 syscall
>>   io_uring: add support for getdents64
>>
>>  arch/alpha/kernel/osf_sys.c   |  2 +-
>>  fs/ecryptfs/file.c            |  2 +-
>>  fs/exportfs/expfs.c           |  2 +-
>>  fs/internal.h                 |  8 +++++
>>  fs/io_uring.c                 | 52 ++++++++++++++++++++++++++++
>>  fs/ksmbd/smb2pdu.c            |  2 +-
>>  fs/ksmbd/vfs.c                |  4 +--
>>  fs/nfsd/nfs4recover.c         |  2 +-
>>  fs/nfsd/vfs.c                 |  2 +-
>>  fs/overlayfs/readdir.c        |  6 ++--
>>  fs/readdir.c                  | 64 ++++++++++++++++++++++++++---------
>>  include/linux/fs.h            |  2 +-
>>  include/uapi/linux/io_uring.h |  1 +
>>  13 files changed, 121 insertions(+), 28 deletions(-)
>>
>>
>> base-commit: f0afafc21027c39544a2c1d889b0cff75b346932
>> -- 
>> 2.30.2
Clay Harris Dec. 1, 2021, 7:11 a.m. UTC | #3
On Tue, Nov 30 2021 at 22:01:30 -0800, Stefan Roesch quoth thus:

> 
> 
> On 11/24/21 8:42 PM, Clay Harris wrote:
> > 
> > I seem to recall making a few comments the last time a getdents64
> > for io_uring was proposed; in particular I wanted to bring up one
> > here.  This applies only to altering the internal interface, which
> > io_uring would use, although wiring up a new syscall would be a nice
> > addition.
> > 
> > The current interface has 2 issues:
> > 
> > 1)
> > getdents64 requires at least two calls to read a directory.
> > One or more to get the dents and a final call to see the EOF.
> > With small directories, this means literally 50% of the calls
> > are wasted.
> > 
> > 2)
> > The fpos cannot be changed atomically with a read, so it is not
> > possible to to safely perform concurrent reads on the same fd.
> > 
> > But, the kernel knows (most, if not all of the time) that it is at
> > EOF at the time it returns the last buffer.  So, it would be very
> > useful to get an EOF indicator back with the final buffer.  This
> > could just a flag, or for instance make an fpos parameter which is
> > both input and output, returning the (post read) fpos or zero at
> > EOF.
> > 
> > Futhermore, for input, one could supply:
> > 	0:	Start from directory beginning
> > 	-1:	Read from current position
> > 	other:	(output from previous call) Read from here
> > 

Thank you for taking the time to respond!

> While I can understand the wish to optimize the getdents call, this
> has its own set of challenges:
> 
> - The getdents API is following the logic of other read API's. None
>   of these API's has the logic you described above. This would be
>   inconsistent.

The point was that the familiar interface can be improved by changing its API.

Consider as if you were implementing a new syscall:
ssize_t getdents2(int fd, void *buf, size_t count, off_t *offset, int flags);

This is close enough to pread() that no one would be slowed down by its
unfamiliarity.  It would work just like getdents64(), except that the fpos
at the end of the read would always be returned in offset.  At EOF, fpos 0
would be returned.  On the calling side, one would set a flag and set offset
to a value other than -1 to cause it to atomically change fpos before the read.

Just considering an io_uring (only) call, io_uring is already a new interface,
so the inconsistency point doesn't really seem to apply.

> - The eof needs to be stored in another field. The dirent structure
>   does not have space in the field, so a new data structure needs to be defined.

I believe the important part of the idea is the EOF processing, so I'll drop
the fpos discussion for now.  In this case you could use exactly the old
getdents64() API for io_uring with the minor addition of a returned flag bit.
My understanding, which I'll admit could be flawed, is that the directory
iterator should be testable for the EOF condition.  If correct, I'd think
that testing it and setting a bit in the cqe flags field is all you'd need.
So, as long as the internal call has access to the iterator, no new structures
are required.

> - However the goal is to provide a familiar interface to the user.
> - If the user wants to reduce the number of calls he can still provide
>   a bigger user buffer.

In io_uring, if there is no EOF flag as suggested above, the best you
could probably due is to queue two requests, IOSQE_IO_HARDLINK-ed together,
every time you want some directory entries, and then look for a zero length
return.  That's extra logic, the extra wasted call, and already doubling the
required buffering.

> 
> > On Wed, Nov 24 2021 at 15:16:57 -0800, Stefan Roesch quoth thus:
> > 
> >> This series adds support for getdents64 in liburing. The intent is to
> >> provide a more complete I/O interface for io_uring.
> >>
> >> Patch 1: fs: add parameter use_fpos to iterate_dir()
> >>   This adds a new parameter to the function iterate_dir() so the
> >>   caller can specify if the position is the file position or the
> >>   position stored in the buffer context.
> >>
> >> Patch 2: fs: split off vfs_getdents function from getdents64 system call
> >>   This splits of the iterate_dir part of the syscall in its own
> >>   dedicated function. This allows to call the function directly from
> >>   liburing.
> >>
> >> Patch 3: io_uring: add support for getdents64
> >>   Adds the functions to io_uring to support getdents64.
> >>
> >> There is also a patch series for the changes to liburing. This includes
> >> a new test. The patch series is called "liburing: add getdents support."
> >>
> >> The following tests have been performed:
> >> - new liburing getdents test program has been run
> >> - xfstests have been run
> >> - both tests have been repeated with the kernel memory leak checker
> >>   and no leaks have been reported.
> >>
> >> Signed-off-by: Stefan Roesch <shr@fb.com>
> >> ---
> >> V2: Updated the iterate_dir calls in fs/ksmbd, fs/ecryptfs and arch/alpha with
> >>     the additional parameter.
> >>
> >> Stefan Roesch (3):
> >>   fs: add parameter use_fpos to iterate_dir function
> >>   fs: split off vfs_getdents function of getdents64 syscall
> >>   io_uring: add support for getdents64
> >>
> >>  arch/alpha/kernel/osf_sys.c   |  2 +-
> >>  fs/ecryptfs/file.c            |  2 +-
> >>  fs/exportfs/expfs.c           |  2 +-
> >>  fs/internal.h                 |  8 +++++
> >>  fs/io_uring.c                 | 52 ++++++++++++++++++++++++++++
> >>  fs/ksmbd/smb2pdu.c            |  2 +-
> >>  fs/ksmbd/vfs.c                |  4 +--
> >>  fs/nfsd/nfs4recover.c         |  2 +-
> >>  fs/nfsd/vfs.c                 |  2 +-
> >>  fs/overlayfs/readdir.c        |  6 ++--
> >>  fs/readdir.c                  | 64 ++++++++++++++++++++++++++---------
> >>  include/linux/fs.h            |  2 +-
> >>  include/uapi/linux/io_uring.h |  1 +
> >>  13 files changed, 121 insertions(+), 28 deletions(-)
> >>
> >>
> >> base-commit: f0afafc21027c39544a2c1d889b0cff75b346932
> >> -- 
> >> 2.30.2
Jens Axboe Dec. 15, 2021, 9:09 p.m. UTC | #4
On 11/24/21 4:16 PM, Stefan Roesch wrote:
> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.
> 
> Patch 1: fs: add parameter use_fpos to iterate_dir()
>   This adds a new parameter to the function iterate_dir() so the
>   caller can specify if the position is the file position or the
>   position stored in the buffer context.
> 
> Patch 2: fs: split off vfs_getdents function from getdents64 system call
>   This splits of the iterate_dir part of the syscall in its own
>   dedicated function. This allows to call the function directly from
>   liburing.
> 
> Patch 3: io_uring: add support for getdents64
>   Adds the functions to io_uring to support getdents64.
> 
> There is also a patch series for the changes to liburing. This includes
> a new test. The patch series is called "liburing: add getdents support."

Al, ping on this one as well, same question on the VFS side.
Al Viro Dec. 31, 2021, 11:27 p.m. UTC | #5
On Wed, Dec 15, 2021 at 02:09:52PM -0700, Jens Axboe wrote:
> On 11/24/21 4:16 PM, Stefan Roesch wrote:
> > This series adds support for getdents64 in liburing. The intent is to
> > provide a more complete I/O interface for io_uring.
> > 
> > Patch 1: fs: add parameter use_fpos to iterate_dir()
> >   This adds a new parameter to the function iterate_dir() so the
> >   caller can specify if the position is the file position or the
> >   position stored in the buffer context.
> > 
> > Patch 2: fs: split off vfs_getdents function from getdents64 system call
> >   This splits of the iterate_dir part of the syscall in its own
> >   dedicated function. This allows to call the function directly from
> >   liburing.
> > 
> > Patch 3: io_uring: add support for getdents64
> >   Adds the functions to io_uring to support getdents64.
> > 
> > There is also a patch series for the changes to liburing. This includes
> > a new test. The patch series is called "liburing: add getdents support."
> 
> Al, ping on this one as well, same question on the VFS side.

First of all, apologies for being MIA - had been sick for a couple of
months, so right now I'm digging through the huge pile of mail.

The problem is not with VFS side - it's with ->iterate_dir() *instances*.
The logics for validation of file position in there is not pretty, and
it's based upon the assumption that all position changes (other than
those from ->iterate_dir() itself) come through lseek(2), with its
per-opened-file exclusion with getdents(2).

Your API just shoves an arbitrary number at them, with no indication
as far as struct file instance is concerned that something might need
to be checked.  Hell, the file might've been just opened with no
directory-changing operations done since then; of course its position
is valid.  And has nothing whatsoever with the number you put into
ctx->pos.

Normalization is sufficiently costly to show up on real-world loads.
Even "assume it bogus" flag somewhere in ctx will seriously cost on
io_uring side.  And that doesn't help with the case of tmpfs et.al.
where position is ignored - there's a per-instance cursor moved
around on lseek/readdir and that's the authoritative thing there.

So please, use a saner userland ABI - pread-like stuff is a really,
really bad idea for directories.