mbox series

[GIT,PULL] io_uring xattr support

Message ID d94a4e55-c4f2-73d8-9e2c-e55ae8436622@kernel.dk (mailing list archive)
State New
Headers show
Series [GIT,PULL] io_uring xattr support | expand

Pull-request

git://git.kernel.dk/linux-block.git tags/for-5.19/io_uring-xattr-2022-05-22

Message

Jens Axboe May 22, 2022, 9:26 p.m. UTC
Hi Linus,

On top of the core io_uring changes, this pull request includes support
for the xattr variants.

Please pull!


The following changes since commit 155bc9505dbd6613585abbf0be6466f1c21536c4:

  io_uring: return an error when cqe is dropped (2022-04-24 18:18:18 -0600)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/for-5.19/io_uring-xattr-2022-05-22

for you to fetch changes up to 4ffaa94b9c047fe0e82b1f271554f31f0e2e2867:

  io_uring: cleanup error-handling around io_req_complete (2022-04-24 18:29:33 -0600)

----------------------------------------------------------------
for-5.19/io_uring-xattr-2022-05-22

----------------------------------------------------------------
Jens Axboe (1):
      io_uring: fix trace for reduced sqe padding

Kanchan Joshi (1):
      io_uring: cleanup error-handling around io_req_complete

Stefan Roesch (4):
      fs: split off setxattr_copy and do_setxattr function from setxattr
      fs: split off do_getxattr from getxattr
      io_uring: add fsetxattr and setxattr support
      io_uring: add fgetxattr and getxattr support

 fs/internal.h                   |  29 ++++
 fs/io_uring.c                   | 322 ++++++++++++++++++++++++++++++++++++----
 fs/xattr.c                      | 143 ++++++++++++------
 include/trace/events/io_uring.h |   9 +-
 include/uapi/linux/io_uring.h   |   8 +-
 5 files changed, 434 insertions(+), 77 deletions(-)

Comments

Linus Torvalds May 23, 2022, 7:41 p.m. UTC | #1
On Sun, May 22, 2022 at 2:26 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On top of the core io_uring changes, this pull request includes support
> for the xattr variants.

So I don't mind the code (having seen the earlier versions), but
looking at this all I *do* end up reacting to this part:

    [torvalds@ryzen linux]$ wc -l fs/io_uring.c
    12744 fs/io_uring.c

and no, this is not due to this xattr pull, but the xattr code did add
another few hundred lines of "io_uring command boilerplate for another
command" to this file that is a nasty file from hell.

I really think that it might be time to start thinking about splitting
that io_uring.c file up. Make it a directory, and have the core
command engine in io_uring/core.c, and then have the different actual
IO_URING_OP_xyz handling in separate files.

And yes, that would probably necessitate making the OP handling use
more of a dispatch table approach, but wouldn't that be good anyway?
That io_uring.c file is starting to have a lot of *big* switch
statements for the different cases.

Wouldn't it be nice to have a "op descriptor array" instead of the

        switch (req->opcode) {
        ...
        case IORING_OP_WRITE:
                return io_prep_rw(req, sqe);
        ...

kind of tables?

Yes, the compiler may end up generating a binary-tree
compare-and-branch thing for a switch like that, and it might be
better than an indirect branch in these days of spectre costs for
branch prediction safety, but if we're talking a few tens of cycles
per op, that's probably not really a big deal.

And from a maintenenace standpoint, I really think it would be good to
try to try to walk away from those "case IORING_OP_xyz" things, and
try to split things up into more manageable pieces.

Hmm?

               Linus
Jens Axboe May 23, 2022, 7:59 p.m. UTC | #2
On 5/23/22 1:41 PM, Linus Torvalds wrote:
> On Sun, May 22, 2022 at 2:26 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On top of the core io_uring changes, this pull request includes support
>> for the xattr variants.
> 
> So I don't mind the code (having seen the earlier versions), but
> looking at this all I *do* end up reacting to this part:
> 
>     [torvalds@ryzen linux]$ wc -l fs/io_uring.c
>     12744 fs/io_uring.c
> 
> and no, this is not due to this xattr pull, but the xattr code did add
> another few hundred lines of "io_uring command boilerplate for another
> command" to this file that is a nasty file from hell.

I know, it really is a monster file at this point...

> I really think that it might be time to start thinking about splitting
> that io_uring.c file up. Make it a directory, and have the core
> command engine in io_uring/core.c, and then have the different actual
> IO_URING_OP_xyz handling in separate files.

I've been pondering that for a while actually, and yes I agree it's time
to organize this a bit differently. When you are in this code all the
time you notice less as you know where everything is, but it would be
nice to take the time to split it into some manageable and separately
readable/maintainable pieces.

> And yes, that would probably necessitate making the OP handling use
> more of a dispatch table approach, but wouldn't that be good anyway?
> That io_uring.c file is starting to have a lot of *big* switch
> statements for the different cases.
> 
> Wouldn't it be nice to have a "op descriptor array" instead of the
> 
>         switch (req->opcode) {
>         ...
>         case IORING_OP_WRITE:
>                 return io_prep_rw(req, sqe);
>         ...
> 
> kind of tables?
> 
> Yes, the compiler may end up generating a binary-tree
> compare-and-branch thing for a switch like that, and it might be
> better than an indirect branch in these days of spectre costs for
> branch prediction safety, but if we're talking a few tens of cycles
> per op, that's probably not really a big deal.

I was resistant to the indirect function call initially because of the
spectre overhead, but that was when the table was a lot smaller. The
tides may indeed have shifted on this now that the table has grown to
the size that it has. Plus we have both a prep handler and issue handler
for each, so you end up with two massive switches to deal with that.

> And from a maintenenace standpoint, I really think it would be good to
> try to try to walk away from those "case IORING_OP_xyz" things, and
> try to split things up into more manageable pieces.
> 
> Hmm?

As mentioned, it's something that I have myself been thinking about for
the past few releases. It's not difficult work and can be done in a
sequential kind of manner, but it will add some pain in terms of
backports. Nothing _really_ major, but... Longer term it'll be nicer for
sure, which is the most important bit.

I've got some ideas on how to split the core bits, and the
related-op-per file kind of idea for the rest makes sense. Eg net
related bits can go in one, or maybe we can even go finer grained and
(almost) do per-op.

I'll spend some time after the merge window to try and get this sorted
out.
pr-tracker-bot@kernel.org May 23, 2022, 8:42 p.m. UTC | #3
The pull request you sent on Sun, 22 May 2022 15:26:02 -0600:

> git://git.kernel.dk/linux-block.git tags/for-5.19/io_uring-xattr-2022-05-22

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/09beaff75e4c4cee590b0e547c7587ff6cadb5db

Thank you!
Jens Axboe May 25, 2022, 6:04 p.m. UTC | #4
On 5/23/22 1:59 PM, Jens Axboe wrote:
>> And from a maintenenace standpoint, I really think it would be good to
>> try to try to walk away from those "case IORING_OP_xyz" things, and
>> try to split things up into more manageable pieces.
>>
>> Hmm?
> 
> As mentioned, it's something that I have myself been thinking about for
> the past few releases. It's not difficult work and can be done in a
> sequential kind of manner, but it will add some pain in terms of
> backports. Nothing _really_ major, but... Longer term it'll be nicer for
> sure, which is the most important bit.
> 
> I've got some ideas on how to split the core bits, and the
> related-op-per file kind of idea for the rest makes sense. Eg net
> related bits can go in one, or maybe we can even go finer grained and
> (almost) do per-op.
> 
> I'll spend some time after the merge window to try and get this sorted
> out.

Well I spent some time yesterday, and 53 patches later, I think it looks
pretty sane. At this point everything is split out, except read/write
and poll handling. There's still some things we can split out, like the
io_uring_register() side, but as it stands, ~35% of the code has been
split into separate files for either opcodes or infrastructure that goes
together. We can definitely get this down further, but it's a good
start.

Anyway, I pitted -git vs -git + this merged in, to test with and without
spectre mitigations as we now have an indirect ->prep() and ->issue()
for each opcode in the fast path. I ran my usual peak testing, which is
basically just polled async random reads from 3 devices. The box in
question is a 12900K, so recent cpu. Mitigation status:

/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: eIBRS with unprivileged eBPF

Out of 5 runs, here are the results:

-git, RETPOLINE=n, mitigations=off
Maximum IOPS=12837K
Maximum IOPS=12812K
Maximum IOPS=12834K
Maximum IOPS=12856K
Maximum IOPS=12809K

for-5.20/io_uring, RETPOLINE=n, mitigations=off
Maximum IOPS=12877K
Maximum IOPS=12870K
Maximum IOPS=12848K
Maximum IOPS=12837K
Maximum IOPS=12899K

-git, RETPOLINE=y, mitigations=on
Maximum IOPS=12869K
Maximum IOPS=12766K
Maximum IOPS=12817K
Maximum IOPS=12866K
Maximum IOPS=12828K

for-5.20/io_uring, RETPOLINE=y, mitigations=on
Maximum IOPS=12859K
Maximum IOPS=12921K
Maximum IOPS=12899K
Maximum IOPS=12859K
Maximum IOPS=12904K

If anything, the new code seems a smidge faster for both mitigations=off
vs RETPOLINE=y && mitigations=on. Hmm? But at least from a first test
this is promising and it may be a viable path forward to make it a bit
saner.

If you're curious, git tree here:

https://git.kernel.dk/cgit/linux-block/log/?h=for-5.20/io_uring

with the first 5 patches being staged for 5.19 as well as they are just
consistency cleanups.