mbox series

[PATCHSET,v2,0/6] io_uring: add support for open/close

Message ID 20200107170034.16165-1-axboe@kernel.dk (mailing list archive)
Headers show
Series io_uring: add support for open/close | expand

Message

Jens Axboe Jan. 7, 2020, 5 p.m. UTC
Sending this out separately, as I rebased it on top of the work.openat2
branch from Al to resolve some of the conflicts with the differences in
how open flags are built.

Al, you had objections on patch 1 in this series. Are you fine with this
version?

Comments

Stefan Metzmacher Jan. 8, 2020, 9:17 p.m. UTC | #1
Am 07.01.20 um 18:00 schrieb Jens Axboe:
> Sending this out separately, as I rebased it on top of the work.openat2
> branch from Al to resolve some of the conflicts with the differences in
> how open flags are built.

Now that you rebased on top of openat2, wouldn't it be better to add
openat2 that to io_uring instead of the old openat call?

metze
Jens Axboe Jan. 8, 2020, 10:57 p.m. UTC | #2
On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>> Sending this out separately, as I rebased it on top of the work.openat2
>> branch from Al to resolve some of the conflicts with the differences in
>> how open flags are built.
> 
> Now that you rebased on top of openat2, wouldn't it be better to add
> openat2 that to io_uring instead of the old openat call?

The IORING_OP_OPENAT already exists, so it would probably make more sense
to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
really feel that strongly about it, I'll probably just add openat2 and
leave openat alone, openat will just be a wrapper around openat2 anyway.
Stefan Metzmacher Jan. 8, 2020, 11:05 p.m. UTC | #3
Am 08.01.20 um 23:57 schrieb Jens Axboe:
> On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
>> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>>> Sending this out separately, as I rebased it on top of the work.openat2
>>> branch from Al to resolve some of the conflicts with the differences in
>>> how open flags are built.
>>
>> Now that you rebased on top of openat2, wouldn't it be better to add
>> openat2 that to io_uring instead of the old openat call?
> 
> The IORING_OP_OPENAT already exists, so it would probably make more sense
> to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
> really feel that strongly about it, I'll probably just add openat2 and
> leave openat alone, openat will just be a wrapper around openat2 anyway.

Great, thanks!
metze
Jens Axboe Jan. 9, 2020, 1:02 a.m. UTC | #4
On 1/8/20 4:05 PM, Stefan Metzmacher wrote:
> Am 08.01.20 um 23:57 schrieb Jens Axboe:
>> On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
>>> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>>>> Sending this out separately, as I rebased it on top of the work.openat2
>>>> branch from Al to resolve some of the conflicts with the differences in
>>>> how open flags are built.
>>>
>>> Now that you rebased on top of openat2, wouldn't it be better to add
>>> openat2 that to io_uring instead of the old openat call?
>>
>> The IORING_OP_OPENAT already exists, so it would probably make more sense
>> to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
>> really feel that strongly about it, I'll probably just add openat2 and
>> leave openat alone, openat will just be a wrapper around openat2 anyway.
> 
> Great, thanks!

Here:

https://git.kernel.dk/cgit/linux-block/log/?h=for-5.6/io_uring-vfs

Not tested yet, will wire this up in liburing and write a test case
as well.
Jens Axboe Jan. 9, 2020, 2:03 a.m. UTC | #5
On 1/8/20 6:02 PM, Jens Axboe wrote:
> On 1/8/20 4:05 PM, Stefan Metzmacher wrote:
>> Am 08.01.20 um 23:57 schrieb Jens Axboe:
>>> On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
>>>> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>>>>> Sending this out separately, as I rebased it on top of the work.openat2
>>>>> branch from Al to resolve some of the conflicts with the differences in
>>>>> how open flags are built.
>>>>
>>>> Now that you rebased on top of openat2, wouldn't it be better to add
>>>> openat2 that to io_uring instead of the old openat call?
>>>
>>> The IORING_OP_OPENAT already exists, so it would probably make more sense
>>> to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
>>> really feel that strongly about it, I'll probably just add openat2 and
>>> leave openat alone, openat will just be a wrapper around openat2 anyway.
>>
>> Great, thanks!
> 
> Here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.6/io_uring-vfs
> 
> Not tested yet, will wire this up in liburing and write a test case
> as well.

Wrote a basic test case, and used my openbench as well. Seems to work
fine for me. Pushed prep etc support to liburing.
Stefan Metzmacher Jan. 16, 2020, 10:50 p.m. UTC | #6
Am 09.01.20 um 03:03 schrieb Jens Axboe:
> On 1/8/20 6:02 PM, Jens Axboe wrote:
>> On 1/8/20 4:05 PM, Stefan Metzmacher wrote:
>>> Am 08.01.20 um 23:57 schrieb Jens Axboe:
>>>> On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
>>>>> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>>>>>> Sending this out separately, as I rebased it on top of the work.openat2
>>>>>> branch from Al to resolve some of the conflicts with the differences in
>>>>>> how open flags are built.
>>>>>
>>>>> Now that you rebased on top of openat2, wouldn't it be better to add
>>>>> openat2 that to io_uring instead of the old openat call?
>>>>
>>>> The IORING_OP_OPENAT already exists, so it would probably make more sense
>>>> to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
>>>> really feel that strongly about it, I'll probably just add openat2 and
>>>> leave openat alone, openat will just be a wrapper around openat2 anyway.
>>>
>>> Great, thanks!
>>
>> Here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.6/io_uring-vfs
>>
>> Not tested yet, will wire this up in liburing and write a test case
>> as well.
> 
> Wrote a basic test case, and used my openbench as well. Seems to work
> fine for me. Pushed prep etc support to liburing.

Thanks!

Another great feature would the possibility to make use of the
generated fd in the following request.

This is a feature that's also available in the SMB3 protocol
called compound related requests.

The client can compound a chain with open, getinfo, read, close
getinfo, read and close get an file handle of -1 and implicitly
get the fd generated/used in the previous request.

metze
Jens Axboe Jan. 17, 2020, 12:18 a.m. UTC | #7
On 1/16/20 3:50 PM, Stefan Metzmacher wrote:
> Am 09.01.20 um 03:03 schrieb Jens Axboe:
>> On 1/8/20 6:02 PM, Jens Axboe wrote:
>>> On 1/8/20 4:05 PM, Stefan Metzmacher wrote:
>>>> Am 08.01.20 um 23:57 schrieb Jens Axboe:
>>>>> On 1/8/20 2:17 PM, Stefan Metzmacher wrote:
>>>>>> Am 07.01.20 um 18:00 schrieb Jens Axboe:
>>>>>>> Sending this out separately, as I rebased it on top of the work.openat2
>>>>>>> branch from Al to resolve some of the conflicts with the differences in
>>>>>>> how open flags are built.
>>>>>>
>>>>>> Now that you rebased on top of openat2, wouldn't it be better to add
>>>>>> openat2 that to io_uring instead of the old openat call?
>>>>>
>>>>> The IORING_OP_OPENAT already exists, so it would probably make more sense
>>>>> to add IORING_OP_OPENAT2 alongside that. Or I could just change it. Don't
>>>>> really feel that strongly about it, I'll probably just add openat2 and
>>>>> leave openat alone, openat will just be a wrapper around openat2 anyway.
>>>>
>>>> Great, thanks!
>>>
>>> Here:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.6/io_uring-vfs
>>>
>>> Not tested yet, will wire this up in liburing and write a test case
>>> as well.
>>
>> Wrote a basic test case, and used my openbench as well. Seems to work
>> fine for me. Pushed prep etc support to liburing.
> 
> Thanks!
> 
> Another great feature would the possibility to make use of the
> generated fd in the following request.
> 
> This is a feature that's also available in the SMB3 protocol
> called compound related requests.
> 
> The client can compound a chain with open, getinfo, read, close
> getinfo, read and close get an file handle of -1 and implicitly
> get the fd generated/used in the previous request.

Right, the "plan" there is to utilize BPF to make this programmable.
We really need something more expressive to be able to pass information
between SQEs that are linked, or even to decide which link to run
depending on the outcome of the parent.

There's a lot of potential there!
Colin Walters Jan. 17, 2020, 12:44 a.m. UTC | #8
On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>
> The client can compound a chain with open, getinfo, read, close
> getinfo, read and close get an file handle of -1 and implicitly
> get the fd generated/used in the previous request.

Sounds similar to  https://capnproto.org/rpc.html too.

But that seems most valuable in a situation with nontrivial latency.
Jens Axboe Jan. 17, 2020, 12:51 a.m. UTC | #9
On 1/16/20 5:44 PM, Colin Walters wrote:
> 
> 
> On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>>
>> The client can compound a chain with open, getinfo, read, close
>> getinfo, read and close get an file handle of -1 and implicitly
>> get the fd generated/used in the previous request.
> 
> Sounds similar to  https://capnproto.org/rpc.html too.

Never heard of it, but I don't see how you can do any of this in an
efficient manner without kernel support. Unless it's just wrapping the
communication in its own protocol and using that as the basis for the
on-wire part. It has "infinitely faster" in a yellow sticker, so it must
work :-)

> But that seems most valuable in a situation with nontrivial latency.

Which is basically what Stefan is looking at, over the network open,
read, close etc is pretty nondeterministic. But even for local storage,
being able to setup a bundle like that and have it automagically work
makes for easier programming and more efficient communication with the
kernel.
Pavel Begunkov Jan. 17, 2020, 9:32 a.m. UTC | #10
On 1/17/2020 3:44 AM, Colin Walters wrote:
> On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>> The client can compound a chain with open, getinfo, read, close
>> getinfo, read and close get an file handle of -1 and implicitly
>> get the fd generated/used in the previous request.
> 
> Sounds similar to  https://capnproto.org/rpc.html too.
> 
Looks like just grouping a pack of operations for RPC.
With io_uring we could implement more interesting stuff. I've been
thinking about eBPF in io_uring for a while as well, and apparently it
could be _really_ powerful, and would allow almost zero-context-switches
for some usecases.

1. full flow control with eBPF
- dropping requests (links)
- emitting reqs/links (e.g. after completions of another req)
- chaining/redirecting
of course, all of that with fast intermediate computations in between

2. do long eBPF programs by introducing a new opcode (punted to async).
(though, there would be problems with that)

Could even allow to dynamically register new opcodes within the kernel
and extend it to eBPF, if there will be demand for such things.
Jens Axboe Jan. 17, 2020, 3:21 p.m. UTC | #11
On 1/17/20 2:32 AM, Pavel Begunkov wrote:
> On 1/17/2020 3:44 AM, Colin Walters wrote:
>> On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>>> The client can compound a chain with open, getinfo, read, close
>>> getinfo, read and close get an file handle of -1 and implicitly
>>> get the fd generated/used in the previous request.
>>
>> Sounds similar to  https://capnproto.org/rpc.html too.
>>
> Looks like just grouping a pack of operations for RPC.
> With io_uring we could implement more interesting stuff. I've been
> thinking about eBPF in io_uring for a while as well, and apparently it
> could be _really_ powerful, and would allow almost zero-context-switches
> for some usecases.
> 
> 1. full flow control with eBPF
> - dropping requests (links)
> - emitting reqs/links (e.g. after completions of another req)
> - chaining/redirecting
> of course, all of that with fast intermediate computations in between
> 
> 2. do long eBPF programs by introducing a new opcode (punted to async).
> (though, there would be problems with that)
> 
> Could even allow to dynamically register new opcodes within the kernel
> and extend it to eBPF, if there will be demand for such things.

We're also looking into exactly that at Facebook, nothing concrete yet
though. But it's clear we need it to take full advantage of links at
least, and it's also clear that it would unlock a lot of really cool
functionality once we do.

Pavel, I'd strongly urge you to submit a talk to LSF/MM/BPF about this.
It's the perfect venue to have some concrete planning around this topic
and get things rolling.

https://lore.kernel.org/bpf/20191122172502.vffyfxlqejthjib6@macbook-pro-91.dhcp.thefacebook.com/
Pavel Begunkov Jan. 17, 2020, 10:27 p.m. UTC | #12
On 17/01/2020 18:21, Jens Axboe wrote:
> On 1/17/20 2:32 AM, Pavel Begunkov wrote:
>> On 1/17/2020 3:44 AM, Colin Walters wrote:
>>> On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>>>> The client can compound a chain with open, getinfo, read, close
>>>> getinfo, read and close get an file handle of -1 and implicitly
>>>> get the fd generated/used in the previous request.
>>>
>>> Sounds similar to  https://capnproto.org/rpc.html too.
>>>
>> Looks like just grouping a pack of operations for RPC.
>> With io_uring we could implement more interesting stuff. I've been
>> thinking about eBPF in io_uring for a while as well, and apparently it
>> could be _really_ powerful, and would allow almost zero-context-switches
>> for some usecases.
>>
>> 1. full flow control with eBPF
>> - dropping requests (links)
>> - emitting reqs/links (e.g. after completions of another req)
>> - chaining/redirecting
>> of course, all of that with fast intermediate computations in between
>>
>> 2. do long eBPF programs by introducing a new opcode (punted to async).
>> (though, there would be problems with that)
>>
>> Could even allow to dynamically register new opcodes within the kernel
>> and extend it to eBPF, if there will be demand for such things.
> 
> We're also looking into exactly that at Facebook, nothing concrete yet
> though. But it's clear we need it to take full advantage of links at
> least, and it's also clear that it would unlock a lot of really cool
> functionality once we do.
> 
> Pavel, I'd strongly urge you to submit a talk to LSF/MM/BPF about this.
> It's the perfect venue to have some concrete planning around this topic
> and get things rolling.

Sounds interesting, I'll try this, but didn't you intend to do it yourself?
And thanks for the tip!

> 
> https://lore.kernel.org/bpf/20191122172502.vffyfxlqejthjib6@macbook-pro-91.dhcp.thefacebook.com/
>
Jens Axboe Jan. 17, 2020, 10:36 p.m. UTC | #13
On 1/17/20 3:27 PM, Pavel Begunkov wrote:
> On 17/01/2020 18:21, Jens Axboe wrote:
>> On 1/17/20 2:32 AM, Pavel Begunkov wrote:
>>> On 1/17/2020 3:44 AM, Colin Walters wrote:
>>>> On Thu, Jan 16, 2020, at 5:50 PM, Stefan Metzmacher wrote:
>>>>> The client can compound a chain with open, getinfo, read, close
>>>>> getinfo, read and close get an file handle of -1 and implicitly
>>>>> get the fd generated/used in the previous request.
>>>>
>>>> Sounds similar to  https://capnproto.org/rpc.html too.
>>>>
>>> Looks like just grouping a pack of operations for RPC.
>>> With io_uring we could implement more interesting stuff. I've been
>>> thinking about eBPF in io_uring for a while as well, and apparently it
>>> could be _really_ powerful, and would allow almost zero-context-switches
>>> for some usecases.
>>>
>>> 1. full flow control with eBPF
>>> - dropping requests (links)
>>> - emitting reqs/links (e.g. after completions of another req)
>>> - chaining/redirecting
>>> of course, all of that with fast intermediate computations in between
>>>
>>> 2. do long eBPF programs by introducing a new opcode (punted to async).
>>> (though, there would be problems with that)
>>>
>>> Could even allow to dynamically register new opcodes within the kernel
>>> and extend it to eBPF, if there will be demand for such things.
>>
>> We're also looking into exactly that at Facebook, nothing concrete yet
>> though. But it's clear we need it to take full advantage of links at
>> least, and it's also clear that it would unlock a lot of really cool
>> functionality once we do.
>>
>> Pavel, I'd strongly urge you to submit a talk to LSF/MM/BPF about this.
>> It's the perfect venue to have some concrete planning around this topic
>> and get things rolling.
> 
> Sounds interesting, I'll try this, but didn't you intend to do it
> yourself?  And thanks for the tip!

Just trying to delegate a bit, and I think you'd be a great candidate to
drive this. I'll likely do some other io_uring related topic there.
Stefan Metzmacher Jan. 20, 2020, 12:15 p.m. UTC | #14
Hi Jens,

>> Thanks!
>>
>> Another great feature would the possibility to make use of the
>> generated fd in the following request.
>>
>> This is a feature that's also available in the SMB3 protocol
>> called compound related requests.
>>
>> The client can compound a chain with open, getinfo, read, close
>> getinfo, read and close get an file handle of -1 and implicitly
>> get the fd generated/used in the previous request.
> 
> Right, the "plan" there is to utilize BPF to make this programmable.
> We really need something more expressive to be able to pass information
> between SQEs that are linked, or even to decide which link to run
> depending on the outcome of the parent.
> 
> There's a lot of potential there!

I guess so, but I don't yet understand how BPF works in real life.

Is it possible to do that as normal user without special privileges?

My naive way would be using some flags and get res and pass fd by reference.

metze
Pavel Begunkov Jan. 20, 2020, 1:04 p.m. UTC | #15
On 1/20/2020 3:15 PM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>>> Thanks!
>>>
>>> Another great feature would the possibility to make use of the
>>> generated fd in the following request.
>>>
>>> This is a feature that's also available in the SMB3 protocol
>>> called compound related requests.
>>>
>>> The client can compound a chain with open, getinfo, read, close
>>> getinfo, read and close get an file handle of -1 and implicitly
>>> get the fd generated/used in the previous request.
>>
>> Right, the "plan" there is to utilize BPF to make this programmable.
>> We really need something more expressive to be able to pass information
>> between SQEs that are linked, or even to decide which link to run
>> depending on the outcome of the parent.
>>
>> There's a lot of potential there!
> 
> I guess so, but I don't yet understand how BPF works in real life.
> 
> Is it possible to do that as normal user without special privileges?
> 
> My naive way would be using some flags and get res and pass fd by reference.
> 
Just have been discussing related stuff. See the link if curious
https://github.com/axboe/liburing/issues/58

To summarise, there won't be enough flags to cover all use-cases and it
will slow down the common path. There should be something with
zero-overhead if the feature is not used, and that's not the case with
flags. That's why it'd be great to have a custom eBPF program
(in-kernel) controlling what and how to do next.

I don't much about eBPF internals, but probably we will be able to
attach an eBPF program to io_uring instance. Though, not sure whether it
could be done without privileges.