[RFC,0/3] Add BPF for io_uring

Message ID	cover.1731285516.git.asml.silence@gmail.com (mailing list archive)
Headers	show Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FBA7191 for <io-uring@vger.kernel.org>; Mon, 11 Nov 2024 01:50:04 +0000 (UTC) From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org Cc: asml.silence@gmail.com Subject: [RFC 0/3] Add BPF for io_uring Date: Mon, 11 Nov 2024 01:50:43 +0000 Message-ID: <cover.1731285516.git.asml.silence@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add BPF for io_uring \| expand [RFC,0/3] Add BPF for io_uring [RFC,1/3] bpf/io_uring: add io_uring program type [RFC,2/3] io_uring/bpf: allow to register and run BPF programs [RFC,3/3] io_uring/bpf: add kfuncs for BPF programs

Message ID

cover.1731285516.git.asml.silence@gmail.com (mailing list archive)

Headers

From: Pavel Begunkov <asml.silence@gmail.com>
To: io-uring@vger.kernel.org
Cc: asml.silence@gmail.com
Subject: [RFC 0/3] Add BPF for io_uring
Date: Mon, 11 Nov 2024 01:50:43 +0000
Message-ID: <cover.1731285516.git.asml.silence@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Add BPF for io_uring | expand

Message

Pavel Begunkov Nov. 11, 2024, 1:50 a.m. UTC

WARNING: it's an early prototype and could likely be broken and unsafe
to run. Also, most probably it doesn't do the right thing from the
modern BPF perspective, but that's fine as I want to get some numbers
first and only then consult with BPF folks and brush it up.

A comeback of the io_uring BPF proposal put on top new infrastructure.
Instead executing BPF as a new request type, it's now run in the io_uring
waiting loop. The program is called to react every time we get a new
event like a queued task_work or an interrupt. Patch 3 adds some helpers
the BPF program can use to interact with io_uring like submitting new
requests and looking at CQEs. It also controls when to return control
back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
the task_work batching size, i.e. how many CQEs to wait for it be run
again, via a kfunc helper. We need to be able to sleep to submit
requests, hence only sleepable BPF is allowed. 

BPF can help to create arbitrary relations between requests from
within the kernel and later help with tuning the wait loop batching.
E.g. with minor extensions we can implement batch wait timeouts.
We can also use it to let the user to safely access internal resources
and maybe even do a more elaborate request setup than SQE allows it.

The benchmark is primitive, the non-BPF baseline issues a 2 nop request
link at a time and waits for them to complete. The BPF version runs
them (2 * N requests) one by one. Numbers with mitigations on:

# nice -n -20 taskset -c 0 ./minimal 0 50000000
type 2-LINK, requests to run 50000000
sec 10, total (ms) 10314
# nice -n -20 taskset -c 0 ./minimal 1 50000000
type BPF, requests to run 50000000
sec 6, total (ms) 6808

It needs to be better tested, especially with asynchronous requests
like reads and other hardware. It can also be further optimised. E.g.
we can avoid extra locking by taking it once for BPF/task_work_run.

The test (see examples-bpf/minimal[.bpf].c)
https://github.com/isilence/liburing.git io_uring-bpf
https://github.com/isilence/liburing/tree/io_uring-bpf

Pavel Begunkov (3):
  bpf/io_uring: add io_uring program type
  io_uring/bpf: allow to register and run BPF programs
  io_uring/bpf: add kfuncs for BPF programs

 include/linux/bpf.h               |   1 +
 include/linux/bpf_types.h         |   4 +
 include/linux/io_uring/bpf.h      |  10 ++
 include/linux/io_uring_types.h    |   4 +
 include/uapi/linux/bpf.h          |   1 +
 include/uapi/linux/io_uring.h     |   9 ++
 include/uapi/linux/io_uring/bpf.h |  22 ++++
 io_uring/Makefile                 |   1 +
 io_uring/bpf.c                    | 205 ++++++++++++++++++++++++++++++
 io_uring/bpf.h                    |  43 +++++++
 io_uring/io_uring.c               |  16 +++
 io_uring/register.c               |   7 +
 kernel/bpf/btf.c                  |   3 +
 kernel/bpf/syscall.c              |   1 +
 kernel/bpf/verifier.c             |  10 +-
 15 files changed, 336 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/io_uring/bpf.h
 create mode 100644 include/uapi/linux/io_uring/bpf.h
 create mode 100644 io_uring/bpf.c
 create mode 100644 io_uring/bpf.h

Comments

Ming Lei Nov. 13, 2024, 8:13 a.m. UTC | #1

On Mon, Nov 11, 2024 at 01:50:43AM +0000, Pavel Begunkov wrote:
> WARNING: it's an early prototype and could likely be broken and unsafe
> to run. Also, most probably it doesn't do the right thing from the
> modern BPF perspective, but that's fine as I want to get some numbers
> first and only then consult with BPF folks and brush it up.
> 
> A comeback of the io_uring BPF proposal put on top new infrastructure.
> Instead executing BPF as a new request type, it's now run in the io_uring
> waiting loop. The program is called to react every time we get a new
> event like a queued task_work or an interrupt. Patch 3 adds some helpers
> the BPF program can use to interact with io_uring like submitting new
> requests and looking at CQEs. It also controls when to return control
> back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
> the task_work batching size, i.e. how many CQEs to wait for it be run
> again, via a kfunc helper. We need to be able to sleep to submit
> requests, hence only sleepable BPF is allowed. 

I guess this way may break the existed interface of io_uring_enter(),
or at least one flag should be added to tell kernel that the wait behavior
will be overrided by bpf prog.

Also can you share how these perfect parameters may be calculated
by bpf prog? And why isn't io_uring kernel code capable of doing that?

> 
> BPF can help to create arbitrary relations between requests from
> within the kernel

Can you explain it in details about the `arbitrary relations`?

> and later help with tuning the wait loop batching.
> E.g. with minor extensions we can implement batch wait timeouts.
> We can also use it to let the user to safely access internal resources
> and maybe even do a more elaborate request setup than SQE allows it.
> 
> The benchmark is primitive, the non-BPF baseline issues a 2 nop request
> link at a time and waits for them to complete. The BPF version runs
> them (2 * N requests) one by one. Numbers with mitigations on:
> 
> # nice -n -20 taskset -c 0 ./minimal 0 50000000
> type 2-LINK, requests to run 50000000
> sec 10, total (ms) 10314
> # nice -n -20 taskset -c 0 ./minimal 1 50000000
> type BPF, requests to run 50000000
> sec 6, total (ms) 6808
> 
> It needs to be better tested, especially with asynchronous requests
> like reads and other hardware. It can also be further optimised. E.g.
> we can avoid extra locking by taking it once for BPF/task_work_run.
> 
> The test (see examples-bpf/minimal[.bpf].c)
> https://github.com/isilence/liburing.git io_uring-bpf
> https://github.com/isilence/liburing/tree/io_uring-bpf

Looks you pull bpftool & libbpf code into the example, and just
wondering why not link the example with libbpf directly?


Thanks, 
Ming

Pavel Begunkov Nov. 13, 2024, 1:09 p.m. UTC | #2

On 11/13/24 08:13, Ming Lei wrote:
> On Mon, Nov 11, 2024 at 01:50:43AM +0000, Pavel Begunkov wrote:
>> WARNING: it's an early prototype and could likely be broken and unsafe
>> to run. Also, most probably it doesn't do the right thing from the
>> modern BPF perspective, but that's fine as I want to get some numbers
>> first and only then consult with BPF folks and brush it up.
>>
>> A comeback of the io_uring BPF proposal put on top new infrastructure.
>> Instead executing BPF as a new request type, it's now run in the io_uring
>> waiting loop. The program is called to react every time we get a new
>> event like a queued task_work or an interrupt. Patch 3 adds some helpers
>> the BPF program can use to interact with io_uring like submitting new
>> requests and looking at CQEs. It also controls when to return control
>> back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
>> the task_work batching size, i.e. how many CQEs to wait for it be run
>> again, via a kfunc helper. We need to be able to sleep to submit
>> requests, hence only sleepable BPF is allowed.
> 
> I guess this way may break the existed interface of io_uring_enter(),
> or at least one flag should be added to tell kernel that the wait behavior
> will be overrided by bpf prog.

It doesn't change anything if there is no BPF registered, a user
who adds BPF should expect the change of behaviour dictated by
its own BPF program. Unlike some other BPF hooks it should be
installed by the ring user and not from outside.

> Also can you share how these perfect parameters may be calculated
> by bpf prog?

In terms of knowledge it should be on par with normal user space,
so it's not about how to calculate a certain parameter but rather
how to pass it to the kernel and whether we want to carve the path
through the io_uring_enter syscall for that. With kfuncs it's easier
to keep it out of the generic path, and they're even more lax on
removals.

  And why isn't io_uring kernel code capable of doing that?

Take a look at the min_wait feature, it passes two timeout values to
io_uring_enter, which wouldn't be needed if same is implement in BPF
(with additional helpers). But what if you want to go a step further,
and lets say have "if the first timeout expired without any new CQEs
wire retry with the doubled waiting time"? You either do it less
efficiently or further extend the API.

>> BPF can help to create arbitrary relations between requests from
>> within the kernel
> 
> Can you explain it in details about the `arbitrary relations`?

For example, we could've implemented IOSQE_IO_LINK and spared ourselves
from a lot of misery, unfortunately that's paste tense. No code for
assembling links, no extra worries about CQE ordering, no complications
around cancellations, no problems with when we bind files and buffers
(prep vs issue), no hacky state machine for DRAIN+LINK, no trobules for
interpreting request results as errors or not with the subsequent
decision of breaking links (see IOSQE_IO_HARDLINK, min_ret/MSG_WAITALL
handling in net.c, IORING_TIMEOUT_ETIME_SUCCESS, etc.). Simpler kernel
code, easier to maintain, less bugs, and would even work faster.

By arbitrary relations you can think of a directed acyclic graph setting
execution ordering bw requests. With error handling questions I don't
believe a hard coded kernel version would ever be viable.

>> and later help with tuning the wait loop batching.
>> E.g. with minor extensions we can implement batch wait timeouts.
>> We can also use it to let the user to safely access internal resources
>> and maybe even do a more elaborate request setup than SQE allows it.
>>
>> The benchmark is primitive, the non-BPF baseline issues a 2 nop request
>> link at a time and waits for them to complete. The BPF version runs
>> them (2 * N requests) one by one. Numbers with mitigations on:
>>
>> # nice -n -20 taskset -c 0 ./minimal 0 50000000
>> type 2-LINK, requests to run 50000000
>> sec 10, total (ms) 10314
>> # nice -n -20 taskset -c 0 ./minimal 1 50000000
>> type BPF, requests to run 50000000
>> sec 6, total (ms) 6808
>>
>> It needs to be better tested, especially with asynchronous requests
>> like reads and other hardware. It can also be further optimised. E.g.
>> we can avoid extra locking by taking it once for BPF/task_work_run.
>>
>> The test (see examples-bpf/minimal[.bpf].c)
>> https://github.com/isilence/liburing.git io_uring-bpf
>> https://github.com/isilence/liburing/tree/io_uring-bpf
> 
> Looks you pull bpftool & libbpf code into the example, and just
> wondering why not link the example with libbpf directly?

It needs liburing and is more useful down the road as a liburing
example. I'll clean up the mess into submodules and separate commits.
Eventually it'd to be turned into tests with system deps, but that
wouldn't be convenient for now.