mbox series

[bpf-next,v3,00/10] Introduce BPF iterators for io_uring and epoll

Message ID 20211201042333.2035153-1-memxor@gmail.com (mailing list archive)
Headers show
Series Introduce BPF iterators for io_uring and epoll | expand

Message

Kumar Kartikeya Dwivedi Dec. 1, 2021, 4:23 a.m. UTC
The CRIU [0] project developers are exploring potential uses of the BPF
subsystem to do complicated tasks that are difficult to add support for in the
kernel using existing interfaces.  Even if they are implemented using procfs,
or kcmp, it is difficult to make it perform well without having some kind of
programmable introspection into the kernel data structures. Moreover, for
procfs based state inspection, the output format once agreed upon is set in
stone and hard to extend, and at the same time inefficient to consume from
programs (where it is first converted from machine readable form to human
readable form, only to be converted again to machine readable form).  In
addition to this, kcmp based file set matching algorithm performs poorly since
each file in one set needs to be compared to each file in another set, to
determine struct file equivalence.

This set adds a io_uring file iterator (for registered files), a io_uring ubuf
iterator (for registered buffers), and a epoll iterator (for registered items
(files, registered using EPOLL_CTL_ADD)) to overcome these limitations.  Using
existing task, task_file, task_vma iterators, all of these can be combined
together to significantly enhance and speed up the task dumping procedure.

The two immediate use cases are io_uring checkpoint/restore support and epoll
checkpoint/restore support. The first is unimplemented, and the second is being
expedited using a new epoll iterator. In the future, more stages of the
checkpointing sequence can be offloaded to eBPF programs to reduce process
downtime, e.g. in pre-dump stage, before task is seized.

The io_uring file iterator is even more important now due to the advent of
descriptorless files in io_uring [1], which makes dumping a task's files a lot
more harder for CRIU, since there is no visibility into these hidden
descriptors that the task depends upon for operation. Similarly, the
io_uring_ubuf iterator is useful in case original VMA used in registering a
buffer has been destroyed.

The set includes an example sample showing how these iterator(s) along with
task_file iterator can be useful to restore an io_uring instance, implementing a
simplified version of the code we are planning to adopt for CRIU. Patch 10 is
not meant for submission, only exposition, hence explicitly marked RFC. It
implements the missing features noted in [2].

Please see the individual patches for more details.

[ Note (for Yonghong): I am still unusure what will be useful in show_fdinfo,
  fill_link_info for epoll, so that has been left out. I was reminded that
  io_uring now uses anon_inode_getfile_secure, which we also use in CRIU to
  determine source fd of ring mapping, so this should be enough to identify
  the io_uring fd in userspace, hence I implemented it for io_uring in v2.   ]

  [0]: https://criu.org/Main_Page
  [1]: https://lwn.net/Articles/863071
  [2]: https://github.com/checkpoint-restore/criu/pull/1597

Changelog:
----------
v2 -> v3:
v2: https://lore.kernel.org/bpf/20211122225352.618453-1-memxor@gmail.com

 * Make show_fdinfo/fill_link_info functions static (Kernel Test Robot)
 * Minor memory leak fixes for bpf_cr
 * Use proper names instead of -2, -1 for denoting epoll iterator state

v1 -> v2:
v1: https://lore.kernel.org/bpf/20211116054237.100814-1-memxor@gmail.com

 * Add example showing how iterator is useful in C/R of io_uring (Alexei)
 * Change type of index from unsigned long to u64 (Yonghong)
 * Fix build error for CONFIG_IO_URING=n (Kernel Test Robot)
  * Move bpf_page_to_pfn out of CONFIG_IO_URING (Yonghong)
 * Add comment to bpf_iter_aux_info for map member (Yonghong)
 * show_fdinfo/fill_link_info for io_uring (Yonghong)
 * Fix other nits

Kumar Kartikeya Dwivedi (10):
  io_uring: Implement eBPF iterator for registered buffers
  bpf: Add bpf_page_to_pfn helper
  io_uring: Implement eBPF iterator for registered files
  epoll: Implement eBPF iterator for registered items
  bpftool: Output io_uring iterator info
  selftests/bpf: Add test for io_uring BPF iterators
  selftests/bpf: Add test for epoll BPF iterator
  selftests/bpf: Test partial reads for io_uring, epoll iterators
  selftests/bpf: Fix btf_dump test for bpf_iter_link_info
  samples/bpf: Add example to checkpoint/restore io_uring

 fs/eventpoll.c                                | 201 ++++-
 fs/io_uring.c                                 | 345 +++++++++
 include/linux/bpf.h                           |  16 +
 include/uapi/linux/bpf.h                      |  18 +
 kernel/trace/bpf_trace.c                      |  19 +
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   8 +-
 samples/bpf/bpf_cr.bpf.c                      | 185 +++++
 samples/bpf/bpf_cr.c                          | 688 ++++++++++++++++++
 samples/bpf/bpf_cr.h                          |  48 ++
 samples/bpf/hbm_kern.h                        |   2 -
 scripts/bpf_doc.py                            |   2 +
 tools/bpf/bpftool/link.c                      |  10 +
 tools/include/uapi/linux/bpf.h                |  18 +
 .../selftests/bpf/prog_tests/bpf_iter.c       | 387 +++++++++-
 .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
 .../selftests/bpf/progs/bpf_iter_epoll.c      |  33 +
 .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++
 18 files changed, 2027 insertions(+), 8 deletions(-)
 create mode 100644 samples/bpf/bpf_cr.bpf.c
 create mode 100644 samples/bpf/bpf_cr.c
 create mode 100644 samples/bpf/bpf_cr.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c