mbox series

[RFC,00/16] bpf: Checkpoint/Restore In eBPF (CRIB)

Message ID AM6PR03MB58480B81F491E8A34241EB3E99A42@AM6PR03MB5848.eurprd03.prod.outlook.com (mailing list archive)
Headers show
Series bpf: Checkpoint/Restore In eBPF (CRIB) | expand

Message

Juntong Deng July 10, 2024, 6:40 p.m. UTC
Overview
--------

This patch series adds a new bpf program type CRIB (Checkpoint/Restore
In eBPF) for better checkpoint/restore of processes. CRIB provides a new
way to dump/restore process information for better performance, more
flexibility, more extensibility (easier support for dumping/restoring
more information), and more elegant implementation.

Motivation
----------

The original goal of the CRIU (Checkpoint/Restore In Userspace) project
was to implement most of the checkpoint/restore functionality in
userspace [0], avoiding placing most of the implementation in the kernel.
The CRIU project achieves this goal and is currently widely used for
live migration in the cloud and works well in most scenarios. However,
the current technology that CRIU relies on is not optimal and has
some problems.

[0]: https://lwn.net/Articles/451916/

1. CRIU relies heavily on procfs to get process information (checkpoint)

Procfs is not really a good place to use for checkpointing processes
(same for sysfs).

- Lots of system calls, lots of context switches (each file needs to
open, read, close)

- Variety of formats (each file format is different and parsers need to
be implemented for each format)

- Fixed return information (if the information needed is not currently
supported by procfs, even if it is just a struct member, the upstream
kernel code still needs to be modified to add it)

- Non-extensible formats (the format of some files in the procfs cannot
be extended without breaking backward compatibility)

- Lots of extra information, slow to read (not all information in some
files is useful for checkpoint, and text parsing is inefficient)

More detailed summary of why procfs is not suitable for checkpointing
can be found in [1].

[1]: https://criu.org/Task-diag

Andrey has tried to replace insufficient procfs by using netlink 
(task_diag) [2], but it was not accepted by upstream for reasons
[3][4][5][6]:

- netlink is unable to elegantly obtain the pidns and userns
of processes

- Since the namespace issue cannot be resolved elegantly,
obtaining process information via netlink can lead to credential
security issues.

[2]: https://lwn.net/Articles/650243/
[3]: https://lore.kernel.org/linux-kernel//CALCETrVg5AyeXW_AGguFoGCPK9_2zeobEgT9JJFsakH6PyQf_A@mail.gmail.com/
[4]: https://lore.kernel.org/linux-kernel//CALCETrVSRkMSAVPz9JW4XCV7DmrgkyGK54HRUrue2R756f5C=Q@mail.gmail.com/
[5]: https://lore.kernel.org/linux-kernel//CALCETrW4LU3M2OAWjnckFR-rqenBjV+ROBi8B3eOo=Y_mCWfGQ@mail.gmail.com/
[6]: https://lore.kernel.org/linux-kernel//CALCETrUzOBybH0-rcgvzMNazjadZpuxkBZLkoUDY30X_-cqBzg@mail.gmail.com/

2. Some process status information is difficult to dump/restore through
normal interfaces

One example is checkpoint/restore for TCP sockets, where we are unable
to get the underlying protocol information for TCP sockets through
procfs (or sysfs), or through the normal socket API. Here we need to
add TCP repair mode [7][8], which works but is not an elegant approach. 

In TCP repair mode, we need to change (hijack) the behaviour of the
system calls, including recvmsg and sendmsg, used to dump/restore
packets in the socket write/receive queue. In TCP repair mode,
additional getsockopt/setsockopt optnames need to be introduced to
dump/restore the underlying TCP socket information such as sequence
number, send window, receive window, max window.

[7]: https://lwn.net/Articles/495304/
[8]: https://criu.org/TCP_connection

The above approach to extending system calls may be feasible, but not
good practice:

- The structure of the data returned by each system call API is roughly
fixed at the moment it is added. If we need to add new members, then we
may need data structures V1 and V2. If we want to remove members we no
longer need, it would be painful because we need to maintain backward
compatibility. More often we need new extensions to system calls,
such as the new getsockopt optnames.

- We need case-by-case extensions to system calls. As more and more
features are added to the kernel (e.g. io uring, bpf),
checkpointing/restoring these features via the normal API will become
more and more difficult (or even impossible). We have had to continue
to add (extend) lots of single-purpose (perhaps only for
checkpoint/restore) interfaces for various kernel features ,
more xxx repair modes, ioctl commands, getxxxopt/setxxxopt optnames.
Obviously, these interfaces are not elegant and may even be
considered cumbersome.

CRIB introduction
-----------------

CRIB is a new bpf program type that is not attached to any hooks
(similar to BPF_PROG_TYPE_SYSCALL), runs through BPF_PROG_RUN, and is
called by userspace programs as eBPF API for dumping/restoring
process information.

The entire CRIB consists of three parts, CRIB kfuncs, CRIB ebpf programs,
and CRIB user space program.

- CRIB kfuncs provides low-level APIs. Each kfuncs low-level API is only
responsible for one small task, such as getting a specific file object
based on the file descriptor of a process.

- CRIB ebpf program provides high-level APIs. Each CRIB ebpf program
obtains process information in the kernel by calling the CRIB kfuncs
API and returns the data to the userspace program through ringbuf.
Each CRIB ebpf API is responsible for some relatively complex tasks,
such as getting all the socket information of a process.

- The CRIB userspace program is responsible for loading the CRIB ebpf
program and calling the CRIB ebpf API, deciding what needs to be dumped
and what needs to be restored, and saving the dumped information so that
it can be read during restoration.

With the above CRIB design, the CRIB kfunc API in the kernel can be kept
simple enough that it does not require much modification even in the
future. Each kfuncs can be easily kept reliable without a lot of
complicated code.

Complex ebpf programs and userspace programs are maintained outside
the kernel, and CRIB ebpf programs are maintained with
CRIB userspace programs.

My current positioning of CRIB is that CRIU as CRIB userspace program
and CRIB ebpf program can be used as a new engine for CRIU, a new
and better way to dump/restore processes which has higher performance
and can dump/restore more information.

Why CRIB is better?
-------------------

1. More elegant way to get process information

If xxx repair mode, ioctl, getxxxopt, setxxxopt are like using
gastroscope, colonoscope, nasal endoscope, and we need to keep looking
for (add) more "holes" in the kernel for physical examination
(dump/restore information), then using CRIB is like putting an
intelligent micro physical examination robot (ebpf) into the kernel
and letting it work inside the kernel to collect all the information
and return.

We no longer need to open more inelegant "holes" in the kernel, and we
no longer need to add more interfaces that are only used for
checkpoint/restore.

2. More flexible and extensible

CRIB ebpf programs are maintained with CRIB userspace programs,
which means that CRIB ebpf programs do not need to provide stable APIs,
do not need stable structures, and can continue to change flexibly with
the needs of CRIB userspace programs.

Most of the information in kernel data structures can be obtained
through BPF_CORE_READ, so there is no need to add trivial CRIB kfuncs,
and the trivial code for obtaining the structure members can be kept
outside the kernel in the CRIB ebpf program. This means that this part of
the code can be added or removed flexibly.

CRIB kfuncs focuses on implementing dump/restore that cannot be done by
simple data structure operations.

3. Higher performance

- Since CRIB is very flexible (CRIB ebpf programs are changeable), we
can dump/restore just enough information and no additional information
is needed. 

- CRIB ebpf programs can return binary data (not text) via ringbuf,
which means no additional conversion or parsing is required.

- With BPF ringbuf, we avoid lots of system calls, lots of context
switches, and lots of memory copying (between kernel space and
user space).

4. Better support for namespaces and credentials

Since CRIB ebpf programs can access the task_struct of a process,
it is simple for CRIB ebpf programs to know the current namespace
(e.g., pidns, userns) and credentials of a process, and there is no
situation where CRIB cannot know that a process has dropped privileges.

The problems in the netlink method mentioned earlier do not exist
in CRIB.

Proof of Concept
----------------

I have currently added three selftest programs to demonstrate the
functionality of CRIB.

- dump_task shows the performance comparison between CRIB and procfs.
CRIB takes only 20-30% of the time of the procfs to obtain the same
process information.

- dump_all_socket shows that CRIB does not need to rely on procfs to
get all the socket information of a process, and can get the
underlying protocol information (e.g., sequence number, send window)
of TCP sockets without using getsockopt.

- restore_udp_socket shows that CRIB can dump/restore packets from
the write queue and receive queue of UDP sockets without adding
additional system call interfaces and without UDP repair mode.

Shortcoming?
------------

Yes, obviously, loading the ebpf programs takes time.

However, in most scenarios, CRIU runs as a service and is integrated
into other software (via RPC or C API) such as OpenVZ , docker, k8s,
rather than as a standalone tool.

This means that in most scenarios CRIU will handle multiple
checkpoints/restores, but in this case CRIB ebpf programs only need
to be loaded once, and can be subsequently used like normal APIs.

Overall, it is worth it.

More?
-----

In restore_udp_socket I had to add a struct bpf_crib_skb_info for
restoring packets, this is because there is currently no BPF_CORE_WRITE.

I am not sure what the current attitude of the kernel community
towards BPF_CORE_WRITE is, personally I think it is well worth adding,
as we need a portable way to change the value in the kernel.

This not only allows more complexity in the CRIB restoring part to
be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
ebpf to unlock more possible application scenarios. 

At the end
----------

This patch series is not the final patch series, this is still a
proof of concept, incomplete in functionality and probably buggy,
but I think it is enough to show the power of CRIB, which is a
meaningful innovation.

(I know I did not pay attention to the coding style of the test cases
in selftest, as these are only for proof of concept, not real testing)

This is not only a new checkpoint/restore method, but also allows us
to think about what more eBPF might be able to do, and what more we
can unlock with eBPF.

I would like to get some feedback, welcome to discuss!

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>

Juntong Deng (16):
  bpf: Introduce BPF_PROG_TYPE_CRIB
  bpf: Add KF_ITER_GETTER and KF_ITER_SETTER flags
  bpf: Improve bpf kfuncs pointer arguments chain of trust
  bpf: Add bpf_task_from_vpid() kfunc
  bpf/crib: Add struct file related CRIB kfuncs
  bpf/crib: Introduce task_file open-coded iterator kfuncs
  bpf/crib: Add struct sock related CRIB kfuncs
  bpf/crib: Add CRIB kfuncs for getting pointer to often-used
    socket-related structures
  bpf/crib: Add CRIB kfuncs for getting socket source/destination
    addresses
  bpf/crib: Add struct sk_buff related CRIB kfuncs
  bpf/crib: Introduce skb open-coded iterator kfuncs
  bpf/crib: Introduce skb_data open-coded iterator kfuncs
  bpf/crib: Add CRIB kfuncs for restoring data in skb
  selftests/crib: Add test for getting basic information of the process
  selftests/crib: Add test for getting all socket information of the
    process
  selftests/crib: Add test for dumping/restoring UDP socket packets

 include/linux/bpf_crib.h                      |  62 +++
 include/linux/bpf_types.h                     |   4 +
 include/linux/btf.h                           |   5 +-
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/Kconfig                            |   2 +
 kernel/bpf/Makefile                           |   2 +
 kernel/bpf/btf.c                              |  34 +-
 kernel/bpf/crib/Kconfig                       |  14 +
 kernel/bpf/crib/Makefile                      |   3 +
 kernel/bpf/crib/bpf_checkpoint.c              | 360 ++++++++++++++++
 kernel/bpf/crib/bpf_crib.c                    | 397 ++++++++++++++++++
 kernel/bpf/crib/bpf_restore.c                 |  80 ++++
 kernel/bpf/helpers.c                          |  21 +
 kernel/bpf/syscall.c                          |   1 +
 kernel/bpf/verifier.c                         |  15 +-
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/crib/.gitignore       |   1 +
 tools/testing/selftests/crib/Makefile         | 136 ++++++
 tools/testing/selftests/crib/config           |   7 +
 .../selftests/crib/test_dump_all_socket.bpf.c | 252 +++++++++++
 .../selftests/crib/test_dump_all_socket.c     | 375 +++++++++++++++++
 .../selftests/crib/test_dump_all_socket.h     |  69 +++
 .../selftests/crib/test_dump_task.bpf.c       | 125 ++++++
 tools/testing/selftests/crib/test_dump_task.c | 337 +++++++++++++++
 tools/testing/selftests/crib/test_dump_task.h |  90 ++++
 .../crib/test_restore_udp_socket.bpf.c        | 311 ++++++++++++++
 .../selftests/crib/test_restore_udp_socket.c  | 333 +++++++++++++++
 .../selftests/crib/test_restore_udp_socket.h  |  51 +++
 30 files changed, 3080 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpf_crib.h
 create mode 100644 kernel/bpf/crib/Kconfig
 create mode 100644 kernel/bpf/crib/Makefile
 create mode 100644 kernel/bpf/crib/bpf_checkpoint.c
 create mode 100644 kernel/bpf/crib/bpf_crib.c
 create mode 100644 kernel/bpf/crib/bpf_restore.c
 create mode 100644 tools/testing/selftests/crib/.gitignore
 create mode 100644 tools/testing/selftests/crib/Makefile
 create mode 100644 tools/testing/selftests/crib/config
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.c
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.h
 create mode 100644 tools/testing/selftests/crib/test_dump_task.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_dump_task.c
 create mode 100644 tools/testing/selftests/crib/test_dump_task.h
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.c
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.h