mbox series

[RFC,bpf-next,RESEND,00/16] bpf: Checkpoint/Restore In eBPF (CRIB)

Message ID AM6PR03MB58488045E4D0FA6AEDC8BDE099A52@AM6PR03MB5848.eurprd03.prod.outlook.com (mailing list archive)
Headers show
Series bpf: Checkpoint/Restore In eBPF (CRIB) | expand

Message

Juntong Deng July 11, 2024, 11:10 a.m. UTC
Overview
--------

This patch series adds a new bpf program type CRIB (Checkpoint/Restore
In eBPF) for better checkpoint/restore of processes. CRIB provides a new
way to dump/restore process information for better performance, more
flexibility, more extensibility (easier support for dumping/restoring
more information), and more elegant implementation.

Motivation
----------

The original goal of the CRIU (Checkpoint/Restore In Userspace) project
was to implement most of the checkpoint/restore functionality in
userspace [0], avoiding placing most of the implementation in the kernel.
The CRIU project achieves this goal and is currently widely used for
live migration in the cloud and works well in most scenarios. However,
the current technology that CRIU relies on is not optimal and has
some problems.

[0]: https://lwn.net/Articles/451916/

1. CRIU relies heavily on procfs to get process information (checkpoint)

Procfs is not really a good place to use for checkpointing processes
(same for sysfs).

- Lots of system calls, lots of context switches (each file needs to
open, read, close)

- Variety of formats (each file format is different and parsers need to
be implemented for each format)

- Fixed return information (if the information needed is not currently
supported by procfs, even if it is just a struct member, the upstream
kernel code still needs to be modified to add it)

- Non-extensible formats (the format of some files in the procfs cannot
be extended without breaking backward compatibility)

- Lots of extra information, slow to read (not all information in some
files is useful for checkpoint, and text parsing is inefficient)

More detailed summary of why procfs is not suitable for checkpointing
can be found in [1].

[1]: https://criu.org/Task-diag

Andrey has tried to replace insufficient procfs by using netlink 
(task_diag) [2], but it was not accepted by upstream for reasons
[3][4][5][6]:

- netlink is unable to elegantly obtain the pidns and userns
of processes

- Since the namespace issue cannot be resolved elegantly,
obtaining process information via netlink can lead to credential
security issues.

[2]: https://lwn.net/Articles/650243/
[3]: https://lore.kernel.org/linux-kernel//CALCETrVg5AyeXW_AGguFoGCPK9_2zeobEgT9JJFsakH6PyQf_A@mail.gmail.com/
[4]: https://lore.kernel.org/linux-kernel//CALCETrVSRkMSAVPz9JW4XCV7DmrgkyGK54HRUrue2R756f5C=Q@mail.gmail.com/
[5]: https://lore.kernel.org/linux-kernel//CALCETrW4LU3M2OAWjnckFR-rqenBjV+ROBi8B3eOo=Y_mCWfGQ@mail.gmail.com/
[6]: https://lore.kernel.org/linux-kernel//CALCETrUzOBybH0-rcgvzMNazjadZpuxkBZLkoUDY30X_-cqBzg@mail.gmail.com/

2. Some process status information is difficult to dump/restore through
normal interfaces

One example is checkpoint/restore for TCP sockets, where we are unable
to get the underlying protocol information for TCP sockets through
procfs (or sysfs), or through the normal socket API. Here we need to
add TCP repair mode [7][8], which works but is not an elegant approach. 

In TCP repair mode, we need to change (hijack) the behaviour of the
system calls, including recvmsg and sendmsg, used to dump/restore
packets in the socket write/receive queue. In TCP repair mode,
additional getsockopt/setsockopt optnames need to be introduced to
dump/restore the underlying TCP socket information such as sequence
number, send window, receive window, max window.

[7]: https://lwn.net/Articles/495304/
[8]: https://criu.org/TCP_connection

The above approach to extending system calls may be feasible, but not
good practice:

- The structure of the data returned by each system call API is roughly
fixed at the moment it is added. If we need to add new members, then we
may need data structures V1 and V2. If we want to remove members we no
longer need, it would be painful because we need to maintain backward
compatibility. More often we need new extensions to system calls,
such as the new getsockopt optnames.

- We need case-by-case extensions to system calls. As more and more
features are added to the kernel (e.g. io uring, bpf),
checkpointing/restoring these features via the normal API will become
more and more difficult (or even impossible). We have had to continue
to add (extend) lots of single-purpose (perhaps only for
checkpoint/restore) interfaces for various kernel features ,
more xxx repair modes, ioctl commands, getxxxopt/setxxxopt optnames.
Obviously, these interfaces are not elegant and may even be
considered cumbersome.

CRIB introduction
-----------------

CRIB is a new bpf program type that is not attached to any hooks
(similar to BPF_PROG_TYPE_SYSCALL), runs through BPF_PROG_RUN, and is
called by userspace programs as eBPF API for dumping/restoring
process information.

The entire CRIB consists of three parts, CRIB kfuncs, CRIB ebpf programs,
and CRIB user space program.

- CRIB kfuncs provides low-level APIs. Each kfuncs low-level API is only
responsible for one small task, such as getting a specific file object
based on the file descriptor of a process.

- CRIB ebpf program provides high-level APIs. Each CRIB ebpf program
obtains process information in the kernel by calling the CRIB kfuncs
API and returns the data to the userspace program through ringbuf.
Each CRIB ebpf API is responsible for some relatively complex tasks,
such as getting all the socket information of a process.

- The CRIB userspace program is responsible for loading the CRIB ebpf
program and calling the CRIB ebpf API, deciding what needs to be dumped
and what needs to be restored, and saving the dumped information so that
it can be read during restoration.

With the above CRIB design, the CRIB kfunc API in the kernel can be kept
simple enough that it does not require much modification even in the
future. Each kfuncs can be easily kept reliable without a lot of
complicated code.

Complex ebpf programs and userspace programs are maintained outside
the kernel, and CRIB ebpf programs are maintained with
CRIB userspace programs.

My current positioning of CRIB is that CRIU as CRIB userspace program
and CRIB ebpf program can be used as a new engine for CRIU, a new
and better way to dump/restore processes which has higher performance
and can dump/restore more information.

Why CRIB is better?
-------------------

1. More elegant way to get process information

If xxx repair mode, ioctl, getxxxopt, setxxxopt are like using
gastroscope, colonoscope, nasal endoscope, and we need to keep looking
for (add) more "holes" in the kernel for physical examination
(dump/restore information), then using CRIB is like putting an
intelligent micro physical examination robot (ebpf) into the kernel
and letting it work inside the kernel to collect all the information
and return.

We no longer need to open more inelegant "holes" in the kernel, and we
no longer need to add more interfaces that are only used for
checkpoint/restore.

2. More flexible and extensible

CRIB ebpf programs are maintained with CRIB userspace programs,
which means that CRIB ebpf programs do not need to provide stable APIs,
do not need stable structures, and can continue to change flexibly with
the needs of CRIB userspace programs.

Most of the information in kernel data structures can be obtained
through BPF_CORE_READ, so there is no need to add trivial CRIB kfuncs,
and the trivial code for obtaining the structure members can be kept
outside the kernel in the CRIB ebpf program. This means that this part of
the code can be added or removed flexibly.

CRIB kfuncs focuses on implementing dump/restore that cannot be done by
simple data structure operations.

3. Higher performance

- Since CRIB is very flexible (CRIB ebpf programs are changeable), we
can dump/restore just enough information and no additional information
is needed. 

- CRIB ebpf programs can return binary data (not text) via ringbuf,
which means no additional conversion or parsing is required.

- With BPF ringbuf, we avoid lots of system calls, lots of context
switches, and lots of memory copying (between kernel space and
user space).

4. Better support for namespaces and credentials

Since CRIB ebpf programs can access the task_struct of a process,
it is simple for CRIB ebpf programs to know the current namespace
(e.g., pidns, userns) and credentials of a process, and there is no
situation where CRIB cannot know that a process has dropped privileges.

The problems in the netlink method mentioned earlier do not exist
in CRIB.

Proof of Concept
----------------

I have currently added three selftest programs to demonstrate the
functionality of CRIB.

- dump_task shows the performance comparison between CRIB and procfs.
CRIB takes only 20-30% of the time of the procfs to obtain the same
process information.

- dump_all_socket shows that CRIB does not need to rely on procfs to
get all the socket information of a process, and can get the
underlying protocol information (e.g., sequence number, send window)
of TCP sockets without using getsockopt.

- restore_udp_socket shows that CRIB can dump/restore packets from
the write queue and receive queue of UDP sockets without adding
additional system call interfaces and without UDP repair mode.

Shortcoming?
------------

Yes, obviously, loading the ebpf programs takes time.

However, in most scenarios, CRIU runs as a service and is integrated
into other software (via RPC or C API) such as OpenVZ , docker, k8s,
rather than as a standalone tool.

This means that in most scenarios CRIU will handle multiple
checkpoints/restores, but in this case CRIB ebpf programs only need
to be loaded once, and can be subsequently used like normal APIs.

Overall, it is worth it.

More?
-----

In restore_udp_socket I had to add a struct bpf_crib_skb_info for
restoring packets, this is because there is currently no BPF_CORE_WRITE.

I am not sure what the current attitude of the kernel community
towards BPF_CORE_WRITE is, personally I think it is well worth adding,
as we need a portable way to change the value in the kernel.

This not only allows more complexity in the CRIB restoring part to
be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
ebpf to unlock more possible application scenarios. 

At the end
----------

This patch series is not the final patch series, this is still a
proof of concept, incomplete in functionality and probably buggy,
but I think it is enough to show the power of CRIB, which is a
meaningful innovation.

(I know I did not pay attention to the coding style of the test cases
in selftest, as these are only for proof of concept, not real testing)

This is not only a new checkpoint/restore method, but also allows us
to think about what more eBPF might be able to do, and what more we
can unlock with eBPF.

I would like to get some feedback, welcome to discuss!

(This resend is used to fix mail thread that was messed up by outlook.)

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>

Juntong Deng (16):
  bpf: Introduce BPF_PROG_TYPE_CRIB
  bpf: Add KF_ITER_GETTER and KF_ITER_SETTER flags
  bpf: Improve bpf kfuncs pointer arguments chain of trust
  bpf: Add bpf_task_from_vpid() kfunc
  bpf/crib: Add struct file related CRIB kfuncs
  bpf/crib: Introduce task_file open-coded iterator kfuncs
  bpf/crib: Add struct sock related CRIB kfuncs
  bpf/crib: Add CRIB kfuncs for getting pointer to often-used
    socket-related structures
  bpf/crib: Add CRIB kfuncs for getting socket source/destination
    addresses
  bpf/crib: Add struct sk_buff related CRIB kfuncs
  bpf/crib: Introduce skb open-coded iterator kfuncs
  bpf/crib: Introduce skb_data open-coded iterator kfuncs
  bpf/crib: Add CRIB kfuncs for restoring data in skb
  selftests/crib: Add test for getting basic information of the process
  selftests/crib: Add test for getting all socket information of the
    process
  selftests/crib: Add test for dumping/restoring UDP socket packets

 include/linux/bpf_crib.h                      |  62 +++
 include/linux/bpf_types.h                     |   4 +
 include/linux/btf.h                           |   5 +-
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/Kconfig                            |   2 +
 kernel/bpf/Makefile                           |   2 +
 kernel/bpf/btf.c                              |  34 +-
 kernel/bpf/crib/Kconfig                       |  14 +
 kernel/bpf/crib/Makefile                      |   3 +
 kernel/bpf/crib/bpf_checkpoint.c              | 360 ++++++++++++++++
 kernel/bpf/crib/bpf_crib.c                    | 397 ++++++++++++++++++
 kernel/bpf/crib/bpf_restore.c                 |  80 ++++
 kernel/bpf/helpers.c                          |  21 +
 kernel/bpf/syscall.c                          |   1 +
 kernel/bpf/verifier.c                         |  15 +-
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/crib/.gitignore       |   1 +
 tools/testing/selftests/crib/Makefile         | 136 ++++++
 tools/testing/selftests/crib/config           |   7 +
 .../selftests/crib/test_dump_all_socket.bpf.c | 252 +++++++++++
 .../selftests/crib/test_dump_all_socket.c     | 375 +++++++++++++++++
 .../selftests/crib/test_dump_all_socket.h     |  69 +++
 .../selftests/crib/test_dump_task.bpf.c       | 125 ++++++
 tools/testing/selftests/crib/test_dump_task.c | 337 +++++++++++++++
 tools/testing/selftests/crib/test_dump_task.h |  90 ++++
 .../crib/test_restore_udp_socket.bpf.c        | 311 ++++++++++++++
 .../selftests/crib/test_restore_udp_socket.c  | 333 +++++++++++++++
 .../selftests/crib/test_restore_udp_socket.h  |  51 +++
 30 files changed, 3080 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpf_crib.h
 create mode 100644 kernel/bpf/crib/Kconfig
 create mode 100644 kernel/bpf/crib/Makefile
 create mode 100644 kernel/bpf/crib/bpf_checkpoint.c
 create mode 100644 kernel/bpf/crib/bpf_crib.c
 create mode 100644 kernel/bpf/crib/bpf_restore.c
 create mode 100644 tools/testing/selftests/crib/.gitignore
 create mode 100644 tools/testing/selftests/crib/Makefile
 create mode 100644 tools/testing/selftests/crib/config
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.c
 create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.h
 create mode 100644 tools/testing/selftests/crib/test_dump_task.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_dump_task.c
 create mode 100644 tools/testing/selftests/crib/test_dump_task.h
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.bpf.c
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.c
 create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.h

Comments

Alexei Starovoitov July 22, 2024, 11:47 p.m. UTC | #1
On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
> 
> In restore_udp_socket I had to add a struct bpf_crib_skb_info for
> restoring packets, this is because there is currently no BPF_CORE_WRITE.
> 
> I am not sure what the current attitude of the kernel community
> towards BPF_CORE_WRITE is, personally I think it is well worth adding,
> as we need a portable way to change the value in the kernel.
> 
> This not only allows more complexity in the CRIB restoring part to
> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
> ebpf to unlock more possible application scenarios. 

There are lots of interesting ideas in this patch set, but it seems they are
doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
is necessary for 'R-restore'.
I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
It will make bpf just as unsafe as any kernel module if bpf progs can start
writing into arbitrary kernel data structures. So it's a show stopper.
If you think there is a value in adding all these iterators for 'checkpoint'
part alone we can discuss and generalize individual patches.

High level feedback:

- no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.

- proposed file/socket iterators are somewhat unnecessary in this open coded form.
  there is already file/socket iterator. From the selftests it looks like it
  can be used to do 'checkpoint' part already.

- KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
  kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
  may be necessary to help verifier if BTF type alone of the argument won't be enough.

- KF_OBTAIN looks like a broken hammer to bypass safety. Like:

  > Currently we cannot pass the pointer returned by the iterator next
  > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
  > returned by the iterator next method is not "valid".

  It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted.

- iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already.

- start with a small patch set.
  30 files changed, 3080 insertions(+), 12 deletions(-)
  isn't really reviewable.
Kumar Kartikeya Dwivedi July 23, 2024, 12:49 a.m. UTC | #2
On Tue, 23 Jul 2024 at 01:47, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
> >
> > In restore_udp_socket I had to add a struct bpf_crib_skb_info for
> > restoring packets, this is because there is currently no BPF_CORE_WRITE.
> >
> > I am not sure what the current attitude of the kernel community
> > towards BPF_CORE_WRITE is, personally I think it is well worth adding,
> > as we need a portable way to change the value in the kernel.
> >
> > This not only allows more complexity in the CRIB restoring part to
> > be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
> > ebpf to unlock more possible application scenarios.
>
> There are lots of interesting ideas in this patch set, but it seems they are
> doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
> is necessary for 'R-restore'.
> I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
> It will make bpf just as unsafe as any kernel module if bpf progs can start
> writing into arbitrary kernel data structures. So it's a show stopper.
> If you think there is a value in adding all these iterators for 'checkpoint'
> part alone we can discuss and generalize individual patches.

I think it would be better to focus on the particular problem Juntong
wants to solve, and go from there.
That might help in cutting down the size of the patch set.
It seems the main problem was restoring UDP sockets, but it got lost
among all the other stuff.
It's better to begin the discussion from there, which can still be
rooted in what you believe CRIB in general is useful for.

Also, information is missing on what the previous attempts at solving
this UDP problem were, and why they were insufficient such that BPF
was necessary.
What motivates the examples included as part of this set?
I think this particular GSoC project is not new, so what were the
limitations in previous attempts at restoring UDP sockets?
Adding kfuncs makes it easier to checkpoint and restore state, but it
also carries a maintenance cost.

Using BPF to speed up task state dump is going to be beneficial, but
is an orthogonal problem (and doesn't have to be CRIU specific, the
primitives that CRIU requires can be generic and used by others as
well).

You're also skirting all kinds of compatibility concerns if you encode
state to restore into structs, not getting into specifics, but if this
pattern is followed, what happens on a kernel where say a particular
field isn't available? It is a possibility that kfuncs may change
their behavior due to kernel changes (not CRIB changes particularly),
so how does user space respond to that? Sometimes, patches are
backported, how does feature detection work?

What happens when the struct used to restore is grown to accomodate
more state to restore? Kfuncs will have to detect the size of the
structure and work with multiple versions (like what nf_conntrack_bpf
kfuncs try to do with opts__sz).

I tried to add io_uring and epoll iterators for capturing state
(https://lore.kernel.org/bpf/20211201042333.2035153-1-memxor@gmail.com)
a couple of years back, although I didn't have time to pursue it
further after GSoC. But I tried to minimize the restoration interfaces
exposed precisely because the above is hard to ensure. The more kfuncs
you expose to restore state, the deeper the hole becomes, since it's
meant to be a relatively user-friendly interface for CRIU to use, and
work across different kernel versions.

Can the values passed through the struct to restore state be trusted?
I'm not very well versed with the net/, but I think
bpf_restore_skb_rcv_queue isn't doing much sanitization and taking in
whatever was passed by the program. It would be helpful to explain why
that is or is not ok.

It's easier to review if we just focus on a particular problem. I
think let's start with the UDP case, and then look at everything else
later.

>
> High level feedback:
>
> - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.
>

+1

> - proposed file/socket iterators are somewhat unnecessary in this open coded form.
>   there is already file/socket iterator. From the selftests it looks like it
>   can be used to do 'checkpoint' part already.

+1

>
> - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
>   kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
>   may be necessary to help verifier if BTF type alone of the argument won't be enough.
>
> - KF_OBTAIN looks like a broken hammer to bypass safety. Like:
>
>   > Currently we cannot pass the pointer returned by the iterator next
>   > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
>   > returned by the iterator next method is not "valid".

I've replied to this particular patch to explain what exact unsafety
it might introduce.
I also think the 2nd use case might be fixed by a recent patch.

[...]
Alexei Starovoitov July 23, 2024, 12:58 a.m. UTC | #3
On Mon, Jul 22, 2024 at 5:50 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> >
> >   > Currently we cannot pass the pointer returned by the iterator next
> >   > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
> >   > returned by the iterator next method is not "valid".
>
> I've replied to this particular patch to explain what exact unsafety
> it might introduce.

What do you mean?
I think we can make the return value from iter_next() trusted in
certain cases.
For example bpf_iter_task_next() returns task_struct and it
can be safely marked as MEM_RCU, since the whole iterator is
KF_RCU_PROTECTED.
Kumar Kartikeya Dwivedi July 23, 2024, 1 a.m. UTC | #4
On Tue, 23 Jul 2024 at 02:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 5:50 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > >
> > >   > Currently we cannot pass the pointer returned by the iterator next
> > >   > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
> > >   > returned by the iterator next method is not "valid".
> >
> > I've replied to this particular patch to explain what exact unsafety
> > it might introduce.
>
> What do you mean?
> I think we can make the return value from iter_next() trusted in
> certain cases.
> For example bpf_iter_task_next() returns task_struct and it
> can be safely marked as MEM_RCU, since the whole iterator is
> KF_RCU_PROTECTED.

Yes, iter_next is ok (see reply here:
https://lore.kernel.org/bpf/CAP01T74pq7pozpMi_LJUA8wehjpATMR3oM4vj7HHxohBPb0LbA@mail.gmail.com/).
But number 1 doesn't seem ok. Number 2 should now be possible if I'm
not mistaken.
Juntong Deng July 31, 2024, 1:09 p.m. UTC | #5
On 2024/7/23 00:47, Alexei Starovoitov wrote:
> On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
>>
>> In restore_udp_socket I had to add a struct bpf_crib_skb_info for
>> restoring packets, this is because there is currently no BPF_CORE_WRITE.
>>
>> I am not sure what the current attitude of the kernel community
>> towards BPF_CORE_WRITE is, personally I think it is well worth adding,
>> as we need a portable way to change the value in the kernel.
>>
>> This not only allows more complexity in the CRIB restoring part to
>> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
>> ebpf to unlock more possible application scenarios.
> 
> There are lots of interesting ideas in this patch set, but it seems they are
> doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
> is necessary for 'R-restore'.
> I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
> It will make bpf just as unsafe as any kernel module if bpf progs can start
> writing into arbitrary kernel data structures. So it's a show stopper.
> If you think there is a value in adding all these iterators for 'checkpoint'
> part alone we can discuss and generalize individual patches.
> 

Thanks for your review!

I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs,
which may be a Pandora's box.

But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration,
so the restoration part is put on hold for now.

In the next version of the patch set, I will focus on implementing
checkpointing (dumping) via CRIB for better dumping performance and more
extensibility (which still has a lot of benefits).

> High level feedback:
> 
> - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.
> 

- If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause
confusion in the functionality of bpf program types?
(BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls)

- Is it good to expose all kfuncs needed for checkpointing to
BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to
restrict the kfuncs that can be used)

- Maybe CRIB needs more specific ebpf program running restrictions?
(for example, not allowed to modify the context, dumped data can only
be returned via ringbuf, context is only used to identify the process
that needs to dump and the part of the data that needs to be dumped)

The above three points were my considerations when I originally added
BPF_PROG_TYPE_CRIB, maybe we can have more discussion?

> - proposed file/socket iterators are somewhat unnecessary in this open coded form.
>    there is already file/socket iterator. From the selftests it looks like it
>    can be used to do 'checkpoint' part already.
> 

If you mean iterators like iter/task_file, iter/tcp, etc., then I think
that is not flexible enough for checkpointing.

This is because the context of bpf iterators is fixed and bpf iterators
cannot be nested. This means that a bpf iterator program can only
complete a specific small iterative dump task, and cannot dump
multi-level data.

An example, when we need to dump all the sockets of a process, we need
to iterate over all the files (sockets) of the process, and iterate over
the all packets in the queue of each socket, and iterate over all data
in each packet.

If we use bpf iterator, since the iterator can not be nested, we need to
use socket iterator program to get all the basic information of all
sockets (pass pid as filter), and then use packet iterator program to
get the basic information of all packets of a specific socket (pass pid,
fd as filter), and then use packet data iterator program to get all the
data of a specific packet (pass pid, fd, packet index as filter).

This would be complicated and require a lot of (each iteration)
bpf program startup and exit (leading to poor performance).

By comparison, open coded iterator is much more flexible, we can iterate
in any context, at any time, and iteration can be nested, so we can
achieve more flexible and more elegant dumping through open coded
iterators.

With open coded iterators, all of the above can be done in a single
bpf program, and with nested iterators, everything becomes compact
and simple.

Also, bpf iterators transmit data to user space through seq_file,
which involves a lot of open (bpf_iter_create), read, close syscalls,
context switching, memory copying, and cannot achieve the performance
of using ringbuf.

The bpf iterator is more like an advanced procfs, but still not CRIB.

> - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
>    kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
>    may be necessary to help verifier if BTF type alone of the argument won't be enough.
> 

I agree, kfuncs can accept iterators as arguments and we can
use __suffix.

But here is a question, should we consider these kfuncs as iter kfuncs?

That is, should we impose specific constraints on these functions?
For example, specific naming patterns (bpf_iter_<type>_ prefix),
GETTER methods cannot take extra arguments (like next methods), etc.

Currently the verifier applies these constraints based on flags.

> - KF_OBTAIN looks like a broken hammer to bypass safety. Like:
> 
>    > Currently we cannot pass the pointer returned by the iterator next
>    > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
>    > returned by the iterator next method is not "valid".
> 
>    It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted.
> 

I agree that KF_OBTAIN currently is not a good solution.

For case 1, I tried the ref_obj_id method mentioned by Kumar and
it worked, solving the ownership and lifetime problems. I will include
it in the next version of the patch.

For case 2, Kumar mentioned that it had been fixed by Matt, but I found
there are still some problems.

More details can be found in my reply to Kumar (in the same email thread)

For iter_next(), I currently have an idea to add new flags to allow
iter_next() to decide whether the return value is trusted or not,
such as KF_RET_TRUSTED.

What do you think?

Also, for these improvements to the chain of trust, do you think I
should send them out as separate patches? (rather than as part of
the CRIB patch set)

> - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already.
> 

I agree that using bpf_dynptr would be better, but probably would
not change much...

This is because, we cannot guarantee that the user provided a large
enough buffer, the buffer provided by the user may not be able to hold
all the data of the packet (but we need to dump the whole packet, the
packet may be very large, which is different from the case of reading
only a fixed size protocol header for filtering), which means we need to
read the data in batches (iteratively), so we still need an iterator.

(Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot
use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to
BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...)

> - start with a small patch set.
>    30 files changed, 3080 insertions(+), 12 deletions(-)
>    isn't really reviewable.

Sorry, I will reduce the size of the patch set in the next version.

I will remove the proof of concept part, keep only the real tests,
and start trying to integrate CRIB with CRIU.
Juntong Deng July 31, 2024, 1:53 p.m. UTC | #6
On 7/23/24 01:49, Kumar Kartikeya Dwivedi wrote:
> On Tue, 23 Jul 2024 at 01:47, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
>>>
>>> In restore_udp_socket I had to add a struct bpf_crib_skb_info for
>>> restoring packets, this is because there is currently no BPF_CORE_WRITE.
>>>
>>> I am not sure what the current attitude of the kernel community
>>> towards BPF_CORE_WRITE is, personally I think it is well worth adding,
>>> as we need a portable way to change the value in the kernel.
>>>
>>> This not only allows more complexity in the CRIB restoring part to
>>> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
>>> ebpf to unlock more possible application scenarios.
>>
>> There are lots of interesting ideas in this patch set, but it seems they are
>> doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
>> is necessary for 'R-restore'.
>> I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
>> It will make bpf just as unsafe as any kernel module if bpf progs can start
>> writing into arbitrary kernel data structures. So it's a show stopper.
>> If you think there is a value in adding all these iterators for 'checkpoint'
>> part alone we can discuss and generalize individual patches.
> 
> I think it would be better to focus on the particular problem Juntong
> wants to solve, and go from there.
> That might help in cutting down the size of the patch set.
> It seems the main problem was restoring UDP sockets, but it got lost
> among all the other stuff.
> It's better to begin the discussion from there, which can still be
> rooted in what you believe CRIB in general is useful for.
> 
> Also, information is missing on what the previous attempts at solving
> this UDP problem were, and why they were insufficient such that BPF
> was necessary.
> What motivates the examples included as part of this set?
> I think this particular GSoC project is not new, so what were the
> limitations in previous attempts at restoring UDP sockets?

Yes, this idea originated from the GSoC task of dumping CORK-ed
UDP socket.

While I was solving this task I realized that ebpf has a much greater
potential to completely change the way we checkpoint/restore processes,
and can achieve better performance and more extensibility,
and that is CRIB.

For now, restoring CORK-ed UDP sockets is just one of the problems that
CRIB can solve, and it is not the main problem (that is, CRIB is not
designed around solving UDP problem).

(The difficulty with restoring CORK-ed UDP is that we do not have a
simple and elegant way to read back UDP packets in the write queue
before, but this is a simple task in CRIB.)

This is why I did not mention the GSoC task and the previous attempts
to solve the UDP problem in the patch set, because it is not the
same problem, and the previous solution to the UDP problem has nothing
to do with ebpf (CRIB).

But if adding this information would be useful, I can add it in the next
version of the patch set.

> Adding kfuncs makes it easier to checkpoint and restore state, but it
> also carries a maintenance cost.
> 
> Using BPF to speed up task state dump is going to be beneficial, but
> is an orthogonal problem (and doesn't have to be CRIU specific, the
> primitives that CRIU requires can be generic and used by others as
> well).
> 
> You're also skirting all kinds of compatibility concerns if you encode
> state to restore into structs, not getting into specifics, but if this
> pattern is followed, what happens on a kernel where say a particular
> field isn't available? It is a possibility that kfuncs may change
> their behavior due to kernel changes (not CRIB changes particularly),
> so how does user space respond to that? Sometimes, patches are
> backported, how does feature detection work?
> 
> What happens when the struct used to restore is grown to accomodate
> more state to restore? Kfuncs will have to detect the size of the
> structure and work with multiple versions (like what nf_conntrack_bpf
> kfuncs try to do with opts__sz).
> 

You are right, so CRIB needs BPF_CORE_READ and BPF_CORE_WRITE because we
need a portable way to read/write kernel structure values, and achieving
portability only through kfuncs would be a complex tough problem.

But since BPF_CORE_WRITE cannot be introduced, we put the restoration
part on hold and focus on dumping part first, which we can achieve
portability with BPF_CORE_READ.

> I tried to add io_uring and epoll iterators for capturing state
> (https://lore.kernel.org/bpf/20211201042333.2035153-1-memxor@gmail.com)
> a couple of years back, although I didn't have time to pursue it
> further after GSoC. But I tried to minimize the restoration interfaces
> exposed precisely because the above is hard to ensure. The more kfuncs
> you expose to restore state, the deeper the hole becomes, since it's
> meant to be a relatively user-friendly interface for CRIU to use, and
> work across different kernel versions.
> 

This was a great attempt!

I have always believed that checkpoint/restore via ebpf has great
potential (that is why I created CRIB).

If CRIB is successfully merged to the mainline, maybe we can retry
dumping io_uring and epoll via ebpf.

> Can the values passed through the struct to restore state be trusted?
> I'm not very well versed with the net/, but I think
> bpf_restore_skb_rcv_queue isn't doing much sanitization and taking in
> whatever was passed by the program. It would be helpful to explain why
> that is or is not ok.
> 

Of course it cannot be trusted, but since this is an RFC patch set,
I did not put too much effort into security checking (sanitization),
I mainly wanted to show the idea (proof of concept) and get feedback.

I will put more effort into security in subsequent versions of
the patch set.

> It's easier to review if we just focus on a particular problem. I
> think let's start with the UDP case, and then look at everything else
> later.
> 

Yes, it is always good to start with a particular problem.

I will focus next on solving the socket dump problem via CRIB and try to
integrate it into the CRIU project (in a personal branch).

If the above patch set is not too large, maybe I can also try to solve
one or two problems via CRIB that cannot be well dumped via procfs
in CRIU (poor performance or incomplete information).

Anyway, I will keep the next version of the patch set small and easy
to review.

>>
>> High level feedback:
>>
>> - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.
>>
> 
> +1
> 
>> - proposed file/socket iterators are somewhat unnecessary in this open coded form.
>>    there is already file/socket iterator. From the selftests it looks like it
>>    can be used to do 'checkpoint' part already.
> 
> +1
> 
>>
>> - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
>>    kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
>>    may be necessary to help verifier if BTF type alone of the argument won't be enough.
>>
>> - KF_OBTAIN looks like a broken hammer to bypass safety. Like:
>>
>>    > Currently we cannot pass the pointer returned by the iterator next
>>    > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
>>    > returned by the iterator next method is not "valid".
> 
> I've replied to this particular patch to explain what exact unsafety
> it might introduce.
> I also think the 2nd use case might be fixed by a recent patch.
> 
> [...]
Kumar Kartikeya Dwivedi July 31, 2024, 4:16 p.m. UTC | #7
On Wed, 31 Jul 2024 at 15:10, Juntong Deng <juntong.deng@outlook.com> wrote:
>
> On 2024/7/23 00:47, Alexei Starovoitov wrote:
> > On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
> >>
> >> In restore_udp_socket I had to add a struct bpf_crib_skb_info for
> >> restoring packets, this is because there is currently no BPF_CORE_WRITE.
> >>
> >> I am not sure what the current attitude of the kernel community
> >> towards BPF_CORE_WRITE is, personally I think it is well worth adding,
> >> as we need a portable way to change the value in the kernel.
> >>
> >> This not only allows more complexity in the CRIB restoring part to
> >> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
> >> ebpf to unlock more possible application scenarios.
> >
> > There are lots of interesting ideas in this patch set, but it seems they are
> > doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
> > is necessary for 'R-restore'.
> > I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
> > It will make bpf just as unsafe as any kernel module if bpf progs can start
> > writing into arbitrary kernel data structures. So it's a show stopper.
> > If you think there is a value in adding all these iterators for 'checkpoint'
> > part alone we can discuss and generalize individual patches.
> >
>
> Thanks for your review!
>
> I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs,
> which may be a Pandora's box.
>
> But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration,
> so the restoration part is put on hold for now.

I think then you should rethink how to do restoration in CRIB.
It is better to anticipate some solution, even if you don't implement
it right away.
It will be necessary for an end-to-end solution.

I think BPF_CORE_WRITE will be a non-starter.
The best that can be done is teaching the kernel specific fields which
are writable, but that doesn't scale.

The conventional method with CRIU was to replicate the same steps that
led to a certain state for a kernel object.
The way you propose is more akin to direct serialization/deserialization.

It's better to rely on restoration helpers if it's necessary.

>
> In the next version of the patch set, I will focus on implementing
> checkpointing (dumping) via CRIB for better dumping performance and more
> extensibility (which still has a lot of benefits).
>
> > High level feedback:
> >
> > - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.
> >
>
> - If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause
> confusion in the functionality of bpf program types?
> (BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls)
>
> - Is it good to expose all kfuncs needed for checkpointing to
> BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to
> restrict the kfuncs that can be used)
>

I think it's become a more generic program type to invoke kfuncs from
task context.
E.g. the upcoming sched-ext uses it to invoke kfuncs to gracefully
exit a scheduler etc.
We didn't have a separate BPF_PROG_TYPE_SCX_TASK_CTX.

Also, you should not think about the kfuncs as being specific to
checkpointing or CRIB.
Try to keep them generic so they could be useful beyond use cases that
you have thought of right now.
Place reasonable constraints that are limited to enforcing safe use
from BPF programs.
Just as an illustration, all of the existing BPF infra/iterators you
used to implement CRIB were never designed with checkpointing in mind.

> - Maybe CRIB needs more specific ebpf program running restrictions?
> (for example, not allowed to modify the context, dumped data can only
> be returned via ringbuf, context is only used to identify the process
> that needs to dump and the part of the data that needs to be dumped)
>

Why would you want to do that?
Maybe someone needs a different way or comes up with a better way of
communicating with userspace?

There are different ways of doing checkpointing.
Some checkpointing systems propose ideas of keeping histories of a
process's state.
This allows possibly reverting process state to some point in the past
(as a way to rollback execution).
There, when you checkpoint stuff, you would store all state as part of
your BPF program in maps or arenas, and may not even send it to user
space.
Then it can be finally dumped when necessary to a user space agent, or
discarded.

> The above three points were my considerations when I originally added
> BPF_PROG_TYPE_CRIB, maybe we can have more discussion?
>

I think it feels unnecessary. So far I don't see a strong reason.
Adding a new program type means remembering everywhere in the verifier
how it needs to be handled.
Or whether it requires a special consideration for some check (esp.
since you want to restrict a lot of stuff).
It's better to decompose CRIB into individual useful parts that are
useful beyond checkpointing.
It's better to think in terms of providing useful basic building
blocks, how people use them shouldn't strongly be dictated by us.

> > - proposed file/socket iterators are somewhat unnecessary in this open coded form.
> >    there is already file/socket iterator. From the selftests it looks like it
> >    can be used to do 'checkpoint' part already.
> >
>
> If you mean iterators like iter/task_file, iter/tcp, etc., then I think
> that is not flexible enough for checkpointing.
>
> This is because the context of bpf iterators is fixed and bpf iterators
> cannot be nested. This means that a bpf iterator program can only
> complete a specific small iterative dump task, and cannot dump
> multi-level data.
>
> An example, when we need to dump all the sockets of a process, we need
> to iterate over all the files (sockets) of the process, and iterate over
> the all packets in the queue of each socket, and iterate over all data
> in each packet.
>
> If we use bpf iterator, since the iterator can not be nested, we need to
> use socket iterator program to get all the basic information of all
> sockets (pass pid as filter), and then use packet iterator program to
> get the basic information of all packets of a specific socket (pass pid,
> fd as filter), and then use packet data iterator program to get all the
> data of a specific packet (pass pid, fd, packet index as filter).
>
> This would be complicated and require a lot of (each iteration)
> bpf program startup and exit (leading to poor performance).
>
> By comparison, open coded iterator is much more flexible, we can iterate
> in any context, at any time, and iteration can be nested, so we can
> achieve more flexible and more elegant dumping through open coded
> iterators.
>
> With open coded iterators, all of the above can be done in a single
> bpf program, and with nested iterators, everything becomes compact
> and simple.

This does make sense to me, but I'd still try to limit what you
introduce to only what you need initially, for the next version, and
let's go from there.

>
> Also, bpf iterators transmit data to user space through seq_file,
> which involves a lot of open (bpf_iter_create), read, close syscalls,
> context switching, memory copying, and cannot achieve the performance
> of using ringbuf.
>
> The bpf iterator is more like an advanced procfs, but still not CRIB.
>
> > - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
> >    kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
> >    may be necessary to help verifier if BTF type alone of the argument won't be enough.
> >
>
> I agree, kfuncs can accept iterators as arguments and we can
> use __suffix.
>
> But here is a question, should we consider these kfuncs as iter kfuncs?
>
> That is, should we impose specific constraints on these functions?
> For example, specific naming patterns (bpf_iter_<type>_ prefix),
> GETTER methods cannot take extra arguments (like next methods), etc.
>
> Currently the verifier applies these constraints based on flags.
>
> > - KF_OBTAIN looks like a broken hammer to bypass safety. Like:
> >
> >    > Currently we cannot pass the pointer returned by the iterator next
> >    > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
> >    > returned by the iterator next method is not "valid".
> >
> >    It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted.
> >
>
> I agree that KF_OBTAIN currently is not a good solution.
>
> For case 1, I tried the ref_obj_id method mentioned by Kumar and
> it worked, solving the ownership and lifetime problems. I will include
> it in the next version of the patch.
>
> For case 2, Kumar mentioned that it had been fixed by Matt, but I found
> there are still some problems.
>
> More details can be found in my reply to Kumar (in the same email thread)
>
> For iter_next(), I currently have an idea to add new flags to allow
> iter_next() to decide whether the return value is trusted or not,
> such as KF_RET_TRUSTED.
>
> What do you think?

Why shouldn't the return value always be trusted?
We eventually want to switch over to trusted by default everywhere.
It would be nice not to go further in the opposite direction (i.e.
having to manually annotate trusted) if we can avoid it.

>
> Also, for these improvements to the chain of trust, do you think I
> should send them out as separate patches? (rather than as part of
> the CRIB patch set)
>
> > - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already.
> >
>
> I agree that using bpf_dynptr would be better, but probably would
> not change much...
>
> This is because, we cannot guarantee that the user provided a large
> enough buffer, the buffer provided by the user may not be able to hold
> all the data of the packet (but we need to dump the whole packet, the
> packet may be very large, which is different from the case of reading
> only a fixed size protocol header for filtering), which means we need to
> read the data in batches (iteratively), so we still need an iterator.
>
> (Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot
> use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to
> BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...)

It can be exposed, we'd just have to ensure syscall programs get
access to an skb that can be passed into the kfunc that is
well-formed.
Someone who wants to call that kfunc on an skb will instead use
prog_type_crib from user space, if we separate them. So it adds no
value.