Message ID | AM6PR03MB58488045E4D0FA6AEDC8BDE099A52@AM6PR03MB5848.eurprd03.prod.outlook.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Checkpoint/Restore In eBPF (CRIB) | expand |
On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote: > > In restore_udp_socket I had to add a struct bpf_crib_skb_info for > restoring packets, this is because there is currently no BPF_CORE_WRITE. > > I am not sure what the current attitude of the kernel community > towards BPF_CORE_WRITE is, personally I think it is well worth adding, > as we need a portable way to change the value in the kernel. > > This not only allows more complexity in the CRIB restoring part to > be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows > ebpf to unlock more possible application scenarios. There are lots of interesting ideas in this patch set, but it seems they are doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE is necessary for 'R-restore'. I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. It will make bpf just as unsafe as any kernel module if bpf progs can start writing into arbitrary kernel data structures. So it's a show stopper. If you think there is a value in adding all these iterators for 'checkpoint' part alone we can discuss and generalize individual patches. High level feedback: - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit. - proposed file/socket iterators are somewhat unnecessary in this open coded form. there is already file/socket iterator. From the selftests it looks like it can be used to do 'checkpoint' part already. - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags. kfunc-s should be able to accept iterator as an argument. Some __suffix annotation may be necessary to help verifier if BTF type alone of the argument won't be enough. - KF_OBTAIN looks like a broken hammer to bypass safety. Like: > Currently we cannot pass the pointer returned by the iterator next > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > returned by the iterator next method is not "valid". It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted. - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already. - start with a small patch set. 30 files changed, 3080 insertions(+), 12 deletions(-) isn't really reviewable.
On Tue, 23 Jul 2024 at 01:47, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote: > > > > In restore_udp_socket I had to add a struct bpf_crib_skb_info for > > restoring packets, this is because there is currently no BPF_CORE_WRITE. > > > > I am not sure what the current attitude of the kernel community > > towards BPF_CORE_WRITE is, personally I think it is well worth adding, > > as we need a portable way to change the value in the kernel. > > > > This not only allows more complexity in the CRIB restoring part to > > be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows > > ebpf to unlock more possible application scenarios. > > There are lots of interesting ideas in this patch set, but it seems they are > doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE > is necessary for 'R-restore'. > I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. > It will make bpf just as unsafe as any kernel module if bpf progs can start > writing into arbitrary kernel data structures. So it's a show stopper. > If you think there is a value in adding all these iterators for 'checkpoint' > part alone we can discuss and generalize individual patches. I think it would be better to focus on the particular problem Juntong wants to solve, and go from there. That might help in cutting down the size of the patch set. It seems the main problem was restoring UDP sockets, but it got lost among all the other stuff. It's better to begin the discussion from there, which can still be rooted in what you believe CRIB in general is useful for. Also, information is missing on what the previous attempts at solving this UDP problem were, and why they were insufficient such that BPF was necessary. What motivates the examples included as part of this set? I think this particular GSoC project is not new, so what were the limitations in previous attempts at restoring UDP sockets? Adding kfuncs makes it easier to checkpoint and restore state, but it also carries a maintenance cost. Using BPF to speed up task state dump is going to be beneficial, but is an orthogonal problem (and doesn't have to be CRIU specific, the primitives that CRIU requires can be generic and used by others as well). You're also skirting all kinds of compatibility concerns if you encode state to restore into structs, not getting into specifics, but if this pattern is followed, what happens on a kernel where say a particular field isn't available? It is a possibility that kfuncs may change their behavior due to kernel changes (not CRIB changes particularly), so how does user space respond to that? Sometimes, patches are backported, how does feature detection work? What happens when the struct used to restore is grown to accomodate more state to restore? Kfuncs will have to detect the size of the structure and work with multiple versions (like what nf_conntrack_bpf kfuncs try to do with opts__sz). I tried to add io_uring and epoll iterators for capturing state (https://lore.kernel.org/bpf/20211201042333.2035153-1-memxor@gmail.com) a couple of years back, although I didn't have time to pursue it further after GSoC. But I tried to minimize the restoration interfaces exposed precisely because the above is hard to ensure. The more kfuncs you expose to restore state, the deeper the hole becomes, since it's meant to be a relatively user-friendly interface for CRIU to use, and work across different kernel versions. Can the values passed through the struct to restore state be trusted? I'm not very well versed with the net/, but I think bpf_restore_skb_rcv_queue isn't doing much sanitization and taking in whatever was passed by the program. It would be helpful to explain why that is or is not ok. It's easier to review if we just focus on a particular problem. I think let's start with the UDP case, and then look at everything else later. > > High level feedback: > > - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit. > +1 > - proposed file/socket iterators are somewhat unnecessary in this open coded form. > there is already file/socket iterator. From the selftests it looks like it > can be used to do 'checkpoint' part already. +1 > > - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags. > kfunc-s should be able to accept iterator as an argument. Some __suffix annotation > may be necessary to help verifier if BTF type alone of the argument won't be enough. > > - KF_OBTAIN looks like a broken hammer to bypass safety. Like: > > > Currently we cannot pass the pointer returned by the iterator next > > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > > returned by the iterator next method is not "valid". I've replied to this particular patch to explain what exact unsafety it might introduce. I also think the 2nd use case might be fixed by a recent patch. [...]
On Mon, Jul 22, 2024 at 5:50 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > > > Currently we cannot pass the pointer returned by the iterator next > > > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > > > returned by the iterator next method is not "valid". > > I've replied to this particular patch to explain what exact unsafety > it might introduce. What do you mean? I think we can make the return value from iter_next() trusted in certain cases. For example bpf_iter_task_next() returns task_struct and it can be safely marked as MEM_RCU, since the whole iterator is KF_RCU_PROTECTED.
On Tue, 23 Jul 2024 at 02:58, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 5:50 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > > > > > > Currently we cannot pass the pointer returned by the iterator next > > > > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > > > > returned by the iterator next method is not "valid". > > > > I've replied to this particular patch to explain what exact unsafety > > it might introduce. > > What do you mean? > I think we can make the return value from iter_next() trusted in > certain cases. > For example bpf_iter_task_next() returns task_struct and it > can be safely marked as MEM_RCU, since the whole iterator is > KF_RCU_PROTECTED. Yes, iter_next is ok (see reply here: https://lore.kernel.org/bpf/CAP01T74pq7pozpMi_LJUA8wehjpATMR3oM4vj7HHxohBPb0LbA@mail.gmail.com/). But number 1 doesn't seem ok. Number 2 should now be possible if I'm not mistaken.
On 2024/7/23 00:47, Alexei Starovoitov wrote: > On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote: >> >> In restore_udp_socket I had to add a struct bpf_crib_skb_info for >> restoring packets, this is because there is currently no BPF_CORE_WRITE. >> >> I am not sure what the current attitude of the kernel community >> towards BPF_CORE_WRITE is, personally I think it is well worth adding, >> as we need a portable way to change the value in the kernel. >> >> This not only allows more complexity in the CRIB restoring part to >> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows >> ebpf to unlock more possible application scenarios. > > There are lots of interesting ideas in this patch set, but it seems they are > doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE > is necessary for 'R-restore'. > I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. > It will make bpf just as unsafe as any kernel module if bpf progs can start > writing into arbitrary kernel data structures. So it's a show stopper. > If you think there is a value in adding all these iterators for 'checkpoint' > part alone we can discuss and generalize individual patches. > Thanks for your review! I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs, which may be a Pandora's box. But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration, so the restoration part is put on hold for now. In the next version of the patch set, I will focus on implementing checkpointing (dumping) via CRIB for better dumping performance and more extensibility (which still has a lot of benefits). > High level feedback: > > - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit. > - If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause confusion in the functionality of bpf program types? (BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls) - Is it good to expose all kfuncs needed for checkpointing to BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to restrict the kfuncs that can be used) - Maybe CRIB needs more specific ebpf program running restrictions? (for example, not allowed to modify the context, dumped data can only be returned via ringbuf, context is only used to identify the process that needs to dump and the part of the data that needs to be dumped) The above three points were my considerations when I originally added BPF_PROG_TYPE_CRIB, maybe we can have more discussion? > - proposed file/socket iterators are somewhat unnecessary in this open coded form. > there is already file/socket iterator. From the selftests it looks like it > can be used to do 'checkpoint' part already. > If you mean iterators like iter/task_file, iter/tcp, etc., then I think that is not flexible enough for checkpointing. This is because the context of bpf iterators is fixed and bpf iterators cannot be nested. This means that a bpf iterator program can only complete a specific small iterative dump task, and cannot dump multi-level data. An example, when we need to dump all the sockets of a process, we need to iterate over all the files (sockets) of the process, and iterate over the all packets in the queue of each socket, and iterate over all data in each packet. If we use bpf iterator, since the iterator can not be nested, we need to use socket iterator program to get all the basic information of all sockets (pass pid as filter), and then use packet iterator program to get the basic information of all packets of a specific socket (pass pid, fd as filter), and then use packet data iterator program to get all the data of a specific packet (pass pid, fd, packet index as filter). This would be complicated and require a lot of (each iteration) bpf program startup and exit (leading to poor performance). By comparison, open coded iterator is much more flexible, we can iterate in any context, at any time, and iteration can be nested, so we can achieve more flexible and more elegant dumping through open coded iterators. With open coded iterators, all of the above can be done in a single bpf program, and with nested iterators, everything becomes compact and simple. Also, bpf iterators transmit data to user space through seq_file, which involves a lot of open (bpf_iter_create), read, close syscalls, context switching, memory copying, and cannot achieve the performance of using ringbuf. The bpf iterator is more like an advanced procfs, but still not CRIB. > - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags. > kfunc-s should be able to accept iterator as an argument. Some __suffix annotation > may be necessary to help verifier if BTF type alone of the argument won't be enough. > I agree, kfuncs can accept iterators as arguments and we can use __suffix. But here is a question, should we consider these kfuncs as iter kfuncs? That is, should we impose specific constraints on these functions? For example, specific naming patterns (bpf_iter_<type>_ prefix), GETTER methods cannot take extra arguments (like next methods), etc. Currently the verifier applies these constraints based on flags. > - KF_OBTAIN looks like a broken hammer to bypass safety. Like: > > > Currently we cannot pass the pointer returned by the iterator next > > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > > returned by the iterator next method is not "valid". > > It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted. > I agree that KF_OBTAIN currently is not a good solution. For case 1, I tried the ref_obj_id method mentioned by Kumar and it worked, solving the ownership and lifetime problems. I will include it in the next version of the patch. For case 2, Kumar mentioned that it had been fixed by Matt, but I found there are still some problems. More details can be found in my reply to Kumar (in the same email thread) For iter_next(), I currently have an idea to add new flags to allow iter_next() to decide whether the return value is trusted or not, such as KF_RET_TRUSTED. What do you think? Also, for these improvements to the chain of trust, do you think I should send them out as separate patches? (rather than as part of the CRIB patch set) > - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already. > I agree that using bpf_dynptr would be better, but probably would not change much... This is because, we cannot guarantee that the user provided a large enough buffer, the buffer provided by the user may not be able to hold all the data of the packet (but we need to dump the whole packet, the packet may be very large, which is different from the case of reading only a fixed size protocol header for filtering), which means we need to read the data in batches (iteratively), so we still need an iterator. (Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...) > - start with a small patch set. > 30 files changed, 3080 insertions(+), 12 deletions(-) > isn't really reviewable. Sorry, I will reduce the size of the patch set in the next version. I will remove the proof of concept part, keep only the real tests, and start trying to integrate CRIB with CRIU.
On 7/23/24 01:49, Kumar Kartikeya Dwivedi wrote: > On Tue, 23 Jul 2024 at 01:47, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote: >>> >>> In restore_udp_socket I had to add a struct bpf_crib_skb_info for >>> restoring packets, this is because there is currently no BPF_CORE_WRITE. >>> >>> I am not sure what the current attitude of the kernel community >>> towards BPF_CORE_WRITE is, personally I think it is well worth adding, >>> as we need a portable way to change the value in the kernel. >>> >>> This not only allows more complexity in the CRIB restoring part to >>> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows >>> ebpf to unlock more possible application scenarios. >> >> There are lots of interesting ideas in this patch set, but it seems they are >> doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE >> is necessary for 'R-restore'. >> I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. >> It will make bpf just as unsafe as any kernel module if bpf progs can start >> writing into arbitrary kernel data structures. So it's a show stopper. >> If you think there is a value in adding all these iterators for 'checkpoint' >> part alone we can discuss and generalize individual patches. > > I think it would be better to focus on the particular problem Juntong > wants to solve, and go from there. > That might help in cutting down the size of the patch set. > It seems the main problem was restoring UDP sockets, but it got lost > among all the other stuff. > It's better to begin the discussion from there, which can still be > rooted in what you believe CRIB in general is useful for. > > Also, information is missing on what the previous attempts at solving > this UDP problem were, and why they were insufficient such that BPF > was necessary. > What motivates the examples included as part of this set? > I think this particular GSoC project is not new, so what were the > limitations in previous attempts at restoring UDP sockets? Yes, this idea originated from the GSoC task of dumping CORK-ed UDP socket. While I was solving this task I realized that ebpf has a much greater potential to completely change the way we checkpoint/restore processes, and can achieve better performance and more extensibility, and that is CRIB. For now, restoring CORK-ed UDP sockets is just one of the problems that CRIB can solve, and it is not the main problem (that is, CRIB is not designed around solving UDP problem). (The difficulty with restoring CORK-ed UDP is that we do not have a simple and elegant way to read back UDP packets in the write queue before, but this is a simple task in CRIB.) This is why I did not mention the GSoC task and the previous attempts to solve the UDP problem in the patch set, because it is not the same problem, and the previous solution to the UDP problem has nothing to do with ebpf (CRIB). But if adding this information would be useful, I can add it in the next version of the patch set. > Adding kfuncs makes it easier to checkpoint and restore state, but it > also carries a maintenance cost. > > Using BPF to speed up task state dump is going to be beneficial, but > is an orthogonal problem (and doesn't have to be CRIU specific, the > primitives that CRIU requires can be generic and used by others as > well). > > You're also skirting all kinds of compatibility concerns if you encode > state to restore into structs, not getting into specifics, but if this > pattern is followed, what happens on a kernel where say a particular > field isn't available? It is a possibility that kfuncs may change > their behavior due to kernel changes (not CRIB changes particularly), > so how does user space respond to that? Sometimes, patches are > backported, how does feature detection work? > > What happens when the struct used to restore is grown to accomodate > more state to restore? Kfuncs will have to detect the size of the > structure and work with multiple versions (like what nf_conntrack_bpf > kfuncs try to do with opts__sz). > You are right, so CRIB needs BPF_CORE_READ and BPF_CORE_WRITE because we need a portable way to read/write kernel structure values, and achieving portability only through kfuncs would be a complex tough problem. But since BPF_CORE_WRITE cannot be introduced, we put the restoration part on hold and focus on dumping part first, which we can achieve portability with BPF_CORE_READ. > I tried to add io_uring and epoll iterators for capturing state > (https://lore.kernel.org/bpf/20211201042333.2035153-1-memxor@gmail.com) > a couple of years back, although I didn't have time to pursue it > further after GSoC. But I tried to minimize the restoration interfaces > exposed precisely because the above is hard to ensure. The more kfuncs > you expose to restore state, the deeper the hole becomes, since it's > meant to be a relatively user-friendly interface for CRIU to use, and > work across different kernel versions. > This was a great attempt! I have always believed that checkpoint/restore via ebpf has great potential (that is why I created CRIB). If CRIB is successfully merged to the mainline, maybe we can retry dumping io_uring and epoll via ebpf. > Can the values passed through the struct to restore state be trusted? > I'm not very well versed with the net/, but I think > bpf_restore_skb_rcv_queue isn't doing much sanitization and taking in > whatever was passed by the program. It would be helpful to explain why > that is or is not ok. > Of course it cannot be trusted, but since this is an RFC patch set, I did not put too much effort into security checking (sanitization), I mainly wanted to show the idea (proof of concept) and get feedback. I will put more effort into security in subsequent versions of the patch set. > It's easier to review if we just focus on a particular problem. I > think let's start with the UDP case, and then look at everything else > later. > Yes, it is always good to start with a particular problem. I will focus next on solving the socket dump problem via CRIB and try to integrate it into the CRIU project (in a personal branch). If the above patch set is not too large, maybe I can also try to solve one or two problems via CRIB that cannot be well dumped via procfs in CRIU (poor performance or incomplete information). Anyway, I will keep the next version of the patch set small and easy to review. >> >> High level feedback: >> >> - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit. >> > > +1 > >> - proposed file/socket iterators are somewhat unnecessary in this open coded form. >> there is already file/socket iterator. From the selftests it looks like it >> can be used to do 'checkpoint' part already. > > +1 > >> >> - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags. >> kfunc-s should be able to accept iterator as an argument. Some __suffix annotation >> may be necessary to help verifier if BTF type alone of the argument won't be enough. >> >> - KF_OBTAIN looks like a broken hammer to bypass safety. Like: >> >> > Currently we cannot pass the pointer returned by the iterator next >> > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer >> > returned by the iterator next method is not "valid". > > I've replied to this particular patch to explain what exact unsafety > it might introduce. > I also think the 2nd use case might be fixed by a recent patch. > > [...]
On Wed, 31 Jul 2024 at 15:10, Juntong Deng <juntong.deng@outlook.com> wrote: > > On 2024/7/23 00:47, Alexei Starovoitov wrote: > > On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote: > >> > >> In restore_udp_socket I had to add a struct bpf_crib_skb_info for > >> restoring packets, this is because there is currently no BPF_CORE_WRITE. > >> > >> I am not sure what the current attitude of the kernel community > >> towards BPF_CORE_WRITE is, personally I think it is well worth adding, > >> as we need a portable way to change the value in the kernel. > >> > >> This not only allows more complexity in the CRIB restoring part to > >> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows > >> ebpf to unlock more possible application scenarios. > > > > There are lots of interesting ideas in this patch set, but it seems they are > > doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE > > is necessary for 'R-restore'. > > I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. > > It will make bpf just as unsafe as any kernel module if bpf progs can start > > writing into arbitrary kernel data structures. So it's a show stopper. > > If you think there is a value in adding all these iterators for 'checkpoint' > > part alone we can discuss and generalize individual patches. > > > > Thanks for your review! > > I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs, > which may be a Pandora's box. > > But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration, > so the restoration part is put on hold for now. I think then you should rethink how to do restoration in CRIB. It is better to anticipate some solution, even if you don't implement it right away. It will be necessary for an end-to-end solution. I think BPF_CORE_WRITE will be a non-starter. The best that can be done is teaching the kernel specific fields which are writable, but that doesn't scale. The conventional method with CRIU was to replicate the same steps that led to a certain state for a kernel object. The way you propose is more akin to direct serialization/deserialization. It's better to rely on restoration helpers if it's necessary. > > In the next version of the patch set, I will focus on implementing > checkpointing (dumping) via CRIB for better dumping performance and more > extensibility (which still has a lot of benefits). > > > High level feedback: > > > > - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit. > > > > - If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause > confusion in the functionality of bpf program types? > (BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls) > > - Is it good to expose all kfuncs needed for checkpointing to > BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to > restrict the kfuncs that can be used) > I think it's become a more generic program type to invoke kfuncs from task context. E.g. the upcoming sched-ext uses it to invoke kfuncs to gracefully exit a scheduler etc. We didn't have a separate BPF_PROG_TYPE_SCX_TASK_CTX. Also, you should not think about the kfuncs as being specific to checkpointing or CRIB. Try to keep them generic so they could be useful beyond use cases that you have thought of right now. Place reasonable constraints that are limited to enforcing safe use from BPF programs. Just as an illustration, all of the existing BPF infra/iterators you used to implement CRIB were never designed with checkpointing in mind. > - Maybe CRIB needs more specific ebpf program running restrictions? > (for example, not allowed to modify the context, dumped data can only > be returned via ringbuf, context is only used to identify the process > that needs to dump and the part of the data that needs to be dumped) > Why would you want to do that? Maybe someone needs a different way or comes up with a better way of communicating with userspace? There are different ways of doing checkpointing. Some checkpointing systems propose ideas of keeping histories of a process's state. This allows possibly reverting process state to some point in the past (as a way to rollback execution). There, when you checkpoint stuff, you would store all state as part of your BPF program in maps or arenas, and may not even send it to user space. Then it can be finally dumped when necessary to a user space agent, or discarded. > The above three points were my considerations when I originally added > BPF_PROG_TYPE_CRIB, maybe we can have more discussion? > I think it feels unnecessary. So far I don't see a strong reason. Adding a new program type means remembering everywhere in the verifier how it needs to be handled. Or whether it requires a special consideration for some check (esp. since you want to restrict a lot of stuff). It's better to decompose CRIB into individual useful parts that are useful beyond checkpointing. It's better to think in terms of providing useful basic building blocks, how people use them shouldn't strongly be dictated by us. > > - proposed file/socket iterators are somewhat unnecessary in this open coded form. > > there is already file/socket iterator. From the selftests it looks like it > > can be used to do 'checkpoint' part already. > > > > If you mean iterators like iter/task_file, iter/tcp, etc., then I think > that is not flexible enough for checkpointing. > > This is because the context of bpf iterators is fixed and bpf iterators > cannot be nested. This means that a bpf iterator program can only > complete a specific small iterative dump task, and cannot dump > multi-level data. > > An example, when we need to dump all the sockets of a process, we need > to iterate over all the files (sockets) of the process, and iterate over > the all packets in the queue of each socket, and iterate over all data > in each packet. > > If we use bpf iterator, since the iterator can not be nested, we need to > use socket iterator program to get all the basic information of all > sockets (pass pid as filter), and then use packet iterator program to > get the basic information of all packets of a specific socket (pass pid, > fd as filter), and then use packet data iterator program to get all the > data of a specific packet (pass pid, fd, packet index as filter). > > This would be complicated and require a lot of (each iteration) > bpf program startup and exit (leading to poor performance). > > By comparison, open coded iterator is much more flexible, we can iterate > in any context, at any time, and iteration can be nested, so we can > achieve more flexible and more elegant dumping through open coded > iterators. > > With open coded iterators, all of the above can be done in a single > bpf program, and with nested iterators, everything becomes compact > and simple. This does make sense to me, but I'd still try to limit what you introduce to only what you need initially, for the next version, and let's go from there. > > Also, bpf iterators transmit data to user space through seq_file, > which involves a lot of open (bpf_iter_create), read, close syscalls, > context switching, memory copying, and cannot achieve the performance > of using ringbuf. > > The bpf iterator is more like an advanced procfs, but still not CRIB. > > > - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags. > > kfunc-s should be able to accept iterator as an argument. Some __suffix annotation > > may be necessary to help verifier if BTF type alone of the argument won't be enough. > > > > I agree, kfuncs can accept iterators as arguments and we can > use __suffix. > > But here is a question, should we consider these kfuncs as iter kfuncs? > > That is, should we impose specific constraints on these functions? > For example, specific naming patterns (bpf_iter_<type>_ prefix), > GETTER methods cannot take extra arguments (like next methods), etc. > > Currently the verifier applies these constraints based on flags. > > > - KF_OBTAIN looks like a broken hammer to bypass safety. Like: > > > > > Currently we cannot pass the pointer returned by the iterator next > > > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer > > > returned by the iterator next method is not "valid". > > > > It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted. > > > > I agree that KF_OBTAIN currently is not a good solution. > > For case 1, I tried the ref_obj_id method mentioned by Kumar and > it worked, solving the ownership and lifetime problems. I will include > it in the next version of the patch. > > For case 2, Kumar mentioned that it had been fixed by Matt, but I found > there are still some problems. > > More details can be found in my reply to Kumar (in the same email thread) > > For iter_next(), I currently have an idea to add new flags to allow > iter_next() to decide whether the return value is trusted or not, > such as KF_RET_TRUSTED. > > What do you think? Why shouldn't the return value always be trusted? We eventually want to switch over to trusted by default everywhere. It would be nice not to go further in the opposite direction (i.e. having to manually annotate trusted) if we can avoid it. > > Also, for these improvements to the chain of trust, do you think I > should send them out as separate patches? (rather than as part of > the CRIB patch set) > > > - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already. > > > > I agree that using bpf_dynptr would be better, but probably would > not change much... > > This is because, we cannot guarantee that the user provided a large > enough buffer, the buffer provided by the user may not be able to hold > all the data of the packet (but we need to dump the whole packet, the > packet may be very large, which is different from the case of reading > only a fixed size protocol header for filtering), which means we need to > read the data in batches (iteratively), so we still need an iterator. > > (Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot > use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to > BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...) It can be exposed, we'd just have to ensure syscall programs get access to an skb that can be passed into the kfunc that is well-formed. Someone who wants to call that kfunc on an skb will instead use prog_type_crib from user space, if we separate them. So it adds no value.
Overview -------- This patch series adds a new bpf program type CRIB (Checkpoint/Restore In eBPF) for better checkpoint/restore of processes. CRIB provides a new way to dump/restore process information for better performance, more flexibility, more extensibility (easier support for dumping/restoring more information), and more elegant implementation. Motivation ---------- The original goal of the CRIU (Checkpoint/Restore In Userspace) project was to implement most of the checkpoint/restore functionality in userspace [0], avoiding placing most of the implementation in the kernel. The CRIU project achieves this goal and is currently widely used for live migration in the cloud and works well in most scenarios. However, the current technology that CRIU relies on is not optimal and has some problems. [0]: https://lwn.net/Articles/451916/ 1. CRIU relies heavily on procfs to get process information (checkpoint) Procfs is not really a good place to use for checkpointing processes (same for sysfs). - Lots of system calls, lots of context switches (each file needs to open, read, close) - Variety of formats (each file format is different and parsers need to be implemented for each format) - Fixed return information (if the information needed is not currently supported by procfs, even if it is just a struct member, the upstream kernel code still needs to be modified to add it) - Non-extensible formats (the format of some files in the procfs cannot be extended without breaking backward compatibility) - Lots of extra information, slow to read (not all information in some files is useful for checkpoint, and text parsing is inefficient) More detailed summary of why procfs is not suitable for checkpointing can be found in [1]. [1]: https://criu.org/Task-diag Andrey has tried to replace insufficient procfs by using netlink (task_diag) [2], but it was not accepted by upstream for reasons [3][4][5][6]: - netlink is unable to elegantly obtain the pidns and userns of processes - Since the namespace issue cannot be resolved elegantly, obtaining process information via netlink can lead to credential security issues. [2]: https://lwn.net/Articles/650243/ [3]: https://lore.kernel.org/linux-kernel//CALCETrVg5AyeXW_AGguFoGCPK9_2zeobEgT9JJFsakH6PyQf_A@mail.gmail.com/ [4]: https://lore.kernel.org/linux-kernel//CALCETrVSRkMSAVPz9JW4XCV7DmrgkyGK54HRUrue2R756f5C=Q@mail.gmail.com/ [5]: https://lore.kernel.org/linux-kernel//CALCETrW4LU3M2OAWjnckFR-rqenBjV+ROBi8B3eOo=Y_mCWfGQ@mail.gmail.com/ [6]: https://lore.kernel.org/linux-kernel//CALCETrUzOBybH0-rcgvzMNazjadZpuxkBZLkoUDY30X_-cqBzg@mail.gmail.com/ 2. Some process status information is difficult to dump/restore through normal interfaces One example is checkpoint/restore for TCP sockets, where we are unable to get the underlying protocol information for TCP sockets through procfs (or sysfs), or through the normal socket API. Here we need to add TCP repair mode [7][8], which works but is not an elegant approach. In TCP repair mode, we need to change (hijack) the behaviour of the system calls, including recvmsg and sendmsg, used to dump/restore packets in the socket write/receive queue. In TCP repair mode, additional getsockopt/setsockopt optnames need to be introduced to dump/restore the underlying TCP socket information such as sequence number, send window, receive window, max window. [7]: https://lwn.net/Articles/495304/ [8]: https://criu.org/TCP_connection The above approach to extending system calls may be feasible, but not good practice: - The structure of the data returned by each system call API is roughly fixed at the moment it is added. If we need to add new members, then we may need data structures V1 and V2. If we want to remove members we no longer need, it would be painful because we need to maintain backward compatibility. More often we need new extensions to system calls, such as the new getsockopt optnames. - We need case-by-case extensions to system calls. As more and more features are added to the kernel (e.g. io uring, bpf), checkpointing/restoring these features via the normal API will become more and more difficult (or even impossible). We have had to continue to add (extend) lots of single-purpose (perhaps only for checkpoint/restore) interfaces for various kernel features , more xxx repair modes, ioctl commands, getxxxopt/setxxxopt optnames. Obviously, these interfaces are not elegant and may even be considered cumbersome. CRIB introduction ----------------- CRIB is a new bpf program type that is not attached to any hooks (similar to BPF_PROG_TYPE_SYSCALL), runs through BPF_PROG_RUN, and is called by userspace programs as eBPF API for dumping/restoring process information. The entire CRIB consists of three parts, CRIB kfuncs, CRIB ebpf programs, and CRIB user space program. - CRIB kfuncs provides low-level APIs. Each kfuncs low-level API is only responsible for one small task, such as getting a specific file object based on the file descriptor of a process. - CRIB ebpf program provides high-level APIs. Each CRIB ebpf program obtains process information in the kernel by calling the CRIB kfuncs API and returns the data to the userspace program through ringbuf. Each CRIB ebpf API is responsible for some relatively complex tasks, such as getting all the socket information of a process. - The CRIB userspace program is responsible for loading the CRIB ebpf program and calling the CRIB ebpf API, deciding what needs to be dumped and what needs to be restored, and saving the dumped information so that it can be read during restoration. With the above CRIB design, the CRIB kfunc API in the kernel can be kept simple enough that it does not require much modification even in the future. Each kfuncs can be easily kept reliable without a lot of complicated code. Complex ebpf programs and userspace programs are maintained outside the kernel, and CRIB ebpf programs are maintained with CRIB userspace programs. My current positioning of CRIB is that CRIU as CRIB userspace program and CRIB ebpf program can be used as a new engine for CRIU, a new and better way to dump/restore processes which has higher performance and can dump/restore more information. Why CRIB is better? ------------------- 1. More elegant way to get process information If xxx repair mode, ioctl, getxxxopt, setxxxopt are like using gastroscope, colonoscope, nasal endoscope, and we need to keep looking for (add) more "holes" in the kernel for physical examination (dump/restore information), then using CRIB is like putting an intelligent micro physical examination robot (ebpf) into the kernel and letting it work inside the kernel to collect all the information and return. We no longer need to open more inelegant "holes" in the kernel, and we no longer need to add more interfaces that are only used for checkpoint/restore. 2. More flexible and extensible CRIB ebpf programs are maintained with CRIB userspace programs, which means that CRIB ebpf programs do not need to provide stable APIs, do not need stable structures, and can continue to change flexibly with the needs of CRIB userspace programs. Most of the information in kernel data structures can be obtained through BPF_CORE_READ, so there is no need to add trivial CRIB kfuncs, and the trivial code for obtaining the structure members can be kept outside the kernel in the CRIB ebpf program. This means that this part of the code can be added or removed flexibly. CRIB kfuncs focuses on implementing dump/restore that cannot be done by simple data structure operations. 3. Higher performance - Since CRIB is very flexible (CRIB ebpf programs are changeable), we can dump/restore just enough information and no additional information is needed. - CRIB ebpf programs can return binary data (not text) via ringbuf, which means no additional conversion or parsing is required. - With BPF ringbuf, we avoid lots of system calls, lots of context switches, and lots of memory copying (between kernel space and user space). 4. Better support for namespaces and credentials Since CRIB ebpf programs can access the task_struct of a process, it is simple for CRIB ebpf programs to know the current namespace (e.g., pidns, userns) and credentials of a process, and there is no situation where CRIB cannot know that a process has dropped privileges. The problems in the netlink method mentioned earlier do not exist in CRIB. Proof of Concept ---------------- I have currently added three selftest programs to demonstrate the functionality of CRIB. - dump_task shows the performance comparison between CRIB and procfs. CRIB takes only 20-30% of the time of the procfs to obtain the same process information. - dump_all_socket shows that CRIB does not need to rely on procfs to get all the socket information of a process, and can get the underlying protocol information (e.g., sequence number, send window) of TCP sockets without using getsockopt. - restore_udp_socket shows that CRIB can dump/restore packets from the write queue and receive queue of UDP sockets without adding additional system call interfaces and without UDP repair mode. Shortcoming? ------------ Yes, obviously, loading the ebpf programs takes time. However, in most scenarios, CRIU runs as a service and is integrated into other software (via RPC or C API) such as OpenVZ , docker, k8s, rather than as a standalone tool. This means that in most scenarios CRIU will handle multiple checkpoints/restores, but in this case CRIB ebpf programs only need to be loaded once, and can be subsequently used like normal APIs. Overall, it is worth it. More? ----- In restore_udp_socket I had to add a struct bpf_crib_skb_info for restoring packets, this is because there is currently no BPF_CORE_WRITE. I am not sure what the current attitude of the kernel community towards BPF_CORE_WRITE is, personally I think it is well worth adding, as we need a portable way to change the value in the kernel. This not only allows more complexity in the CRIB restoring part to be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows ebpf to unlock more possible application scenarios. At the end ---------- This patch series is not the final patch series, this is still a proof of concept, incomplete in functionality and probably buggy, but I think it is enough to show the power of CRIB, which is a meaningful innovation. (I know I did not pay attention to the coding style of the test cases in selftest, as these are only for proof of concept, not real testing) This is not only a new checkpoint/restore method, but also allows us to think about what more eBPF might be able to do, and what more we can unlock with eBPF. I would like to get some feedback, welcome to discuss! (This resend is used to fix mail thread that was messed up by outlook.) Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Juntong Deng (16): bpf: Introduce BPF_PROG_TYPE_CRIB bpf: Add KF_ITER_GETTER and KF_ITER_SETTER flags bpf: Improve bpf kfuncs pointer arguments chain of trust bpf: Add bpf_task_from_vpid() kfunc bpf/crib: Add struct file related CRIB kfuncs bpf/crib: Introduce task_file open-coded iterator kfuncs bpf/crib: Add struct sock related CRIB kfuncs bpf/crib: Add CRIB kfuncs for getting pointer to often-used socket-related structures bpf/crib: Add CRIB kfuncs for getting socket source/destination addresses bpf/crib: Add struct sk_buff related CRIB kfuncs bpf/crib: Introduce skb open-coded iterator kfuncs bpf/crib: Introduce skb_data open-coded iterator kfuncs bpf/crib: Add CRIB kfuncs for restoring data in skb selftests/crib: Add test for getting basic information of the process selftests/crib: Add test for getting all socket information of the process selftests/crib: Add test for dumping/restoring UDP socket packets include/linux/bpf_crib.h | 62 +++ include/linux/bpf_types.h | 4 + include/linux/btf.h | 5 +- include/uapi/linux/bpf.h | 1 + kernel/bpf/Kconfig | 2 + kernel/bpf/Makefile | 2 + kernel/bpf/btf.c | 34 +- kernel/bpf/crib/Kconfig | 14 + kernel/bpf/crib/Makefile | 3 + kernel/bpf/crib/bpf_checkpoint.c | 360 ++++++++++++++++ kernel/bpf/crib/bpf_crib.c | 397 ++++++++++++++++++ kernel/bpf/crib/bpf_restore.c | 80 ++++ kernel/bpf/helpers.c | 21 + kernel/bpf/syscall.c | 1 + kernel/bpf/verifier.c | 15 +- tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/libbpf_probes.c | 1 + tools/testing/selftests/crib/.gitignore | 1 + tools/testing/selftests/crib/Makefile | 136 ++++++ tools/testing/selftests/crib/config | 7 + .../selftests/crib/test_dump_all_socket.bpf.c | 252 +++++++++++ .../selftests/crib/test_dump_all_socket.c | 375 +++++++++++++++++ .../selftests/crib/test_dump_all_socket.h | 69 +++ .../selftests/crib/test_dump_task.bpf.c | 125 ++++++ tools/testing/selftests/crib/test_dump_task.c | 337 +++++++++++++++ tools/testing/selftests/crib/test_dump_task.h | 90 ++++ .../crib/test_restore_udp_socket.bpf.c | 311 ++++++++++++++ .../selftests/crib/test_restore_udp_socket.c | 333 +++++++++++++++ .../selftests/crib/test_restore_udp_socket.h | 51 +++ 30 files changed, 3080 insertions(+), 12 deletions(-) create mode 100644 include/linux/bpf_crib.h create mode 100644 kernel/bpf/crib/Kconfig create mode 100644 kernel/bpf/crib/Makefile create mode 100644 kernel/bpf/crib/bpf_checkpoint.c create mode 100644 kernel/bpf/crib/bpf_crib.c create mode 100644 kernel/bpf/crib/bpf_restore.c create mode 100644 tools/testing/selftests/crib/.gitignore create mode 100644 tools/testing/selftests/crib/Makefile create mode 100644 tools/testing/selftests/crib/config create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.bpf.c create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.c create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.h create mode 100644 tools/testing/selftests/crib/test_dump_task.bpf.c create mode 100644 tools/testing/selftests/crib/test_dump_task.c create mode 100644 tools/testing/selftests/crib/test_dump_task.h create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.bpf.c create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.c create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.h