Message ID | 20231015141644.260646-2-akihiko.odaki@daynix.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | tun: Introduce virtio-net hashing feature | expand |
On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 0448700890f7..298634556fab 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -988,6 +988,7 @@ enum bpf_prog_type { > BPF_PROG_TYPE_SK_LOOKUP, > BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > BPF_PROG_TYPE_NETFILTER, > + BPF_PROG_TYPE_VNET_HASH, Sorry, we do not add new stable program types anymore. > @@ -6111,6 +6112,10 @@ struct __sk_buff { > __u8 tstamp_type; > __u32 :24; /* Padding, future use. */ > __u64 hwtstamp; > + > + __u32 vnet_hash_value; > + __u16 vnet_hash_report; > + __u16 vnet_rss_queue; > }; we also do not add anything to uapi __sk_buff. > +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > + .get_func_proto = sk_filter_func_proto, > + .is_valid_access = sk_filter_is_valid_access, > + .convert_ctx_access = bpf_convert_ctx_access, > + .gen_ld_abs = bpf_gen_ld_abs, > +}; and we don't do ctx rewrites like this either. Please see how hid-bpf and cgroup rstat are hooking up bpf in _unstable_ way.
On 2023/10/16 1:07, Alexei Starovoitov wrote: > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 0448700890f7..298634556fab 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -988,6 +988,7 @@ enum bpf_prog_type { >> BPF_PROG_TYPE_SK_LOOKUP, >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ >> BPF_PROG_TYPE_NETFILTER, >> + BPF_PROG_TYPE_VNET_HASH, > > Sorry, we do not add new stable program types anymore. > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { >> __u8 tstamp_type; >> __u32 :24; /* Padding, future use. */ >> __u64 hwtstamp; >> + >> + __u32 vnet_hash_value; >> + __u16 vnet_hash_report; >> + __u16 vnet_rss_queue; >> }; > > we also do not add anything to uapi __sk_buff. > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { >> + .get_func_proto = sk_filter_func_proto, >> + .is_valid_access = sk_filter_is_valid_access, >> + .convert_ctx_access = bpf_convert_ctx_access, >> + .gen_ld_abs = bpf_gen_ld_abs, >> +}; > > and we don't do ctx rewrites like this either. > > Please see how hid-bpf and cgroup rstat are hooking up bpf > in _unstable_ way. Can you describe what "stable" and "unstable" mean here? I'm new to BPF and I'm worried if it may mean the interface stability. Let me describe the context. QEMU bundles an eBPF program that is used for the "eBPF steering program" feature of tun. Now I'm proposing to extend the feature to allow to return some values to the userspace and vhost_net. As such, the extension needs to be done in a way that ensures interface stability.
On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > >> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >> index 0448700890f7..298634556fab 100644 > >> --- a/include/uapi/linux/bpf.h > >> +++ b/include/uapi/linux/bpf.h > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > >> BPF_PROG_TYPE_SK_LOOKUP, > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > >> BPF_PROG_TYPE_NETFILTER, > >> + BPF_PROG_TYPE_VNET_HASH, > > > > Sorry, we do not add new stable program types anymore. > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > >> __u8 tstamp_type; > >> __u32 :24; /* Padding, future use. */ > >> __u64 hwtstamp; > >> + > >> + __u32 vnet_hash_value; > >> + __u16 vnet_hash_report; > >> + __u16 vnet_rss_queue; > >> }; > > > > we also do not add anything to uapi __sk_buff. > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > >> + .get_func_proto = sk_filter_func_proto, > >> + .is_valid_access = sk_filter_is_valid_access, > >> + .convert_ctx_access = bpf_convert_ctx_access, > >> + .gen_ld_abs = bpf_gen_ld_abs, > >> +}; > > > > and we don't do ctx rewrites like this either. > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > in _unstable_ way. > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > and I'm worried if it may mean the interface stability. > > Let me describe the context. QEMU bundles an eBPF program that is used > for the "eBPF steering program" feature of tun. Now I'm proposing to > extend the feature to allow to return some values to the userspace and > vhost_net. As such, the extension needs to be done in a way that ensures > interface stability. bpf is not an option then. we do not add stable bpf program types or hooks any more. If a kernel subsystem wants to use bpf it needs to accept the fact that such bpf extensibility will be unstable and subsystem maintainers can decide to remove such bpf support in the future.
On Mon, Oct 16, 2023 at 7:53 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > >> > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > >> index 0448700890f7..298634556fab 100644 > > >> --- a/include/uapi/linux/bpf.h > > >> +++ b/include/uapi/linux/bpf.h > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > > >> BPF_PROG_TYPE_SK_LOOKUP, > > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > > >> BPF_PROG_TYPE_NETFILTER, > > >> + BPF_PROG_TYPE_VNET_HASH, > > > > > > Sorry, we do not add new stable program types anymore. > > > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > > >> __u8 tstamp_type; > > >> __u32 :24; /* Padding, future use. */ > > >> __u64 hwtstamp; > > >> + > > >> + __u32 vnet_hash_value; > > >> + __u16 vnet_hash_report; > > >> + __u16 vnet_rss_queue; > > >> }; > > > > > > we also do not add anything to uapi __sk_buff. > > > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > > >> + .get_func_proto = sk_filter_func_proto, > > >> + .is_valid_access = sk_filter_is_valid_access, > > >> + .convert_ctx_access = bpf_convert_ctx_access, > > >> + .gen_ld_abs = bpf_gen_ld_abs, > > >> +}; > > > > > > and we don't do ctx rewrites like this either. > > > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > > in _unstable_ way. > > > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > > and I'm worried if it may mean the interface stability. > > > > Let me describe the context. QEMU bundles an eBPF program that is used > > for the "eBPF steering program" feature of tun. Now I'm proposing to > > extend the feature to allow to return some values to the userspace and > > vhost_net. As such, the extension needs to be done in a way that ensures > > interface stability. > > bpf is not an option then. > we do not add stable bpf program types or hooks any more. > If a kernel subsystem wants to use bpf it needs to accept the fact > that such bpf extensibility will be unstable and subsystem maintainers > can decide to remove such bpf support in the future. Based on hooks for tracepoints and kfuncs, correct? Perhaps the existing stable flow dissector type is extensible to optionally compute the hash and report hash and hash type. Else we probably should revisit the previous version of this series.
On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > >> > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > >> index 0448700890f7..298634556fab 100644 > > >> --- a/include/uapi/linux/bpf.h > > >> +++ b/include/uapi/linux/bpf.h > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > > >> BPF_PROG_TYPE_SK_LOOKUP, > > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > > >> BPF_PROG_TYPE_NETFILTER, > > >> + BPF_PROG_TYPE_VNET_HASH, > > > > > > Sorry, we do not add new stable program types anymore. > > > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > > >> __u8 tstamp_type; > > >> __u32 :24; /* Padding, future use. */ > > >> __u64 hwtstamp; > > >> + > > >> + __u32 vnet_hash_value; > > >> + __u16 vnet_hash_report; > > >> + __u16 vnet_rss_queue; > > >> }; > > > > > > we also do not add anything to uapi __sk_buff. > > > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > > >> + .get_func_proto = sk_filter_func_proto, > > >> + .is_valid_access = sk_filter_is_valid_access, > > >> + .convert_ctx_access = bpf_convert_ctx_access, > > >> + .gen_ld_abs = bpf_gen_ld_abs, > > >> +}; > > > > > > and we don't do ctx rewrites like this either. > > > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > > in _unstable_ way. > > > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > > and I'm worried if it may mean the interface stability. > > > > Let me describe the context. QEMU bundles an eBPF program that is used > > for the "eBPF steering program" feature of tun. Now I'm proposing to > > extend the feature to allow to return some values to the userspace and > > vhost_net. As such, the extension needs to be done in a way that ensures > > interface stability. > > bpf is not an option then. > we do not add stable bpf program types or hooks any more. Does this mean eBPF could not be used for any new use cases other than the existing ones? > If a kernel subsystem wants to use bpf it needs to accept the fact > that such bpf extensibility will be unstable and subsystem maintainers > can decide to remove such bpf support in the future. I don't see how it is different from the existing ones. Thanks >
On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@redhat.com> wrote: > > On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > > > > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > > >> > > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > >> index 0448700890f7..298634556fab 100644 > > > >> --- a/include/uapi/linux/bpf.h > > > >> +++ b/include/uapi/linux/bpf.h > > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > > > >> BPF_PROG_TYPE_SK_LOOKUP, > > > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > > > >> BPF_PROG_TYPE_NETFILTER, > > > >> + BPF_PROG_TYPE_VNET_HASH, > > > > > > > > Sorry, we do not add new stable program types anymore. > > > > > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > > > >> __u8 tstamp_type; > > > >> __u32 :24; /* Padding, future use. */ > > > >> __u64 hwtstamp; > > > >> + > > > >> + __u32 vnet_hash_value; > > > >> + __u16 vnet_hash_report; > > > >> + __u16 vnet_rss_queue; > > > >> }; > > > > > > > > we also do not add anything to uapi __sk_buff. > > > > > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > > > >> + .get_func_proto = sk_filter_func_proto, > > > >> + .is_valid_access = sk_filter_is_valid_access, > > > >> + .convert_ctx_access = bpf_convert_ctx_access, > > > >> + .gen_ld_abs = bpf_gen_ld_abs, > > > >> +}; > > > > > > > > and we don't do ctx rewrites like this either. > > > > > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > > > in _unstable_ way. > > > > > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > > > and I'm worried if it may mean the interface stability. > > > > > > Let me describe the context. QEMU bundles an eBPF program that is used > > > for the "eBPF steering program" feature of tun. Now I'm proposing to > > > extend the feature to allow to return some values to the userspace and > > > vhost_net. As such, the extension needs to be done in a way that ensures > > > interface stability. > > > > bpf is not an option then. > > we do not add stable bpf program types or hooks any more. > > Does this mean eBPF could not be used for any new use cases other than > the existing ones? It means that any new use of bpf has to be unstable for the time being. > > If a kernel subsystem wants to use bpf it needs to accept the fact > > that such bpf extensibility will be unstable and subsystem maintainers > > can decide to remove such bpf support in the future. > > I don't see how it is different from the existing ones. Can we remove BPF_CGROUP_RUN_PROG_INET_INGRESS hook along with BPF_PROG_TYPE_CGROUP_SKB program type? Obviously not. We can refactor it. We can move it around, but not remove. That's the difference in stable vs unstable.
On 2023/10/18 4:03, Alexei Starovoitov wrote: > On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@redhat.com> wrote: >> >> On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>>> >>>> On 2023/10/16 1:07, Alexei Starovoitov wrote: >>>>> On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >>>>>> >>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>>>>> index 0448700890f7..298634556fab 100644 >>>>>> --- a/include/uapi/linux/bpf.h >>>>>> +++ b/include/uapi/linux/bpf.h >>>>>> @@ -988,6 +988,7 @@ enum bpf_prog_type { >>>>>> BPF_PROG_TYPE_SK_LOOKUP, >>>>>> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ >>>>>> BPF_PROG_TYPE_NETFILTER, >>>>>> + BPF_PROG_TYPE_VNET_HASH, >>>>> >>>>> Sorry, we do not add new stable program types anymore. >>>>> >>>>>> @@ -6111,6 +6112,10 @@ struct __sk_buff { >>>>>> __u8 tstamp_type; >>>>>> __u32 :24; /* Padding, future use. */ >>>>>> __u64 hwtstamp; >>>>>> + >>>>>> + __u32 vnet_hash_value; >>>>>> + __u16 vnet_hash_report; >>>>>> + __u16 vnet_rss_queue; >>>>>> }; >>>>> >>>>> we also do not add anything to uapi __sk_buff. >>>>> >>>>>> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { >>>>>> + .get_func_proto = sk_filter_func_proto, >>>>>> + .is_valid_access = sk_filter_is_valid_access, >>>>>> + .convert_ctx_access = bpf_convert_ctx_access, >>>>>> + .gen_ld_abs = bpf_gen_ld_abs, >>>>>> +}; >>>>> >>>>> and we don't do ctx rewrites like this either. >>>>> >>>>> Please see how hid-bpf and cgroup rstat are hooking up bpf >>>>> in _unstable_ way. >>>> >>>> Can you describe what "stable" and "unstable" mean here? I'm new to BPF >>>> and I'm worried if it may mean the interface stability. >>>> >>>> Let me describe the context. QEMU bundles an eBPF program that is used >>>> for the "eBPF steering program" feature of tun. Now I'm proposing to >>>> extend the feature to allow to return some values to the userspace and >>>> vhost_net. As such, the extension needs to be done in a way that ensures >>>> interface stability. >>> >>> bpf is not an option then. >>> we do not add stable bpf program types or hooks any more. >> >> Does this mean eBPF could not be used for any new use cases other than >> the existing ones? > > It means that any new use of bpf has to be unstable for the time being. Can you elaborate more about making new use unstable "for the time being?" Is it a temporary situation? What is the rationale for that? Such information will help devise a solution that is best for both of the BPF and network subsystems. I would also appreciate if you have some documentation or link to relevant discussions on the mailing list. That will avoid having same discussion you may already have done in the past.
On 2023/10/18 4:19, Akihiko Odaki wrote: > On 2023/10/18 4:03, Alexei Starovoitov wrote: >> On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@redhat.com> wrote: >>> >>> On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov >>> <alexei.starovoitov@gmail.com> wrote: >>>> >>>> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki >>>> <akihiko.odaki@daynix.com> wrote: >>>>> >>>>> On 2023/10/16 1:07, Alexei Starovoitov wrote: >>>>>> On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki >>>>>> <akihiko.odaki@daynix.com> wrote: >>>>>>> >>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>>>>>> index 0448700890f7..298634556fab 100644 >>>>>>> --- a/include/uapi/linux/bpf.h >>>>>>> +++ b/include/uapi/linux/bpf.h >>>>>>> @@ -988,6 +988,7 @@ enum bpf_prog_type { >>>>>>> BPF_PROG_TYPE_SK_LOOKUP, >>>>>>> BPF_PROG_TYPE_SYSCALL, /* a program that can execute >>>>>>> syscalls */ >>>>>>> BPF_PROG_TYPE_NETFILTER, >>>>>>> + BPF_PROG_TYPE_VNET_HASH, >>>>>> >>>>>> Sorry, we do not add new stable program types anymore. >>>>>> >>>>>>> @@ -6111,6 +6112,10 @@ struct __sk_buff { >>>>>>> __u8 tstamp_type; >>>>>>> __u32 :24; /* Padding, future use. */ >>>>>>> __u64 hwtstamp; >>>>>>> + >>>>>>> + __u32 vnet_hash_value; >>>>>>> + __u16 vnet_hash_report; >>>>>>> + __u16 vnet_rss_queue; >>>>>>> }; >>>>>> >>>>>> we also do not add anything to uapi __sk_buff. >>>>>> >>>>>>> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { >>>>>>> + .get_func_proto = sk_filter_func_proto, >>>>>>> + .is_valid_access = sk_filter_is_valid_access, >>>>>>> + .convert_ctx_access = bpf_convert_ctx_access, >>>>>>> + .gen_ld_abs = bpf_gen_ld_abs, >>>>>>> +}; >>>>>> >>>>>> and we don't do ctx rewrites like this either. >>>>>> >>>>>> Please see how hid-bpf and cgroup rstat are hooking up bpf >>>>>> in _unstable_ way. >>>>> >>>>> Can you describe what "stable" and "unstable" mean here? I'm new to >>>>> BPF >>>>> and I'm worried if it may mean the interface stability. >>>>> >>>>> Let me describe the context. QEMU bundles an eBPF program that is used >>>>> for the "eBPF steering program" feature of tun. Now I'm proposing to >>>>> extend the feature to allow to return some values to the userspace and >>>>> vhost_net. As such, the extension needs to be done in a way that >>>>> ensures >>>>> interface stability. >>>> >>>> bpf is not an option then. >>>> we do not add stable bpf program types or hooks any more. >>> >>> Does this mean eBPF could not be used for any new use cases other than >>> the existing ones? >> >> It means that any new use of bpf has to be unstable for the time being. > > Can you elaborate more about making new use unstable "for the time > being?" Is it a temporary situation? What is the rationale for that? > Such information will help devise a solution that is best for both of > the BPF and network subsystems. > > I would also appreciate if you have some documentation or link to > relevant discussions on the mailing list. That will avoid having same > discussion you may already have done in the past. Hi, The discussion has been stuck for a month, but I'd still like to continue figuring out the way best for the whole kernel to implement this feature. I summarize the current situation and question that needs to be answered before push this forward: The goal of this RFC is to allow to report hash values calculated with eBPF steering program. It's essentially just to report 4 bytes from the kernel to the userspace. Unfortunately, however, it is not acceptable for the BPF subsystem because the "stable" BPF is completely fixed these days. The "unstable/kfunc" BPF is an alternative, but the eBPF program will be shipped with a portable userspace program (QEMU)[1] so the lack of interface stability is not tolerable. Another option is to hardcode the algorithm that was conventionally implemented with eBPF steering program in the kernel[2]. It is possible because the algorithm strictly follows the virtio-net specification[3]. However, there are proposals to add different algorithms to the specification[4], and hardcoding the algorithm to the kernel will require to add more UAPIs and code each time such a specification change happens, which is not good for tuntap. In short, the proposed feature requires to make either of three compromises: 1. Compromise on the BPF side: Relax the "stable" BPF feature freeze once and allow eBPF steering program to report 4 more bytes to the kernel. 2. Compromise on the tuntap side: Implement the algorithm to the kernel, and abandon the capability to update the algorithm without changing the kernel. IMHO, I think it's better to make a compromise on the BPF side (option 1). We should minimize the total UAPI changes in the whole kernel, and option 1 is much superior in that sense. Yet I have to note that such a compromise on the BPF side can risk the "stable" BPF feature freeze fragile and let other people complain like "you allowed to change stable BPF for this, why do you reject [some other request to change stable BPF]?" It is bad for BPF maintainers. (I can imagine that introducing and maintaining widely different BPF interfaces is too much burden.) And, of course, this requires an approval from BPF maintainers. So I'd like to ask you that which of these compromises you think worse. Please also tell me if you have another idea. Regards, Akihiko Odaki [1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html [2] https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@daynix.com/ [3] https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003 [4] https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@mail.gmail.com/
Hi, A few rookie questions below. On Sat, Nov 18, 2023 at 2:39 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2023/10/18 4:19, Akihiko Odaki wrote: > > On 2023/10/18 4:03, Alexei Starovoitov wrote: [...] > > > > I would also appreciate if you have some documentation or link to > > relevant discussions on the mailing list. That will avoid having same > > discussion you may already have done in the past. > > Hi, > > The discussion has been stuck for a month, but I'd still like to > continue figuring out the way best for the whole kernel to implement > this feature. I summarize the current situation and question that needs > to be answered before push this forward: > > The goal of this RFC is to allow to report hash values calculated with > eBPF steering program. It's essentially just to report 4 bytes from the > kernel to the userspace. AFAICT, the proposed design is to have BPF generate some data (namely hash, but could be anything afaict) and consume it from user space. Instead of updating __sk_buff, can we have the user space to fetch the data/hash from a bpf map? If this is an option, I guess we can implement the same feature with BPF tracing programs? > > Unfortunately, however, it is not acceptable for the BPF subsystem > because the "stable" BPF is completely fixed these days. The > "unstable/kfunc" BPF is an alternative, but the eBPF program will be > shipped with a portable userspace program (QEMU)[1] so the lack of > interface stability is not tolerable. bpf kfuncs are as stable as exported symbols. Is exported symbols like stability enough for the use case? (I would assume yes.) > > Another option is to hardcode the algorithm that was conventionally > implemented with eBPF steering program in the kernel[2]. It is possible > because the algorithm strictly follows the virtio-net specification[3]. > However, there are proposals to add different algorithms to the > specification[4], and hardcoding the algorithm to the kernel will > require to add more UAPIs and code each time such a specification change > happens, which is not good for tuntap. The requirement looks similar to hid-bpf. Could you explain why that model is not enough? HID also requires some stability AFAICT. Thanks, Song > > In short, the proposed feature requires to make either of three compromises: > > 1. Compromise on the BPF side: Relax the "stable" BPF feature freeze > once and allow eBPF steering program to report 4 more bytes to the kernel. > > 2. Compromise on the tuntap side: Implement the algorithm to the kernel, > and abandon the capability to update the algorithm without changing the > kernel. > > IMHO, I think it's better to make a compromise on the BPF side (option > 1). We should minimize the total UAPI changes in the whole kernel, and > option 1 is much superior in that sense. > > Yet I have to note that such a compromise on the BPF side can risk the > "stable" BPF feature freeze fragile and let other people complain like > "you allowed to change stable BPF for this, why do you reject [some > other request to change stable BPF]?" It is bad for BPF maintainers. (I > can imagine that introducing and maintaining widely different BPF > interfaces is too much burden.) And, of course, this requires an > approval from BPF maintainers. > > So I'd like to ask you that which of these compromises you think worse. > Please also tell me if you have another idea. > > Regards, > Akihiko Odaki > > [1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html > [2] > https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@daynix.com/ > [3] > https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003 > [4] > https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@mail.gmail.com/
On 2023/11/19 1:08, Song Liu wrote: > Hi, > > A few rookie questions below. Thanks for questions. > > On Sat, Nov 18, 2023 at 2:39 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2023/10/18 4:19, Akihiko Odaki wrote: >>> On 2023/10/18 4:03, Alexei Starovoitov wrote: > [...] >>> >>> I would also appreciate if you have some documentation or link to >>> relevant discussions on the mailing list. That will avoid having same >>> discussion you may already have done in the past. >> >> Hi, >> >> The discussion has been stuck for a month, but I'd still like to >> continue figuring out the way best for the whole kernel to implement >> this feature. I summarize the current situation and question that needs >> to be answered before push this forward: >> >> The goal of this RFC is to allow to report hash values calculated with >> eBPF steering program. It's essentially just to report 4 bytes from the >> kernel to the userspace. > > AFAICT, the proposed design is to have BPF generate some data > (namely hash, but could be anything afaict) and consume it from > user space. Instead of updating __sk_buff, can we have the user > space to fetch the data/hash from a bpf map? If this is an option, > I guess we can implement the same feature with BPF tracing > programs? Unfortunately no. The communication with the userspace can be done with two different means: - usual socket read/write - vhost for direct interaction with a KVM guest The BPF map may be a valid option for socket read/write, but it is not for vhost. In-kernel vhost may fetch hash from the BPF map, but I guess it's not a standard way to have an interaction between the kernel code and a BPF program. > >> >> Unfortunately, however, it is not acceptable for the BPF subsystem >> because the "stable" BPF is completely fixed these days. The >> "unstable/kfunc" BPF is an alternative, but the eBPF program will be >> shipped with a portable userspace program (QEMU)[1] so the lack of >> interface stability is not tolerable. > > bpf kfuncs are as stable as exported symbols. Is exported symbols > like stability enough for the use case? (I would assume yes.) > >> >> Another option is to hardcode the algorithm that was conventionally >> implemented with eBPF steering program in the kernel[2]. It is possible >> because the algorithm strictly follows the virtio-net specification[3]. >> However, there are proposals to add different algorithms to the >> specification[4], and hardcoding the algorithm to the kernel will >> require to add more UAPIs and code each time such a specification change >> happens, which is not good for tuntap. > > The requirement looks similar to hid-bpf. Could you explain why that > model is not enough? HID also requires some stability AFAICT. I have little knowledge with hid-bpf, but I assume it is more like a "safe" kernel module; in my understanding, it affects the system state and is intended to be loaded with some kind of a system daemon. It is fine to have the same lifecycle with the kernel for such a BPF program; whenever the kernel is updated, the distributor can recompile the BPF program with the new kernel headers and ship it along with the kernel just as like a kernel module. In contrast, our intended use case is more like a normal application. So, for example, a user may download a container and run QEMU (including the BPF program) installed in the container. As such, it is nice if the ABI is stable across kernel releases, but it is not guaranteed for kfuncs. Such a use case is already covered with the eBPF steering program so I want to maintain it if possible. Regards, Akihiko Odaki
On Sun, Nov 19, 2023 at 12:03 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > [...] > > Unfortunately no. The communication with the userspace can be done with > two different means: > - usual socket read/write > - vhost for direct interaction with a KVM guest > > The BPF map may be a valid option for socket read/write, but it is not > for vhost. In-kernel vhost may fetch hash from the BPF map, but I guess > it's not a standard way to have an interaction between the kernel code > and a BPF program. I am very new to areas like vhost and KVM. So I don't really follow. Does this mean we have the guest kernel reading data from host eBPF programs (loaded by Qemu)? > > > >> > >> Unfortunately, however, it is not acceptable for the BPF subsystem > >> because the "stable" BPF is completely fixed these days. The > >> "unstable/kfunc" BPF is an alternative, but the eBPF program will be > >> shipped with a portable userspace program (QEMU)[1] so the lack of > >> interface stability is not tolerable. > > > > bpf kfuncs are as stable as exported symbols. Is exported symbols > > like stability enough for the use case? (I would assume yes.) > > > >> > >> Another option is to hardcode the algorithm that was conventionally > >> implemented with eBPF steering program in the kernel[2]. It is possible > >> because the algorithm strictly follows the virtio-net specification[3]. > >> However, there are proposals to add different algorithms to the > >> specification[4], and hardcoding the algorithm to the kernel will > >> require to add more UAPIs and code each time such a specification change > >> happens, which is not good for tuntap. > > > > The requirement looks similar to hid-bpf. Could you explain why that > > model is not enough? HID also requires some stability AFAICT. > > I have little knowledge with hid-bpf, but I assume it is more like a > "safe" kernel module; in my understanding, it affects the system state > and is intended to be loaded with some kind of a system daemon. It is > fine to have the same lifecycle with the kernel for such a BPF program; > whenever the kernel is updated, the distributor can recompile the BPF > program with the new kernel headers and ship it along with the kernel > just as like a kernel module. > > In contrast, our intended use case is more like a normal application. > So, for example, a user may download a container and run QEMU (including > the BPF program) installed in the container. As such, it is nice if the > ABI is stable across kernel releases, but it is not guaranteed for > kfuncs. Such a use case is already covered with the eBPF steering > program so I want to maintain it if possible. TBH, I don't think stability should be a concern for kfuncs used by QEMU. Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*, bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will be supported. Thanks, Song
On 2023/11/20 6:02, Song Liu wrote: > On Sun, Nov 19, 2023 at 12:03 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> > [...] >> >> Unfortunately no. The communication with the userspace can be done with >> two different means: >> - usual socket read/write >> - vhost for direct interaction with a KVM guest >> >> The BPF map may be a valid option for socket read/write, but it is not >> for vhost. In-kernel vhost may fetch hash from the BPF map, but I guess >> it's not a standard way to have an interaction between the kernel code >> and a BPF program. > > I am very new to areas like vhost and KVM. So I don't really follow. > Does this mean we have the guest kernel reading data from host eBPF > programs (loaded by Qemu)? Yes, the guest will read hashes calculated by the host, and the interface is strictly defined with the virtio-net specification. > >>> >>>> >>>> Unfortunately, however, it is not acceptable for the BPF subsystem >>>> because the "stable" BPF is completely fixed these days. The >>>> "unstable/kfunc" BPF is an alternative, but the eBPF program will be >>>> shipped with a portable userspace program (QEMU)[1] so the lack of >>>> interface stability is not tolerable. >>> >>> bpf kfuncs are as stable as exported symbols. Is exported symbols >>> like stability enough for the use case? (I would assume yes.) >>> >>>> >>>> Another option is to hardcode the algorithm that was conventionally >>>> implemented with eBPF steering program in the kernel[2]. It is possible >>>> because the algorithm strictly follows the virtio-net specification[3]. >>>> However, there are proposals to add different algorithms to the >>>> specification[4], and hardcoding the algorithm to the kernel will >>>> require to add more UAPIs and code each time such a specification change >>>> happens, which is not good for tuntap. >>> >>> The requirement looks similar to hid-bpf. Could you explain why that >>> model is not enough? HID also requires some stability AFAICT. >> >> I have little knowledge with hid-bpf, but I assume it is more like a >> "safe" kernel module; in my understanding, it affects the system state >> and is intended to be loaded with some kind of a system daemon. It is >> fine to have the same lifecycle with the kernel for such a BPF program; >> whenever the kernel is updated, the distributor can recompile the BPF >> program with the new kernel headers and ship it along with the kernel >> just as like a kernel module. >> >> In contrast, our intended use case is more like a normal application. >> So, for example, a user may download a container and run QEMU (including >> the BPF program) installed in the container. As such, it is nice if the >> ABI is stable across kernel releases, but it is not guaranteed for >> kfuncs. Such a use case is already covered with the eBPF steering >> program so I want to maintain it if possible. > > TBH, I don't think stability should be a concern for kfuncs used by QEMU. > Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*, > bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will > be supported. Documentation/bpf/kfuncs.rst still says: > kfuncs provide a kernel <-> kernel API, and thus are not bound by any > of the strict stability restrictions associated with kernel <-> user > UAPIs. Is it possible to change the statement like as follows: "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the strict stability restrictions associated with kernel <-> user UAPIs. kfuncs that have same stability restrictions associated with UAPIs are exceptional, and must be carefully reviewed by subsystem (and BPF?) maintainers as any other UAPIs are." Regards, Akihiko Odaki
On Mon, Nov 20, 2023 at 12:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2023/11/20 6:02, Song Liu wrote: [...] > >> In contrast, our intended use case is more like a normal application. > >> So, for example, a user may download a container and run QEMU (including > >> the BPF program) installed in the container. As such, it is nice if the > >> ABI is stable across kernel releases, but it is not guaranteed for > >> kfuncs. Such a use case is already covered with the eBPF steering > >> program so I want to maintain it if possible. > > > > TBH, I don't think stability should be a concern for kfuncs used by QEMU. > > Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*, > > bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will > > be supported. > > Documentation/bpf/kfuncs.rst still says: > > kfuncs provide a kernel <-> kernel API, and thus are not bound by any > > of the strict stability restrictions associated with kernel <-> user > > UAPIs. > > Is it possible to change the statement like as follows: > "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by > any of the strict stability restrictions associated with kernel <-> user > UAPIs. kfuncs that have same stability restrictions associated with > UAPIs are exceptional, and must be carefully reviewed by subsystem (and > BPF?) maintainers as any other UAPIs are." I am afraid this is against the intention to not guarantee UAPI-level stability for kfuncs. Thanks, Song
On 2023/11/22 14:25, Song Liu wrote: > On Mon, Nov 20, 2023 at 12:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2023/11/20 6:02, Song Liu wrote: > [...] >>>> In contrast, our intended use case is more like a normal application. >>>> So, for example, a user may download a container and run QEMU (including >>>> the BPF program) installed in the container. As such, it is nice if the >>>> ABI is stable across kernel releases, but it is not guaranteed for >>>> kfuncs. Such a use case is already covered with the eBPF steering >>>> program so I want to maintain it if possible. >>> >>> TBH, I don't think stability should be a concern for kfuncs used by QEMU. >>> Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*, >>> bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will >>> be supported. >> >> Documentation/bpf/kfuncs.rst still says: >> > kfuncs provide a kernel <-> kernel API, and thus are not bound by any >> > of the strict stability restrictions associated with kernel <-> user >> > UAPIs. >> >> Is it possible to change the statement like as follows: >> "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by >> any of the strict stability restrictions associated with kernel <-> user >> UAPIs. kfuncs that have same stability restrictions associated with >> UAPIs are exceptional, and must be carefully reviewed by subsystem (and >> BPF?) maintainers as any other UAPIs are." > > I am afraid this is against the intention to not guarantee UAPI-level stability > for kfuncs. Is it possible to ensure that a QEMU binary with the eBPF program included works on different kernel versions without UAPI-level stability then? Otherwise, I think we need to think of the minimal UAPI addition that exposes the feature I propose, and the two options I presented first are the candidates of such: the stable BPF change or tuntap interface change. Regards, Akihiko Odaki
On 2023/11/22 14:36, Akihiko Odaki wrote: > On 2023/11/22 14:25, Song Liu wrote: >> On Mon, Nov 20, 2023 at 12:05 AM Akihiko Odaki >> <akihiko.odaki@daynix.com> wrote: >>> >>> On 2023/11/20 6:02, Song Liu wrote: >> [...] >>>>> In contrast, our intended use case is more like a normal application. >>>>> So, for example, a user may download a container and run QEMU >>>>> (including >>>>> the BPF program) installed in the container. As such, it is nice if >>>>> the >>>>> ABI is stable across kernel releases, but it is not guaranteed for >>>>> kfuncs. Such a use case is already covered with the eBPF steering >>>>> program so I want to maintain it if possible. >>>> >>>> TBH, I don't think stability should be a concern for kfuncs used by >>>> QEMU. >>>> Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*, >>>> bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will >>>> be supported. >>> >>> Documentation/bpf/kfuncs.rst still says: >>> > kfuncs provide a kernel <-> kernel API, and thus are not bound by >>> any >>> > of the strict stability restrictions associated with kernel <-> user >>> > UAPIs. >>> >>> Is it possible to change the statement like as follows: >>> "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by >>> any of the strict stability restrictions associated with kernel <-> user >>> UAPIs. kfuncs that have same stability restrictions associated with >>> UAPIs are exceptional, and must be carefully reviewed by subsystem (and >>> BPF?) maintainers as any other UAPIs are." >> >> I am afraid this is against the intention to not guarantee UAPI-level >> stability >> for kfuncs. > > Is it possible to ensure that a QEMU binary with the eBPF program > included works on different kernel versions without UAPI-level stability > then? Otherwise, I think we need to think of the minimal UAPI addition > that exposes the feature I propose, and the two options I presented > first are the candidates of such: the stable BPF change or tuntap > interface change. > > Regards, > Akihiko Odaki Now the discussion is stale again so let me summarize the discussion: A tuntap device can have an eBPF steering program to let the userspace decide which tuntap queue should be used for each packet. QEMU uses this feature to implement the RSS algorithm for virtio-net emulation. Now, the virtio specification has a new feature to report hash values calculated with the RSS algorithm. The goal of this RFC is to report such hash values from the eBPF steering program to the userspace. There are currently three ideas to implement the proposal: 1. Abandon eBPF steering program and implement RSS in the kernel. It is possible to implement the RSS algorithm in the kernel as it's strictly defined in the specification. However, there are proposals for relevant virtio specification changes, and abandoning eBPF steering program will loose the ability to implement those changes in the userspace. There are concerns that this lead to more UAPI changes in the end. 2. Add BPF kfuncs. Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf is a good reference for this. The problem with BPF kfuncs is that kfuncs are not considered as stable as UAPI. In my understanding, it is not problematic for things like hid-bpf because programs using those kfuncs affect the entire system state and expected to be centrally managed. Such BPF programs can be updated along with the kernel in a manner similar to kernel modules. The use case of tuntap steering/hash reporting is somewhat different though; the eBPF program is more like a part of application (QEMU or potentially other VMM) and thus needs to be portable. For example, a user may expect a Debian container with QEMU installed to work on Fedora. BPF kfuncs do still provide some level of stability, but there is no documentation that tell how stable they are. The worst case scenario I can imagine is that a future legitimate BPF change breaks QEMU, letting the "no regressions" rule force the change to be reverted. Some assurance that kind scenario will not happen is necessary in my opinion. 3. Add BPF program type derived from the conventional steering program type In principle, it's just to add a feature to report four more bytes to the conventional steering program. However, BPF program types are frozen for feature additions and the proposed change will break the feature freeze. So what's next? I'm inclined to option 3 due to its minimal ABI/API change, but I'm also fine with option 2 if it is possible to guarantee the ABI/API stability necessary to run pre-built QEMUs on future kernel versions by e.g., explicitly stating the stability of kfuncs. If no objection arises, I'll resend this series with the RFC prefix dropped for upstream inclusion. If it's decided to go for option 1 or 2, I'll post a new version of the series implementing the idea. Regards, Akihiko Odaki
On Sat, Dec 9, 2023 at 11:03 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > > On 2023/11/22 14:36, Akihiko Odaki wrote: > > On 2023/11/22 14:25, Song Liu wrote: [...] > > Now the discussion is stale again so let me summarize the discussion: > > A tuntap device can have an eBPF steering program to let the userspace > decide which tuntap queue should be used for each packet. QEMU uses this > feature to implement the RSS algorithm for virtio-net emulation. Now, > the virtio specification has a new feature to report hash values > calculated with the RSS algorithm. The goal of this RFC is to report > such hash values from the eBPF steering program to the userspace. > > There are currently three ideas to implement the proposal: > > 1. Abandon eBPF steering program and implement RSS in the kernel. > > It is possible to implement the RSS algorithm in the kernel as it's > strictly defined in the specification. However, there are proposals for > relevant virtio specification changes, and abandoning eBPF steering > program will loose the ability to implement those changes in the > userspace. There are concerns that this lead to more UAPI changes in the > end. > > 2. Add BPF kfuncs. > > Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf > is a good reference for this. > > The problem with BPF kfuncs is that kfuncs are not considered as stable > as UAPI. In my understanding, it is not problematic for things like > hid-bpf because programs using those kfuncs affect the entire system > state and expected to be centrally managed. Such BPF programs can be > updated along with the kernel in a manner similar to kernel modules. > > The use case of tuntap steering/hash reporting is somewhat different > though; the eBPF program is more like a part of application (QEMU or > potentially other VMM) and thus needs to be portable. For example, a > user may expect a Debian container with QEMU installed to work on Fedora. > > BPF kfuncs do still provide some level of stability, but there is no > documentation that tell how stable they are. The worst case scenario I > can imagine is that a future legitimate BPF change breaks QEMU, letting > the "no regressions" rule force the change to be reverted. Some > assurance that kind scenario will not happen is necessary in my opinion. I don't think we can provide stability guarantees before seeing something being used in the field. How do we know it will be useful forever? If a couple years later, there is only one person using it somewhere in the world, why should we keep supporting it? If there are millions of virtual machines using it, why would you worry about it being removed? > > 3. Add BPF program type derived from the conventional steering program type > > In principle, it's just to add a feature to report four more bytes to > the conventional steering program. However, BPF program types are frozen > for feature additions and the proposed change will break the feature freeze. > > So what's next? I'm inclined to option 3 due to its minimal ABI/API > change, but I'm also fine with option 2 if it is possible to guarantee > the ABI/API stability necessary to run pre-built QEMUs on future kernel > versions by e.g., explicitly stating the stability of kfuncs. If no > objection arises, I'll resend this series with the RFC prefix dropped > for upstream inclusion. If it's decided to go for option 1 or 2, I'll > post a new version of the series implementing the idea. Probably a dumb question, but does this RFC fall into option 3? If that's the case, I seriously don't think it's gonna happen. I would recommend you give option 2 a try and share the code. This is probably the best way to move the discussion forward. Thanks, Song
On 2023/12/11 10:40, Song Liu wrote: > On Sat, Dec 9, 2023 at 11:03 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: >> >> On 2023/11/22 14:36, Akihiko Odaki wrote: >>> On 2023/11/22 14:25, Song Liu wrote: > [...] >> >> Now the discussion is stale again so let me summarize the discussion: >> >> A tuntap device can have an eBPF steering program to let the userspace >> decide which tuntap queue should be used for each packet. QEMU uses this >> feature to implement the RSS algorithm for virtio-net emulation. Now, >> the virtio specification has a new feature to report hash values >> calculated with the RSS algorithm. The goal of this RFC is to report >> such hash values from the eBPF steering program to the userspace. >> >> There are currently three ideas to implement the proposal: >> >> 1. Abandon eBPF steering program and implement RSS in the kernel. >> >> It is possible to implement the RSS algorithm in the kernel as it's >> strictly defined in the specification. However, there are proposals for >> relevant virtio specification changes, and abandoning eBPF steering >> program will loose the ability to implement those changes in the >> userspace. There are concerns that this lead to more UAPI changes in the >> end. >> >> 2. Add BPF kfuncs. >> >> Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf >> is a good reference for this. >> >> The problem with BPF kfuncs is that kfuncs are not considered as stable >> as UAPI. In my understanding, it is not problematic for things like >> hid-bpf because programs using those kfuncs affect the entire system >> state and expected to be centrally managed. Such BPF programs can be >> updated along with the kernel in a manner similar to kernel modules. >> >> The use case of tuntap steering/hash reporting is somewhat different >> though; the eBPF program is more like a part of application (QEMU or >> potentially other VMM) and thus needs to be portable. For example, a >> user may expect a Debian container with QEMU installed to work on Fedora. >> >> BPF kfuncs do still provide some level of stability, but there is no >> documentation that tell how stable they are. The worst case scenario I >> can imagine is that a future legitimate BPF change breaks QEMU, letting >> the "no regressions" rule force the change to be reverted. Some >> assurance that kind scenario will not happen is necessary in my opinion. > > I don't think we can provide stability guarantees before seeing something > being used in the field. How do we know it will be useful forever? If a > couple years later, there is only one person using it somewhere in the > world, why should we keep supporting it? If there are millions of virtual > machines using it, why would you worry about it being removed? I have a different opinion about providing stability guarantees; I believe it is safe to provide such a guarantee without actual use in a field. We develop features expecting there are real uses, and if it turns out otherwise, we can break the stated guarantee since there is no real use cases anyway. It is fine even breaking UAPIs in such a case, which is stated in Documentation/admin-guide/reporting-regressions.rst. So I rather feel easy about guaranteeing UAPI stability; we can just guarantee the UAPI-level stability for a particular kfunc and use it from QEMU expecting the stability. If the feature is found not useful, QEMU and the kernel can just remove it. I'm more concerned about the other case, which means that there will be wide uses of this feature. A kernel developer may assume the stability of the interface is like one of kernel internal APIs (Documentation/bpf/kfuncs.rst says kfuncs are like EXPORT_SYMBOL_GPL) and decide to change it, breaking old QEMU binaries and that's something I would like to avoid. Regarding the breakage scenario, I think we can avoid the kfuncs removal just by saying "we won't remove them". I'm more worried the case that a change in the BPF kfunc infrastucture requires to recompile the binary. So, in short, I don't think we can say "kfuncs are like EXPORT_SYMBOL_GPL" and "you can freely use kfuncs in a normal userspace application like QEMU" at the same time. > >> >> 3. Add BPF program type derived from the conventional steering program type >> >> In principle, it's just to add a feature to report four more bytes to >> the conventional steering program. However, BPF program types are frozen >> for feature additions and the proposed change will break the feature freeze. >> >> So what's next? I'm inclined to option 3 due to its minimal ABI/API >> change, but I'm also fine with option 2 if it is possible to guarantee >> the ABI/API stability necessary to run pre-built QEMUs on future kernel >> versions by e.g., explicitly stating the stability of kfuncs. If no >> objection arises, I'll resend this series with the RFC prefix dropped >> for upstream inclusion. If it's decided to go for option 1 or 2, I'll >> post a new version of the series implementing the idea. > > Probably a dumb question, but does this RFC fall into option 3? If > that's the case, I seriously don't think it's gonna happen. Yes, it's option 3. > > I would recommend you give option 2 a try and share the code. This is > probably the best way to move the discussion forward. I'd like to add a documentation change to say the added kfuncs are exceptional cases that are not like EXPORT_SYMBOL_GPL in that case. Will it work? Regards, Akihiko Odaki
On Sun, Dec 10, 2023 at 9:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > [...] > > > > I don't think we can provide stability guarantees before seeing something > > being used in the field. How do we know it will be useful forever? If a > > couple years later, there is only one person using it somewhere in the > > world, why should we keep supporting it? If there are millions of virtual > > machines using it, why would you worry about it being removed? > > I have a different opinion about providing stability guarantees; I > believe it is safe to provide such a guarantee without actual use in a > field. We develop features expecting there are real uses, and if it > turns out otherwise, we can break the stated guarantee since there is no > real use cases anyway. It is fine even breaking UAPIs in such a case, > which is stated in Documentation/admin-guide/reporting-regressions.rst. > > So I rather feel easy about guaranteeing UAPI stability; we can just > guarantee the UAPI-level stability for a particular kfunc and use it > from QEMU expecting the stability. If the feature is found not useful, > QEMU and the kernel can just remove it. It appears that we more or less agree that this feature may not be something we will support forever. > I'm more concerned about the other case, which means that there will be > wide uses of this feature. A kernel developer may assume the stability > of the interface is like one of kernel internal APIs > (Documentation/bpf/kfuncs.rst says kfuncs are like EXPORT_SYMBOL_GPL) > and decide to change it, breaking old QEMU binaries and that's something > I would like to avoid. > > Regarding the breakage scenario, I think we can avoid the kfuncs removal > just by saying "we won't remove them". I'm more worried the case that a > change in the BPF kfunc infrastucture requires to recompile the binary. > > So, in short, I don't think we can say "kfuncs are like > EXPORT_SYMBOL_GPL" and "you can freely use kfuncs in a normal userspace > application like QEMU" at the same time. > > > [...] > > > > > I would recommend you give option 2 a try and share the code. This is > > probably the best way to move the discussion forward. > > I'd like to add a documentation change to say the added kfuncs are > exceptional cases that are not like EXPORT_SYMBOL_GPL in that case. Will > it work? It will not. The BPF community had a lot of discussions about the stability of BPF APIs, for example in [1]. Therefore, this is not a light decision. AFAICT, what is being proposed in this RFC is way less stable than many kfuncs we added recently. We are not changing the stability guarantee for this. Let's invest our time wisely and work on more meaningful things, for example, a prototype that may actually get accepted. Thanks, Song [1] https://lore.kernel.org/bpf/20221207205537.860248-1-joannelkoong@gmail.com/
diff --git a/Documentation/bpf/bpf_prog_run.rst b/Documentation/bpf/bpf_prog_run.rst index 4868c909df5c..0d108d867c03 100644 --- a/Documentation/bpf/bpf_prog_run.rst +++ b/Documentation/bpf/bpf_prog_run.rst @@ -39,6 +39,7 @@ following types: - ``BPF_PROG_TYPE_STRUCT_OPS`` - ``BPF_PROG_TYPE_RAW_TRACEPOINT`` - ``BPF_PROG_TYPE_SYSCALL`` +- ``BPF_PROG_TYPE_VNET_HASH`` When using the ``BPF_PROG_RUN`` command, userspace supplies an input context object and (for program types operating on network packets) a buffer containing diff --git a/Documentation/bpf/libbpf/program_types.rst b/Documentation/bpf/libbpf/program_types.rst index ad4d4d5eecb0..6be53201f91b 100644 --- a/Documentation/bpf/libbpf/program_types.rst +++ b/Documentation/bpf/libbpf/program_types.rst @@ -171,6 +171,8 @@ described in more detail in the footnotes. + +----------------------------------------+----------------------------------+-----------+ | | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | | +-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_VNET_HASH`` | | ``vnet_hash`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ | ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | | + + +----------------------------------+-----------+ | | | ``xdp/cpumap`` | | diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index fc0d6f32c687..dec83d495e82 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -34,6 +34,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg, struct sk_msg_md, struct sk_msg) BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector, struct __sk_buff, struct bpf_flow_dissector) +BPF_PROG_TYPE(BPF_PROG_TYPE_VNET_HASH, vnet_hash, + struct __sk_buff, struct sk_buff) #endif #ifdef CONFIG_BPF_EVENTS BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0448700890f7..298634556fab 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -988,6 +988,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_VNET_HASH, }; enum bpf_attach_type { @@ -6111,6 +6112,10 @@ struct __sk_buff { __u8 tstamp_type; __u32 :24; /* Padding, future use. */ __u64 hwtstamp; + + __u32 vnet_hash_value; + __u16 vnet_hash_report; + __u16 vnet_rss_queue; }; struct bpf_tunnel_key { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bb78212fa5b2..fd6d842635d2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -14373,6 +14373,7 @@ static bool may_access_skb(enum bpf_prog_type type) case BPF_PROG_TYPE_SOCKET_FILTER: case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: + case BPF_PROG_TYPE_VNET_HASH: return true; default: return false; @@ -16973,6 +16974,11 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } + if (prog_type == BPF_PROG_TYPE_VNET_HASH) { + verbose(env, "vnet hash progs cannot use bpf_spin_lock yet\n"); + return -EINVAL; + } + if (is_tracing_prog_type(prog_type)) { verbose(env, "tracing progs cannot use bpf_spin_lock yet\n"); return -EINVAL; diff --git a/net/core/filter.c b/net/core/filter.c index a094694899c9..867edbc628de 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -10967,6 +10967,17 @@ const struct bpf_prog_ops flow_dissector_prog_ops = { .test_run = bpf_prog_test_run_flow_dissector, }; +const struct bpf_verifier_ops vnet_hash_verifier_ops = { + .get_func_proto = sk_filter_func_proto, + .is_valid_access = sk_filter_is_valid_access, + .convert_ctx_access = bpf_convert_ctx_access, + .gen_ld_abs = bpf_gen_ld_abs, +}; + +const struct bpf_prog_ops vnet_hash_prog_ops = { + .test_run = bpf_prog_test_run_skb, +}; + int sk_detach_filter(struct sock *sk) { int ret = -ENOENT; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0448700890f7..60976fe86247 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -988,6 +988,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_VNET_HASH, }; enum bpf_attach_type { diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 96ff1aa4bf6a..e74d136eae07 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -209,6 +209,7 @@ static const char * const prog_type_name[] = { [BPF_PROG_TYPE_SK_LOOKUP] = "sk_lookup", [BPF_PROG_TYPE_SYSCALL] = "syscall", [BPF_PROG_TYPE_NETFILTER] = "netfilter", + [BPF_PROG_TYPE_VNET_HASH] = "vnet_hash", }; static int __base_pr(enum libbpf_print_level level, const char *format, @@ -8858,6 +8859,7 @@ static const struct bpf_sec_def section_defs[] = { SEC_DEF("struct_ops.s+", STRUCT_OPS, 0, SEC_SLEEPABLE), SEC_DEF("sk_lookup", SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE), SEC_DEF("netfilter", NETFILTER, BPF_NETFILTER, SEC_NONE), + SEC_DEF("vnet_hash", VNET_HASH, 0, SEC_NONE), }; int libbpf_register_prog_handler(const char *sec,
This new program type will be used by tun to determine the queues to deliver packets and the hash values and types reported with virtio-net headers. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> --- Documentation/bpf/bpf_prog_run.rst | 1 + Documentation/bpf/libbpf/program_types.rst | 2 ++ include/linux/bpf_types.h | 2 ++ include/uapi/linux/bpf.h | 5 +++++ kernel/bpf/verifier.c | 6 ++++++ net/core/filter.c | 11 +++++++++++ tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/libbpf.c | 2 ++ 8 files changed, 30 insertions(+)