mbox series

[RFC,v4,0/9] tun: Introduce virtio-net hashing feature

Message ID 20240924-rss-v4-0-84e932ec0e6c@daynix.com (mailing list archive)
Headers show
Series tun: Introduce virtio-net hashing feature | expand

Message

Akihiko Odaki Sept. 24, 2024, 9:01 a.m. UTC
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28647@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com

Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com

---
Akihiko Odaki (9):
      skbuff: Introduce SKB_EXT_TUN_VNET_HASH
      virtio_net: Add functions for hashing
      net: flow_dissector: Export flow_keys_dissector_symmetric
      tap: Pad virtio header with zero
      tun: Pad virtio header with zero
      tun: Introduce virtio-net hash reporting feature
      tun: Introduce virtio-net RSS
      selftest: tun: Add tests for virtio-net hashing
      vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 drivers/net/Kconfig                  |   1 +
 drivers/net/tap.c                    |   2 +-
 drivers/net/tun.c                    | 255 ++++++++++++--
 drivers/vhost/net.c                  |  16 +-
 include/linux/if_tun.h               |   5 +
 include/linux/skbuff.h               |   3 +
 include/linux/virtio_net.h           | 174 +++++++++
 include/net/flow_dissector.h         |   1 +
 include/uapi/linux/if_tun.h          |  71 ++++
 net/core/flow_dissector.c            |   3 +-
 net/core/skbuff.c                    |   4 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c    | 666 ++++++++++++++++++++++++++++++++++-
 14 files changed, 1170 insertions(+), 40 deletions(-)
---
base-commit: 752ebcbe87aceeb6334e846a466116197711a982
change-id: 20240403-rss-e737d89efa77

Best regards,

Comments

Jason Wang Sept. 25, 2024, 3:30 a.m. UTC | #1
On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it is based on context
> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> and vhost_net).
>

I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?

Thanks
Akihiko Odaki Sept. 27, 2024, 2:11 a.m. UTC | #2
On 2024/09/25 12:30, Jason Wang wrote:
> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> virtio-net have two usage of hashes: one is RSS and another is hash
>> reporting. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF.
>>
>> Introduce the code to compute hashes to the kernel in order to overcome
>> thse challenges.
>>
>> An alternative solution is to extend the eBPF steering program so that it
>> will be able to report to the userspace, but it is based on context
>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>> and vhost_net).
>>
> 
> I wonder if we could clone the skb and reuse some to store the hash,
> then the steering eBPF program can access these fields without
> introducing full RSS in the kernel?

I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with 
tc-bpf to store a hash value that can be used for eBPF steering program 
and virtio hash reporting. However we don't have a means of storing a 
hash type, which is specific to virtio hash reporting and lacks a 
corresponding skb field.

Regards,
Akihiko Odaki
Jason Wang Sept. 27, 2024, 4:31 a.m. UTC | #3
On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2024/09/25 12:30, Jason Wang wrote:
> > On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> virtio-net have two usage of hashes: one is RSS and another is hash
> >> reporting. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF.
> >>
> >> Introduce the code to compute hashes to the kernel in order to overcome
> >> thse challenges.
> >>
> >> An alternative solution is to extend the eBPF steering program so that it
> >> will be able to report to the userspace, but it is based on context
> >> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >> and vhost_net).
> >>
> >
> > I wonder if we could clone the skb and reuse some to store the hash,
> > then the steering eBPF program can access these fields without
> > introducing full RSS in the kernel?
>
> I don't get how cloning the skb can solve the issue.
>
> We can certainly implement Toeplitz function in the kernel or even with
> tc-bpf to store a hash value that can be used for eBPF steering program
> and virtio hash reporting. However we don't have a means of storing a
> hash type, which is specific to virtio hash reporting and lacks a
> corresponding skb field.

I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?

Thanks

>
> Regards,
> Akihiko Odaki
>
Akihiko Odaki Sept. 27, 2024, 7:50 a.m. UTC | #4
On 2024/09/27 13:31, Jason Wang wrote:
> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2024/09/25 12:30, Jason Wang wrote:
>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges.
>>>>
>>>> An alternative solution is to extend the eBPF steering program so that it
>>>> will be able to report to the userspace, but it is based on context
>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>>>> and vhost_net).
>>>>
>>>
>>> I wonder if we could clone the skb and reuse some to store the hash,
>>> then the steering eBPF program can access these fields without
>>> introducing full RSS in the kernel?
>>
>> I don't get how cloning the skb can solve the issue.
>>
>> We can certainly implement Toeplitz function in the kernel or even with
>> tc-bpf to store a hash value that can be used for eBPF steering program
>> and virtio hash reporting. However we don't have a means of storing a
>> hash type, which is specific to virtio hash reporting and lacks a
>> corresponding skb field.
> 
> I may miss something but looking at sk_filter_is_valid_access(). It
> looks to me we can make use of skb->cb[0..4]?

I didn't opt to using cb. Below is the rationale:

cb is for tail call so it means we reuse the field for a different 
purpose. The context rewrite allows adding a field without increasing 
the size of the underlying storage (the real sk_buff) so we should add a 
new field instead of reusing an existing field to avoid confusion.

We are however no longer allowed to add a new field. In my 
understanding, this is because it is an UAPI, and eBPF maintainers found 
it is difficult to maintain its stability.

Reusing cb for hash reporting is a workaround to avoid having a new 
field, but it does not solve the underlying problem (i.e., keeping eBPF 
as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl 
is a reasonable option to keep the API as stable as other virtualization 
UAPIs while respecting the underlying intention of the context rewrite 
feature freeze.

Regards,
Akihiko Odaki
Jason Wang Sept. 29, 2024, 2:07 a.m. UTC | #5
On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2024/09/27 13:31, Jason Wang wrote:
> > On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2024/09/25 12:30, Jason Wang wrote:
> >>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF.
> >>>>
> >>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>> thse challenges.
> >>>>
> >>>> An alternative solution is to extend the eBPF steering program so that it
> >>>> will be able to report to the userspace, but it is based on context
> >>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >>>> and vhost_net).
> >>>>
> >>>
> >>> I wonder if we could clone the skb and reuse some to store the hash,
> >>> then the steering eBPF program can access these fields without
> >>> introducing full RSS in the kernel?
> >>
> >> I don't get how cloning the skb can solve the issue.
> >>
> >> We can certainly implement Toeplitz function in the kernel or even with
> >> tc-bpf to store a hash value that can be used for eBPF steering program
> >> and virtio hash reporting. However we don't have a means of storing a
> >> hash type, which is specific to virtio hash reporting and lacks a
> >> corresponding skb field.
> >
> > I may miss something but looking at sk_filter_is_valid_access(). It
> > looks to me we can make use of skb->cb[0..4]?
>
> I didn't opt to using cb. Below is the rationale:
>
> cb is for tail call so it means we reuse the field for a different
> purpose. The context rewrite allows adding a field without increasing
> the size of the underlying storage (the real sk_buff) so we should add a
> new field instead of reusing an existing field to avoid confusion.
>
> We are however no longer allowed to add a new field. In my
> understanding, this is because it is an UAPI, and eBPF maintainers found
> it is difficult to maintain its stability.
>
> Reusing cb for hash reporting is a workaround to avoid having a new
> field, but it does not solve the underlying problem (i.e., keeping eBPF
> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
> is a reasonable option to keep the API as stable as other virtualization
> UAPIs while respecting the underlying intention of the context rewrite
> feature freeze.

Fair enough.

Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
via cls or other). It might worth to see if anything we miss here.

Thanks

>
> Regards,
> Akihiko Odaki
>
Akihiko Odaki Sept. 29, 2024, 7:10 a.m. UTC | #6
On 2024/09/29 11:07, Jason Wang wrote:
> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2024/09/27 13:31, Jason Wang wrote:
>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2024/09/25 12:30, Jason Wang wrote:
>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>> purpose of RSS.
>>>>>>
>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>> restrictive nature of eBPF.
>>>>>>
>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>> thse challenges.
>>>>>>
>>>>>> An alternative solution is to extend the eBPF steering program so that it
>>>>>> will be able to report to the userspace, but it is based on context
>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>>>>>> and vhost_net).
>>>>>>
>>>>>
>>>>> I wonder if we could clone the skb and reuse some to store the hash,
>>>>> then the steering eBPF program can access these fields without
>>>>> introducing full RSS in the kernel?
>>>>
>>>> I don't get how cloning the skb can solve the issue.
>>>>
>>>> We can certainly implement Toeplitz function in the kernel or even with
>>>> tc-bpf to store a hash value that can be used for eBPF steering program
>>>> and virtio hash reporting. However we don't have a means of storing a
>>>> hash type, which is specific to virtio hash reporting and lacks a
>>>> corresponding skb field.
>>>
>>> I may miss something but looking at sk_filter_is_valid_access(). It
>>> looks to me we can make use of skb->cb[0..4]?
>>
>> I didn't opt to using cb. Below is the rationale:
>>
>> cb is for tail call so it means we reuse the field for a different
>> purpose. The context rewrite allows adding a field without increasing
>> the size of the underlying storage (the real sk_buff) so we should add a
>> new field instead of reusing an existing field to avoid confusion.
>>
>> We are however no longer allowed to add a new field. In my
>> understanding, this is because it is an UAPI, and eBPF maintainers found
>> it is difficult to maintain its stability.
>>
>> Reusing cb for hash reporting is a workaround to avoid having a new
>> field, but it does not solve the underlying problem (i.e., keeping eBPF
>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
>> is a reasonable option to keep the API as stable as other virtualization
>> UAPIs while respecting the underlying intention of the context rewrite
>> feature freeze.
> 
> Fair enough.
> 
> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
> via cls or other). It might worth to see if anything we miss here.

Thanks for the information. I wonder why they used cls instead of 
steering program. Perhaps it may be due to compatibility with macvtap 
and ipvtap, which don't steering program.

Their RSS implementation looks cleaner so I will improve my RSS 
implementation accordingly.

Regards,
Akihiko Odaki
Stephen Hemminger Sept. 29, 2024, 3:33 p.m. UTC | #7
On Sun, 29 Sep 2024 16:10:47 +0900
Akihiko Odaki <akihiko.odaki@daynix.com> wrote:

> On 2024/09/29 11:07, Jason Wang wrote:
> > On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>
> >> On 2024/09/27 13:31, Jason Wang wrote:  
> >>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>
> >>>> On 2024/09/25 12:30, Jason Wang wrote:  
> >>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>>>
> >>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF.
> >>>>>>
> >>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>> thse challenges.
> >>>>>>
> >>>>>> An alternative solution is to extend the eBPF steering program so that it
> >>>>>> will be able to report to the userspace, but it is based on context
> >>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >>>>>> and vhost_net).
> >>>>>>  
> >>>>>
> >>>>> I wonder if we could clone the skb and reuse some to store the hash,
> >>>>> then the steering eBPF program can access these fields without
> >>>>> introducing full RSS in the kernel?  
> >>>>
> >>>> I don't get how cloning the skb can solve the issue.
> >>>>
> >>>> We can certainly implement Toeplitz function in the kernel or even with
> >>>> tc-bpf to store a hash value that can be used for eBPF steering program
> >>>> and virtio hash reporting. However we don't have a means of storing a
> >>>> hash type, which is specific to virtio hash reporting and lacks a
> >>>> corresponding skb field.  
> >>>
> >>> I may miss something but looking at sk_filter_is_valid_access(). It
> >>> looks to me we can make use of skb->cb[0..4]?  
> >>
> >> I didn't opt to using cb. Below is the rationale:
> >>
> >> cb is for tail call so it means we reuse the field for a different
> >> purpose. The context rewrite allows adding a field without increasing
> >> the size of the underlying storage (the real sk_buff) so we should add a
> >> new field instead of reusing an existing field to avoid confusion.
> >>
> >> We are however no longer allowed to add a new field. In my
> >> understanding, this is because it is an UAPI, and eBPF maintainers found
> >> it is difficult to maintain its stability.
> >>
> >> Reusing cb for hash reporting is a workaround to avoid having a new
> >> field, but it does not solve the underlying problem (i.e., keeping eBPF
> >> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
> >> is a reasonable option to keep the API as stable as other virtualization
> >> UAPIs while respecting the underlying intention of the context rewrite
> >> feature freeze.  
> > 
> > Fair enough.
> > 
> > Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
> > via cls or other). It might worth to see if anything we miss here.  
> 
> Thanks for the information. I wonder why they used cls instead of 
> steering program. Perhaps it may be due to compatibility with macvtap 
> and ipvtap, which don't steering program.
> 
> Their RSS implementation looks cleaner so I will improve my RSS 
> implementation accordingly.
> 

DPDK needs to support flow rules. The specific case is where packets
are classified by a flow, then RSS is done across a subset of the queues.
The support for flow in TUN driver is more academic than useful,
I fixed it for current BPF, but doubt anyone is using it really.

A full steering program would be good, but would require much more
complexity to take a general set of flow rules then communicate that
to the steering program.
Akihiko Odaki Oct. 1, 2024, 5:54 a.m. UTC | #8
On 2024/09/30 0:33, Stephen Hemminger wrote:
> On Sun, 29 Sep 2024 16:10:47 +0900
> Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> 
>> On 2024/09/29 11:07, Jason Wang wrote:
>>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2024/09/27 13:31, Jason Wang wrote:
>>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2024/09/25 12:30, Jason Wang wrote:
>>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>> purpose of RSS.
>>>>>>>>
>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>> restrictive nature of eBPF.
>>>>>>>>
>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>> thse challenges.
>>>>>>>>
>>>>>>>> An alternative solution is to extend the eBPF steering program so that it
>>>>>>>> will be able to report to the userspace, but it is based on context
>>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>>>>>>>> and vhost_net).
>>>>>>>>   
>>>>>>>
>>>>>>> I wonder if we could clone the skb and reuse some to store the hash,
>>>>>>> then the steering eBPF program can access these fields without
>>>>>>> introducing full RSS in the kernel?
>>>>>>
>>>>>> I don't get how cloning the skb can solve the issue.
>>>>>>
>>>>>> We can certainly implement Toeplitz function in the kernel or even with
>>>>>> tc-bpf to store a hash value that can be used for eBPF steering program
>>>>>> and virtio hash reporting. However we don't have a means of storing a
>>>>>> hash type, which is specific to virtio hash reporting and lacks a
>>>>>> corresponding skb field.
>>>>>
>>>>> I may miss something but looking at sk_filter_is_valid_access(). It
>>>>> looks to me we can make use of skb->cb[0..4]?
>>>>
>>>> I didn't opt to using cb. Below is the rationale:
>>>>
>>>> cb is for tail call so it means we reuse the field for a different
>>>> purpose. The context rewrite allows adding a field without increasing
>>>> the size of the underlying storage (the real sk_buff) so we should add a
>>>> new field instead of reusing an existing field to avoid confusion.
>>>>
>>>> We are however no longer allowed to add a new field. In my
>>>> understanding, this is because it is an UAPI, and eBPF maintainers found
>>>> it is difficult to maintain its stability.
>>>>
>>>> Reusing cb for hash reporting is a workaround to avoid having a new
>>>> field, but it does not solve the underlying problem (i.e., keeping eBPF
>>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
>>>> is a reasonable option to keep the API as stable as other virtualization
>>>> UAPIs while respecting the underlying intention of the context rewrite
>>>> feature freeze.
>>>
>>> Fair enough.
>>>
>>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
>>> via cls or other). It might worth to see if anything we miss here.
>>
>> Thanks for the information. I wonder why they used cls instead of
>> steering program. Perhaps it may be due to compatibility with macvtap
>> and ipvtap, which don't steering program.
>>
>> Their RSS implementation looks cleaner so I will improve my RSS
>> implementation accordingly.
>>
> 
> DPDK needs to support flow rules. The specific case is where packets
> are classified by a flow, then RSS is done across a subset of the queues.
> The support for flow in TUN driver is more academic than useful,
> I fixed it for current BPF, but doubt anyone is using it really.
> 
> A full steering program would be good, but would require much more
> complexity to take a general set of flow rules then communicate that
> to the steering program.
> 

It reminded me of RSS context and flow filter. Some physical NICs 
support to use a dedicated RSS context for packets matched with flow 
filter, and virtio is also gaining corresponding features.

RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179

I considered about the possibility of supporting these features with tc 
instead of adding ioctls to tuntap, but it seems not appropriate for 
virtualization use case.

In a virtualization use case, tuntap is configured according to requests 
of guests, and the code processing these requests need to have minimal 
permissions for security. This goal is achieved by passing a file 
descriptor that represents a tuntap from a privileged process (e.g., 
libvirt) to the process handling guest requests (e.g., QEMU).

However, tc is configured with rtnetlink, which does not seem to have an 
interface to delegate a permission for one particular device to another 
process.

For now I'll continue working on the current approach that is based on 
ioctl and lacks RSS context and flow filter features. Eventually they 
are also likely to require new ioctls if they are to be supported with 
vhost_net.

Regards,
Akihiko Odaki
Stephen Hemminger Oct. 1, 2024, 4:31 p.m. UTC | #9
On Tue, 1 Oct 2024 14:54:29 +0900
Akihiko Odaki <akihiko.odaki@daynix.com> wrote:

> On 2024/09/30 0:33, Stephen Hemminger wrote:
> > On Sun, 29 Sep 2024 16:10:47 +0900
> > Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >   
> >> On 2024/09/29 11:07, Jason Wang wrote:  
> >>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>
> >>>> On 2024/09/27 13:31, Jason Wang wrote:  
> >>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>>>
> >>>>>> On 2024/09/25 12:30, Jason Wang wrote:  
> >>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>>>>>
> >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF.
> >>>>>>>>
> >>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>> thse challenges.
> >>>>>>>>
> >>>>>>>> An alternative solution is to extend the eBPF steering program so that it
> >>>>>>>> will be able to report to the userspace, but it is based on context
> >>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >>>>>>>> and vhost_net).
> >>>>>>>>     
> >>>>>>>
> >>>>>>> I wonder if we could clone the skb and reuse some to store the hash,
> >>>>>>> then the steering eBPF program can access these fields without
> >>>>>>> introducing full RSS in the kernel?  
> >>>>>>
> >>>>>> I don't get how cloning the skb can solve the issue.
> >>>>>>
> >>>>>> We can certainly implement Toeplitz function in the kernel or even with
> >>>>>> tc-bpf to store a hash value that can be used for eBPF steering program
> >>>>>> and virtio hash reporting. However we don't have a means of storing a
> >>>>>> hash type, which is specific to virtio hash reporting and lacks a
> >>>>>> corresponding skb field.  
> >>>>>
> >>>>> I may miss something but looking at sk_filter_is_valid_access(). It
> >>>>> looks to me we can make use of skb->cb[0..4]?  
> >>>>
> >>>> I didn't opt to using cb. Below is the rationale:
> >>>>
> >>>> cb is for tail call so it means we reuse the field for a different
> >>>> purpose. The context rewrite allows adding a field without increasing
> >>>> the size of the underlying storage (the real sk_buff) so we should add a
> >>>> new field instead of reusing an existing field to avoid confusion.
> >>>>
> >>>> We are however no longer allowed to add a new field. In my
> >>>> understanding, this is because it is an UAPI, and eBPF maintainers found
> >>>> it is difficult to maintain its stability.
> >>>>
> >>>> Reusing cb for hash reporting is a workaround to avoid having a new
> >>>> field, but it does not solve the underlying problem (i.e., keeping eBPF
> >>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
> >>>> is a reasonable option to keep the API as stable as other virtualization
> >>>> UAPIs while respecting the underlying intention of the context rewrite
> >>>> feature freeze.  
> >>>
> >>> Fair enough.
> >>>
> >>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
> >>> via cls or other). It might worth to see if anything we miss here.  
> >>
> >> Thanks for the information. I wonder why they used cls instead of
> >> steering program. Perhaps it may be due to compatibility with macvtap
> >> and ipvtap, which don't steering program.
> >>
> >> Their RSS implementation looks cleaner so I will improve my RSS
> >> implementation accordingly.
> >>  
> > 
> > DPDK needs to support flow rules. The specific case is where packets
> > are classified by a flow, then RSS is done across a subset of the queues.
> > The support for flow in TUN driver is more academic than useful,
> > I fixed it for current BPF, but doubt anyone is using it really.
> > 
> > A full steering program would be good, but would require much more
> > complexity to take a general set of flow rules then communicate that
> > to the steering program.
> >   
> 
> It reminded me of RSS context and flow filter. Some physical NICs 
> support to use a dedicated RSS context for packets matched with flow 
> filter, and virtio is also gaining corresponding features.
> 
> RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
> Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179
> 
> I considered about the possibility of supporting these features with tc 
> instead of adding ioctls to tuntap, but it seems not appropriate for 
> virtualization use case.
> 
> In a virtualization use case, tuntap is configured according to requests 
> of guests, and the code processing these requests need to have minimal 
> permissions for security. This goal is achieved by passing a file 
> descriptor that represents a tuntap from a privileged process (e.g., 
> libvirt) to the process handling guest requests (e.g., QEMU).
> 
> However, tc is configured with rtnetlink, which does not seem to have an 
> interface to delegate a permission for one particular device to another 
> process.
> 
> For now I'll continue working on the current approach that is based on 
> ioctl and lacks RSS context and flow filter features. Eventually they 
> are also likely to require new ioctls if they are to be supported with 
> vhost_net.

The DPDK flow handling (rte_flow) was started by Mellanox and many of
the features are to support what that NIC can do. Would be good to have
a tc way to configure that (or devlink).
Akihiko Odaki Oct. 2, 2024, 5:26 a.m. UTC | #10
On 2024/10/02 1:31, Stephen Hemminger wrote:
> On Tue, 1 Oct 2024 14:54:29 +0900
> Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> 
>> On 2024/09/30 0:33, Stephen Hemminger wrote:
>>> On Sun, 29 Sep 2024 16:10:47 +0900
>>> Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>    
>>>> On 2024/09/29 11:07, Jason Wang wrote:
>>>>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2024/09/27 13:31, Jason Wang wrote:
>>>>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2024/09/25 12:30, Jason Wang wrote:
>>>>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>> purpose of RSS.
>>>>>>>>>>
>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>
>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>> thse challenges.
>>>>>>>>>>
>>>>>>>>>> An alternative solution is to extend the eBPF steering program so that it
>>>>>>>>>> will be able to report to the userspace, but it is based on context
>>>>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>>>>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>>>>>>>>>> and vhost_net).
>>>>>>>>>>      
>>>>>>>>>
>>>>>>>>> I wonder if we could clone the skb and reuse some to store the hash,
>>>>>>>>> then the steering eBPF program can access these fields without
>>>>>>>>> introducing full RSS in the kernel?
>>>>>>>>
>>>>>>>> I don't get how cloning the skb can solve the issue.
>>>>>>>>
>>>>>>>> We can certainly implement Toeplitz function in the kernel or even with
>>>>>>>> tc-bpf to store a hash value that can be used for eBPF steering program
>>>>>>>> and virtio hash reporting. However we don't have a means of storing a
>>>>>>>> hash type, which is specific to virtio hash reporting and lacks a
>>>>>>>> corresponding skb field.
>>>>>>>
>>>>>>> I may miss something but looking at sk_filter_is_valid_access(). It
>>>>>>> looks to me we can make use of skb->cb[0..4]?
>>>>>>
>>>>>> I didn't opt to using cb. Below is the rationale:
>>>>>>
>>>>>> cb is for tail call so it means we reuse the field for a different
>>>>>> purpose. The context rewrite allows adding a field without increasing
>>>>>> the size of the underlying storage (the real sk_buff) so we should add a
>>>>>> new field instead of reusing an existing field to avoid confusion.
>>>>>>
>>>>>> We are however no longer allowed to add a new field. In my
>>>>>> understanding, this is because it is an UAPI, and eBPF maintainers found
>>>>>> it is difficult to maintain its stability.
>>>>>>
>>>>>> Reusing cb for hash reporting is a workaround to avoid having a new
>>>>>> field, but it does not solve the underlying problem (i.e., keeping eBPF
>>>>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
>>>>>> is a reasonable option to keep the API as stable as other virtualization
>>>>>> UAPIs while respecting the underlying intention of the context rewrite
>>>>>> feature freeze.
>>>>>
>>>>> Fair enough.
>>>>>
>>>>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
>>>>> via cls or other). It might worth to see if anything we miss here.
>>>>
>>>> Thanks for the information. I wonder why they used cls instead of
>>>> steering program. Perhaps it may be due to compatibility with macvtap
>>>> and ipvtap, which don't steering program.
>>>>
>>>> Their RSS implementation looks cleaner so I will improve my RSS
>>>> implementation accordingly.
>>>>   
>>>
>>> DPDK needs to support flow rules. The specific case is where packets
>>> are classified by a flow, then RSS is done across a subset of the queues.
>>> The support for flow in TUN driver is more academic than useful,
>>> I fixed it for current BPF, but doubt anyone is using it really.
>>>
>>> A full steering program would be good, but would require much more
>>> complexity to take a general set of flow rules then communicate that
>>> to the steering program.
>>>    
>>
>> It reminded me of RSS context and flow filter. Some physical NICs
>> support to use a dedicated RSS context for packets matched with flow
>> filter, and virtio is also gaining corresponding features.
>>
>> RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
>> Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179
>>
>> I considered about the possibility of supporting these features with tc
>> instead of adding ioctls to tuntap, but it seems not appropriate for
>> virtualization use case.
>>
>> In a virtualization use case, tuntap is configured according to requests
>> of guests, and the code processing these requests need to have minimal
>> permissions for security. This goal is achieved by passing a file
>> descriptor that represents a tuntap from a privileged process (e.g.,
>> libvirt) to the process handling guest requests (e.g., QEMU).
>>
>> However, tc is configured with rtnetlink, which does not seem to have an
>> interface to delegate a permission for one particular device to another
>> process.
>>
>> For now I'll continue working on the current approach that is based on
>> ioctl and lacks RSS context and flow filter features. Eventually they
>> are also likely to require new ioctls if they are to be supported with
>> vhost_net.
> 
> The DPDK flow handling (rte_flow) was started by Mellanox and many of
> the features are to support what that NIC can do. Would be good to have
> a tc way to configure that (or devlink).

Yes, but I would rather only implement the ioctl without flow handling 
for now. My purpose of implementing RSS in the kernel is to report hash 
values to the guest that has its own network stack in the virtualization 
context. tc-bpf would be suffice for DPDK, which does not have such a 
requirement.

Having an access to the in-kernel RSS implementation also saves the 
trouble of implementing an eBPF program for RSS, but DPDK already does 
have such a program so it makes little difference. There may be also 
performance improvement because I'm optimizing the in-kernel RSS 
implementation with ffs(), which is currently not available to eBPF, but 
there is also a proposal to expose ffs() to eBPF*.

For now, I will keep the patch series small by having only the ioctl 
interface. Anyone who finds the feature useful for tc can take it and 
add a tc interface in the future.

Regards,
Akihiko Odaki

* https://lore.kernel.org/bpf/20240131155607.51157-1-hffilwlqm@gmail.com/#t