mbox series

[RFC,net-next,00/18] virtio_net XDP offload

Message ID 20191126100744.5083-1-prashantbhole.linux@gmail.com (mailing list archive)
Headers show
Series virtio_net XDP offload | expand

Message

Prashant Bhole Nov. 26, 2019, 10:07 a.m. UTC
Note: This RFC has been sent to netdev as well as qemu-devel lists

This series introduces XDP offloading from virtio_net. It is based on
the following work by Jason Wang:
https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net

Current XDP performance in virtio-net is far from what we can achieve
on host. Several major factors cause the difference:
- Cost of virtualization
- Cost of virtio (populating virtqueue and context switching)
- Cost of vhost, it needs more optimization
- Cost of data copy
Because of above reasons there is a need of offloading XDP program to
host. This set is an attempt to implement XDP offload from the guest.


* High level design:

virtio_net exposes itself as offload capable device and works as a
transport of commands to load the program on the host. When offload is
requested, it sends the program to Qemu. Qemu then loads the program
and attaches to corresponding tap device. Similarly virtio_net sends
control commands to create and control maps. tap device runs the XDP
prog in its Tx path. The fast datapath remains on host whereas slow
path in which user program reads/updates map values remains in the
guest.

When offloading to actual hardware the program needs to be translated
and JITed for the target hardware. In case of offloading from guest
we pass almost raw program to the host. The verifier on the host
verifies the offloaded program.


* Implementation in Kernel


virtio_net
==========
Creates bpf offload device and registers as offload capable device.
It also implements bpf_map_dev_ops to handle the offloaded map. A new
command structure is defined to communicate with qemu.

Map offload:
- In offload sequence maps are always offloaded before the program. In
  map offloading stage, virtio_net sends control commands to qemu to
  create a map and return a map fd which is valid on host. This fd is
  stored in driver specific map structure. A list of such maps is
  maintained.

- Currently BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH are supported.
  Offloading a per cpu array from guest to host doesn't make sense.

Program offload:
- In general the verifier in the guest replaces map fds in the user
  submitted programs with map pointers then bpf_prog_offload_ops
  callbacks are called.

- This set introduces new program offload callback 'setup()' which
  verifier calls before replacing map fds with map pointers. This way
  virtio_net can create a copy of the program with guest map fds. It
  was needed because virtio_net wants to derive driver specific map
  data from guest map fd. Then guest map fd will be replaced with
  host map fd in the copy of the program, hence the copy of the
  program which will be submitted to the host will have valid host map
  fds.

- Alternatively if we can move the prep() call in the verifier before
  map fd replacement happens, there is not need to introduce 'setup()'
  callback.

- As per current implementation of 'setup()' callback in virtio_net,
  it verifies full program for allowed helper functions and performs
  above mentioned map fd replacement.

- A list of allowed helper function is maintained and it is currently
  experimental, it will be updated later as per need. Using this
  list we can filter out most non-XDP type programs to some extent.
  Also we prevent the guest from collecting host specific information
  by not allowing some helper calls.

- XDP_PROG_SETUP_HW is called after successful program verification.
  In this call a control buffer is prepared, program instructions are
  appended to the buffer and it is sent to qemu.

tun
===
This set makes changes in tun to run XDP prog in Tx path. It will be
the offloaded program from the guest. This program can be set using
tun ioctl interface. There were multiple places where this program can
be executed.
- tun_net_xmit
- tun_xdp_xmit
- tun_recvmsg
tun_recvmsg was chosen because it runs in process context. The other
two run in bh context. Running in process context helps in setting up
service chaining using XDP redirect.

XDP_REDIRECT action of offloaded program isn't handled. It is because
target interface's ndo_xdp_xmit is called when we redirect a packet.
In offload case the target interface will be some tap interface. Any
packet redirected towards it will sent back to the guest, which is not
what we expect. Such redirect will need special handling in the kernel

XDP_TX action of offloaded program is handled. Packet is injected into
the Rx path in this case. Care is taken such that the tap's native Rx
path XDP will be executed in such case.


* Implementation in Qemu

Qemu is modified to handle handle control commands from the guest.
When program offload command is received, it loads the program in the
host OS and attaches program fd to tap device. All the program and map
operations are performed using libbpf APIs.


* Performance numbers

Single flow tests were performed. The diagram below shows the setup.
xdp1 and xdp2 sample programs were modified to use BPF_MAP_TYPE_ARRAY
instead of per cpu array and xdp1_user.c was modified to have hardware
offload parameter.

                     (Rx path XDP to drop      (Tx path XDP.
                      XDP_TX'ed pkts from       Program offloaded
                      tun Tx path XDP)          from virtio_net)
                          XDP_DROP ----------.  XDP_DROP/XDP_TX
                                              \   |
                                    (Case 2)   \  |   XDP_DROP/XDP_TX
 pktgen ---> 10G-NIC === 10G-NIC --- bridge --- tun --- virtio-net
|<------ netns ------>|    |                     ^   |<----guest---->|
                           v                     |
                           '---- XDP_REDIRECT----'
                                  (Case 1)

Case 1: Packets XDP_REDIRECT'ed towards tun.
                        Non-offload        Offload
  xdp1 (XDP_DROP)        2.46 Mpps        12.90 Mpps
  xdp2 (XDP_TX)          1.50 Mpps         7.26 Mpps

Case 2: Packets are not redirected. They pass through the bridge.
                        Non-offload        Offload
  xdp1 (XDP_DROP)        1.03 Mpps         1.01 Mpps
  xdp2 (XDP_TX)          1.10 Mpps         0.99 Mpps

  In case 2, the offload performance is low. In this case the
  producer function is tun_net_xmit. It puts single packet in ptr ring
  and spends most of the time in waking up vhost thread. On the other
  hand, each time when vhost thread wakes up, it calls tun_recvmsg.
  Since Tx path XDP runs in tun_recvmsg, vhost doesn't see any packet.
  It sleeps frequently and producer function most spends more time in
  waking it up. vhost polling improves these numbers but in that case
  non-offload performance also improves and remains higher than the
  offload case. Performance in this case can be improved later in a
  separate work.

Since this set makes changes in virtio_net, tun and vhost_net, it was
necessary to measure the performance difference after applying this
set. Performance numbers are in table below:

   Netperf Test         Before      After      Difference
  UDP_STREAM 18byte     89.43       90.74       +1.46%
  UDP_STREAM 1472byte    6882        7026       +2.09%
  TCP_STREAM             9403        9407       +0.04%
  UDP_RR                13520       13478       -0.31%
  TCP_RR                13120       12918       -1.53%


* Points for improvement (TODO)

- In current implementation, qemu passes host map fd to the guest,
  which means guest is poking host information. It can be avoided by
  moving the map fd replacement task from guest to qemu.

- Currently there is no way on the host side to show whether a tap
  interface has offloaded XDP program attached.

- When sending program and map related control commands from guest to
  host, it will be better if we pass metadata about the program, map.
  For example BTF data.

- In future virtio can have feature bit for offloading capability

- TUNGETFEATURES should have a flag to notify about offloading
  capability

- Submit virtio spec patch to describe XDP offloading feature

- When offloading is enabled, it should be a migration blocker.

- DoS: Offloaded map uses host's memory which is other than what has
  been allocated for the guest. Offloading many maps of large size can
  be one of the DoS strategy. Hence qemu should have parameter to
  limit how many maps guest can offload or how much memory offloaded
  maps use.


* Other dependencies

- Loading a bpf program requires CAP_SYS_ADMIN capability. We tested
  this set by running qemu as root OR adding CAP_SYS_ADMIN to the
  qemu binary. In other cases Qemu doesn't have this capability.
  Alexei's recent work CAP_BPF can be a solution to this problem.
  The CAP_BPF work is still being discussed in the mailing list.

Jason Wang (9):
  bpf: introduce bpf_prog_offload_verifier_setup()
  net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core()
  net: core: export do_xdp_generic_core()
  tun: set offloaded xdp program
  virtio-net: store xdp_prog in device
  virtio_net: add XDP prog offload infrastructure
  virtio_net: implement XDP prog offload functionality
  bpf: export function __bpf_map_get
  virtio_net: implment XDP map offload functionality

Prashant Bhole (9):
  tuntap: check tun_msg_ctl type at necessary places
  vhost_net: user tap recvmsg api to access ptr ring
  tuntap: remove usage of ptr ring in vhost_net
  tun: run offloaded XDP program in Tx path
  tun: add a way to inject Tx path packet into Rx path
  tun: handle XDP_TX action of offloaded program
  tun: run xdp prog when tun is read from file interface
  virtio_net: use XDP attachment helpers
  virtio_net: restrict bpf helper calls from offloaded program

 drivers/net/tap.c               |  42 ++-
 drivers/net/tun.c               | 257 +++++++++++++--
 drivers/net/virtio_net.c        | 552 +++++++++++++++++++++++++++++---
 drivers/vhost/net.c             |  77 ++---
 include/linux/bpf.h             |   1 +
 include/linux/bpf_verifier.h    |   1 +
 include/linux/if_tap.h          |   5 -
 include/linux/if_tun.h          |  23 +-
 include/linux/netdevice.h       |   2 +
 include/uapi/linux/if_tun.h     |   1 +
 include/uapi/linux/virtio_net.h |  50 +++
 kernel/bpf/offload.c            |  14 +
 kernel/bpf/syscall.c            |   1 +
 kernel/bpf/verifier.c           |   6 +
 net/core/dev.c                  |   8 +-
 15 files changed, 901 insertions(+), 139 deletions(-)

Comments

Jakub Kicinski Nov. 26, 2019, 8:35 p.m. UTC | #1
On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
> Note: This RFC has been sent to netdev as well as qemu-devel lists
> 
> This series introduces XDP offloading from virtio_net. It is based on
> the following work by Jason Wang:
> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> 
> Current XDP performance in virtio-net is far from what we can achieve
> on host. Several major factors cause the difference:
> - Cost of virtualization
> - Cost of virtio (populating virtqueue and context switching)
> - Cost of vhost, it needs more optimization
> - Cost of data copy
> Because of above reasons there is a need of offloading XDP program to
> host. This set is an attempt to implement XDP offload from the guest.

This turns the guest kernel into a uAPI proxy.

BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
to the hypervisor, they pop up in QEMU which makes the requested call
to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
be someone's proprietary "SmartNIC" implementation.

Why can't those calls be forwarded at the higher layer? Why do they
have to go through the guest kernel?

If kernel performs no significant work (or "adds value", pardon the
expression), and problem can easily be solved otherwise we shouldn't 
do the work of maintaining the mechanism.

The approach of kernel generating actual machine code which is then
loaded into a sandbox on the hypervisor/SmartNIC is another story.

I'd appreciate if others could chime in.
Jason Wang Nov. 27, 2019, 2:59 a.m. UTC | #2
Hi Jakub:

On 2019/11/27 上午4:35, Jakub Kicinski wrote:
> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>
>> This series introduces XDP offloading from virtio_net. It is based on
>> the following work by Jason Wang:
>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>
>> Current XDP performance in virtio-net is far from what we can achieve
>> on host. Several major factors cause the difference:
>> - Cost of virtualization
>> - Cost of virtio (populating virtqueue and context switching)
>> - Cost of vhost, it needs more optimization
>> - Cost of data copy
>> Because of above reasons there is a need of offloading XDP program to
>> host. This set is an attempt to implement XDP offload from the guest.
> This turns the guest kernel into a uAPI proxy.
>
> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
> to the hypervisor, they pop up in QEMU which makes the requested call
> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
> be someone's proprietary "SmartNIC" implementation.
>
> Why can't those calls be forwarded at the higher layer? Why do they
> have to go through the guest kernel?


I think doing forwarding at higher layer have the following issues:

- Need a dedicated library (probably libbpf) but application may choose 
to do eBPF syscall directly
- Depends on guest agent to work
- Can't work for virtio-net hardware, since it still requires a hardware 
interface for carrying  offloading information
- Implement at the level of kernel may help for future extension like 
BPF object pinning and eBPF helper etc.

Basically, this series is trying to have an implementation of 
transporting eBPF through virtio, so it's not necessarily a guest to 
host but driver and device. For device, it could be either a virtual one 
(as done in qemu) or a real hardware.


>
> If kernel performs no significant work (or "adds value", pardon the
> expression), and problem can easily be solved otherwise we shouldn't
> do the work of maintaining the mechanism.


My understanding is that it should not be much difference compared to 
other offloading technology.


>
> The approach of kernel generating actual machine code which is then
> loaded into a sandbox on the hypervisor/SmartNIC is another story.


We've considered such way, but actual machine code is not as portable as 
eBPF bytecode consider we may want:

- Support migration
- Further offload the program to smart NIC (e.g through macvtap 
passthrough mode etc).

Thanks


> I'd appreciate if others could chime in.
>
Jakub Kicinski Nov. 27, 2019, 7:49 p.m. UTC | #3
On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
> > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:  
> >> Note: This RFC has been sent to netdev as well as qemu-devel lists
> >>
> >> This series introduces XDP offloading from virtio_net. It is based on
> >> the following work by Jason Wang:
> >> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> >>
> >> Current XDP performance in virtio-net is far from what we can achieve
> >> on host. Several major factors cause the difference:
> >> - Cost of virtualization
> >> - Cost of virtio (populating virtqueue and context switching)
> >> - Cost of vhost, it needs more optimization
> >> - Cost of data copy
> >> Because of above reasons there is a need of offloading XDP program to
> >> host. This set is an attempt to implement XDP offload from the guest.  
> > This turns the guest kernel into a uAPI proxy.
> >
> > BPF uAPI calls related to the "offloaded" BPF objects are forwarded
> > to the hypervisor, they pop up in QEMU which makes the requested call
> > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
> > be someone's proprietary "SmartNIC" implementation.
> >
> > Why can't those calls be forwarded at the higher layer? Why do they
> > have to go through the guest kernel?  
> 
> 
> I think doing forwarding at higher layer have the following issues:
> 
> - Need a dedicated library (probably libbpf) but application may choose 
>   to do eBPF syscall directly
> - Depends on guest agent to work

This can be said about any user space functionality.

> - Can't work for virtio-net hardware, since it still requires a hardware 
> interface for carrying  offloading information

The HW virtio-net presumably still has a PF and hopefully reprs for
VFs, so why can't it attach the program there?

> - Implement at the level of kernel may help for future extension like 
>   BPF object pinning and eBPF helper etc.

No idea what you mean by this.

> Basically, this series is trying to have an implementation of 
> transporting eBPF through virtio, so it's not necessarily a guest to 
> host but driver and device. For device, it could be either a virtual one 
> (as done in qemu) or a real hardware.

SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as 
is the x86 hypervisor side. This set turns the kernel into a uAPI
forwarder.

3 years ago my answer to this proposal would have been very different.
Today after all the CPU bugs it seems like the SmartNICs (which are 
just another CPU running proprietary code) may just take off..

> > If kernel performs no significant work (or "adds value", pardon the
> > expression), and problem can easily be solved otherwise we shouldn't
> > do the work of maintaining the mechanism.  
> 
> My understanding is that it should not be much difference compared to 
> other offloading technology.

I presume you mean TC offloads? In virtualization there is inherently a
hypervisor which will receive the request, be it an IO hub/SmartNIC or
the traditional hypervisor on the same CPU.

The ACL/routing offloads differ significantly, because it's either the 
driver that does all the HW register poking directly or the complexity
of programming a rule into a HW table is quite low.

Same is true for the NFP BPF offload, BTW, the driver does all the
heavy lifting and compiles the final machine code image.

You can't say verifying and JITing BPF code into machine code entirely
in the hypervisor is similarly simple.

So no, there is a huge difference.

> > The approach of kernel generating actual machine code which is then
> > loaded into a sandbox on the hypervisor/SmartNIC is another story.  
> 
> We've considered such way, but actual machine code is not as portable as 
> eBPF bytecode consider we may want:
> 
> - Support migration
> - Further offload the program to smart NIC (e.g through macvtap 
>   passthrough mode etc).

You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
guarantee migration either, if the environment is expected to be
running different version of HW and SW. But yes, JITing in the guest
kernel when you don't know what to JIT for may be hard, I was just
saying that I don't mean to discourage people from implementing
sandboxes which run JITed code on SmartNICs. My criticism is (as
always?) against turning the kernel into a one-to-one uAPI forwarder
into unknown platform code.

For cloud use cases I believe the higher layer should solve this.
Michael S. Tsirkin Nov. 27, 2019, 8:32 p.m. UTC | #4
On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
> > Note: This RFC has been sent to netdev as well as qemu-devel lists
> > 
> > This series introduces XDP offloading from virtio_net. It is based on
> > the following work by Jason Wang:
> > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> > 
> > Current XDP performance in virtio-net is far from what we can achieve
> > on host. Several major factors cause the difference:
> > - Cost of virtualization
> > - Cost of virtio (populating virtqueue and context switching)
> > - Cost of vhost, it needs more optimization
> > - Cost of data copy
> > Because of above reasons there is a need of offloading XDP program to
> > host. This set is an attempt to implement XDP offload from the guest.
> 
> This turns the guest kernel into a uAPI proxy.
> 
> BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
> to the hypervisor, they pop up in QEMU which makes the requested call
> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
> be someone's proprietary "SmartNIC" implementation.
> 
> Why can't those calls be forwarded at the higher layer? Why do they
> have to go through the guest kernel?

Well everyone is writing these programs and attaching them to NICs.

For better or worse that's how userspace is written.

Yes, in the simple case where everything is passed through, it could
instead be passed through some other channel just as well, but then
userspace would need significant changes just to make it work with
virtio.



> If kernel performs no significant work (or "adds value", pardon the
> expression), and problem can easily be solved otherwise we shouldn't 
> do the work of maintaining the mechanism.
> 
> The approach of kernel generating actual machine code which is then
> loaded into a sandbox on the hypervisor/SmartNIC is another story.

But that's transparent to guest userspace. Making userspace care whether
it's a SmartNIC or a software device breaks part of virtualization's
appeal, which is that it looks like a hardware box to the guest.

> I'd appreciate if others could chime in.
Jakub Kicinski Nov. 27, 2019, 11:40 p.m. UTC | #5
On Wed, 27 Nov 2019 15:32:17 -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:  
> > > Note: This RFC has been sent to netdev as well as qemu-devel lists
> > > 
> > > This series introduces XDP offloading from virtio_net. It is based on
> > > the following work by Jason Wang:
> > > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> > > 
> > > Current XDP performance in virtio-net is far from what we can achieve
> > > on host. Several major factors cause the difference:
> > > - Cost of virtualization
> > > - Cost of virtio (populating virtqueue and context switching)
> > > - Cost of vhost, it needs more optimization
> > > - Cost of data copy
> > > Because of above reasons there is a need of offloading XDP program to
> > > host. This set is an attempt to implement XDP offload from the guest.  
> > 
> > This turns the guest kernel into a uAPI proxy.
> > 
> > BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
> > to the hypervisor, they pop up in QEMU which makes the requested call
> > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
> > be someone's proprietary "SmartNIC" implementation.
> > 
> > Why can't those calls be forwarded at the higher layer? Why do they
> > have to go through the guest kernel?  
> 
> Well everyone is writing these programs and attaching them to NICs.

Who's everyone?

> For better or worse that's how userspace is written.

HW offload requires modifying the user space, too. The offload is not
transparent. Do you know that?

> Yes, in the simple case where everything is passed through, it could
> instead be passed through some other channel just as well, but then
> userspace would need significant changes just to make it work with
> virtio.

There is a recently spawned effort to create an "XDP daemon" or
otherwise a control application which would among other things link
separate XDP apps to share a NIC attachment point.

Making use of cloud APIs would make a perfect addition to that.

Obviously if one asks a kernel guy to solve a problem one'll get kernel
code as an answer. And writing higher layer code requires companies to
actually organize their teams and have "full stack" strategies.

We've seen this story already with net_failover wart. At least that
time we weren't risking building a proxy to someone's proprietary FW.

> > If kernel performs no significant work (or "adds value", pardon the
> > expression), and problem can easily be solved otherwise we shouldn't 
> > do the work of maintaining the mechanism.
> > 
> > The approach of kernel generating actual machine code which is then
> > loaded into a sandbox on the hypervisor/SmartNIC is another story.  
> 
> But that's transparent to guest userspace. Making userspace care whether
> it's a SmartNIC or a software device breaks part of virtualization's
> appeal, which is that it looks like a hardware box to the guest.

It's not hardware unless you JITed machine code for it, it's just
someone else's software.

I'm not arguing with the appeal. I'm arguing the risk/benefit ratio
doesn't justify opening this can of worms.

> > I'd appreciate if others could chime in.
Alexei Starovoitov Nov. 28, 2019, 3:32 a.m. UTC | #6
On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> 
> I'd appreciate if others could chime in.

The performance improvements are quite appealing.
In general offloading from higher layers into lower layers is necessary long term.

But the approach taken by patches 15 and 17 is a dead end. I don't see how it
can ever catch up with the pace of bpf development. As presented this approach
works for the most basic programs and simple maps. No line info, no BTF, no
debuggability. There are no tail_calls either. I don't think I've seen a single
production XDP program that doesn't use tail calls. Static and dynamic linking
is coming. Wraping one bpf feature at a time with virtio api is never going to
be complete. How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ?
OBJ_PIN/GET ? Where bpffs is going to live ? Any realistic XDP application will
be using a lot more than single self contained XDP prog with hash and array
maps. It feels that the whole sys_bpf needs to be forwarded as a whole from
guest into host. In case of true hw offload the host is managing HW. So it
doesn't forward syscalls into the driver. The offload from guest into host is
different. BPF can be seen as a resource that host provides and guest kernel
plus qemu would be forwarding requests between guest user space and host
kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly.
The FD that hosts sees would need a corresponding mirror FD in the guest. There
are still questions about bpffs paths, but the main issue of
one-feature-at-a-time will be addressed in such approach. There could be other
solutions, of course.
Jason Wang Nov. 28, 2019, 3:41 a.m. UTC | #7
On 2019/11/28 上午3:49, Jakub Kicinski wrote:
> On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
>> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
>>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>>>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>>>
>>>> This series introduces XDP offloading from virtio_net. It is based on
>>>> the following work by Jason Wang:
>>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>>>
>>>> Current XDP performance in virtio-net is far from what we can achieve
>>>> on host. Several major factors cause the difference:
>>>> - Cost of virtualization
>>>> - Cost of virtio (populating virtqueue and context switching)
>>>> - Cost of vhost, it needs more optimization
>>>> - Cost of data copy
>>>> Because of above reasons there is a need of offloading XDP program to
>>>> host. This set is an attempt to implement XDP offload from the guest.
>>> This turns the guest kernel into a uAPI proxy.
>>>
>>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
>>> to the hypervisor, they pop up in QEMU which makes the requested call
>>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
>>> be someone's proprietary "SmartNIC" implementation.
>>>
>>> Why can't those calls be forwarded at the higher layer? Why do they
>>> have to go through the guest kernel?
>>
>> I think doing forwarding at higher layer have the following issues:
>>
>> - Need a dedicated library (probably libbpf) but application may choose
>>    to do eBPF syscall directly
>> - Depends on guest agent to work
> This can be said about any user space functionality.


Yes but the feature may have too much unnecessary dependencies: 
dedicated library, guest agent, host agent etc. This can only work for 
some specific setups and will lead vendor specific implementations.


>
>> - Can't work for virtio-net hardware, since it still requires a hardware
>> interface for carrying  offloading information
> The HW virtio-net presumably still has a PF and hopefully reprs for
> VFs, so why can't it attach the program there?


Then you still need a interface for carrying such information? It will 
work like assuming we had a virtio-net VF with reprs:

libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF 
syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF

Still need a vendor specific way for passing eBPF commands from driver 
to reprs/PF, and possibility, it could still be a virtio interface there.

In this proposal it will work out of box as simple as:

libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF

If the request comes from host (e.g flow offloading, configuration etc), 
VF reprs make perfect fit. But if the request comes from guest, having 
much longer journey looks quite like a burden (dependencies, bugs etc) .

What's more important, we can not assume the how virtio-net HW is 
structured, it could even not a SRIOV or PCI card.


>
>> - Implement at the level of kernel may help for future extension like
>>    BPF object pinning and eBPF helper etc.
> No idea what you mean by this.


My understanding is, we should narrow the gap between non-offloaded eBPF 
program and offloaded eBPF program. Making maps or progs to be visible 
to kernel may help to persist a unified API e.g object pinning through 
sysfs, tracepoint, debug etc.


>
>> Basically, this series is trying to have an implementation of
>> transporting eBPF through virtio, so it's not necessarily a guest to
>> host but driver and device. For device, it could be either a virtual one
>> (as done in qemu) or a real hardware.
> SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as
> is the x86 hypervisor side. This set turns the kernel into a uAPI
> forwarder.


Not necessarily, as what has been done by NFP, driver filter out the 
features that is not supported, and the bpf object is still visible in 
the kernel (and see above comment).


>
> 3 years ago my answer to this proposal would have been very different.
> Today after all the CPU bugs it seems like the SmartNICs (which are
> just another CPU running proprietary code) may just take off..
>

That's interesting but vendor may choose to use FPGA other than SoC in 
this case. Anyhow discussion like this is somehow out of the scope of 
the series.


>>> If kernel performs no significant work (or "adds value", pardon the
>>> expression), and problem can easily be solved otherwise we shouldn't
>>> do the work of maintaining the mechanism.
>> My understanding is that it should not be much difference compared to
>> other offloading technology.
> I presume you mean TC offloads? In virtualization there is inherently a
> hypervisor which will receive the request, be it an IO hub/SmartNIC or
> the traditional hypervisor on the same CPU.
>
> The ACL/routing offloads differ significantly, because it's either the
> driver that does all the HW register poking directly or the complexity
> of programming a rule into a HW table is quite low.
>
> Same is true for the NFP BPF offload, BTW, the driver does all the
> heavy lifting and compiles the final machine code image.


Yes and this series benefit from the infrastructure invented from NFP. 
But I'm not sure this is a good point since, technically the machine 
code could be generated by smart NIC as well.


>
> You can't say verifying and JITing BPF code into machine code entirely
> in the hypervisor is similarly simple.


Yes and that's why we choose to do in on the device (host) to simplify 
things.


>
> So no, there is a huge difference.
>

>>> The approach of kernel generating actual machine code which is then
>>> loaded into a sandbox on the hypervisor/SmartNIC is another story.
>> We've considered such way, but actual machine code is not as portable as
>> eBPF bytecode consider we may want:
>>
>> - Support migration
>> - Further offload the program to smart NIC (e.g through macvtap
>>    passthrough mode etc).
> You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
> guarantee migration either,


Yes, but it's more portable than machine code.


> if the environment is expected to be
> running different version of HW and SW.


Right, we plan to have feature negotiation.


> But yes, JITing in the guest
> kernel when you don't know what to JIT for may be hard,


Yes.


> I was just
> saying that I don't mean to discourage people from implementing
> sandboxes which run JITed code on SmartNICs. My criticism is (as
> always?) against turning the kernel into a one-to-one uAPI forwarder
> into unknown platform code.


We have FUSE and I think it's not only the forwarder, and we may do much 
more work on top in the future. For unknown platform code, I'm not sure 
why we need care about that. There's no way for us to prevent such 
implementation and if we try to formalize it through a specification 
(virtio spec and probably eBPF spec), it may help actually.


>
> For cloud use cases I believe the higher layer should solve this.
>

Technically possible, but have lots of drawbacks.

Thanks
Jason Wang Nov. 28, 2019, 4:18 a.m. UTC | #8
On 2019/11/28 上午11:32, Alexei Starovoitov wrote:
> On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
>> I'd appreciate if others could chime in.
> The performance improvements are quite appealing.
> In general offloading from higher layers into lower layers is necessary long term.
>
> But the approach taken by patches 15 and 17 is a dead end. I don't see how it
> can ever catch up with the pace of bpf development.


This applies for any hardware offloading features, isn't it?


>   As presented this approach
> works for the most basic programs and simple maps. No line info, no BTF, no
> debuggability. There are no tail_calls either.


If I understand correctly, neither of above were implemented in NFP. We 
can collaborate to find solution for all of those.


>   I don't think I've seen a single
> production XDP program that doesn't use tail calls.


It looks to me we can manage to add this support.


> Static and dynamic linking
> is coming. Wraping one bpf feature at a time with virtio api is never going to
> be complete.


It's a common problem for hardware that want to implement eBPF 
offloading, not a virtio specific one.


> How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ?
> OBJ_PIN/GET ? Where bpffs is going to live ?


If we want pinning work in the case of virt, it should live in both host 
and guest probably.


>   Any realistic XDP application will
> be using a lot more than single self contained XDP prog with hash and array
> maps.


It's possible if we want to use XDP offloading to accelerate VNF which 
often has simple logic.


> It feels that the whole sys_bpf needs to be forwarded as a whole from
> guest into host. In case of true hw offload the host is managing HW. So it
> doesn't forward syscalls into the driver. The offload from guest into host is
> different. BPF can be seen as a resource that host provides and guest kernel
> plus qemu would be forwarding requests between guest user space and host
> kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly.
> The FD that hosts sees would need a corresponding mirror FD in the guest. There
> are still questions about bpffs paths, but the main issue of
> one-feature-at-a-time will be addressed in such approach.


We try to follow what NFP did by starting from a fraction of the whole 
eBPF features. It would be very hard to have all eBPF features 
implemented from the start.  It would be helpful to clarify what's the 
minimal set of features that you want to have from the start.


> There could be other
> solutions, of course.
>
>

Suggestions are welcomed.

Thanks
David Ahern Dec. 1, 2019, 4:54 p.m. UTC | #9
On 11/27/19 10:18 PM, Jason Wang wrote:
> We try to follow what NFP did by starting from a fraction of the whole
> eBPF features. It would be very hard to have all eBPF features
> implemented from the start.  It would be helpful to clarify what's the
> minimal set of features that you want to have from the start.

Offloading guest programs needs to prevent a guest XDP program from
running bpf helpers that access host kernel data. e.g., bpf_fib_lookup
Jason Wang Dec. 2, 2019, 2:48 a.m. UTC | #10
On 2019/12/2 上午12:54, David Ahern wrote:
> On 11/27/19 10:18 PM, Jason Wang wrote:
>> We try to follow what NFP did by starting from a fraction of the whole
>> eBPF features. It would be very hard to have all eBPF features
>> implemented from the start.  It would be helpful to clarify what's the
>> minimal set of features that you want to have from the start.
> Offloading guest programs needs to prevent a guest XDP program from
> running bpf helpers that access host kernel data. e.g., bpf_fib_lookup


Right, so we probably need a new type of eBPF program on the host and 
filter out the unsupported helpers there.

Thanks


>
Michael S. Tsirkin Dec. 2, 2019, 3:29 p.m. UTC | #11
On Wed, Nov 27, 2019 at 03:40:14PM -0800, Jakub Kicinski wrote:
> > For better or worse that's how userspace is written.
> 
> HW offload requires modifying the user space, too. The offload is not
> transparent. Do you know that?

It's true, offload of program itself isn't transparent. Adding a 3rd
interface (software/hardware/host) isn't welcome though, IMHO.