mbox series

[RFC,v2,00/13] Introduce VDUSE - vDPA Device in Userspace

Message ID 20201222145221.711-1-xieyongji@bytedance.com (mailing list archive)
Headers show
Series Introduce VDUSE - vDPA Device in Userspace | expand

Message

Yongji Xie Dec. 22, 2020, 2:52 p.m. UTC
This series introduces a framework, which can be used to implement
vDPA Devices in a userspace program. The work consist of two parts:
control path forwarding and data path offloading.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the core is mapping dma buffer into VDUSE
daemon's address space, which can be implemented in different ways
depending on the vdpa bus to which the vDPA device is attached.

In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
buffer is reside in a userspace memory region which can be shared to the
VDUSE userspace processs via transferring the shmfd.

The details and our user case is shown below:

------------------------    -------------------------   ----------------------------------------------
|            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
|       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
|       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
------------+-----------     -----------+------------   -------------+----------------------+---------
            |                           |                            |                      |
            |                           |                            |                      |
------------+---------------------------+----------------------------+----------------------+---------
|    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
|    -------+--------           --------+--------            -------+--------          -----+----    |
|           |                           |                           |                       |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
| | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
| ----------+----------       ----------+-----------         -------+-------                |        |
|           |      virtio bus           |                           |                       |        |
|   --------+----+-----------           |                           |                       |        |
|                |                      |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|      | virtio-blk device |            |                           |                       |        |
|      ----------+----------            |                           |                       |        |
|                |                      |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|     |  virtio-vdpa driver |           |                           |                       |        |
|     -----------+-----------           |                           |                       |        |
|                |                      |                           |    vdpa bus           |        |
|     -----------+----------------------+---------------------------+------------           |        |
|                                                                                        ---+---     |
-----------------------------------------------------------------------------------------| NIC |------
                                                                                         ---+---
                                                                                            |
                                                                                   ---------+---------
                                                                                   | Remote Storages |
                                                                                   -------------------

We make use of it to implement a block device connecting to
our distributed storage, which can be used both in containers and
VMs. Thus, we can have an unified technology stack in this two cases.

To test it with null-blk:

  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
      --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128

The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse

Future work:
  - Improve performance (e.g. zero copy implementation in datapath)
  - Config interrupt support
  - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)

This is now based on below series:
https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/

V1 to V2:
- Add vhost-vdpa support
- Add some documents
- Based on the vdpa management tool
- Introduce a workqueue for irq injection
- Replace interval tree with array map to store the iova_map

Xie Yongji (13):
  mm: export zap_page_range() for driver use
  eventfd: track eventfd_signal() recursion depth separately in different cases
  eventfd: Increase the recursion depth of eventfd_signal()
  vdpa: Remove the restriction that only supports virtio-net devices
  vdpa: Pass the netlink attributes to ops.dev_add()
  vduse: Introduce VDUSE - vDPA Device in Userspace
  vduse: support get/set virtqueue state
  vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
  vduse: Add support for processing vhost iotlb message
  vduse: grab the module's references until there is no vduse device
  vduse/iova_domain: Support reclaiming bounce pages
  vduse: Add memory shrinker to reclaim bounce pages
  vduse: Introduce a workqueue for irq injection

 Documentation/driver-api/vduse.rst                 |   91 ++
 Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
 drivers/vdpa/Kconfig                               |    8 +
 drivers/vdpa/Makefile                              |    1 +
 drivers/vdpa/vdpa.c                                |    2 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    3 +-
 drivers/vdpa/vdpa_user/Makefile                    |    5 +
 drivers/vdpa/vdpa_user/eventfd.c                   |  229 ++++
 drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
 drivers/vdpa/vdpa_user/iova_domain.c               |  517 ++++++++
 drivers/vdpa/vdpa_user/iova_domain.h               |  103 ++
 drivers/vdpa/vdpa_user/vduse.h                     |   59 +
 drivers/vdpa/vdpa_user/vduse_dev.c                 | 1373 ++++++++++++++++++++
 drivers/vhost/vdpa.c                               |   34 +-
 fs/aio.c                                           |    3 +-
 fs/eventfd.c                                       |   20 +-
 include/linux/eventfd.h                            |    5 +-
 include/linux/vdpa.h                               |   11 +-
 include/uapi/linux/vdpa.h                          |    1 +
 include/uapi/linux/vduse.h                         |  119 ++
 mm/memory.c                                        |    1 +
 21 files changed, 2598 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/driver-api/vduse.rst
 create mode 100644 drivers/vdpa/vdpa_user/Makefile
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
 create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
 create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse.h
 create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
 create mode 100644 include/uapi/linux/vduse.h

Comments

Jason Wang Dec. 23, 2020, 6:38 a.m. UTC | #1
On 2020/12/22 下午10:52, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. The work consist of two parts:
> control path forwarding and data path offloading.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the config operation from vdpa bus driver
> to userspace. Userspace can use read()/write() to receive/reply
> those control messages.
>
> In the data path, the core is mapping dma buffer into VDUSE
> daemon's address space, which can be implemented in different ways
> depending on the vdpa bus to which the vDPA device is attached.
>
> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> bounce-buffering mechanism to achieve that.


Rethink about the bounce buffer stuffs. I wonder instead of using kernel 
pages with mmap(), how about just use userspace pages like what vhost did?

It means we need a worker to do bouncing but we don't need to care about 
annoying stuffs like page reclaiming?


> And in vhost-vdpa case, the dma
> buffer is reside in a userspace memory region which can be shared to the
> VDUSE userspace processs via transferring the shmfd.
>
> The details and our user case is shown below:
>
> ------------------------    -------------------------   ----------------------------------------------
> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> ------------+-----------     -----------+------------   -------------+----------------------+---------
>              |                           |                            |                      |
>              |                           |                            |                      |
> ------------+---------------------------+----------------------------+----------------------+---------
> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> |    -------+--------           --------+--------            -------+--------          -----+----    |
> |           |                           |                           |                       |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> |           |      virtio bus           |                           |                       |        |
> |   --------+----+-----------           |                           |                       |        |
> |                |                      |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |      | virtio-blk device |            |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |                |                      |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |     |  virtio-vdpa driver |           |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |                |                      |                           |    vdpa bus           |        |
> |     -----------+----------------------+---------------------------+------------           |        |
> |                                                                                        ---+---     |
> -----------------------------------------------------------------------------------------| NIC |------
>                                                                                           ---+---
>                                                                                              |
>                                                                                     ---------+---------
>                                                                                     | Remote Storages |
>                                                                                     -------------------
>
> We make use of it to implement a block device connecting to
> our distributed storage, which can be used both in containers and
> VMs. Thus, we can have an unified technology stack in this two cases.
>
> To test it with null-blk:
>
>    $ qemu-storage-daemon \
>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>        --monitor chardev=charmonitor \
>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>        --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
>
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>
> Future work:
>    - Improve performance (e.g. zero copy implementation in datapath)
>    - Config interrupt support
>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
>
> This is now based on below series:
> https://lore.kernel.org/netdev/20201112064005.349268-1-parav@nvidia.com/
>
> V1 to V2:
> - Add vhost-vdpa support


I may miss something but I don't see any code to support that. E.g 
neither set_map nor dma_map/unmap is implemented in the config ops.

Thanks


> - Add some documents
> - Based on the vdpa management tool
> - Introduce a workqueue for irq injection
> - Replace interval tree with array map to store the iova_map
>
> Xie Yongji (13):
>    mm: export zap_page_range() for driver use
>    eventfd: track eventfd_signal() recursion depth separately in different cases
>    eventfd: Increase the recursion depth of eventfd_signal()
>    vdpa: Remove the restriction that only supports virtio-net devices
>    vdpa: Pass the netlink attributes to ops.dev_add()
>    vduse: Introduce VDUSE - vDPA Device in Userspace
>    vduse: support get/set virtqueue state
>    vdpa: Introduce process_iotlb_msg() in vdpa_config_ops
>    vduse: Add support for processing vhost iotlb message
>    vduse: grab the module's references until there is no vduse device
>    vduse/iova_domain: Support reclaiming bounce pages
>    vduse: Add memory shrinker to reclaim bounce pages
>    vduse: Introduce a workqueue for irq injection
>
>   Documentation/driver-api/vduse.rst                 |   91 ++
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |    8 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa.c                                |    2 +-
>   drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    3 +-
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c                   |  229 ++++
>   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
>   drivers/vdpa/vdpa_user/iova_domain.c               |  517 ++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h               |  103 ++
>   drivers/vdpa/vdpa_user/vduse.h                     |   59 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1373 ++++++++++++++++++++
>   drivers/vhost/vdpa.c                               |   34 +-
>   fs/aio.c                                           |    3 +-
>   fs/eventfd.c                                       |   20 +-
>   include/linux/eventfd.h                            |    5 +-
>   include/linux/vdpa.h                               |   11 +-
>   include/uapi/linux/vdpa.h                          |    1 +
>   include/uapi/linux/vduse.h                         |  119 ++
>   mm/memory.c                                        |    1 +
>   21 files changed, 2598 insertions(+), 36 deletions(-)
>   create mode 100644 Documentation/driver-api/vduse.rst
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
Jason Wang Dec. 23, 2020, 8:14 a.m. UTC | #2
On 2020/12/23 下午2:38, Jason Wang wrote:
>>
>> V1 to V2:
>> - Add vhost-vdpa support
>
>
> I may miss something but I don't see any code to support that. E.g 
> neither set_map nor dma_map/unmap is implemented in the config ops.
>
> Thanks 


Speak too fast :(

I saw a new config ops was introduced.

Let me dive into that.

Thanks
Yongji Xie Dec. 23, 2020, 10:59 a.m. UTC | #3
On Wed, Dec 23, 2020 at 2:38 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/12/22 下午10:52, Xie Yongji wrote:
> > This series introduces a framework, which can be used to implement
> > vDPA Devices in a userspace program. The work consist of two parts:
> > control path forwarding and data path offloading.
> >
> > In the control path, the VDUSE driver will make use of message
> > mechnism to forward the config operation from vdpa bus driver
> > to userspace. Userspace can use read()/write() to receive/reply
> > those control messages.
> >
> > In the data path, the core is mapping dma buffer into VDUSE
> > daemon's address space, which can be implemented in different ways
> > depending on the vdpa bus to which the vDPA device is attached.
> >
> > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> > bounce-buffering mechanism to achieve that.
>
>
> Rethink about the bounce buffer stuffs. I wonder instead of using kernel
> pages with mmap(), how about just use userspace pages like what vhost did?
>
> It means we need a worker to do bouncing but we don't need to care about
> annoying stuffs like page reclaiming?
>

Now the I/O bouncing is done in the streaming DMA mapping routines
which can be called from interrupt context. If we put this into a
kworker, that means we need to synchronize with a kworker in an
interrupt context. I think it can't work.

Thanks,
Yongji
Jason Wang Dec. 24, 2020, 2:24 a.m. UTC | #4
On 2020/12/23 下午6:59, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 2:38 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> This series introduces a framework, which can be used to implement
>>> vDPA Devices in a userspace program. The work consist of two parts:
>>> control path forwarding and data path offloading.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, the core is mapping dma buffer into VDUSE
>>> daemon's address space, which can be implemented in different ways
>>> depending on the vdpa bus to which the vDPA device is attached.
>>>
>>> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
>>> bounce-buffering mechanism to achieve that.
>>
>> Rethink about the bounce buffer stuffs. I wonder instead of using kernel
>> pages with mmap(), how about just use userspace pages like what vhost did?
>>
>> It means we need a worker to do bouncing but we don't need to care about
>> annoying stuffs like page reclaiming?
>>
> Now the I/O bouncing is done in the streaming DMA mapping routines
> which can be called from interrupt context. If we put this into a
> kworker, that means we need to synchronize with a kworker in an
> interrupt context. I think it can't work.


We just need to make sure the buffer is ready before the user is trying 
to access them.

But I admit it would be tricky (require shadow virtqueue etc) which is 
probably not a good idea.

Thanks


>
> Thanks,
> Yongji
>