mbox series

[RFC,0/5] VirtIO RDMA

Message ID 20210902130625.25277-1-weijunji@bytedance.com (mailing list archive)
Headers show
Series VirtIO RDMA | expand

Message

Junji Wei Sept. 2, 2021, 1:06 p.m. UTC
Hi all,

This RFC aims to reopen the discussion of Virtio RDMA.
Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
which implemented a frame for Virtio RDMA and a simple
control path (Not sure if Yuval Shaia has any further
plan for it).

We try to extend this work and implement a simple
data-path and a completed control path. Now this can
work with SEND, RECV and REG_MR in kernel. There is a
simple test module in this patch that can communicate
with ibv_rc_pingpong in rdma-core.

During doing this work, we have found some problems and
would like to ask for some suggestions from community:
1. Each qp need two VQ, but qemu default only support 1024 VQ.
   I think it is possible to multiplex the VQ, since the
   cmd_post_send carry the qpn in request.

2. The virtio-rdma device's gid should equal to host rdma
   device's gid. This means that we cannot use gid cache in
   rdma subsystem. And theoretically the gid should also equal
   to the device's netdev's ip address, how can we deal with
   this conflict.

3. How to support DMA mr? The verbs in host cannot support it.
   And it seems hard to ping whole guest physical memory in qemu.

4. The FRMR api need to set key of MR through IB_WR_REG_MR.
   But it is impossible to change a key of mr using uverbs.
   In our implementation, we change the key of WR while post_send,
   but this means the MR can only work with SEND and RECV since we
   cannot change the key in the remote. The final solution may be to
   implement an urdma device based on rxe in qemu, through this we
   can get full control of MR.

5. The GSI is not supported now. And we think it's a problem that
   when the host receive a GSI package, it doesn't know which
   device it belongs to.

Any further thoughts will be greatly welcomed. And we noticed that
there seems to be no existing work for virtio-rdma spec, we are
happy to start it from this RFC.

How to test with test module:

1. Set test module's SERVER_ADDR and SERVER_PORT
2. Build kernel and qemu
3. Build rdmacm-mux in qemu/contrib and run it in backend
4. Boot kernel with qemu with following args using libvirt
<interface type='bridge'>
  <mac address='00:16:3e:5d:aa:a8'/>
  <source bridge='virbr0'/>
  <target dev='vnet1'/>
  <model type='virtio'/>
  <alias name='net0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
   function='0x0' multifunction='on'/>
</interface>

<qemu:commandline>
  <qemu:arg value='-chardev'/>
  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
  <qemu:arg value='-device'/>
  <qemu:arg value='virtio-rdma-pci,disable-legacy=on,addr=2.1,
   ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
  <qemu:arg value='-object'/>
  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
  <qemu:arg value='-numa'/>
  <qemu:arg value='node,memdev=mb1'/>
</qemu:commandline>

Note that virtio-net and virtio-rdma should be in same slot's
function 0 and function 1.

5. Run "ibv_rc_pingpong -g 1 -n 500 -s 20480" as server
6. Run "insmod virtio_rdma_rc_pingping_client.ko" in guest

One note regarding the patchset.
We know it's not standard to collaps patches from two repos. But in
order to display the whole work of Virtio RDMA, we still did it.

Thanks.

patch1: RDMA/virtio-rdma Introduce a new core cap prot (linux)
patch2: RDMA/virtio-rdma: VirtIO RDMA driver (linux)
        The main patch of virtio-rdma driver in linux kernel
patch3: RDMA/virtio-rdma: VirtIO RDMA test module (linux)
        A test module
patch4: virtio-net: Move some virtio-net-pci decl to include/hw/virtio (qemu)
        Patch from Yuval Shaia
patch5: hw/virtio-rdma: VirtIO rdma device (qemu)
        The main patch of virtio-rdma device in linux kernel

Comments

Jason Wang Sept. 3, 2021, 12:57 a.m. UTC | #1
On Thu, Sep 2, 2021 at 9:07 PM Junji Wei <weijunji@bytedance.com> wrote:
>
> Hi all,
>
> This RFC aims to reopen the discussion of Virtio RDMA.
> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
> which implemented a frame for Virtio RDMA and a simple
> control path (Not sure if Yuval Shaia has any further
> plan for it).
>
> We try to extend this work and implement a simple
> data-path and a completed control path. Now this can
> work with SEND, RECV and REG_MR in kernel. There is a
> simple test module in this patch that can communicate
> with ibv_rc_pingpong in rdma-core.
>
> During doing this work, we have found some problems and
> would like to ask for some suggestions from community:

I think it would be beneficial if you can post a spec patch.

Thanks
Junji Wei Sept. 3, 2021, 7:41 a.m. UTC | #2
> On Sep 3, 2021, at 8:57 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Thu, Sep 2, 2021 at 9:07 PM Junji Wei <weijunji@bytedance.com> wrote:
>> 
>> Hi all,
>> 
>> This RFC aims to reopen the discussion of Virtio RDMA.
>> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
>> which implemented a frame for Virtio RDMA and a simple
>> control path (Not sure if Yuval Shaia has any further
>> plan for it).
>> 
>> We try to extend this work and implement a simple
>> data-path and a completed control path. Now this can
>> work with SEND, RECV and REG_MR in kernel. There is a
>> simple test module in this patch that can communicate
>> with ibv_rc_pingpong in rdma-core.
>> 
>> During doing this work, we have found some problems and
>> would like to ask for some suggestions from community:
> 
> I think it would be beneficial if you can post a spec patch.

Ok, I will do it.

Thanks
Jason Gunthorpe Sept. 15, 2021, 1:43 p.m. UTC | #3
On Thu, Sep 02, 2021 at 09:06:20PM +0800, Junji Wei wrote:
> Hi all,
> 
> This RFC aims to reopen the discussion of Virtio RDMA.
> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
> which implemented a frame for Virtio RDMA and a simple
> control path (Not sure if Yuval Shaia has any further
> plan for it).
> 
> We try to extend this work and implement a simple
> data-path and a completed control path. Now this can
> work with SEND, RECV and REG_MR in kernel. There is a
> simple test module in this patch that can communicate
> with ibv_rc_pingpong in rdma-core.
> 
> During doing this work, we have found some problems and
> would like to ask for some suggestions from community:

These seem like serious problems! Shouldn't these be solved before
sending patches?

> 1. Each qp need two VQ, but qemu default only support 1024 VQ.
>    I think it is possible to multiplex the VQ, since the
>    cmd_post_send carry the qpn in request.

QPs and CQs need to have predictable fixed WQE sizes, I don't know how
you can reasonably expect to map them to a shared queue.

> 2. The virtio-rdma device's gid should equal to host rdma
>    device's gid. This means that we cannot use gid cache in
>    rdma subsystem. And theoretically the gid should also equal
>    to the device's netdev's ip address, how can we deal with
>    this conflict.

You have to follow the correct semantics, the GID flows from the guest
into the host and updates the hosts GID table, not the other way
around.
 
> 3. How to support DMA mr? The verbs in host cannot support it.
>    And it seems hard to ping whole guest physical memory in qemu.

Either you have to trap the FRWR in the hypervisor and pin the memory,
remap the MR, etc or you have to pin the entire guest and rely on
something like memory windows to emulate FRWR.
 
> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
>    But it is impossible to change a key of mr using uverbs.

FRMR is more like memory windows in user space, you can't support it
using just regular MRs.

>    In our implementation, we change the key of WR while post_send,
>    but this means the MR can only work with SEND and RECV since we
>    cannot change the key in the remote.

Yes, this is not a realistic solution

> 5. The GSI is not supported now. And we think it's a problem that
>    when the host receive a GSI package, it doesn't know which
>    device it belongs to.

Of course, GSI packets are not virtualized. You need to somehow
capture GSI messages for the entire GID that the guest is using. We
don't have any API to do this in userspace.

Jason
Junji Wei Sept. 22, 2021, 12:08 p.m. UTC | #4
> On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Thu, Sep 02, 2021 at 09:06:20PM +0800, Junji Wei wrote:
>> Hi all,
>> 
>> This RFC aims to reopen the discussion of Virtio RDMA.
>> Now this is based on Yuval Shaia's RFC "VirtIO RDMA"
>> which implemented a frame for Virtio RDMA and a simple
>> control path (Not sure if Yuval Shaia has any further
>> plan for it).
>> 
>> We try to extend this work and implement a simple
>> data-path and a completed control path. Now this can
>> work with SEND, RECV and REG_MR in kernel. There is a
>> simple test module in this patch that can communicate
>> with ibv_rc_pingpong in rdma-core.
>> 
>> During doing this work, we have found some problems and
>> would like to ask for some suggestions from community:
> 
> These seem like serious problems! Shouldn't these be solved before
> sending patches?
> 
>> 1. Each qp need two VQ, but qemu default only support 1024 VQ.
>>   I think it is possible to multiplex the VQ, since the
>>   cmd_post_send carry the qpn in request.
> 
> QPs and CQs need to have predictable fixed WQE sizes, I don't know how
> you can reasonably expect to map them to a shared queue.

Yes, it is a bad idea to multiplex the VQ. If we need more VQ,
we can extend QEMU and virtio spec.

>> 2. The virtio-rdma device's gid should equal to host rdma
>>   device's gid. This means that we cannot use gid cache in
>>   rdma subsystem. And theoretically the gid should also equal
>>   to the device's netdev's ip address, how can we deal with
>>   this conflict.
> 
> You have to follow the correct semantics, the GID flows from the guest
> into the host and updates the hosts GID table, not the other way
> around.

Sure, this is my misunderstanding.

>> 3. How to support DMA mr? The verbs in host cannot support it.
>>   And it seems hard to ping whole guest physical memory in qemu.
> 
> Either you have to trap the FRWR in the hypervisor and pin the memory,
> remap the MR, etc or you have to pin the entire guest and rely on
> something like memory windows to emulate FRWR.

We want to implement an emulated RDMA device in userspace. Since
we can directly access guest's physical memory in QEMU, it will be
easy to support DMA mr.

>> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
>>   But it is impossible to change a key of mr using uverbs.
> 
> FRMR is more like memory windows in user space, you can't support it
> using just regular MRs.

It is hard to support this using uverbs, but it is easy to support
with uRDMA that we can get full control of mrs.

>> 5. The GSI is not supported now. And we think it's a problem that
>>   when the host receive a GSI package, it doesn't know which
>>   device it belongs to.
> 
> Of course, GSI packets are not virtualized. You need to somehow
> capture GSI messages for the entire GID that the guest is using. We
> don't have any API to do this in userspace.

If we implement uRDMA device in QEMU, there is no need to distinguish
which device it belongs to, because there is only one device.

Thanks.

Junji
Leon Romanovsky Sept. 22, 2021, 1:06 p.m. UTC | #5
On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:

<...>

> >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> >>   But it is impossible to change a key of mr using uverbs.
> > 
> > FRMR is more like memory windows in user space, you can't support it
> > using just regular MRs.
> 
> It is hard to support this using uverbs, but it is easy to support
> with uRDMA that we can get full control of mrs.

What is uRDMA?

Thanks
Junji Wei Sept. 22, 2021, 1:37 p.m. UTC | #6
On Wed, Sep 22, 2021 at 9:06 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> <...>
>
> > >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> > >>   But it is impossible to change a key of mr using uverbs.
> > >
> > > FRMR is more like memory windows in user space, you can't support it
> > > using just regular MRs.
> >
> > It is hard to support this using uverbs, but it is easy to support
> > with uRDMA that we can get full control of mrs.
>
> What is uRDMA?

uRDMA is a software implementation of the RoCEv2 protocol like rxe.
We will implement it in QEMU with VFIO or DPDK.

Thanks.
Junji
Leon Romanovsky Sept. 22, 2021, 1:59 p.m. UTC | #7
On Wed, Sep 22, 2021 at 09:37:37PM +0800, 魏俊吉 wrote:
> On Wed, Sep 22, 2021 at 9:06 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Sep 22, 2021 at 08:08:44PM +0800, Junji Wei wrote:
> > > > On Sep 15, 2021, at 9:43 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > <...>
> >
> > > >> 4. The FRMR api need to set key of MR through IB_WR_REG_MR.
> > > >>   But it is impossible to change a key of mr using uverbs.
> > > >
> > > > FRMR is more like memory windows in user space, you can't support it
> > > > using just regular MRs.
> > >
> > > It is hard to support this using uverbs, but it is easy to support
> > > with uRDMA that we can get full control of mrs.
> >
> > What is uRDMA?
> 
> uRDMA is a software implementation of the RoCEv2 protocol like rxe.
> We will implement it in QEMU with VFIO or DPDK.

ok, thanks

> 
> Thanks.
> Junji