mbox series

[rdma-next,00/11] Elastic RDMA Adapter (ERDMA) driver

Message ID 20211221024858.25938-1-chengyou@linux.alibaba.com (mailing list archive)
Headers show
Series Elastic RDMA Adapter (ERDMA) driver | expand

Message

Cheng Xu Dec. 21, 2021, 2:48 a.m. UTC
Hello all,

This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
released in Apsara Conference 2021 by Alibaba.

ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
environment, initially offered in g7re instance. It can improve the
efficiency of large-scale distributed computing and communication
significantly and expand dynamically with the cluster scale of Alibaba
Cloud.

ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
works in the VPC network environment (overlay network), and uses iWarp
tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
supports both kernel space and user space verbs. Now we have already
supported HPC/AI applications with libfabric, NoF and some other internal
verbs libraries, such as xrdma, epsl, etc,.

For the ECS instance with RDMA enabled, there are two kinds of devices
allocated, one for ERDMA, and one for the original netdev (virtio-net).
They are different PCI deivces. ERDMA driver can get the information about
which netdev attached to in its PCIe barspace (by MAC address matching).

Thanks,
Cheng Xu

Cheng Xu (11):
  RDMA: Add ERDMA to rdma_driver_id definition
  RDMA/erdma: Add the hardware related definitions
  RDMA/erdma: Add main include file
  RDMA/erdma: Add cmdq implementation
  RDMA/erdma: Add event queue implementation
  RDMA/erdma: Add verbs header file
  RDMA/erdma: Add verbs implementation
  RDMA/erdma: Add connection management (CM) support
  RDMA/erdma: Add the erdma module
  RDMA/erdma: Add the ABI definitions
  RDMA/erdma: Add driver to kernel build environment

 MAINTAINERS                               |    8 +
 drivers/infiniband/Kconfig                |    1 +
 drivers/infiniband/hw/Makefile            |    1 +
 drivers/infiniband/hw/erdma/Kconfig       |   10 +
 drivers/infiniband/hw/erdma/Makefile      |    5 +
 drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
 drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
 drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
 drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
 drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
 drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
 drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
 drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
 drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
 drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
 drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
 drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
 drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
 include/uapi/rdma/erdma-abi.h             |   49 +
 include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
 20 files changed, 7219 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/Kconfig
 create mode 100644 drivers/infiniband/hw/erdma/Makefile
 create mode 100644 drivers/infiniband/hw/erdma/erdma.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
 create mode 100644 include/uapi/rdma/erdma-abi.h

Comments

Leon Romanovsky Dec. 21, 2021, 1:09 p.m. UTC | #1
On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote:
> Hello all,
> 
> This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
> released in Apsara Conference 2021 by Alibaba.
> 
> ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
> environment, initially offered in g7re instance. It can improve the
> efficiency of large-scale distributed computing and communication
> significantly and expand dynamically with the cluster scale of Alibaba
> Cloud.
> 
> ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
> works in the VPC network environment (overlay network), and uses iWarp
> tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
> supports both kernel space and user space verbs. Now we have already
> supported HPC/AI applications with libfabric, NoF and some other internal
> verbs libraries, such as xrdma, epsl, etc,.

We will need to get erdma provider implementation in the rdma-core too,
in order to consider to merge it.

> 
> For the ECS instance with RDMA enabled, there are two kinds of devices
> allocated, one for ERDMA, and one for the original netdev (virtio-net).
> They are different PCI deivces. ERDMA driver can get the information about
> which netdev attached to in its PCIe barspace (by MAC address matching).

This is very questionable. The netdev part should be kept in the
drivers/ethernet/... part of the kernel.

Thanks

> 
> Thanks,
> Cheng Xu
> 
> Cheng Xu (11):
>   RDMA: Add ERDMA to rdma_driver_id definition
>   RDMA/erdma: Add the hardware related definitions
>   RDMA/erdma: Add main include file
>   RDMA/erdma: Add cmdq implementation
>   RDMA/erdma: Add event queue implementation
>   RDMA/erdma: Add verbs header file
>   RDMA/erdma: Add verbs implementation
>   RDMA/erdma: Add connection management (CM) support
>   RDMA/erdma: Add the erdma module
>   RDMA/erdma: Add the ABI definitions
>   RDMA/erdma: Add driver to kernel build environment
> 
>  MAINTAINERS                               |    8 +
>  drivers/infiniband/Kconfig                |    1 +
>  drivers/infiniband/hw/Makefile            |    1 +
>  drivers/infiniband/hw/erdma/Kconfig       |   10 +
>  drivers/infiniband/hw/erdma/Makefile      |    5 +
>  drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
>  drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
>  drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
>  drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
>  drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>  drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
>  drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
>  drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
>  drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
>  drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
>  drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
>  drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
>  drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
>  include/uapi/rdma/erdma-abi.h             |   49 +
>  include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
>  20 files changed, 7219 insertions(+)
>  create mode 100644 drivers/infiniband/hw/erdma/Kconfig
>  create mode 100644 drivers/infiniband/hw/erdma/Makefile
>  create mode 100644 drivers/infiniband/hw/erdma/erdma.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
>  create mode 100644 include/uapi/rdma/erdma-abi.h
> 
> -- 
> 2.27.0
>
Cheng Xu Dec. 22, 2021, 3:35 a.m. UTC | #2
On 12/21/21 9:09 PM, Leon Romanovsky wrote:
> On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote:
>> Hello all,
>>
>> This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
>> released in Apsara Conference 2021 by Alibaba.
>>
>> ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
>> environment, initially offered in g7re instance. It can improve the
>> efficiency of large-scale distributed computing and communication
>> significantly and expand dynamically with the cluster scale of Alibaba
>> Cloud.
>>
>> ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
>> works in the VPC network environment (overlay network), and uses iWarp
>> tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
>> supports both kernel space and user space verbs. Now we have already
>> supported HPC/AI applications with libfabric, NoF and some other internal
>> verbs libraries, such as xrdma, epsl, etc,.
> 
> We will need to get erdma provider implementation in the rdma-core too,
> in order to consider to merge it.

Sure, I will submit erdma userspace provider implementation within 2
days.

>>
>> For the ECS instance with RDMA enabled, there are two kinds of devices
>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>> They are different PCI deivces. ERDMA driver can get the information about
>> which netdev attached to in its PCIe barspace (by MAC address matching).
> 
> This is very questionable. The netdev part should be kept in the
> drivers/ethernet/... part of the kernel.
> 
> Thanks

The net device used in Alibaba ECS instance is virtio-net device, driven
by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
device, and will be attached to an existed virtio-net device. The
relationship between ibdev and netdev in erdma is similar to siw/rxe.

>>
>> Thanks,
>> Cheng Xu
>>
>> Cheng Xu (11):
>>    RDMA: Add ERDMA to rdma_driver_id definition
>>    RDMA/erdma: Add the hardware related definitions
>>    RDMA/erdma: Add main include file
>>    RDMA/erdma: Add cmdq implementation
>>    RDMA/erdma: Add event queue implementation
>>    RDMA/erdma: Add verbs header file
>>    RDMA/erdma: Add verbs implementation
>>    RDMA/erdma: Add connection management (CM) support
>>    RDMA/erdma: Add the erdma module
>>    RDMA/erdma: Add the ABI definitions
>>    RDMA/erdma: Add driver to kernel build environment
>>
>>   MAINTAINERS                               |    8 +
>>   drivers/infiniband/Kconfig                |    1 +
>>   drivers/infiniband/hw/Makefile            |    1 +
>>   drivers/infiniband/hw/erdma/Kconfig       |   10 +
>>   drivers/infiniband/hw/erdma/Makefile      |    5 +
>>   drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
>>   drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
>>   drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
>>   drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>   drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
>>   drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
>>   drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
>>   drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
>>   drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
>>   drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
>>   include/uapi/rdma/erdma-abi.h             |   49 +
>>   include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
>>   20 files changed, 7219 insertions(+)
>>   create mode 100644 drivers/infiniband/hw/erdma/Kconfig
>>   create mode 100644 drivers/infiniband/hw/erdma/Makefile
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
>>   create mode 100644 include/uapi/rdma/erdma-abi.h
>>
>> -- 
>> 2.27.0
>>
Leon Romanovsky Dec. 23, 2021, 10:23 a.m. UTC | #3
On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> 

<...>

> > > 
> > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > They are different PCI deivces. ERDMA driver can get the information about
> > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > 
> > This is very questionable. The netdev part should be kept in the
> > drivers/ethernet/... part of the kernel.
> > 
> > Thanks
> 
> The net device used in Alibaba ECS instance is virtio-net device, driven
> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> device, and will be attached to an existed virtio-net device. The
> relationship between ibdev and netdev in erdma is similar to siw/rxe.

siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
through MAC's matching.

Thanks
Cheng Xu Dec. 23, 2021, 12:59 p.m. UTC | #4
On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>
> 
> <...>
> 
>>>>
>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>
>>> This is very questionable. The netdev part should be kept in the
>>> drivers/ethernet/... part of the kernel.
>>>
>>> Thanks
>>
>> The net device used in Alibaba ECS instance is virtio-net device, driven
>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>> device, and will be attached to an existed virtio-net device. The
>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
> 
> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> through MAC's matching.
> 
> Thanks

Both siw/rxe/erdma don't need to implement netdev part, this is what I
wanted to express when I said 'similar'.
What you mentioned (the bind mechanism) is one major difference between
erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
he/she wants, but it is not true for erdma. When user buys the erdma
service, he/she must specify which ENI (elastic network interface) to be
binded, it means that the attached erdma device can only be binded to
the specific netdev. Due to the uniqueness of MAC address in our ECS
instance, we use the MAC address as the identification, then the driver 
knows which netdev should be binded to.

Thanks,
Cheng Xu
Leon Romanovsky Dec. 23, 2021, 1:44 p.m. UTC | #5
On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > 
> > 
> > <...>
> > 
> > > > > 
> > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > 
> > > > This is very questionable. The netdev part should be kept in the
> > > > drivers/ethernet/... part of the kernel.
> > > > 
> > > > Thanks
> > > 
> > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > device, and will be attached to an existed virtio-net device. The
> > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > 
> > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > through MAC's matching.
> > 
> > Thanks
> 
> Both siw/rxe/erdma don't need to implement netdev part, this is what I
> wanted to express when I said 'similar'.
> What you mentioned (the bind mechanism) is one major difference between
> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> he/she wants, but it is not true for erdma. When user buys the erdma
> service, he/she must specify which ENI (elastic network interface) to be
> binded, it means that the attached erdma device can only be binded to
> the specific netdev. Due to the uniqueness of MAC address in our ECS
> instance, we use the MAC address as the identification, then the driver
> knows which netdev should be binded to.

Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
I personally don't like the idea that bind logic is performed "magically".

BTW,
1. No module parameters
2. No driver versions

Thanks

> 
> Thanks,
> Cheng Xu
Cheng Xu Dec. 24, 2021, 7:07 a.m. UTC | #6
On 12/23/21 9:44 PM, Leon Romanovsky wrote:
> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>
>>>
>>> <...>
>>>
>>>>>>
>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>
>>>>> This is very questionable. The netdev part should be kept in the
>>>>> drivers/ethernet/... part of the kernel.
>>>>>
>>>>> Thanks
>>>>
>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>> device, and will be attached to an existed virtio-net device. The
>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>
>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>> through MAC's matching.
>>>
>>> Thanks
>>
>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>> wanted to express when I said 'similar'.
>> What you mentioned (the bind mechanism) is one major difference between
>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>> he/she wants, but it is not true for erdma. When user buys the erdma
>> service, he/she must specify which ENI (elastic network interface) to be
>> binded, it means that the attached erdma device can only be binded to
>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>> instance, we use the MAC address as the identification, then the driver
>> knows which netdev should be binded to.
> 
> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
> I personally don't like the idea that bind logic is performed "magically".
> 

OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
means that erdma can not be ready to use like other RDMA HCAs, until
user configure the link manually. This way may be not friendly to them.
I'm not sure that our current method is acceptable or not. If you
strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
it.

Thanks,
Cheng Xu

> BTW,
> 1. No module parameters
> 2. No driver versions
> 

Will fix them.

> Thanks
> 
>>
>> Thanks,
>> Cheng Xu
Leon Romanovsky Dec. 24, 2021, 6:26 p.m. UTC | #7
On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
> > On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> > > 
> > > 
> > > On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > > > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > > > 
> > > > 
> > > > <...>
> > > > 
> > > > > > > 
> > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > > > 
> > > > > > This is very questionable. The netdev part should be kept in the
> > > > > > drivers/ethernet/... part of the kernel.
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > > > device, and will be attached to an existed virtio-net device. The
> > > > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > > > 
> > > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > > > through MAC's matching.
> > > > 
> > > > Thanks
> > > 
> > > Both siw/rxe/erdma don't need to implement netdev part, this is what I
> > > wanted to express when I said 'similar'.
> > > What you mentioned (the bind mechanism) is one major difference between
> > > erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> > > he/she wants, but it is not true for erdma. When user buys the erdma
> > > service, he/she must specify which ENI (elastic network interface) to be
> > > binded, it means that the attached erdma device can only be binded to
> > > the specific netdev. Due to the uniqueness of MAC address in our ECS
> > > instance, we use the MAC address as the identification, then the driver
> > > knows which netdev should be binded to.
> > 
> > Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
> > I personally don't like the idea that bind logic is performed "magically".
> > 
> 
> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
> means that erdma can not be ready to use like other RDMA HCAs, until
> user configure the link manually. This way may be not friendly to them.
> I'm not sure that our current method is acceptable or not. If you
> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
> it.

Before you are rushing to change that logic, could you please explain
the security model of this binding?

I'm as an owner of VM can replace kernel code with any code I want and
remove your MAC matching (or replace to something different). How will
you protect from such flow?

If you don't trust VM, you should perform binding in hypervisor and
this erdma driver will work out-of-the-box in the VM.

Thanks

> 
> Thanks,
> Cheng Xu
> 
> > BTW,
> > 1. No module parameters
> > 2. No driver versions
> > 
> 
> Will fix them.
> 
> > Thanks
> > 
> > > 
> > > Thanks,
> > > Cheng Xu
Cheng Xu Dec. 25, 2021, 2:54 a.m. UTC | #8
On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

I think this topic belongs to anti-attack. One principle of anti-attack
in our cloud is that the attacker MUST NOT have influence on users but
themselves.

Before I answer the question, I want to describe some more details of
our architecture.

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the 
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.


> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

> Thanks
>
Cheng Xu Dec. 25, 2021, 2:57 a.m. UTC | #9
On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.

> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

Thanks,
Cheng Xu

> Thanks
>
Cheng Xu Dec. 25, 2021, 3:03 a.m. UTC | #10
On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

(I'm sorry for wrong editing format in the two former responses, please
ignore them.)

I think this topic belongs to anti-attack. One principle of anti-attack
in our cloud is that the attacker MUST NOT have influence on users but
themselves.

Before I answer the question, I want to describe some more details of
our architecture.

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.

> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

Thanks,
Cheng Xu

> Thanks
>
Jason Gunthorpe Jan. 7, 2022, 2:24 p.m. UTC | #11
On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > 
> > 
> > <...>
> > 
> > > > > 
> > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > 
> > > > This is very questionable. The netdev part should be kept in the
> > > > drivers/ethernet/... part of the kernel.
> > > > 
> > > > Thanks
> > > 
> > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > device, and will be attached to an existed virtio-net device. The
> > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > 
> > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > through MAC's matching.
> > 
> > Thanks
> 
> Both siw/rxe/erdma don't need to implement netdev part, this is what I
> wanted to express when I said 'similar'.
> What you mentioned (the bind mechanism) is one major difference between
> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> he/she wants, but it is not true for erdma. When user buys the erdma
> service, he/she must specify which ENI (elastic network interface) to be
> binded, it means that the attached erdma device can only be binded to
> the specific netdev. Due to the uniqueness of MAC address in our ECS
> instance, we use the MAC address as the identification, then the driver
> knows which netdev should be binded to.

It really doesn't match our driver binding model to rely on MAC
addreses.

Our standard model would expect that the virtio-net driver would
detect it has RDMA capability and spawn an aux device to link the two
things together.

Using net notifiers to try to link the lifecycles together has been a
mess so far.

Jason
Cheng Xu Jan. 10, 2022, 10:07 a.m. UTC | #12
On 1/7/22 10:24 PM, Jason Gunthorpe wrote:
> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>
>>>
>>> <...>
>>>
>>>>>>
>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>
>>>>> This is very questionable. The netdev part should be kept in the
>>>>> drivers/ethernet/... part of the kernel.
>>>>>
>>>>> Thanks
>>>>
>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>> device, and will be attached to an existed virtio-net device. The
>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>
>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>> through MAC's matching.
>>>
>>> Thanks
>>
>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>> wanted to express when I said 'similar'.
>> What you mentioned (the bind mechanism) is one major difference between
>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>> he/she wants, but it is not true for erdma. When user buys the erdma
>> service, he/she must specify which ENI (elastic network interface) to be
>> binded, it means that the attached erdma device can only be binded to
>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>> instance, we use the MAC address as the identification, then the driver
>> knows which netdev should be binded to.
> 
> It really doesn't match our driver binding model to rely on MAC
> addreses.
> 
> Our standard model would expect that the virtio-net driver would
> detect it has RDMA capability and spawn an aux device to link the two
> things together.
> 
> Using net notifiers to try to link the lifecycles together has been a
> mess so far.
Thanks for your explanation.

I guess this model requires the netdev and its associated ibdev share
the same physical hardware (pci device or platform device)? ERDMA is a
separated pci device. Only because that ENIs in our cloud are
virtio-net devices, and we let ERDMA binded to virtio-net. Actually it
also can work with other type of netdev.

As you and Leon said, using net notifiers is not a good way. And I'm
modifying our bind mechanism, using RDMA_NLDEV_CMD_NEWLINK to fix it.

Thanks,
Cheng Xu

> Jason