diff mbox series

[RFC,v2] virtio-net: Add RoCE (RDMA over Converged Ethernet) support

Message ID 20220511095900.343-1-xieyongji@bytedance.com (mailing list archive)
State RFC
Headers show
Series [RFC,v2] virtio-net: Add RoCE (RDMA over Converged Ethernet) support | expand

Commit Message

Yongji Xie May 11, 2022, 9:59 a.m. UTC
Hi all,

This RFC aims to introduce our recent work on enabling RoCE support
for virtio-net device.

To support RoCE, three types of virtqueues including RDMA send virtqueue,
RDMA receive virtqueue and RDMA completion virtqueue are introduced.
And control virtqueue is reused to support the RDMA control messages.

Now we support some basic RDMA semantics such as send/receive
and read/write operation.

To test with our demo:

1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA

2. Build QEMU [2] with config VHOST_USER_RDMA

3. Build rdma-core [3]

4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3)

5. Build vhost-user-rdma [4]

6. Run vhost-user-rdma with command:
    $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0'

7. Run qemu with command:
    $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \
      -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ...

[1] https://github.com/bytedance/linux/tree/virtio-net-roce
[2] https://github.com/bytedance/qemu/tree/vhost-user-rdma
[3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma
[4] https://github.com/YongjiXie/vhost-user-rdma

We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some
others in rdma-core.

TODO:

1. Add support for Base Memory Management Extensions

2. Add support for atomic operation

3. Add support for SRQ

4. Add support for virtqueue resize

5. Add support for enabling/disabling virtqueue at runtime

Please review, thanks!

V1 to V2:
- Rework the implementation via extending virtio-net instead of
  introducing a new device type [Jason]
- Add address handle support

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Co-developed-by: Wei Junji <weijunji@bytedance.com>
Signed-off-by: Wei Junji <weijunji@bytedance.com>
---
 content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 854 insertions(+), 4 deletions(-)

Comments

Jason Wang Aug. 4, 2022, 8:30 a.m. UTC | #1
On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@bytedance.com> wrote:
>
> Hi all,
>

Not very familiar with ROCE, try to give some comments from general
virtio level.

> This RFC aims to introduce our recent work on enabling RoCE support
> for virtio-net device.

We need to clarify the version of ROCE, is it ROCEv2 or not?

>
> To support RoCE, three types of virtqueues including RDMA send virtqueue,
> RDMA receive virtqueue and RDMA completion virtqueue are introduced.
> And control virtqueue is reused to support the RDMA control messages.
>
> Now we support some basic RDMA semantics such as send/receive
> and read/write operation.

It would be better to explain the advantages of this over the existing
pvrdma approach. I guess one advantage is that using virtio makes it
easier to connect to a userspace dataplane through vDPA/vhost-user?

>
> To test with our demo:
>
> 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA
>
> 2. Build QEMU [2] with config VHOST_USER_RDMA
>
> 3. Build rdma-core [3]
>
> 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3)
>
> 5. Build vhost-user-rdma [4]
>
> 6. Run vhost-user-rdma with command:
>     $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0'
>
> 7. Run qemu with command:
>     $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \
>       -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ...

It would be better to give some performance numbers (or even compare
it with pvrdma).

>
> [1] https://github.com/bytedance/linux/tree/virtio-net-roce
> [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma
> [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma
> [4] https://github.com/YongjiXie/vhost-user-rdma
>
> We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some
> others in rdma-core.
>
> TODO:
>

And we'd better consider the live migration support. Having a quick
glance, it looks to me trapping the cvq is sufficient?

> 1. Add support for Base Memory Management Extensions
>
> 2. Add support for atomic operation
>
> 3. Add support for SRQ
>
> 4. Add support for virtqueue resize

Note that this is already supported by the spec via virtqueue reset.

>
> 5. Add support for enabling/disabling virtqueue at runtime

I guess virtqueue reset could help in this case.

>
> Please review, thanks!
>
> V1 to V2:
> - Rework the implementation via extending virtio-net instead of
>   introducing a new device type [Jason]
> - Add address handle support
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Co-developed-by: Wei Junji <weijunji@bytedance.com>
> Signed-off-by: Wei Junji <weijunji@bytedance.com>
> ---
>  content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 854 insertions(+), 4 deletions(-)

I wonder if there's some open-source ROCE transport device API that we
can re-use then we can just behave like a transport layer instead of
inventing new commands.

>
> diff --git a/content.tex b/content.tex
> index 7508dd1..646d82a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device}
>  placed in one virtqueue for receiving packets, and outgoing
>  packets are enqueued into another for transmission in that order.
>  A third command queue is used to control advanced filtering
> -features.
> +features. And if RoCE (RDMA over Converged Ethernet) capability
> +is enabled, the virtio network device can also support transmitting
> +and receiving RDMA message through RDMA send virtqueue, RDMA receive
> +virtqueue and RDMA completion virtqueue.
>
>  \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID}
>
> @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues}
>  \item[2(N-1)] receiveqN
>  \item[2(N-1)+1] transmitqN
>  \item[2N] controlq
> +\item[2N+1] rdma_completeq1
> +\item[\ldots]
> +\item[2N+M] rdma_completeqM
> +\item[2N+M+1] rdma_transmitq1
> +\item[2N+M+2] rdma_receiveq1
> +\item[\ldots]
> +\item[2N+M+2L-1] rdma_transmitqL
> +\item[2N+M+2L] rdma_receiveqL
>  \end{description}
>
>   N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by
> - \field{max_virtqueue_pairs}.
> + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by
> + \field{max_rdma_qps}.
>
>   controlq only exists if VIRTIO_NET_F_CTRL_VQ set.
>
> + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set
> +
>  \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits}
>
>  \begin{description}
> @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits
>  \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control
>      channel.
>
> +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet)
> +    capability.
> +
>  \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO
>   (fragmenting the packet) the USO splits large UDP packet
>   to several segments when each of these smaller packets has UDP header.
> @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device
>  \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ.
>  \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ.
>  \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ.
> +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ.
>  \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6.
>  \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ.
>  \end{description}
> @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
>          u8 rss_max_key_size;
>          le16 rss_max_indirection_table_length;
>          le32 supported_hash_types;
> +        le32 max_rdma_qps;
> +        le32 max_rdma_cps;
>  };
>  \end{lstlisting}
>  The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set.
> @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
>  Field \field{supported_hash_types} contains the bitmask of supported hash types.
>  See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types.
>
> +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set.
> +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage.
> +
> +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set.
> +It specifies the maximum number of completion virtqueue for RoCE usage.
> +
>  \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout}
>
>  The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive,
>  if it offers VIRTIO_NET_F_MQ.
>
> +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive,
> +if it offers VIRTIO_NET_F_ROCE.

I wonder why 16384 is chosen here?

> +
> +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive,
> +if it offers VIRTIO_NET_F_ROCE.
> +
>  The device MUST set \field{mtu} to between 68 and 65535 inclusive,
>  if it offers VIRTIO_NET_F_MTU.
>
> @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev
>  \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated,
>    identify the control virtqueue.
>
> +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> +  identify the the RDMA completion virtqueues, up to max_rdma_cqs.
> +
> +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> +  identify the the RDMA send and receive virtqueues, up to max_rdma_qps.
> +
>  \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}.
>
>  \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and
> @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
>          u8 command;
>          u8 command-specific-data[];
>          u8 ack;
> +        u8 ack-specific-data[];
>  };
>
>  /* ack values */
> @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
>  \end{lstlisting}
>
>  The \field{class}, \field{command} and command-specific-data are set by the
> -driver, and the device sets the \field{ack} byte. There is little it can
> -do except issue a diagnostic if \field{ack} is not
> +driver, and the device sets the \field{ack} byte and ack-specific-data. There
> +is little it can do except issue a diagnostic if \field{ack} is not
>  VIRTIO_NET_OK.
>
>  \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering}
> @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
>  according to the native endian of the guest rather than
>  (necessarily when not using the legacy interface) little-endian.
>
> +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> +
> +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ),
> +it can send control commands for RoCE usage. The following commands are defined now:
> +
> +\begin{lstlisting}
> +#define VIRTIO_NET_CTRL_ROCE    6
> + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE      0
> + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT        1
> + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ         2
> + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ        3
> + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD         4
> + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD        5
> + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR        6
> + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR       7
> + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR          8
> + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP         9
> + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP         10
> + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP          11
> + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP        12
> + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH         13
> + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH        14
> + #define VIRTIO_NET_CTRL_ROCE_ADD_GID           15
> + #define VIRTIO_NET_CTRL_ROCE_DEL_GID           16
> + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ     17
> +\end{lstlisting}
> +
> +\begin{description}
> +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device.
> +  No command-specific-data;
> +  the ack-specific-data is \field{struct virtio_rdma_ack_query_device}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_ack_query_device {
> +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN    (1 << 0)

What's the meaning of this capability?

> +        /* Capabilities mask */
> +        le64 device_cap_flags;

Will this introduce a migration compatibility issue? E.g src and dst
have the same features but different capabilities.

> +        /* Largest contiguous block that can be registered */
> +        le64 max_mr_size;
> +        /* Supported memory shift sizes */
> +        le64 page_size_cap;
> +        /* Hardware version */
> +        le32 hw_ver;

What did "hardware version" mean? Is this something that is defined in
the IB spec?

> +        /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */
> +        le32 max_qp_wr;

Is this implied in the virtqueue size? If not, why?

> +        /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */
> +        le32 max_send_sge;
> +        /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */
> +        le32 max_recv_sge;
> +        /* Maximum number of s/g per WR for RDMA Read operations */
> +        le32 max_sge_rd;
> +        /* Maximum size of Completion Queue (CQ) */
> +        le32 max_cqe;

Need to specify the reason why we can't use the virtqueue size for the
completion queue.

> +        /* Maximum number of Memory Regions (MR) */
> +        le32 max_mr;
> +        /* Maximum number of Protection Domains (PD) */
> +        le32 max_pd;
> +        /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */

I guess you mean "operations" here.

> +        le32 max_qp_rd_atom;
> +        /* Maximum depth per QP for initiation of RDMA Read operations */

The member has an "atom" suffix, does it mean "atomic read" or other?

> +        le32 max_qp_init_rd_atom;
> +        /* Maximum number of Address Handles (AH) */
> +        le32 max_ah;
> +        /* Local CA ack delay */
> +        u8 local_ca_ack_delay;
> +        /* Padding */
> +        u8 padding[3];
> +        /* Reserved for future */
> +        le32 reserved[14];
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port.
> +  No command-specific-data;
> +  the ack-specific-data is \field{struct virtio_rdma_ack_query_port}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_ack_query_port {
> +        /* Length of source Global Identifier (GID) table */
> +        le32 gid_tbl_len;
> +        /* Maximum message size */
> +        le32 max_msg_sz;

I guess this is for both read/write/send/receive? And is 4GB
sufficient for the future?

> +        /* Reserved for future */
> +        le32 reserved[6];
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ).
> +  The command-specific-data is \field{struct virtio_rdma_cmd_create_cq};
> +  the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_create_cq {
> +        /* Size of CQ */
> +        le32 cqe;
> +};
> +
> +struct virtio_rdma_ack_create_cq {
> +        /* The index of CQ */
> +        le32 cqn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue.
> +  The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_destroy_cq {
> +        /* The index of CQ */
> +        le32 cqn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD).
> +  No command-specific-data;
> +  the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_ack_create_pd {
> +        /* The handle of PD */
> +        le32 pdn;
> +};
> +\end{lstlisting}

Can this command always succeed? I meant is there a limit of the total
number of PDs that a single ROCE device can support?

> +
> +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain.
> +  The command-specific-data is \field{virtio_rdma_cmd_destroy_pd};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_destroy_pd {
> +        /* The handle of PD */
> +        le32 pdn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR).
> +  associated with one protection domain.

I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR
and USR_MR. Can we unify them?

> +  The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr};
> +  the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}.
> +
> +\begin{lstlisting}
> +enum virtio_ib_access_flags {
> +        VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0),

Is LOCAL_READ implied to work always?

> +        VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1),
> +        VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2),
> +};
> +
> +struct virtio_rdma_cmd_get_dma_mr {
> +        /* The handle of PD which the MR associated with */
> +        le32 pdn;
> +        /* MR's protection attributes, enum virtio_ib_access_flags */
> +        le32 access_flags;
> +};
> +
> +struct virtio_rdma_ack_get_dma_mr {
> +        /* The handle of MR */
> +        le32 mrn;
> +        /* MR's local access key */
> +        le32 lkey;
> +        /* MR's remote access key */
> +        le32 rkey;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region
> +  associated with one Protection Domain.
> +  The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr};
> +  the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_reg_user_mr {
> +        /* The handle of PD which the MR associated with */
> +        le32 pdn;
> +        /* MR's protection attributes, enum virtio_ib_access_flags */
> +        le32 access_flags;
> +        /* Starting virtual address of MR */
> +        le64 virt_addr;

I guess this is actually the I/O virtual address and the device is in
charge of translate it to the page arrays below?

> +        /* Length of MR */
> +        le64 length;
> +        /* Size of the below page array */
> +        le32 npages;
> +        /* Padding */
> +        le32 padding;
> +        /* Array to store physical address of each page in MR */
> +        le64 pages[];

How do device know the size of a page?

> +};

I believe this command can fail, we need to describe the error conditions.

> +
> +struct virtio_rdma_ack_reg_user_mr {
> +        /* The handle of MR */
> +        le32 mrn;
> +        /* MR's local access key */
> +        le32 lkey;
> +        /* MR's remote access key */
> +        le32 rkey;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region.
> +  The command-specific-data is \field{virtio_rdma_cmd_dereg_mr};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_dereg_mr {
> +        /* The handle of MR */
> +        le32 mrn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue).
> +  The command-specific-data is \field{virtio_rdma_cmd_create_qp};
> +  the ack-specific-data is \field{virtio_rdma_ack_create_qp}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_qp_cap {
> +        /* Maximum number of outstanding WRs in SQ */
> +        le32 max_send_wr;
> +        /* Maximum number of outstanding WRs in RQ */
> +        le32 max_recv_wr;
> +        /* Maximum number of s/g elements per WR in SQ */
> +        le32 max_send_sge;
> +        /* Maximum number of s/g elements per WR in RQ */
> +        le32 max_recv_sge;
> +        /* Maximum number of data (bytes) that can be posted inline to SQ */
> +        le32 max_inline_data;
> +        /* Padding */
> +        le32 padding;
> +};
> +
> +struct virtio_rdma_cmd_create_qp {
> +        /* The handle of PD which the QP associated with */
> +        le32 pdn;
> +#define VIRTIO_IB_QPT_SMI    0
> +#define VIRTIO_IB_QPT_GSI    1
> +#define VIRTIO_IB_QPT_RC     2
> +#define VIRTIO_IB_QPT_UC     3
> +#define VIRTIO_IB_QPT_UD     4
> +        /* QP's type */
> +        u8 qp_type;
> +        /* If set, each WR submitted to the SQ generates a completion entry */
> +        u8 sq_sig_all;
> +        /* Padding */
> +        u8 padding[2];
> +        /* The index of CQ which the SQ associated with */
> +        le32 send_cqn;
> +        /* The index of CQ which the RQ associated with */
> +        le32 recv_cqn;
> +        /* QP's capabilities */
> +        struct virtio_rdma_qp_cap cap;
> +        /* Reserved for future */
> +        le32 reserved[4];
> +};
> +
> +struct virtio_rdma_ack_create_qp {
> +        /* The index of QP */
> +        le32 qpn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair.
> +  The command-specific-data is \field{virtio_rdma_cmd_modify_qp};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_global_route {
> +        /* Destination GID or MGID */
> +        u8 dgid[16];
> +        /* Flow label */
> +        le32 flow_label;
> +        /* Source GID index */
> +        u8 sgid_index;
> +        /* Hop limit */
> +        u8 hop_limit;
> +        /* Traffic class */
> +        u8 traffic_class;
> +        /* Padding */
> +        u8 padding;
> +};
> +
> +struct virtio_rdma_ah_attr {
> +        /* Global Routing Header (GRH) attributes */
> +        virtio_rdma_global_route grh;
> +        /* Destination MAC address */
> +        u8 dmac[6];
> +        /* Reserved for future */
> +        u8 reserved[10];
> +};
> +
> +enum virtio_ib_qp_attr_mask {
> +        VIRTIO_IB_QP_STATE = (1 << 0),
> +        VIRTIO_IB_QP_CUR_STATE = (1 << 1),
> +        VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2),
> +        VIRTIO_IB_QP_QKEY = (1 << 3),
> +        VIRTIO_IB_QP_AV = (1 << 4),
> +        VIRTIO_IB_QP_PATH_MTU = (1 << 5),
> +        VIRTIO_IB_QP_TIMEOUT = (1 << 6),
> +        VIRTIO_IB_QP_RETRY_CNT = (1 << 7),
> +        VIRTIO_IB_QP_RNR_RETRY = (1 << 8),
> +        VIRTIO_IB_QP_RQ_PSN = (1 << 9),
> +        VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10),
> +        VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11),
> +        VIRTIO_IB_QP_SQ_PSN = (1 << 12),
> +        VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13),
> +        VIRTIO_IB_QP_CAP = (1 << 14),
> +        VIRTIO_IB_QP_DEST_QPN = (1 << 15),
> +        VIRTIO_IB_QP_RATE_LIMIT = (1 << 16),
> +};

Do we need to explain the above error codes? Or it's simply a map from IB spec?

> +
> +enum virtio_ib_qp_state {
> +        VIRTIO_IB_QPS_RESET,
> +        VIRTIO_IB_QPS_INIT,
> +        VIRTIO_IB_QPS_RTR,
> +        VIRTIO_IB_QPS_RTS,
> +        VIRTIO_IB_QPS_SQD,
> +        VIRTIO_IB_QPS_SQE,
> +        VIRTIO_IB_QPS_ERR
> +};
> +
> +enum virtio_ib_mtu {
> +        VIRTIO_IB_MTU_256 = 1,
> +        VIRTIO_IB_MTU_512 = 2,
> +        VIRTIO_IB_MTU_1024 = 3,
> +        VIRTIO_IB_MTU_2048 = 4,
> +        VIRTIO_IB_MTU_4096 = 5
> +};
> +
> +struct virtio_rdma_cmd_modify_qp {
> +        /* The index of QP */
> +        le32 qpn;
> +        /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */
> +        le32 attr_mask;
> +        /* Move the QP to this state, enum virtio_ib_qp_state */
> +        u8 qp_state;
> +        /* Current QP state, enum virtio_ib_qp_state */
> +        u8 cur_qp_state;
> +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> +        u8 path_mtu;
> +        /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */
> +        u8 max_rd_atomic;
> +        /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */
> +        u8 max_dest_rd_atomic;
> +        /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */
> +        u8 min_rnr_timer;
> +        /* Local ack timeout (valid only for RC QPs) */
> +        u8 timeout;
> +        /* Retry count (valid only for RC QPs) */
> +        u8 retry_cnt;
> +        /* RNR retry (valid only for RC QPs) */
> +        u8 rnr_retry;
> +        /* Padding */
> +        u8 padding[7];
> +        /* Q_Key for the QP (valid only for UD QPs) */
> +        le32 qkey;
> +        /* PSN for RQ (valid only for RC/UC QPs) */
> +        le32 rq_psn;
> +        /* PSN for SQ */
> +        le32 sq_psn;
> +        /* Destination QP number (valid only for RC/UC QPs) */
> +        le32 dest_qp_num;
> +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> +        le32 qp_access_flags;
> +        /* Rate limit in kbps for packet pacing */
> +        le32 rate_limit;
> +        /* QP capabilities */
> +        struct virtio_rdma_qp_cap cap;
> +        /* Address Vector (valid only for RC/UC QPs) */
> +        struct virtio_rdma_ah_attr ah_attr;
> +        /* Reserved for future */
> +        le32 reserved[4];
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair.
> +  The command-specific-data is \field{virtio_rdma_cmd_query_qp};
> +  the ack-specific-data is \field{virtio_rdma_ack_query_qp}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_query_qp {
> +       /* The index of QP */
> +        le32 qpn;
> +        /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */
> +        le32 attr_mask;
> +};
> +
> +struct virtio_rdma_ack_query_qp {

Any chance to unify this with virtio_rdma_cmd_modify_qp?

> +        /* Move the QP to this state, enum virtio_ib_qp_state */
> +        u8 qp_state;
> +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> +        u8 path_mtu;
> +        /* Is the SQ draining */
> +        u8 sq_draining;
> +        /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */
> +        u8 max_rd_atomic;
> +        /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */
> +        u8 max_dest_rd_atomic;
> +        /* Minimum RNR NAK timer (valid only for RC QPs) */
> +        u8 min_rnr_timer;
> +        /* Local ack timeout (valid only for RC QPs) */
> +        u8 timeout;
> +        /* Retry count (valid only for RC QPs) */
> +        u8 retry_cnt;
> +        /* RNR retry (valid only for RC QPs) */
> +        u8 rnr_retry;
> +        /* Padding */
> +        u8 padding[7];
> +        /* Q_Key for the QP (valid only for UD QPs) */
> +        le32 qkey;
> +        /* PSN for RQ (valid only for RC/UC QPs) */
> +        le32 rq_psn;
> +        /* PSN for SQ */
> +        le32 sq_psn;
> +        /* Destination QP number (valid only for RC/UC QPs) */
> +        le32 dest_qp_num;
> +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> +        le32 qp_access_flags;
> +        /* Rate limit in kbps for packet pacing */
> +        le32 rate_limit;
> +        /* QP capabilities */
> +        struct virtio_rdma_qp_cap cap;
> +        /* Address Vector (valid only for RC/UC QPs) */
> +        struct virtio_rdma_ah_attr ah_attr;
> +        /* Reserved for future */
> +        le32 reserved[4];
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair.
> +  The command-specific-data is \field{virtio_rdma_cmd_destroy_qp};
> +  no ack-specific-data.

What happen to the pending requests? Will the device wait for the
completion or not?

> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_destroy_qp {
> +        /* The index of QP */
> +        le32 qpn;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH).
> +  The command-specific-data is \field{virtio_rdma_cmd_create_ah};
> +  the ack-specific-data is \field{virtio_rdma_ack_create_ah}.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_create_ah {
> +        /* The handle of PD which the AH associated with */
> +        le32 pdn;
> +        /* Padding */
> +        le32 padding;
> +        /* Address Vector */
> +        struct virtio_rdma_ah_attr ah_attr;
> +};
> +
> +struct virtio_rdma_ack_create_ah {
> +        /* The address handle */
> +        le32 ah;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle.
> +  The command-specific-data is \field{virtio_rdma_cmd_destroy_ah};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_destroy_ah {
> +        /* The handle of PD which the AH associated with */
> +        le32 pdn;
> +        /* The address handle */
> +        le32 ah;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID).
> +  The command-specific-data is \field{virtio_rdma_cmd_add_gid};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_add_gid {
> +        /* Index of GID */
> +        le16 index;
> +        /* Padding */
> +        le16 padding[3];
> +        /* GID to be added */
> +        u8 gid[16];
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier.
> +  The command-specific-data is \field{virtio_rdma_cmd_del_gid};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_del_gid {
> +        /* Index of GID */
> +        le16 index;
> +};
> +\end{lstlisting}
> +
> +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification
> +  on a Completion Queue.
> +  The command-specific-data is \field{virtio_rdma_cmd_req_notify};
> +  no ack-specific-data.
> +
> +\begin{lstlisting}
> +struct virtio_rdma_cmd_req_notify {
> +        /* The index of CQ */
> +        le32 cqn;
> +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0)
> +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1)

Need to describe the differences on those two flags.

> +        /* Notify flags */
> +        le32 flags;
> +};
> +\end{lstlisting}
> +
> +\end{description}
> +
> +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> +
> +A driver MUST initialize the completion virtqueue and fill it with
> +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is
> +successfully executed.
> +
> +A driver MUST reset the completion virtqueue after

How to do the reset? Do you mean driver need to reset the indices?

> +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed.
> +
> +A driver MUST initialize the send virtqueue and receive virtqueue after
> +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed.
> +
> +A driver MUST reset the send virtqueue and receive virtqueue after
> +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed.
>
>  \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
>  Types / Network Device / Legacy Interface: Framing Requirements}
> @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
>  See \ref{sec:Basic
>  Facilities of a Virtio Device / Virtqueues / Message Framing}.
>
> +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support}
> +
> +RDMA over Converged Ethernet (RoCE) is a network protocol that allows
> +remote direct memory access (RDMA) over an Ethernet network. To support
> +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control
> +virtqueue support mentioned in \ref{sec:Device Types / Network Device /
> +Device Operation / Control Virtqueue / RoCE Configuration}, multiple
> +types of virtqueues including send virtqueue, receive virtqueue and
> +completion virtqueue are introduced.
> +
> +The send virtqueue contains elements that describe the data to be
> +transmitted.
> +
> +Requests (device-readable) have the following format:
> +
> +\begin{lstlisting}
> +enum virtio_ib_wr_opcode {
> +        VIRTIO_IB_WR_RDMA_WRITE,
> +        VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM,
> +        VIRTIO_IB_WR_SEND,
> +        VIRTIO_IB_WR_SEND_WITH_IMM,
> +        VIRTIO_IB_WR_RDMA_READ,
> +};
> +
> +struct virtio_rdma_sge {
> +        le64 addr;
> +        le32 length;
> +        le32 lkey;
> +};
> +
> +struct virtio_rdma_sq_req {
> +        /* User defined WR ID */
> +        le64 wr_id;
> +        /* WR opcode, enum virtio_ib_wr_opcode */
> +        u8 opcode;
> +#define VIRTIO_IB_SEND_FENCE        (1 << 0)
> +#define VIRTIO_IB_SEND_SIGNALED     (1 << 1)
> +#define VIRTIO_IB_SEND_SOLICITED    (1 << 2)
> +#define VIRTIO_IB_SEND_INLINE       (1 << 3)
> +        /* Flags of the WR properties */
> +        u8 send_flags;
> +        /* Padding */
> +        le16 padding;
> +        /* Immediate data (in network byte order) to send */
> +        le32 imm_data;
> +        union {
> +                struct {
> +                        /* Start address of remote memory buffer */
> +                        le64 remote_addr;
> +                        /* Key of the remote MR */
> +                        le32 rkey;
> +                } rdma;
> +                struct {
> +                        /* Index of the destination QP */
> +                        le32 remote_qpn;
> +                        /* Q_Key of the destination QP */
> +                        le32 remote_qkey;
> +                        /* Address Handle */
> +                        le32 ah;
> +                } ud;
> +                /* Reserved for future */
> +                le64 reserved[4];
> +        };
> +        /* Inline data */
> +        u8 inline_data[512];
> +        union {
> +                /* Length of sg_list */
> +                le32 num_sge;
> +                /* Length of inline data */
> +                le16 inline_len;
> +        };
> +        /* Reserved for future */
> +        le32 reserved2[3];
> +       /* Scatter/gather list */
> +        struct virtio_rdma_sge sg_list[];
> +};
> +\end{lstlisting}
> +
> +The receive virtqueue contains elements that describe where to place incoming data.
> +
> +Requests (device-readable) have the following format:
> +
> +\begin{lstlisting}
> +struct virtio_rdma_rq_req {
> +        /* User defined WR ID */
> +        le64 wr_id;
> +        /* Length of sg_list */
> +        le32 num_sge;
> +        /* Reserved for future */
> +        le32 reserved[3];
> +        /* Scatter/gather list */
> +        struct virtio_rdma_sge sg_list[];
> +};
> +\end{lstlisting}
> +
> +The completion virtqueue is used to notify the completion of requests in
> +send virtqueue or receive virtqueue.
> +
> +Requests (device-writable) have the following format:
> +
> +\begin{lstlisting}
> +enum virtio_ib_wc_opcode {
> +        VIRTIO_IB_WC_SEND,
> +        VIRTIO_IB_WC_RDMA_WRITE,
> +        VIRTIO_IB_WC_RDMA_READ,
> +        VIRTIO_IB_WC_RECV,
> +        VIRTIO_IB_WC_RECV_RDMA_WITH_IMM,
> +};
> +
> +enum virtio_ib_wc_status {
> +        /* Operation completed successfully */
> +        VIRTIO_IB_WC_SUCCESS,
> +        /* Local Length Error */
> +        VIRTIO_IB_WC_LOC_LEN_ERR,
> +        /* Local QP Operation Error */
> +        VIRTIO_IB_WC_LOC_QP_OP_ERR,
> +        /* Local Protection Error */
> +        VIRTIO_IB_WC_LOC_PROT_ERR,
> +        /* Work Request Flushed Error */
> +        VIRTIO_IB_WC_WR_FLUSH_ERR,
> +        /* Bad Response Error */
> +        VIRTIO_IB_WC_BAD_RESP_ERR,
> +        /* Local Access Error */
> +        VIRTIO_IB_WC_LOC_ACCESS_ERR,
> +        /* Remote Invalid Request Error */
> +        VIRTIO_IB_WC_REM_INV_REQ_ERR,
> +        /* Remote Access Error */
> +        VIRTIO_IB_WC_REM_ACCESS_ERR,
> +        /* Remote Operation Error */
> +        VIRTIO_IB_WC_REM_OP_ERR,
> +        /* Transport Retry Counter Exceeded */
> +        VIRTIO_IB_WC_RETRY_EXC_ERR,
> +        /* RNR Retry Counter Exceeded */
> +        VIRTIO_IB_WC_RNR_RETRY_EXC_ERR,
> +        /* Remote Aborted Error */
> +        VIRTIO_IB_WC_REM_ABORT_ERR,
> +        /* Fatal Error */
> +        VIRTIO_IB_WC_FATAL_ERR,
> +        /* Response Timeout Error */
> +        VIRTIO_IB_WC_RESP_TIMEOUT_ERR,
> +        /* General Error */
> +        VIRTIO_IB_WC_GENERAL_ERR
> +};
> +
> +struct virtio_rdma_cq_req {
> +        /* User defined WR ID */
> +        le64 wr_id;
> +        /* Work completion status, enum virtio_ib_wc_status */
> +        u8 status;
> +        /* WR opcode, enum virtio_ib_wc_opcode */
> +        u8 opcode;
> +        /* Padding */
> +        le16 padding;
> +        /* Vendor error */
> +        le32 vendor_err;
> +        /* Number of bytes transferred */
> +        le32 byte_len;
> +        /* Immediate data (in network byte order) to send */
> +        le32 imm_data;
> +        /* Local QP number of completed WR */
> +        le32 qp_num;
> +        /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */
> +        le32 src_qp;
> +#define VIRTIO_IB_WC_GRH         (1 << 0)
> +#define VIRTIO_IB_WC_WITH_IMM    (1 << 1)
> +        /* Work completion flag */
> +        le32 wc_flags;
> +        /* Reserved for future */
> +        le32 reserved[3];
> +};
> +\end{lstlisting}
> +
> +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> +
> +The send operation allows us to send data to a remote QP’s Receive Queue.
> +The receiver MUST have previously posted a receive buffer to receive the data.

"MUST" keyword must belong to the normative section.

> +
> +To do a send operation, a request with \field{opcode} set to
> +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send
> +Queue as one output descriptor and the device is notified of the new entry.
> +
> +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> +
> +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill
> +send buffer into \field{inline_data} field and set \field{inline_len} to the
> +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to
> +describe the buffer.
> +
> +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> +
> +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST
> +ignore \field{imm_data}.
> +
> +If the QP type is UD, the device MUST validate \field{ud.ah}.
> +
> +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST
> +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}.
> +
> +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> +
> +The receive operation allows us to receive data from remote QP.
> +It's the corresponding operation to a send operation.
> +
> +To do a receive operation, a request MUST be posted to the Receive
> +Queue as one output descriptor and the device is notified of the new entry.
> +

I think we probably need to be more verbose as what has been done for
virtio-net.

That is, describe what need to be filled in virtio_rdma_rq_req in
details. (And do this for other operation as well)


> +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> +
> +The driver MUST fill \field{sg_list} to describe the receive buffer.
> +
> +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> +
> +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> +in \field{sg_list}.
> +
> +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> +
> +The write operation allows us to write data to the local memory buffer
> +in remote side with no notification. The remote side wouldn't be aware
> +that this operation being done.
> +
> +To do a write operation, a request with \field{opcode} set to
> +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be
> +posted to the Send Queue as one output descriptor and the device is
> +notified of the new entry.
> +
> +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> +
> +The driver MUST fill \field{sg_list} to describe the write buffer.

So sg is a must even if the driver want to use imm?

> +
> +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> +identify the remote buffer.
> +
> +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> +
> +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device
> +MUST ignore \field{imm_data}.
> +
> +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> +in \field{sg_list}.
> +
> +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> +
> +The read operation allows us to read data from the local memory buffer
> +in remote side with no notification. The remote side wouldn't be aware
> +that this operation being done.
> +
> +To do a read operation, a request with \field{opcode} set to
> +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output
> +descriptor and the device is notified of the new entry.
> +
> +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> +
> +The driver MUST fill \field{sg_list} to describe the read buffer.
> +
> +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> +identify the remote buffer.
> +
> +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> +
> +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> +in \field{sg_list}.
> +
> +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> +
> +After above operation is completed, a completion notification MUST
> +be triggered by the device.

For "completion notification", do you mean the virtqueue notification
of cq or the making the buffer than contains cqe used?

> To achieve that, the device MUST consume
> +an entry of the Completion Queue associated with the Send Queue/Receive
> +Queue which the operation belongs to.
> +
> +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> +
> +The driver MUST fill the Completion Queue with enough entries previously.

What do you mean by "previously"? What happens if there's no sufficient cqe?

Thanks

> +
> +\devicenormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> +
> +If \field{imm_data} is valid, the device MUST set VIRTIO_IB_WC_WITH_IMM to
> +\field{wc_flags}.
> +
> +The device MUST set \field{wr_id} to the value of \field{wr_id} of
> +corresponding \field{struct virtio_rdma_sq_req} or
> +\field{struct virtio_rdma_rq_req}.
> +
>  \section{Block Device}\label{sec:Device Types / Block Device}
>
>  The virtio block device is a simple virtual block device (ie.
> --
> 2.11.0
>
Yongji Xie Aug. 9, 2022, 2:22 p.m. UTC | #2
On Thu, Aug 4, 2022 at 4:30 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@bytedance.com> wrote:
> >
> > Hi all,
> >
>
> Not very familiar with ROCE, try to give some comments from general
> virtio level.
>

Thank you!

> > This RFC aims to introduce our recent work on enabling RoCE support
> > for virtio-net device.
>
> We need to clarify the version of ROCE, is it ROCEv2 or not?
>

Yes, it's RoCE v2.

> >
> > To support RoCE, three types of virtqueues including RDMA send virtqueue,
> > RDMA receive virtqueue and RDMA completion virtqueue are introduced.
> > And control virtqueue is reused to support the RDMA control messages.
> >
> > Now we support some basic RDMA semantics such as send/receive
> > and read/write operation.
>
> It would be better to explain the advantages of this over the existing
> pvrdma approach. I guess one advantage is that using virtio makes it
> easier to connect to a userspace dataplane through vDPA/vhost-user?
>

Yes, this is one advantage. Another one is that we don't need a
physical RDMA-capable NIC.

> >
> > To test with our demo:
> >
> > 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA
> >
> > 2. Build QEMU [2] with config VHOST_USER_RDMA
> >
> > 3. Build rdma-core [3]
> >
> > 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3)
> >
> > 5. Build vhost-user-rdma [4]
> >
> > 6. Run vhost-user-rdma with command:
> >     $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0'
> >
> > 7. Run qemu with command:
> >     $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \
> >       -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ...
>
> It would be better to give some performance numbers (or even compare
> it with pvrdma).
>

OK, will do it in v3.

> >
> > [1] https://github.com/bytedance/linux/tree/virtio-net-roce
> > [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma
> > [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma
> > [4] https://github.com/YongjiXie/vhost-user-rdma
> >
> > We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some
> > others in rdma-core.
> >
> > TODO:
> >
>
> And we'd better consider the live migration support. Having a quick
> glance, it looks to me trapping the cvq is sufficient?
>

I'm not sure. Each QP has its own state machine, which may also
require save & restore.

> > 1. Add support for Base Memory Management Extensions
> >
> > 2. Add support for atomic operation
> >
> > 3. Add support for SRQ
> >
> > 4. Add support for virtqueue resize
>
> Note that this is already supported by the spec via virtqueue reset.
>

OK.

> >
> > 5. Add support for enabling/disabling virtqueue at runtime
>
> I guess virtqueue reset could help in this case.
>

We might need to do some extension since we want to free the resources
when disabling the queue.

> >
> > Please review, thanks!
> >
> > V1 to V2:
> > - Rework the implementation via extending virtio-net instead of
> >   introducing a new device type [Jason]
> > - Add address handle support
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > Co-developed-by: Wei Junji <weijunji@bytedance.com>
> > Signed-off-by: Wei Junji <weijunji@bytedance.com>
> > ---
> >  content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 854 insertions(+), 4 deletions(-)
>
> I wonder if there's some open-source ROCE transport device API that we
> can re-use then we can just behave like a transport layer instead of
> inventing new commands.
>

That's would be better. But I didn't find one.

> >
> > diff --git a/content.tex b/content.tex
> > index 7508dd1..646d82a 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device}
> >  placed in one virtqueue for receiving packets, and outgoing
> >  packets are enqueued into another for transmission in that order.
> >  A third command queue is used to control advanced filtering
> > -features.
> > +features. And if RoCE (RDMA over Converged Ethernet) capability
> > +is enabled, the virtio network device can also support transmitting
> > +and receiving RDMA message through RDMA send virtqueue, RDMA receive
> > +virtqueue and RDMA completion virtqueue.
> >
> >  \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID}
> >
> > @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues}
> >  \item[2(N-1)] receiveqN
> >  \item[2(N-1)+1] transmitqN
> >  \item[2N] controlq
> > +\item[2N+1] rdma_completeq1
> > +\item[\ldots]
> > +\item[2N+M] rdma_completeqM
> > +\item[2N+M+1] rdma_transmitq1
> > +\item[2N+M+2] rdma_receiveq1
> > +\item[\ldots]
> > +\item[2N+M+2L-1] rdma_transmitqL
> > +\item[2N+M+2L] rdma_receiveqL
> >  \end{description}
> >
> >   N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by
> > - \field{max_virtqueue_pairs}.
> > + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by
> > + \field{max_rdma_qps}.
> >
> >   controlq only exists if VIRTIO_NET_F_CTRL_VQ set.
> >
> > + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set
> > +
> >  \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits}
> >
> >  \begin{description}
> > @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits
> >  \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control
> >      channel.
> >
> > +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet)
> > +    capability.
> > +
> >  \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO
> >   (fragmenting the packet) the USO splits large UDP packet
> >   to several segments when each of these smaller packets has UDP header.
> > @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device
> >  \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ.
> > +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6.
> >  \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \end{description}
> > @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
> >          u8 rss_max_key_size;
> >          le16 rss_max_indirection_table_length;
> >          le32 supported_hash_types;
> > +        le32 max_rdma_qps;
> > +        le32 max_rdma_cps;
> >  };
> >  \end{lstlisting}
> >  The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set.
> > @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
> >  Field \field{supported_hash_types} contains the bitmask of supported hash types.
> >  See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types.
> >
> > +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set.
> > +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage.
> > +
> > +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set.
> > +It specifies the maximum number of completion virtqueue for RoCE usage.
> > +
> >  \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout}
> >
> >  The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive,
> >  if it offers VIRTIO_NET_F_MQ.
> >
> > +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive,
> > +if it offers VIRTIO_NET_F_ROCE.
>
> I wonder why 16384 is chosen here?
>

Since the max queue number is limited to 65536 and we have three types
of queue, the queue number should be less than 65536 / 3. We choose
65536 / 4 here.

> > +
> > +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive,
> > +if it offers VIRTIO_NET_F_ROCE.
> > +
> >  The device MUST set \field{mtu} to between 68 and 65535 inclusive,
> >  if it offers VIRTIO_NET_F_MTU.
> >
> > @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev
> >  \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated,
> >    identify the control virtqueue.
> >
> > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> > +  identify the the RDMA completion virtqueues, up to max_rdma_cqs.
> > +
> > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> > +  identify the the RDMA send and receive virtqueues, up to max_rdma_qps.
> > +
> >  \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}.
> >
> >  \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and
> > @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >          u8 command;
> >          u8 command-specific-data[];
> >          u8 ack;
> > +        u8 ack-specific-data[];
> >  };
> >
> >  /* ack values */
> > @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >  \end{lstlisting}
> >
> >  The \field{class}, \field{command} and command-specific-data are set by the
> > -driver, and the device sets the \field{ack} byte. There is little it can
> > -do except issue a diagnostic if \field{ack} is not
> > +driver, and the device sets the \field{ack} byte and ack-specific-data. There
> > +is little it can do except issue a diagnostic if \field{ack} is not
> >  VIRTIO_NET_OK.
> >
> >  \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering}
> > @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >  according to the native endian of the guest rather than
> >  (necessarily when not using the legacy interface) little-endian.
> >
> > +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> > +
> > +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ),
> > +it can send control commands for RoCE usage. The following commands are defined now:
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_NET_CTRL_ROCE    6
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE      0
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT        1
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ         2
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ        3
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD         4
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD        5
> > + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR        6
> > + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR       7
> > + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR          8
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP         9
> > + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP         10
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP          11
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP        12
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH         13
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH        14
> > + #define VIRTIO_NET_CTRL_ROCE_ADD_GID           15
> > + #define VIRTIO_NET_CTRL_ROCE_DEL_GID           16
> > + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ     17
> > +\end{lstlisting}
> > +
> > +\begin{description}
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device.
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_query_device}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_query_device {
> > +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN    (1 << 0)
>
> What's the meaning of this capability?
>

It indicates whether the device supports RNR-NAK generation for RC
QPs. I will add some comments.

> > +        /* Capabilities mask */
> > +        le64 device_cap_flags;
>
> Will this introduce a migration compatibility issue? E.g src and dst
> have the same features but different capabilities.
>

Should this be controlled by hypervisor since all capabilities is
emulated by software.

> > +        /* Largest contiguous block that can be registered */
> > +        le64 max_mr_size;
> > +        /* Supported memory shift sizes */
> > +        le64 page_size_cap;
> > +        /* Hardware version */
> > +        le32 hw_ver;
>
> What did "hardware version" mean? Is this something that is defined in
> the IB spec?
>

Yes, it's defined in IB spec.

> > +        /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */
> > +        le32 max_qp_wr;
>
> Is this implied in the virtqueue size? If not, why?
>

Yes. Will remove it.

> > +        /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */
> > +        le32 max_send_sge;
> > +        /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */
> > +        le32 max_recv_sge;
> > +        /* Maximum number of s/g per WR for RDMA Read operations */
> > +        le32 max_sge_rd;
> > +        /* Maximum size of Completion Queue (CQ) */
> > +        le32 max_cqe;
>
> Need to specify the reason why we can't use the virtqueue size for the
> completion queue.
>

I think we can. Will remove it

> > +        /* Maximum number of Memory Regions (MR) */
> > +        le32 max_mr;
> > +        /* Maximum number of Protection Domains (PD) */
> > +        le32 max_pd;
> > +        /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */
>
> I guess you mean "operations" here.
>

Yes.

> > +        le32 max_qp_rd_atom;
> > +        /* Maximum depth per QP for initiation of RDMA Read operations */
>
> The member has an "atom" suffix, does it mean "atomic read" or other?
>

It means the atomic operation which is unsupported now. I think we
need to remove it.

> > +        le32 max_qp_init_rd_atom;
> > +        /* Maximum number of Address Handles (AH) */
> > +        le32 max_ah;
> > +        /* Local CA ack delay */
> > +        u8 local_ca_ack_delay;
> > +        /* Padding */
> > +        u8 padding[3];
> > +        /* Reserved for future */
> > +        le32 reserved[14];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port.
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_query_port}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_query_port {
> > +        /* Length of source Global Identifier (GID) table */
> > +        le32 gid_tbl_len;
> > +        /* Maximum message size */
> > +        le32 max_msg_sz;
>
> I guess this is for both read/write/send/receive? And is 4GB
> sufficient for the future?
>

Now this follows the definition in linux kernel and IB Spec. If we
need to extend it in future, we can add a new field max_msg_sz64?

> > +        /* Reserved for future */
> > +        le32 reserved[6];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ).
> > +  The command-specific-data is \field{struct virtio_rdma_cmd_create_cq};
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_create_cq {
> > +        /* Size of CQ */
> > +        le32 cqe;
> > +};
> > +
> > +struct virtio_rdma_ack_create_cq {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue.
> > +  The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_cq {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD).
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_create_pd {
> > +        /* The handle of PD */
> > +        le32 pdn;
> > +};
> > +\end{lstlisting}
>
> Can this command always succeed? I meant is there a limit of the total
> number of PDs that a single ROCE device can support?
>

Yes, we have max_pd field in structure virtio_rdma_ack_query_device.

> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_pd};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_pd {
> > +        /* The handle of PD */
> > +        le32 pdn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR).
> > +  associated with one protection domain.
>
> I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR
> and USR_MR. Can we unify them?
>

We should pass some address for USER_MR. I think we can unify them if we want.

> > +  The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr};
> > +  the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}.
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_access_flags {
> > +        VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0),
>
> Is LOCAL_READ implied to work always?
>

Yes, the LOCAL_READ is always supported.

> > +        VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1),
> > +        VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2),
> > +};
> > +
> > +struct virtio_rdma_cmd_get_dma_mr {
> > +        /* The handle of PD which the MR associated with */
> > +        le32 pdn;
> > +        /* MR's protection attributes, enum virtio_ib_access_flags */
> > +        le32 access_flags;
> > +};
> > +
> > +struct virtio_rdma_ack_get_dma_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +        /* MR's local access key */
> > +        le32 lkey;
> > +        /* MR's remote access key */
> > +        le32 rkey;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region
> > +  associated with one Protection Domain.
> > +  The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr};
> > +  the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_reg_user_mr {
> > +        /* The handle of PD which the MR associated with */
> > +        le32 pdn;
> > +        /* MR's protection attributes, enum virtio_ib_access_flags */
> > +        le32 access_flags;
> > +        /* Starting virtual address of MR */
> > +        le64 virt_addr;
>
> I guess this is actually the I/O virtual address and the device is in
> charge of translate it to the page arrays below?
>

Yes, this address is specified by userspace, which can be a virtual
address or not.

> > +        /* Length of MR */
> > +        le64 length;
> > +        /* Size of the below page array */
> > +        le32 npages;
> > +        /* Padding */
> > +        le32 padding;
> > +        /* Array to store physical address of each page in MR */
> > +        le64 pages[];
>
> How do device know the size of a page?
>

We have npages field in this struture.

> > +};
>
> I believe this command can fail, we need to describe the error conditions.
>

OK.

> > +
> > +struct virtio_rdma_ack_reg_user_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +        /* MR's local access key */
> > +        le32 lkey;
> > +        /* MR's remote access key */
> > +        le32 rkey;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region.
> > +  The command-specific-data is \field{virtio_rdma_cmd_dereg_mr};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_dereg_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue).
> > +  The command-specific-data is \field{virtio_rdma_cmd_create_qp};
> > +  the ack-specific-data is \field{virtio_rdma_ack_create_qp}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_qp_cap {
> > +        /* Maximum number of outstanding WRs in SQ */
> > +        le32 max_send_wr;
> > +        /* Maximum number of outstanding WRs in RQ */
> > +        le32 max_recv_wr;
> > +        /* Maximum number of s/g elements per WR in SQ */
> > +        le32 max_send_sge;
> > +        /* Maximum number of s/g elements per WR in RQ */
> > +        le32 max_recv_sge;
> > +        /* Maximum number of data (bytes) that can be posted inline to SQ */
> > +        le32 max_inline_data;
> > +        /* Padding */
> > +        le32 padding;
> > +};
> > +
> > +struct virtio_rdma_cmd_create_qp {
> > +        /* The handle of PD which the QP associated with */
> > +        le32 pdn;
> > +#define VIRTIO_IB_QPT_SMI    0
> > +#define VIRTIO_IB_QPT_GSI    1
> > +#define VIRTIO_IB_QPT_RC     2
> > +#define VIRTIO_IB_QPT_UC     3
> > +#define VIRTIO_IB_QPT_UD     4
> > +        /* QP's type */
> > +        u8 qp_type;
> > +        /* If set, each WR submitted to the SQ generates a completion entry */
> > +        u8 sq_sig_all;
> > +        /* Padding */
> > +        u8 padding[2];
> > +        /* The index of CQ which the SQ associated with */
> > +        le32 send_cqn;
> > +        /* The index of CQ which the RQ associated with */
> > +        le32 recv_cqn;
> > +        /* QP's capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +
> > +struct virtio_rdma_ack_create_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_modify_qp};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_global_route {
> > +        /* Destination GID or MGID */
> > +        u8 dgid[16];
> > +        /* Flow label */
> > +        le32 flow_label;
> > +        /* Source GID index */
> > +        u8 sgid_index;
> > +        /* Hop limit */
> > +        u8 hop_limit;
> > +        /* Traffic class */
> > +        u8 traffic_class;
> > +        /* Padding */
> > +        u8 padding;
> > +};
> > +
> > +struct virtio_rdma_ah_attr {
> > +        /* Global Routing Header (GRH) attributes */
> > +        virtio_rdma_global_route grh;
> > +        /* Destination MAC address */
> > +        u8 dmac[6];
> > +        /* Reserved for future */
> > +        u8 reserved[10];
> > +};
> > +
> > +enum virtio_ib_qp_attr_mask {
> > +        VIRTIO_IB_QP_STATE = (1 << 0),
> > +        VIRTIO_IB_QP_CUR_STATE = (1 << 1),
> > +        VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2),
> > +        VIRTIO_IB_QP_QKEY = (1 << 3),
> > +        VIRTIO_IB_QP_AV = (1 << 4),
> > +        VIRTIO_IB_QP_PATH_MTU = (1 << 5),
> > +        VIRTIO_IB_QP_TIMEOUT = (1 << 6),
> > +        VIRTIO_IB_QP_RETRY_CNT = (1 << 7),
> > +        VIRTIO_IB_QP_RNR_RETRY = (1 << 8),
> > +        VIRTIO_IB_QP_RQ_PSN = (1 << 9),
> > +        VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10),
> > +        VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11),
> > +        VIRTIO_IB_QP_SQ_PSN = (1 << 12),
> > +        VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13),
> > +        VIRTIO_IB_QP_CAP = (1 << 14),
> > +        VIRTIO_IB_QP_DEST_QPN = (1 << 15),
> > +        VIRTIO_IB_QP_RATE_LIMIT = (1 << 16),
> > +};
>
> Do we need to explain the above error codes? Or it's simply a map from IB spec?
>

Yes, it's defined in IB spec. But we can add some comments for them too.

> > +
> > +enum virtio_ib_qp_state {
> > +        VIRTIO_IB_QPS_RESET,
> > +        VIRTIO_IB_QPS_INIT,
> > +        VIRTIO_IB_QPS_RTR,
> > +        VIRTIO_IB_QPS_RTS,
> > +        VIRTIO_IB_QPS_SQD,
> > +        VIRTIO_IB_QPS_SQE,
> > +        VIRTIO_IB_QPS_ERR
> > +};
> > +
> > +enum virtio_ib_mtu {
> > +        VIRTIO_IB_MTU_256 = 1,
> > +        VIRTIO_IB_MTU_512 = 2,
> > +        VIRTIO_IB_MTU_1024 = 3,
> > +        VIRTIO_IB_MTU_2048 = 4,
> > +        VIRTIO_IB_MTU_4096 = 5
> > +};
> > +
> > +struct virtio_rdma_cmd_modify_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +        /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */
> > +        le32 attr_mask;
> > +        /* Move the QP to this state, enum virtio_ib_qp_state */
> > +        u8 qp_state;
> > +        /* Current QP state, enum virtio_ib_qp_state */
> > +        u8 cur_qp_state;
> > +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> > +        u8 path_mtu;
> > +        /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */
> > +        u8 max_rd_atomic;
> > +        /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */
> > +        u8 max_dest_rd_atomic;
> > +        /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */
> > +        u8 min_rnr_timer;
> > +        /* Local ack timeout (valid only for RC QPs) */
> > +        u8 timeout;
> > +        /* Retry count (valid only for RC QPs) */
> > +        u8 retry_cnt;
> > +        /* RNR retry (valid only for RC QPs) */
> > +        u8 rnr_retry;
> > +        /* Padding */
> > +        u8 padding[7];
> > +        /* Q_Key for the QP (valid only for UD QPs) */
> > +        le32 qkey;
> > +        /* PSN for RQ (valid only for RC/UC QPs) */
> > +        le32 rq_psn;
> > +        /* PSN for SQ */
> > +        le32 sq_psn;
> > +        /* Destination QP number (valid only for RC/UC QPs) */
> > +        le32 dest_qp_num;
> > +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> > +        le32 qp_access_flags;
> > +        /* Rate limit in kbps for packet pacing */
> > +        le32 rate_limit;
> > +        /* QP capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Address Vector (valid only for RC/UC QPs) */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_query_qp};
> > +  the ack-specific-data is \field{virtio_rdma_ack_query_qp}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_query_qp {
> > +       /* The index of QP */
> > +        le32 qpn;
> > +        /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */
> > +        le32 attr_mask;
> > +};
> > +
> > +struct virtio_rdma_ack_query_qp {
>
> Any chance to unify this with virtio_rdma_cmd_modify_qp?
>

It would be a little confusing since some states is only used by modify_qp.

> > +        /* Move the QP to this state, enum virtio_ib_qp_state */
> > +        u8 qp_state;
> > +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> > +        u8 path_mtu;
> > +        /* Is the SQ draining */
> > +        u8 sq_draining;
> > +        /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */
> > +        u8 max_rd_atomic;
> > +        /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */
> > +        u8 max_dest_rd_atomic;
> > +        /* Minimum RNR NAK timer (valid only for RC QPs) */
> > +        u8 min_rnr_timer;
> > +        /* Local ack timeout (valid only for RC QPs) */
> > +        u8 timeout;
> > +        /* Retry count (valid only for RC QPs) */
> > +        u8 retry_cnt;
> > +        /* RNR retry (valid only for RC QPs) */
> > +        u8 rnr_retry;
> > +        /* Padding */
> > +        u8 padding[7];
> > +        /* Q_Key for the QP (valid only for UD QPs) */
> > +        le32 qkey;
> > +        /* PSN for RQ (valid only for RC/UC QPs) */
> > +        le32 rq_psn;
> > +        /* PSN for SQ */
> > +        le32 sq_psn;
> > +        /* Destination QP number (valid only for RC/UC QPs) */
> > +        le32 dest_qp_num;
> > +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> > +        le32 qp_access_flags;
> > +        /* Rate limit in kbps for packet pacing */
> > +        le32 rate_limit;
> > +        /* QP capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Address Vector (valid only for RC/UC QPs) */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_qp};
> > +  no ack-specific-data.
>
> What happen to the pending requests? Will the device wait for the
> completion or not?
>

It should be discarded according to IB spec.

> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH).
> > +  The command-specific-data is \field{virtio_rdma_cmd_create_ah};
> > +  the ack-specific-data is \field{virtio_rdma_ack_create_ah}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_create_ah {
> > +        /* The handle of PD which the AH associated with */
> > +        le32 pdn;
> > +        /* Padding */
> > +        le32 padding;
> > +        /* Address Vector */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +};
> > +
> > +struct virtio_rdma_ack_create_ah {
> > +        /* The address handle */
> > +        le32 ah;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_ah};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_ah {
> > +        /* The handle of PD which the AH associated with */
> > +        le32 pdn;
> > +        /* The address handle */
> > +        le32 ah;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID).
> > +  The command-specific-data is \field{virtio_rdma_cmd_add_gid};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_add_gid {
> > +        /* Index of GID */
> > +        le16 index;
> > +        /* Padding */
> > +        le16 padding[3];
> > +        /* GID to be added */
> > +        u8 gid[16];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier.
> > +  The command-specific-data is \field{virtio_rdma_cmd_del_gid};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_del_gid {
> > +        /* Index of GID */
> > +        le16 index;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification
> > +  on a Completion Queue.
> > +  The command-specific-data is \field{virtio_rdma_cmd_req_notify};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_req_notify {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0)
> > +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1)
>
> Need to describe the differences on those two flags.
>

OK.

> > +        /* Notify flags */
> > +        le32 flags;
> > +};
> > +\end{lstlisting}
> > +
> > +\end{description}
> > +
> > +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> > +
> > +A driver MUST initialize the completion virtqueue and fill it with
> > +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is
> > +successfully executed.
> > +
> > +A driver MUST reset the completion virtqueue after
>
> How to do the reset? Do you mean driver need to reset the indices?
>

Yes, something like avail_idx, used_idx.

> > +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed.
> > +
> > +A driver MUST initialize the send virtqueue and receive virtqueue after
> > +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed.
> > +
> > +A driver MUST reset the send virtqueue and receive virtqueue after
> > +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed.
> >
> >  \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
> >  Types / Network Device / Legacy Interface: Framing Requirements}
> > @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
> >  See \ref{sec:Basic
> >  Facilities of a Virtio Device / Virtqueues / Message Framing}.
> >
> > +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support}
> > +
> > +RDMA over Converged Ethernet (RoCE) is a network protocol that allows
> > +remote direct memory access (RDMA) over an Ethernet network. To support
> > +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control
> > +virtqueue support mentioned in \ref{sec:Device Types / Network Device /
> > +Device Operation / Control Virtqueue / RoCE Configuration}, multiple
> > +types of virtqueues including send virtqueue, receive virtqueue and
> > +completion virtqueue are introduced.
> > +
> > +The send virtqueue contains elements that describe the data to be
> > +transmitted.
> > +
> > +Requests (device-readable) have the following format:
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_wr_opcode {
> > +        VIRTIO_IB_WR_RDMA_WRITE,
> > +        VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM,
> > +        VIRTIO_IB_WR_SEND,
> > +        VIRTIO_IB_WR_SEND_WITH_IMM,
> > +        VIRTIO_IB_WR_RDMA_READ,
> > +};
> > +
> > +struct virtio_rdma_sge {
> > +        le64 addr;
> > +        le32 length;
> > +        le32 lkey;
> > +};
> > +
> > +struct virtio_rdma_sq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* WR opcode, enum virtio_ib_wr_opcode */
> > +        u8 opcode;
> > +#define VIRTIO_IB_SEND_FENCE        (1 << 0)
> > +#define VIRTIO_IB_SEND_SIGNALED     (1 << 1)
> > +#define VIRTIO_IB_SEND_SOLICITED    (1 << 2)
> > +#define VIRTIO_IB_SEND_INLINE       (1 << 3)
> > +        /* Flags of the WR properties */
> > +        u8 send_flags;
> > +        /* Padding */
> > +        le16 padding;
> > +        /* Immediate data (in network byte order) to send */
> > +        le32 imm_data;
> > +        union {
> > +                struct {
> > +                        /* Start address of remote memory buffer */
> > +                        le64 remote_addr;
> > +                        /* Key of the remote MR */
> > +                        le32 rkey;
> > +                } rdma;
> > +                struct {
> > +                        /* Index of the destination QP */
> > +                        le32 remote_qpn;
> > +                        /* Q_Key of the destination QP */
> > +                        le32 remote_qkey;
> > +                        /* Address Handle */
> > +                        le32 ah;
> > +                } ud;
> > +                /* Reserved for future */
> > +                le64 reserved[4];
> > +        };
> > +        /* Inline data */
> > +        u8 inline_data[512];
> > +        union {
> > +                /* Length of sg_list */
> > +                le32 num_sge;
> > +                /* Length of inline data */
> > +                le16 inline_len;
> > +        };
> > +        /* Reserved for future */
> > +        le32 reserved2[3];
> > +       /* Scatter/gather list */
> > +        struct virtio_rdma_sge sg_list[];
> > +};
> > +\end{lstlisting}
> > +
> > +The receive virtqueue contains elements that describe where to place incoming data.
> > +
> > +Requests (device-readable) have the following format:
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_rq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* Length of sg_list */
> > +        le32 num_sge;
> > +        /* Reserved for future */
> > +        le32 reserved[3];
> > +        /* Scatter/gather list */
> > +        struct virtio_rdma_sge sg_list[];
> > +};
> > +\end{lstlisting}
> > +
> > +The completion virtqueue is used to notify the completion of requests in
> > +send virtqueue or receive virtqueue.
> > +
> > +Requests (device-writable) have the following format:
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_wc_opcode {
> > +        VIRTIO_IB_WC_SEND,
> > +        VIRTIO_IB_WC_RDMA_WRITE,
> > +        VIRTIO_IB_WC_RDMA_READ,
> > +        VIRTIO_IB_WC_RECV,
> > +        VIRTIO_IB_WC_RECV_RDMA_WITH_IMM,
> > +};
> > +
> > +enum virtio_ib_wc_status {
> > +        /* Operation completed successfully */
> > +        VIRTIO_IB_WC_SUCCESS,
> > +        /* Local Length Error */
> > +        VIRTIO_IB_WC_LOC_LEN_ERR,
> > +        /* Local QP Operation Error */
> > +        VIRTIO_IB_WC_LOC_QP_OP_ERR,
> > +        /* Local Protection Error */
> > +        VIRTIO_IB_WC_LOC_PROT_ERR,
> > +        /* Work Request Flushed Error */
> > +        VIRTIO_IB_WC_WR_FLUSH_ERR,
> > +        /* Bad Response Error */
> > +        VIRTIO_IB_WC_BAD_RESP_ERR,
> > +        /* Local Access Error */
> > +        VIRTIO_IB_WC_LOC_ACCESS_ERR,
> > +        /* Remote Invalid Request Error */
> > +        VIRTIO_IB_WC_REM_INV_REQ_ERR,
> > +        /* Remote Access Error */
> > +        VIRTIO_IB_WC_REM_ACCESS_ERR,
> > +        /* Remote Operation Error */
> > +        VIRTIO_IB_WC_REM_OP_ERR,
> > +        /* Transport Retry Counter Exceeded */
> > +        VIRTIO_IB_WC_RETRY_EXC_ERR,
> > +        /* RNR Retry Counter Exceeded */
> > +        VIRTIO_IB_WC_RNR_RETRY_EXC_ERR,
> > +        /* Remote Aborted Error */
> > +        VIRTIO_IB_WC_REM_ABORT_ERR,
> > +        /* Fatal Error */
> > +        VIRTIO_IB_WC_FATAL_ERR,
> > +        /* Response Timeout Error */
> > +        VIRTIO_IB_WC_RESP_TIMEOUT_ERR,
> > +        /* General Error */
> > +        VIRTIO_IB_WC_GENERAL_ERR
> > +};
> > +
> > +struct virtio_rdma_cq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* Work completion status, enum virtio_ib_wc_status */
> > +        u8 status;
> > +        /* WR opcode, enum virtio_ib_wc_opcode */
> > +        u8 opcode;
> > +        /* Padding */
> > +        le16 padding;
> > +        /* Vendor error */
> > +        le32 vendor_err;
> > +        /* Number of bytes transferred */
> > +        le32 byte_len;
> > +        /* Immediate data (in network byte order) to send */
> > +        le32 imm_data;
> > +        /* Local QP number of completed WR */
> > +        le32 qp_num;
> > +        /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */
> > +        le32 src_qp;
> > +#define VIRTIO_IB_WC_GRH         (1 << 0)
> > +#define VIRTIO_IB_WC_WITH_IMM    (1 << 1)
> > +        /* Work completion flag */
> > +        le32 wc_flags;
> > +        /* Reserved for future */
> > +        le32 reserved[3];
> > +};
> > +\end{lstlisting}
> > +
> > +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +The send operation allows us to send data to a remote QP’s Receive Queue.
> > +The receiver MUST have previously posted a receive buffer to receive the data.
>
> "MUST" keyword must belong to the normative section.
>

OK.

> > +
> > +To do a send operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send
> > +Queue as one output descriptor and the device is notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill
> > +send buffer into \field{inline_data} field and set \field{inline_len} to the
> > +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to
> > +describe the buffer.
> > +
> > +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST
> > +ignore \field{imm_data}.
> > +
> > +If the QP type is UD, the device MUST validate \field{ud.ah}.
> > +
> > +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST
> > +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}.
> > +
> > +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The receive operation allows us to receive data from remote QP.
> > +It's the corresponding operation to a send operation.
> > +
> > +To do a receive operation, a request MUST be posted to the Receive
> > +Queue as one output descriptor and the device is notified of the new entry.
> > +
>
> I think we probably need to be more verbose as what has been done for
> virtio-net.
>
> That is, describe what need to be filled in virtio_rdma_rq_req in
> details. (And do this for other operation as well)
>

OK.

>
> > +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the receive buffer.
> > +
> > +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +The write operation allows us to write data to the local memory buffer
> > +in remote side with no notification. The remote side wouldn't be aware
> > +that this operation being done.
> > +
> > +To do a write operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be
> > +posted to the Send Queue as one output descriptor and the device is
> > +notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the write buffer.
>
> So sg is a must even if the driver want to use imm?
>

Looks like not. I will fix it.

> > +
> > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> > +identify the remote buffer.
> > +
> > +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device
> > +MUST ignore \field{imm_data}.
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The read operation allows us to read data from the local memory buffer
> > +in remote side with no notification. The remote side wouldn't be aware
> > +that this operation being done.
> > +
> > +To do a read operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output
> > +descriptor and the device is notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the read buffer.
> > +
> > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> > +identify the remote buffer.
> > +
> > +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> > +
> > +After above operation is completed, a completion notification MUST
> > +be triggered by the device.
>
> For "completion notification", do you mean the virtqueue notification
> of cq or the making the buffer than contains cqe used?
>

Both? Making the buffer that contains cqe used and notify the virtqueue.

> > To achieve that, the device MUST consume
> > +an entry of the Completion Queue associated with the Send Queue/Receive
> > +Queue which the operation belongs to.
> > +
> > +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> > +
> > +The driver MUST fill the Completion Queue with enough entries previously.
>
> What do you mean by "previously"? What happens if there's no sufficient cqe?
>

We need to fill the Completion Queue in advance. Otherwise, the driver
would not get completion notification after some operation is
completed.

Thanks,
Yongji
diff mbox series

Patch

diff --git a/content.tex b/content.tex
index 7508dd1..646d82a 100644
--- a/content.tex
+++ b/content.tex
@@ -3008,7 +3008,10 @@  \section{Network Device}\label{sec:Device Types / Network Device}
 placed in one virtqueue for receiving packets, and outgoing
 packets are enqueued into another for transmission in that order.
 A third command queue is used to control advanced filtering
-features.
+features. And if RoCE (RDMA over Converged Ethernet) capability
+is enabled, the virtio network device can also support transmitting
+and receiving RDMA message through RDMA send virtqueue, RDMA receive
+virtqueue and RDMA completion virtqueue.
 
 \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID}
 
@@ -3023,13 +3026,24 @@  \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues}
 \item[2(N-1)] receiveqN
 \item[2(N-1)+1] transmitqN
 \item[2N] controlq
+\item[2N+1] rdma_completeq1
+\item[\ldots]
+\item[2N+M] rdma_completeqM
+\item[2N+M+1] rdma_transmitq1
+\item[2N+M+2] rdma_receiveq1
+\item[\ldots]
+\item[2N+M+2L-1] rdma_transmitqL
+\item[2N+M+2L] rdma_receiveqL
 \end{description}
 
  N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by
- \field{max_virtqueue_pairs}.
+ \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by
+ \field{max_rdma_qps}.
 
  controlq only exists if VIRTIO_NET_F_CTRL_VQ set.
 
+ rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set
+
 \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits}
 
 \begin{description}
@@ -3084,6 +3098,9 @@  \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits
 \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control
     channel.
 
+\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet)
+    capability.
+
 \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO
  (fragmenting the packet) the USO splits large UDP packet
  to several segments when each of these smaller packets has UDP header.
@@ -3129,6 +3146,7 @@  \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device
 \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ.
 \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ.
 \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ.
+\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ.
 \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6.
 \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ.
 \end{description}
@@ -3190,6 +3208,8 @@  \subsection{Device configuration layout}\label{sec:Device Types / Network Device
         u8 rss_max_key_size;
         le16 rss_max_indirection_table_length;
         le32 supported_hash_types;
+        le32 max_rdma_qps;
+        le32 max_rdma_cps;
 };
 \end{lstlisting}
 The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set.
@@ -3204,11 +3224,23 @@  \subsection{Device configuration layout}\label{sec:Device Types / Network Device
 Field \field{supported_hash_types} contains the bitmask of supported hash types.
 See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types.
 
+Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set.
+It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage.
+
+Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set.
+It specifies the maximum number of completion virtqueue for RoCE usage.
+
 \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout}
 
 The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive,
 if it offers VIRTIO_NET_F_MQ.
 
+The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive,
+if it offers VIRTIO_NET_F_ROCE.
+
+The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive,
+if it offers VIRTIO_NET_F_ROCE.
+
 The device MUST set \field{mtu} to between 68 and 65535 inclusive,
 if it offers VIRTIO_NET_F_MTU.
 
@@ -3306,6 +3338,12 @@  \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev
 \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated,
   identify the control virtqueue.
 
+\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
+  identify the the RDMA completion virtqueues, up to max_rdma_cqs.
+
+\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
+  identify the the RDMA send and receive virtqueues, up to max_rdma_qps.
+
 \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}.
 
 \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and
@@ -4007,6 +4045,7 @@  \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
         u8 command;
         u8 command-specific-data[];
         u8 ack;
+        u8 ack-specific-data[];
 };
 
 /* ack values */
@@ -4015,8 +4054,8 @@  \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 \end{lstlisting}
 
 The \field{class}, \field{command} and command-specific-data are set by the
-driver, and the device sets the \field{ack} byte. There is little it can
-do except issue a diagnostic if \field{ack} is not
+driver, and the device sets the \field{ack} byte and ack-specific-data. There
+is little it can do except issue a diagnostic if \field{ack} is not
 VIRTIO_NET_OK.
 
 \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering}
@@ -4463,6 +4502,534 @@  \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 according to the native endian of the guest rather than
 (necessarily when not using the legacy interface) little-endian.
 
+\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
+
+If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ),
+it can send control commands for RoCE usage. The following commands are defined now:
+
+\begin{lstlisting}
+#define VIRTIO_NET_CTRL_ROCE    6
+ #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE      0
+ #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT        1
+ #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ         2
+ #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ        3
+ #define VIRTIO_NET_CTRL_ROCE_CREATE_PD         4
+ #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD        5
+ #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR        6
+ #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR       7
+ #define VIRTIO_NET_CTRL_ROCE_DEREG_MR          8
+ #define VIRTIO_NET_CTRL_ROCE_CREATE_QP         9
+ #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP         10
+ #define VIRTIO_NET_CTRL_ROCE_QUERY_QP          11
+ #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP        12
+ #define VIRTIO_NET_CTRL_ROCE_CREATE_AH         13
+ #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH        14
+ #define VIRTIO_NET_CTRL_ROCE_ADD_GID           15
+ #define VIRTIO_NET_CTRL_ROCE_DEL_GID           16
+ #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ     17
+\end{lstlisting}
+
+\begin{description}
+\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device.
+  No command-specific-data;
+  the ack-specific-data is \field{struct virtio_rdma_ack_query_device}.
+
+\begin{lstlisting}
+struct virtio_rdma_ack_query_device {
+#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN    (1 << 0)
+        /* Capabilities mask */
+        le64 device_cap_flags;
+        /* Largest contiguous block that can be registered */
+        le64 max_mr_size;
+        /* Supported memory shift sizes */
+        le64 page_size_cap;
+        /* Hardware version */
+        le32 hw_ver;
+        /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */
+        le32 max_qp_wr;
+        /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */
+        le32 max_send_sge;
+        /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */
+        le32 max_recv_sge;
+        /* Maximum number of s/g per WR for RDMA Read operations */
+        le32 max_sge_rd;
+        /* Maximum size of Completion Queue (CQ) */
+        le32 max_cqe;
+        /* Maximum number of Memory Regions (MR) */
+        le32 max_mr;
+        /* Maximum number of Protection Domains (PD) */
+        le32 max_pd;
+        /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */
+        le32 max_qp_rd_atom;
+        /* Maximum depth per QP for initiation of RDMA Read operations */
+        le32 max_qp_init_rd_atom;
+        /* Maximum number of Address Handles (AH) */
+        le32 max_ah;
+        /* Local CA ack delay */
+        u8 local_ca_ack_delay;
+        /* Padding */
+        u8 padding[3];
+        /* Reserved for future */
+        le32 reserved[14];
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port.
+  No command-specific-data;
+  the ack-specific-data is \field{struct virtio_rdma_ack_query_port}.
+
+\begin{lstlisting}
+struct virtio_rdma_ack_query_port {
+        /* Length of source Global Identifier (GID) table */
+        le32 gid_tbl_len;
+        /* Maximum message size */
+        le32 max_msg_sz;
+        /* Reserved for future */
+        le32 reserved[6];
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ).
+  The command-specific-data is \field{struct virtio_rdma_cmd_create_cq};
+  the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_create_cq {
+        /* Size of CQ */
+        le32 cqe;
+};
+
+struct virtio_rdma_ack_create_cq {
+        /* The index of CQ */
+        le32 cqn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue.
+  The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_destroy_cq {
+        /* The index of CQ */
+        le32 cqn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD).
+  No command-specific-data;
+  the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}.
+
+\begin{lstlisting}
+struct virtio_rdma_ack_create_pd {
+        /* The handle of PD */
+        le32 pdn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain.
+  The command-specific-data is \field{virtio_rdma_cmd_destroy_pd};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_destroy_pd {
+        /* The handle of PD */
+        le32 pdn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR).
+  associated with one protection domain.
+  The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr};
+  the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}.
+
+\begin{lstlisting}
+enum virtio_ib_access_flags {
+        VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0),
+        VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1),
+        VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2),
+};
+
+struct virtio_rdma_cmd_get_dma_mr {
+        /* The handle of PD which the MR associated with */
+        le32 pdn;
+        /* MR's protection attributes, enum virtio_ib_access_flags */
+        le32 access_flags;
+};
+
+struct virtio_rdma_ack_get_dma_mr {
+        /* The handle of MR */
+        le32 mrn;
+        /* MR's local access key */
+        le32 lkey;
+        /* MR's remote access key */
+        le32 rkey;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region
+  associated with one Protection Domain.
+  The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr};
+  the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_reg_user_mr {
+        /* The handle of PD which the MR associated with */
+        le32 pdn;
+        /* MR's protection attributes, enum virtio_ib_access_flags */
+        le32 access_flags;
+        /* Starting virtual address of MR */
+        le64 virt_addr;
+        /* Length of MR */
+        le64 length;
+        /* Size of the below page array */
+        le32 npages;
+        /* Padding */
+        le32 padding;
+        /* Array to store physical address of each page in MR */
+        le64 pages[];
+};
+
+struct virtio_rdma_ack_reg_user_mr {
+        /* The handle of MR */
+        le32 mrn;
+        /* MR's local access key */
+        le32 lkey;
+        /* MR's remote access key */
+        le32 rkey;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region.
+  The command-specific-data is \field{virtio_rdma_cmd_dereg_mr};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_dereg_mr {
+        /* The handle of MR */
+        le32 mrn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue).
+  The command-specific-data is \field{virtio_rdma_cmd_create_qp};
+  the ack-specific-data is \field{virtio_rdma_ack_create_qp}.
+
+\begin{lstlisting}
+struct virtio_rdma_qp_cap {
+        /* Maximum number of outstanding WRs in SQ */
+        le32 max_send_wr;
+        /* Maximum number of outstanding WRs in RQ */
+        le32 max_recv_wr;
+        /* Maximum number of s/g elements per WR in SQ */
+        le32 max_send_sge;
+        /* Maximum number of s/g elements per WR in RQ */
+        le32 max_recv_sge;
+        /* Maximum number of data (bytes) that can be posted inline to SQ */
+        le32 max_inline_data;
+        /* Padding */
+        le32 padding;
+};
+
+struct virtio_rdma_cmd_create_qp {
+        /* The handle of PD which the QP associated with */
+        le32 pdn;
+#define VIRTIO_IB_QPT_SMI    0
+#define VIRTIO_IB_QPT_GSI    1
+#define VIRTIO_IB_QPT_RC     2
+#define VIRTIO_IB_QPT_UC     3
+#define VIRTIO_IB_QPT_UD     4
+        /* QP's type */
+        u8 qp_type;
+        /* If set, each WR submitted to the SQ generates a completion entry */
+        u8 sq_sig_all;
+        /* Padding */
+        u8 padding[2];
+        /* The index of CQ which the SQ associated with */
+        le32 send_cqn;
+        /* The index of CQ which the RQ associated with */
+        le32 recv_cqn;
+        /* QP's capabilities */
+        struct virtio_rdma_qp_cap cap;
+        /* Reserved for future */
+        le32 reserved[4];
+};
+
+struct virtio_rdma_ack_create_qp {
+        /* The index of QP */
+        le32 qpn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair.
+  The command-specific-data is \field{virtio_rdma_cmd_modify_qp};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_global_route {
+        /* Destination GID or MGID */
+        u8 dgid[16];
+        /* Flow label */
+        le32 flow_label;
+        /* Source GID index */
+        u8 sgid_index;
+        /* Hop limit */
+        u8 hop_limit;
+        /* Traffic class */
+        u8 traffic_class;
+        /* Padding */
+        u8 padding;
+};
+
+struct virtio_rdma_ah_attr {
+        /* Global Routing Header (GRH) attributes */
+        virtio_rdma_global_route grh;
+        /* Destination MAC address */
+        u8 dmac[6];
+        /* Reserved for future */
+        u8 reserved[10];
+};
+
+enum virtio_ib_qp_attr_mask {
+        VIRTIO_IB_QP_STATE = (1 << 0),
+        VIRTIO_IB_QP_CUR_STATE = (1 << 1),
+        VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2),
+        VIRTIO_IB_QP_QKEY = (1 << 3),
+        VIRTIO_IB_QP_AV = (1 << 4),
+        VIRTIO_IB_QP_PATH_MTU = (1 << 5),
+        VIRTIO_IB_QP_TIMEOUT = (1 << 6),
+        VIRTIO_IB_QP_RETRY_CNT = (1 << 7),
+        VIRTIO_IB_QP_RNR_RETRY = (1 << 8),
+        VIRTIO_IB_QP_RQ_PSN = (1 << 9),
+        VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10),
+        VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11),
+        VIRTIO_IB_QP_SQ_PSN = (1 << 12),
+        VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13),
+        VIRTIO_IB_QP_CAP = (1 << 14),
+        VIRTIO_IB_QP_DEST_QPN = (1 << 15),
+        VIRTIO_IB_QP_RATE_LIMIT = (1 << 16),
+};
+
+enum virtio_ib_qp_state {
+        VIRTIO_IB_QPS_RESET,
+        VIRTIO_IB_QPS_INIT,
+        VIRTIO_IB_QPS_RTR,
+        VIRTIO_IB_QPS_RTS,
+        VIRTIO_IB_QPS_SQD,
+        VIRTIO_IB_QPS_SQE,
+        VIRTIO_IB_QPS_ERR
+};
+
+enum virtio_ib_mtu {
+        VIRTIO_IB_MTU_256 = 1,
+        VIRTIO_IB_MTU_512 = 2,
+        VIRTIO_IB_MTU_1024 = 3,
+        VIRTIO_IB_MTU_2048 = 4,
+        VIRTIO_IB_MTU_4096 = 5
+};
+
+struct virtio_rdma_cmd_modify_qp {
+        /* The index of QP */
+        le32 qpn;
+        /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */
+        le32 attr_mask;
+        /* Move the QP to this state, enum virtio_ib_qp_state */
+        u8 qp_state;
+        /* Current QP state, enum virtio_ib_qp_state */
+        u8 cur_qp_state;
+        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
+        u8 path_mtu;
+        /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */
+        u8 max_rd_atomic;
+        /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */
+        u8 max_dest_rd_atomic;
+        /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */
+        u8 min_rnr_timer;
+        /* Local ack timeout (valid only for RC QPs) */
+        u8 timeout;
+        /* Retry count (valid only for RC QPs) */
+        u8 retry_cnt;
+        /* RNR retry (valid only for RC QPs) */
+        u8 rnr_retry;
+        /* Padding */
+        u8 padding[7];
+        /* Q_Key for the QP (valid only for UD QPs) */
+        le32 qkey;
+        /* PSN for RQ (valid only for RC/UC QPs) */
+        le32 rq_psn;
+        /* PSN for SQ */
+        le32 sq_psn;
+        /* Destination QP number (valid only for RC/UC QPs) */
+        le32 dest_qp_num;
+        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
+        le32 qp_access_flags;
+        /* Rate limit in kbps for packet pacing */
+        le32 rate_limit;
+        /* QP capabilities */
+        struct virtio_rdma_qp_cap cap;
+        /* Address Vector (valid only for RC/UC QPs) */
+        struct virtio_rdma_ah_attr ah_attr;
+        /* Reserved for future */
+        le32 reserved[4];
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair.
+  The command-specific-data is \field{virtio_rdma_cmd_query_qp};
+  the ack-specific-data is \field{virtio_rdma_ack_query_qp}.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_query_qp {
+	/* The index of QP */
+        le32 qpn;
+        /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */
+        le32 attr_mask;
+};
+
+struct virtio_rdma_ack_query_qp {
+        /* Move the QP to this state, enum virtio_ib_qp_state */
+        u8 qp_state;
+        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
+        u8 path_mtu;
+        /* Is the SQ draining */
+        u8 sq_draining;
+        /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */
+        u8 max_rd_atomic;
+        /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */
+        u8 max_dest_rd_atomic;
+        /* Minimum RNR NAK timer (valid only for RC QPs) */
+        u8 min_rnr_timer;
+        /* Local ack timeout (valid only for RC QPs) */
+        u8 timeout;
+        /* Retry count (valid only for RC QPs) */
+        u8 retry_cnt;
+        /* RNR retry (valid only for RC QPs) */
+        u8 rnr_retry;
+        /* Padding */
+        u8 padding[7];
+        /* Q_Key for the QP (valid only for UD QPs) */
+        le32 qkey;
+        /* PSN for RQ (valid only for RC/UC QPs) */
+        le32 rq_psn;
+        /* PSN for SQ */
+        le32 sq_psn;
+        /* Destination QP number (valid only for RC/UC QPs) */
+        le32 dest_qp_num;
+        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
+        le32 qp_access_flags;
+        /* Rate limit in kbps for packet pacing */
+        le32 rate_limit;
+        /* QP capabilities */
+        struct virtio_rdma_qp_cap cap;
+        /* Address Vector (valid only for RC/UC QPs) */
+        struct virtio_rdma_ah_attr ah_attr;
+        /* Reserved for future */
+        le32 reserved[4];
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair.
+  The command-specific-data is \field{virtio_rdma_cmd_destroy_qp};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_destroy_qp {
+        /* The index of QP */
+        le32 qpn;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH).
+  The command-specific-data is \field{virtio_rdma_cmd_create_ah};
+  the ack-specific-data is \field{virtio_rdma_ack_create_ah}.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_create_ah {
+        /* The handle of PD which the AH associated with */
+        le32 pdn;
+        /* Padding */
+        le32 padding;
+        /* Address Vector */
+        struct virtio_rdma_ah_attr ah_attr;
+};
+
+struct virtio_rdma_ack_create_ah {
+        /* The address handle */
+        le32 ah;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle.
+  The command-specific-data is \field{virtio_rdma_cmd_destroy_ah};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_destroy_ah {
+        /* The handle of PD which the AH associated with */
+        le32 pdn;
+        /* The address handle */
+        le32 ah;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID).
+  The command-specific-data is \field{virtio_rdma_cmd_add_gid};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_add_gid {
+        /* Index of GID */
+        le16 index;
+        /* Padding */
+        le16 padding[3];
+        /* GID to be added */
+        u8 gid[16];
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier.
+  The command-specific-data is \field{virtio_rdma_cmd_del_gid};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_del_gid {
+        /* Index of GID */
+        le16 index;
+};
+\end{lstlisting}
+
+\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification
+  on a Completion Queue.
+  The command-specific-data is \field{virtio_rdma_cmd_req_notify};
+  no ack-specific-data.
+
+\begin{lstlisting}
+struct virtio_rdma_cmd_req_notify {
+        /* The index of CQ */
+        le32 cqn;
+#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0)
+#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1)
+        /* Notify flags */
+        le32 flags;
+};
+\end{lstlisting}
+
+\end{description}
+
+\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
+
+A driver MUST initialize the completion virtqueue and fill it with
+enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is
+successfully executed.
+
+A driver MUST reset the completion virtqueue after
+command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed.
+
+A driver MUST initialize the send virtqueue and receive virtqueue after
+command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed.
+
+A driver MUST reset the send virtqueue and receive virtqueue after
+command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed.
 
 \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
 Types / Network Device / Legacy Interface: Framing Requirements}
@@ -4496,6 +5063,289 @@  \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
 See \ref{sec:Basic
 Facilities of a Virtio Device / Virtqueues / Message Framing}.
 
+\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support}
+
+RDMA over Converged Ethernet (RoCE) is a network protocol that allows
+remote direct memory access (RDMA) over an Ethernet network. To support
+RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control
+virtqueue support mentioned in \ref{sec:Device Types / Network Device /
+Device Operation / Control Virtqueue / RoCE Configuration}, multiple
+types of virtqueues including send virtqueue, receive virtqueue and
+completion virtqueue are introduced.
+
+The send virtqueue contains elements that describe the data to be
+transmitted.
+
+Requests (device-readable) have the following format:
+
+\begin{lstlisting}
+enum virtio_ib_wr_opcode {
+        VIRTIO_IB_WR_RDMA_WRITE,
+        VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM,
+        VIRTIO_IB_WR_SEND,
+        VIRTIO_IB_WR_SEND_WITH_IMM,
+        VIRTIO_IB_WR_RDMA_READ,
+};
+
+struct virtio_rdma_sge {
+        le64 addr;
+        le32 length;
+        le32 lkey;
+};
+
+struct virtio_rdma_sq_req {
+        /* User defined WR ID */
+        le64 wr_id;
+        /* WR opcode, enum virtio_ib_wr_opcode */
+        u8 opcode;
+#define VIRTIO_IB_SEND_FENCE        (1 << 0)
+#define VIRTIO_IB_SEND_SIGNALED     (1 << 1)
+#define VIRTIO_IB_SEND_SOLICITED    (1 << 2)
+#define VIRTIO_IB_SEND_INLINE       (1 << 3)
+        /* Flags of the WR properties */
+        u8 send_flags;
+        /* Padding */
+        le16 padding;
+        /* Immediate data (in network byte order) to send */
+        le32 imm_data;
+        union {
+                struct {
+                        /* Start address of remote memory buffer */
+                        le64 remote_addr;
+                        /* Key of the remote MR */
+                        le32 rkey;
+                } rdma;
+                struct {
+                        /* Index of the destination QP */
+                        le32 remote_qpn;
+                        /* Q_Key of the destination QP */
+                        le32 remote_qkey;
+                        /* Address Handle */
+                        le32 ah;
+                } ud;
+                /* Reserved for future */
+                le64 reserved[4];
+        };
+        /* Inline data */
+        u8 inline_data[512];
+        union {
+                /* Length of sg_list */
+                le32 num_sge;
+                /* Length of inline data */
+                le16 inline_len;
+        };
+        /* Reserved for future */
+        le32 reserved2[3];
+	/* Scatter/gather list */
+        struct virtio_rdma_sge sg_list[];
+};
+\end{lstlisting}
+
+The receive virtqueue contains elements that describe where to place incoming data.
+
+Requests (device-readable) have the following format:
+
+\begin{lstlisting}
+struct virtio_rdma_rq_req {
+        /* User defined WR ID */
+        le64 wr_id;
+        /* Length of sg_list */
+        le32 num_sge;
+        /* Reserved for future */
+        le32 reserved[3];
+        /* Scatter/gather list */
+        struct virtio_rdma_sge sg_list[];
+};
+\end{lstlisting}
+
+The completion virtqueue is used to notify the completion of requests in
+send virtqueue or receive virtqueue.
+
+Requests (device-writable) have the following format:
+
+\begin{lstlisting}
+enum virtio_ib_wc_opcode {
+        VIRTIO_IB_WC_SEND,
+        VIRTIO_IB_WC_RDMA_WRITE,
+        VIRTIO_IB_WC_RDMA_READ,
+        VIRTIO_IB_WC_RECV,
+        VIRTIO_IB_WC_RECV_RDMA_WITH_IMM,
+};
+
+enum virtio_ib_wc_status {
+        /* Operation completed successfully */
+        VIRTIO_IB_WC_SUCCESS,
+        /* Local Length Error */
+        VIRTIO_IB_WC_LOC_LEN_ERR,
+        /* Local QP Operation Error */
+        VIRTIO_IB_WC_LOC_QP_OP_ERR,
+        /* Local Protection Error */
+        VIRTIO_IB_WC_LOC_PROT_ERR,
+        /* Work Request Flushed Error */
+        VIRTIO_IB_WC_WR_FLUSH_ERR,
+        /* Bad Response Error */
+        VIRTIO_IB_WC_BAD_RESP_ERR,
+        /* Local Access Error */
+        VIRTIO_IB_WC_LOC_ACCESS_ERR,
+        /* Remote Invalid Request Error */
+        VIRTIO_IB_WC_REM_INV_REQ_ERR,
+        /* Remote Access Error */
+        VIRTIO_IB_WC_REM_ACCESS_ERR,
+        /* Remote Operation Error */
+        VIRTIO_IB_WC_REM_OP_ERR,
+        /* Transport Retry Counter Exceeded */
+        VIRTIO_IB_WC_RETRY_EXC_ERR,
+        /* RNR Retry Counter Exceeded */
+        VIRTIO_IB_WC_RNR_RETRY_EXC_ERR,
+        /* Remote Aborted Error */
+        VIRTIO_IB_WC_REM_ABORT_ERR,
+        /* Fatal Error */
+        VIRTIO_IB_WC_FATAL_ERR,
+        /* Response Timeout Error */
+        VIRTIO_IB_WC_RESP_TIMEOUT_ERR,
+        /* General Error */
+        VIRTIO_IB_WC_GENERAL_ERR
+};
+
+struct virtio_rdma_cq_req {
+        /* User defined WR ID */
+        le64 wr_id;
+        /* Work completion status, enum virtio_ib_wc_status */
+        u8 status;
+        /* WR opcode, enum virtio_ib_wc_opcode */
+        u8 opcode;
+        /* Padding */
+        le16 padding;
+        /* Vendor error */
+        le32 vendor_err;
+        /* Number of bytes transferred */
+        le32 byte_len;
+        /* Immediate data (in network byte order) to send */
+        le32 imm_data;
+        /* Local QP number of completed WR */
+        le32 qp_num;
+        /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */
+        le32 src_qp;
+#define VIRTIO_IB_WC_GRH         (1 << 0)
+#define VIRTIO_IB_WC_WITH_IMM    (1 << 1)
+        /* Work completion flag */
+        le32 wc_flags;
+        /* Reserved for future */
+        le32 reserved[3];
+};
+\end{lstlisting}
+
+\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
+
+The send operation allows us to send data to a remote QP’s Receive Queue.
+The receiver MUST have previously posted a receive buffer to receive the data.
+
+To do a send operation, a request with \field{opcode} set to
+VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send
+Queue as one output descriptor and the device is notified of the new entry.
+
+\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
+
+If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill
+send buffer into \field{inline_data} field and set \field{inline_len} to the
+length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to
+describe the buffer.
+
+\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
+
+If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST
+ignore \field{imm_data}.
+
+If the QP type is UD, the device MUST validate \field{ud.ah}.
+
+If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST
+validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}.
+
+\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
+
+The receive operation allows us to receive data from remote QP.
+It's the corresponding operation to a send operation.
+
+To do a receive operation, a request MUST be posted to the Receive
+Queue as one output descriptor and the device is notified of the new entry.
+
+\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
+
+The driver MUST fill \field{sg_list} to describe the receive buffer.
+
+\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
+
+The device MUST validate the \field{addr}, \field{length} and \field{lkey}
+in \field{sg_list}.
+
+\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
+
+The write operation allows us to write data to the local memory buffer
+in remote side with no notification. The remote side wouldn't be aware
+that this operation being done.
+
+To do a write operation, a request with \field{opcode} set to
+VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be
+posted to the Send Queue as one output descriptor and the device is
+notified of the new entry.
+
+\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
+
+The driver MUST fill \field{sg_list} to describe the write buffer.
+
+The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
+identify the remote buffer.
+
+\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
+
+If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device
+MUST ignore \field{imm_data}.
+
+The device MUST validate the \field{addr}, \field{length} and \field{lkey}
+in \field{sg_list}.
+
+\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
+
+The read operation allows us to read data from the local memory buffer
+in remote side with no notification. The remote side wouldn't be aware
+that this operation being done.
+
+To do a read operation, a request with \field{opcode} set to
+VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output
+descriptor and the device is notified of the new entry.
+
+\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
+
+The driver MUST fill \field{sg_list} to describe the read buffer.
+
+The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
+identify the remote buffer.
+
+\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
+
+The device MUST validate the \field{addr}, \field{length} and \field{lkey}
+in \field{sg_list}.
+
+\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
+
+After above operation is completed, a completion notification MUST
+be triggered by the device. To achieve that, the device MUST consume
+an entry of the Completion Queue associated with the Send Queue/Receive
+Queue which the operation belongs to.
+
+\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
+
+The driver MUST fill the Completion Queue with enough entries previously.
+
+\devicenormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
+
+If \field{imm_data} is valid, the device MUST set VIRTIO_IB_WC_WITH_IMM to
+\field{wc_flags}.
+
+The device MUST set \field{wr_id} to the value of \field{wr_id} of
+corresponding \field{struct virtio_rdma_sq_req} or
+\field{struct virtio_rdma_rq_req}.
+
 \section{Block Device}\label{sec:Device Types / Block Device}
 
 The virtio block device is a simple virtual block device (ie.