diff mbox series

[v1,net-next,02/15] net: Introduce direct data placement tcp offload

Message ID 20201207210649.19194-3-borisp@mellanox.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series nvme-tcp receive offloads | expand

Checks

Context Check Description
netdev/apply fail Patch does not apply to net-next
netdev/tree_selection success Clearly marked for net-next

Commit Message

Boris Pismenny Dec. 7, 2020, 9:06 p.m. UTC
This commit introduces direct data placement offload for TCP.
This capability is accompanied by new net_device operations that
configure
hardware contexts. There is a context per socket, and a context per DDP
opreation. Additionally, a resynchronization routine is used to assist
hardware handle TCP OOO, and continue the offload.
Furthermore, we let the offloading driver advertise what is the max hw
sectors/segments.

Using this interface, the NIC hardware will scatter TCP payload directly
to the BIO pages according to the command_id.
To maintain the correctness of the network stack, the driver is expected
to construct SKBs that point to the BIO pages.

This, the SKB represents the data on the wire, while it is pointing
to data that is already placed in the destination buffer.
As a result, data from page frags should not be copied out to
the linear part.

As SKBs that use DDP are already very memory efficient, we modify
skb_condence to avoid copying data from fragments to the linear
part of SKBs that belong to a socket that uses DDP offload.

A follow-up patch will use this interface for DDP in NVMe-TCP.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yoray Zack <yorayz@mellanox.com>
---
 include/linux/netdev_features.h    |   2 +
 include/linux/netdevice.h          |   5 ++
 include/net/inet_connection_sock.h |   4 +
 include/net/tcp_ddp.h              | 129 +++++++++++++++++++++++++++++
 net/Kconfig                        |   9 ++
 net/core/skbuff.c                  |   9 +-
 net/ethtool/common.c               |   1 +
 7 files changed, 158 insertions(+), 1 deletion(-)
 create mode 100644 include/net/tcp_ddp.h

Comments

David Ahern Dec. 8, 2020, 12:42 a.m. UTC | #1
On 12/7/20 2:06 PM, Boris Pismenny wrote:
> This commit introduces direct data placement offload for TCP.
> This capability is accompanied by new net_device operations that
> configure
> hardware contexts. There is a context per socket, and a context per DDP
> opreation. Additionally, a resynchronization routine is used to assist
> hardware handle TCP OOO, and continue the offload.
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments.
> 
> Using this interface, the NIC hardware will scatter TCP payload directly
> to the BIO pages according to the command_id.
> To maintain the correctness of the network stack, the driver is expected
> to construct SKBs that point to the BIO pages.
> 
> This, the SKB represents the data on the wire, while it is pointing
> to data that is already placed in the destination buffer.
> As a result, data from page frags should not be copied out to
> the linear part.
> 
> As SKBs that use DDP are already very memory efficient, we modify
> skb_condence to avoid copying data from fragments to the linear
> part of SKBs that belong to a socket that uses DDP offload.
> 
> A follow-up patch will use this interface for DDP in NVMe-TCP.
> 

You call this Direct Data Placement - which sounds like a marketing name.

Fundamentally, this starts with offloading TCP socket buffers for a
specific flow, so generically a TCP Rx zerocopy for kernel stack managed
sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
that level of infrastructure first and adding ULPs like NVME on top?
Boris Pismenny Dec. 8, 2020, 2:36 p.m. UTC | #2
On 08/12/2020 2:42, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> This commit introduces direct data placement offload for TCP.
>> This capability is accompanied by new net_device operations that
>> configure
>> hardware contexts. There is a context per socket, and a context per DDP
>> opreation. Additionally, a resynchronization routine is used to assist
>> hardware handle TCP OOO, and continue the offload.
>> Furthermore, we let the offloading driver advertise what is the max hw
>> sectors/segments.
>>
>> Using this interface, the NIC hardware will scatter TCP payload directly
>> to the BIO pages according to the command_id.
>> To maintain the correctness of the network stack, the driver is expected
>> to construct SKBs that point to the BIO pages.
>>
>> This, the SKB represents the data on the wire, while it is pointing
>> to data that is already placed in the destination buffer.
>> As a result, data from page frags should not be copied out to
>> the linear part.
>>
>> As SKBs that use DDP are already very memory efficient, we modify
>> skb_condence to avoid copying data from fragments to the linear
>> part of SKBs that belong to a socket that uses DDP offload.
>>
>> A follow-up patch will use this interface for DDP in NVMe-TCP.
>>
> 
> You call this Direct Data Placement - which sounds like a marketing name.
> 

[Re-sending as the previous one didn't hit the mailing list. Sorry for the spam]

Interesting idea. But, unlike SKBTX_DEV_ZEROCOPY this SKB can be inspected/modified by the stack without the need to copy things out. Additionally, the SKB may contain both data that is already placed in its final destination buffer (PDU data) and data that isn't (PDU header); it doesn't matter. Therefore, labeling the entire SKB as zerocopy doesn't convey the desired information. Moreover, skipping copies in the stack to receive zerocopy SKBs will require more invasive changes.

Our goal in this approach was to provide the smallest change that enables the desired functionality while preserving the performance of existing flows that do not care for it. An alternative approach, that doesn't affect existing flows at all, which we considered was to make a special version of memcpy_to_page to be used by DDP providers (nvme-tcp). This alternative will require creating corresponding special versions for users of this function such skb_copy_datagram_iter. Thit is more invasive, thus in this patchset we decided to avoid it.

> Fundamentally, this starts with offloading TCP socket buffers for a
> specific flow, so generically a TCP Rx zerocopy for kernel stack managed
> sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
> that level of infrastructure first and adding ULPs like NVME on top?
> 

We aren't using AF_XDP or any of the Rx zerocopy infrastructure, because it is unsuitable for data placement for nvme-tcp, which reordes responses relatively to requests for efficiency and requires that data reside in specific destination buffers.
David Ahern Dec. 9, 2020, 12:38 a.m. UTC | #3
On 12/8/20 7:36 AM, Boris Pismenny wrote:
> On 08/12/2020 2:42, David Ahern wrote:
>> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>>> This commit introduces direct data placement offload for TCP.
>>> This capability is accompanied by new net_device operations that
>>> configure
>>> hardware contexts. There is a context per socket, and a context per DDP
>>> opreation. Additionally, a resynchronization routine is used to assist
>>> hardware handle TCP OOO, and continue the offload.
>>> Furthermore, we let the offloading driver advertise what is the max hw
>>> sectors/segments.
>>>
>>> Using this interface, the NIC hardware will scatter TCP payload directly
>>> to the BIO pages according to the command_id.
>>> To maintain the correctness of the network stack, the driver is expected
>>> to construct SKBs that point to the BIO pages.
>>>
>>> This, the SKB represents the data on the wire, while it is pointing
>>> to data that is already placed in the destination buffer.
>>> As a result, data from page frags should not be copied out to
>>> the linear part.
>>>
>>> As SKBs that use DDP are already very memory efficient, we modify
>>> skb_condence to avoid copying data from fragments to the linear
>>> part of SKBs that belong to a socket that uses DDP offload.
>>>
>>> A follow-up patch will use this interface for DDP in NVMe-TCP.
>>>
>>
>> You call this Direct Data Placement - which sounds like a marketing name.
>>
> 
> [Re-sending as the previous one didn't hit the mailing list. Sorry for the spam]
> 
> Interesting idea. But, unlike SKBTX_DEV_ZEROCOPY this SKB can be inspected/modified by the stack without the need to copy things out. Additionally, the SKB may contain both data that is already placed in its final destination buffer (PDU data) and data that isn't (PDU header); it doesn't matter. Therefore, labeling the entire SKB as zerocopy doesn't convey the desired information. Moreover, skipping copies in the stack to receive zerocopy SKBs will require more invasive changes.
> 
> Our goal in this approach was to provide the smallest change that enables the desired functionality while preserving the performance of existing flows that do not care for it. An alternative approach, that doesn't affect existing flows at all, which we considered was to make a special version of memcpy_to_page to be used by DDP providers (nvme-tcp). This alternative will require creating corresponding special versions for users of this function such skb_copy_datagram_iter. Thit is more invasive, thus in this patchset we decided to avoid it.
> 
>> Fundamentally, this starts with offloading TCP socket buffers for a
>> specific flow, so generically a TCP Rx zerocopy for kernel stack managed
>> sockets (as opposed to AF_XDP's zerocopy). Why is this not building in
>> that level of infrastructure first and adding ULPs like NVME on top?
>>
> 
> We aren't using AF_XDP or any of the Rx zerocopy infrastructure, because it is unsuitable for data placement for nvme-tcp, which reordes responses relatively to requests for efficiency and requires that data reside in specific destination buffers.
> 
> 

The AF_XDP reference was to differentiate one zerocopy use case (all
packets go to userspace) from another (kernel managed TCP socket with
zerocopy payload). You are focusing on a very narrow use case - kernel
based NVMe over TCP - of a more general problem.

You have a TCP socket and a design that only works for kernel owned
sockets. You have specialized queues in the NIC, a flow rule directing
packets to those queues. Presumably some ULP parser in the NIC
associated with the queues to process NVMe packets. Rather than copying
headers (ethernet/ip/tcp) to one buffer and payload to another (which is
similar to what Jonathan Lemon is working on), this design has a ULP
processor that just splits out the TCP payload even more making it
highly selective about which part of the packet is put into which
buffer. Take out the NVMe part, and it is header split with zerocopy for
the payload - a generic feature that can have a wider impact with NVMe
as a special case.
David Ahern Dec. 9, 2020, 12:57 a.m. UTC | #4
On 12/7/20 2:06 PM, Boris Pismenny wrote:
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 934de56644e7..fb35dcac03d2 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -84,6 +84,7 @@ enum {
>  	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
>  
>  	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
> +	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
>  
>  	/*
>  	 * Add your fresh new feature above and remember to update
> @@ -157,6 +158,7 @@ enum {
>  #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
>  #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
>  #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
> +#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)

All of the DDP naming seems wrong to me. I realize the specific use case
is targeted payloads of a ULP, but it is still S/W handing H/W specific
buffers for a payload of a flow.


>  
>  /* Finds the next feature with the highest number of the range of start till 0.
>   */
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a07c8e431f45..755766976408 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -934,6 +934,7 @@ struct dev_ifalias {
>  
>  struct devlink;
>  struct tlsdev_ops;
> +struct tcp_ddp_dev_ops;
>  
>  struct netdev_name_node {
>  	struct hlist_node hlist;
> @@ -1930,6 +1931,10 @@ struct net_device {
>  	const struct tlsdev_ops *tlsdev_ops;
>  #endif
>  
> +#ifdef CONFIG_TCP_DDP
> +	const struct tcp_ddp_dev_ops *tcp_ddp_ops;
> +#endif
> +
>  	const struct header_ops *header_ops;
>  
>  	unsigned int		flags;
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7338b3865a2a..a08b85b53aa8 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>   * @icsk_ulp_ops	   Pluggable ULP control hook
>   * @icsk_ulp_data	   ULP private data
>   * @icsk_clean_acked	   Clean acked data hook
> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
> + * @icsk_ulp_ddp_data	   ULP direct data placement private data

Neither of these socket layer intrusions are needed. All references but
1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
needed can be handled with a different setting.

>   * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
>   * @icsk_ca_state:	   Congestion control state
>   * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
> @@ -94,6 +96,8 @@ struct inet_connection_sock {
>  	const struct tcp_ulp_ops  *icsk_ulp_ops;
>  	void __rcu		  *icsk_ulp_data;
>  	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
> +	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
> +	void __rcu		  *icsk_ulp_ddp_data;
>  	struct hlist_node         icsk_listen_portaddr_node;
>  	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>  	__u8			  icsk_ca_state:5,
> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
> new file mode 100644
> index 000000000000..df3264be4600
> --- /dev/null
> +++ b/include/net/tcp_ddp.h
> @@ -0,0 +1,129 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * tcp_ddp.h
> + *	Author:	Boris Pismenny <borisp@mellanox.com>
> + *	Copyright (C) 2020 Mellanox Technologies.
> + */
> +#ifndef _TCP_DDP_H
> +#define _TCP_DDP_H
> +
> +#include <linux/netdevice.h>
> +#include <net/inet_connection_sock.h>
> +#include <net/sock.h>
> +
> +/* limits returned by the offload driver, zero means don't care */
> +struct tcp_ddp_limits {
> +	int	 max_ddp_sgl_len;
> +};
> +
> +enum tcp_ddp_type {
> +	TCP_DDP_NVME = 1,
> +};
> +
> +/**
> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
> + * config implementations must use this as the first member.
> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
> + */
> +struct tcp_ddp_config {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

you have this variable length buf, but it is not used (as far as I can
tell). But then ...


> +};
> +
> +/**
> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
> + *
> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
> + * @cpda:       controller pdu data alignmend (dwords, 0's based)
> + * @dgst:       digest types enabled.
> + *              The netdev will offload crc if ddp_crc is supported.
> + * @queue_size: number of nvme-tcp IO queue elements
> + * @queue_id:   queue identifier
> + * @cpu_io:     cpu core running the IO thread for this queue
> + */
> +struct nvme_tcp_ddp_config {
> +	struct tcp_ddp_config   cfg;

... how would you use it within another struct like this?

> +
> +	u16			pfv;
> +	u8			cpda;
> +	u8			dgst;
> +	int			queue_size;
> +	int			queue_id;
> +	int			io_cpu;
> +};
> +
> +/**
> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.
> + *
> + * @command_id:  identifier on the wire associated with these buffers
> + * @nents:       number of entries in the sg_table
> + * @sg_table:    describing the buffers for this IO request
> + * @first_sgl:   first SGL in sg_table
> + */
> +struct tcp_ddp_io {
> +	u32			command_id;
> +	int			nents;
> +	struct sg_table		sg_table;
> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
> +};
> +
> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
> + *
> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
> + *                     the device driver can use this to limit the resources allocated per queue.
> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
> + *                     this function is used to configure either copy, crc or both offloads.
> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
> + *                     tcp_ddp_io.
> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
> + */
> +struct tcp_ddp_dev_ops {
> +	int (*tcp_ddp_limits)(struct net_device *netdev,
> +			      struct tcp_ddp_limits *limits);
> +	int (*tcp_ddp_sk_add)(struct net_device *netdev,
> +			      struct sock *sk,
> +			      struct tcp_ddp_config *config);
> +	void (*tcp_ddp_sk_del)(struct net_device *netdev,
> +			       struct sock *sk);
> +	int (*tcp_ddp_setup)(struct net_device *netdev,
> +			     struct sock *sk,
> +			     struct tcp_ddp_io *io);
> +	int (*tcp_ddp_teardown)(struct net_device *netdev,
> +				struct sock *sk,
> +				struct tcp_ddp_io *io,
> +				void *ddp_ctx);
> +	void (*tcp_ddp_resync)(struct net_device *netdev,
> +			       struct sock *sk, u32 seq);
> +};
> +
> +#define TCP_DDP_RESYNC_REQ (1 << 0)
> +
> +/**
> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
> + */
> +struct tcp_ddp_ulp_ops {
> +	/* NIC requests ulp to indicate if @seq is the start of a message */
> +	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
> +	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
> +	void (*ddp_teardown_done)(void *ddp_ctx);
> +};
> +
> +/**
> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
> + * use this as the first member.
> + */
> +struct tcp_ddp_ctx {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

similar to my comment above, I did not see any uses of the buf element.
David Ahern Dec. 9, 2020, 1:11 a.m. UTC | #5
On 12/8/20 5:57 PM, David Ahern wrote:
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 7338b3865a2a..a08b85b53aa8 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>>   * @icsk_ulp_ops	   Pluggable ULP control hook
>>   * @icsk_ulp_data	   ULP private data
>>   * @icsk_clean_acked	   Clean acked data hook
>> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
>> + * @icsk_ulp_ddp_data	   ULP direct data placement private data
> 
> Neither of these socket layer intrusions are needed. All references but
> 1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
> needed can be handled with a different setting.

missed the nvme ops for the driver to callback to the socket owner.
Boris Pismenny Dec. 9, 2020, 8:15 a.m. UTC | #6
On 09/12/2020 2:38, David Ahern wrote:
> 
> The AF_XDP reference was to differentiate one zerocopy use case (all
> packets go to userspace) from another (kernel managed TCP socket with
> zerocopy payload). You are focusing on a very narrow use case - kernel
> based NVMe over TCP - of a more general problem.
> 

Please note that although our framework implements support for nvme-tcp,
we designed it to fit iscsi as well, and hopefully future protocols too,
as general as we could. For why this could not be generalized further
see below.

> You have a TCP socket and a design that only works for kernel owned
> sockets. You have specialized queues in the NIC, a flow rule directing
> packets to those queues. Presumably some ULP parser in the NIC
> associated with the queues to process NVMe packets. Rather than copying
> headers (ethernet/ip/tcp) to one buffer and payload to another (which is
> similar to what Jonathan Lemon is working on), this design has a ULP
> processor that just splits out the TCP payload even more making it
> highly selective about which part of the packet is put into which
> buffer. Take out the NVMe part, and it is header split with zerocopy for
> the payload - a generic feature that can have a wider impact with NVMe
> as a special case.
> 

There is more to this than TCP zerocopy that exists in userspace or
inside the kernel. First, please note that the patches include support for
CRC offload as well as data placement. Second, data-placement is not the same
as zerocopy for the following reasons:
(1) The former places buffers *exactly* where the user requests
regardless of the order of response arrivals, while the latter places packets
in anonymous buffers according to packet arrival order. Therefore, zerocopy
can be implemented using data placement, but not vice versa.
(2) Data-placement supports sub-page zerocopy, unlike page-flipping
techniques (i.e., TCP_ZEROCOPY).
(3) Page-flipping can't work for any storage initiator because the
destination buffer is owned by some user pagecache or process using O_DIRECT.
(4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
i.e., the PDU header can be in the middle of a packet, so header-data split
alone isn't enough.

I wish we could do the same using some simpler zerocopy mechanism,
it would indeed simplify things. But, unfortunately this would severely
restrict generality, no sub-page support and alignment between PDUs
and packets, and performance (ordering of PDUs).
Boris Pismenny Dec. 9, 2020, 8:25 a.m. UTC | #7
On 09/12/2020 2:57, David Ahern wrote:
> On 12/7/20 2:06 PM, Boris Pismenny wrote:
>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> index 934de56644e7..fb35dcac03d2 100644
>> --- a/include/linux/netdev_features.h
>> +++ b/include/linux/netdev_features.h
>> @@ -84,6 +84,7 @@ enum {
>>  	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
>>  
>>  	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
>> +	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
>>  
>>  	/*
>>  	 * Add your fresh new feature above and remember to update
>> @@ -157,6 +158,7 @@ enum {
>>  #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
>>  #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
>>  #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
>> +#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
> 
> All of the DDP naming seems wrong to me. I realize the specific use case
> is targeted payloads of a ULP, but it is still S/W handing H/W specific
> buffers for a payload of a flow.
> 
> 

This is intended to be used strictly by ULPs. DDP is how such things were
called in the past. It is more than zerocopy as explained before, so naming
it zerocopy will be misleading at best. I can propose another name. How about:
"Autonomous copy offload (ACO)" and "Autonomous crc offload (ACRC)"?
This will indicate that it is independent of other offloads, i.e. autonomous,
while also informing users of the functionality (copy/crc). Other names are
welcome too.


>>   * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
>>   * @icsk_ca_state:	   Congestion control state
>>   * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
>> @@ -94,6 +96,8 @@ struct inet_connection_sock {
>>  	const struct tcp_ulp_ops  *icsk_ulp_ops;
>>  	void __rcu		  *icsk_ulp_data;
>>  	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>> +	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
>> +	void __rcu		  *icsk_ulp_ddp_data;
>>  	struct hlist_node         icsk_listen_portaddr_node;
>>  	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>>  	__u8			  icsk_ca_state:5,
>> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
>> new file mode 100644
>> index 000000000000..df3264be4600
>> --- /dev/null
>> +++ b/include/net/tcp_ddp.h
>> @@ -0,0 +1,129 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + *
>> + * tcp_ddp.h
>> + *	Author:	Boris Pismenny <borisp@mellanox.com>
>> + *	Copyright (C) 2020 Mellanox Technologies.
>> + */
>> +#ifndef _TCP_DDP_H
>> +#define _TCP_DDP_H
>> +
>> +#include <linux/netdevice.h>
>> +#include <net/inet_connection_sock.h>
>> +#include <net/sock.h>
>> +
>> +/* limits returned by the offload driver, zero means don't care */
>> +struct tcp_ddp_limits {
>> +	int	 max_ddp_sgl_len;
>> +};
>> +
>> +enum tcp_ddp_type {
>> +	TCP_DDP_NVME = 1,
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
>> + * config implementations must use this as the first member.
>> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
>> + */
>> +struct tcp_ddp_config {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> 
> you have this variable length buf, but it is not used (as far as I can
> tell). But then ...
> 
> 

True. This buf[] is here to indicate that users are expected to extend it with
the ULP specific data as in nvme_tcp_config. We can remove it and leave a comment
if you prefer that.

>> +};
>> +
>> +/**
>> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
>> + *
>> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
>> + * @cpda:       controller pdu data alignmend (dwords, 0's based)
>> + * @dgst:       digest types enabled.
>> + *              The netdev will offload crc if ddp_crc is supported.
>> + * @queue_size: number of nvme-tcp IO queue elements
>> + * @queue_id:   queue identifier
>> + * @cpu_io:     cpu core running the IO thread for this queue
>> + */
>> +struct nvme_tcp_ddp_config {
>> +	struct tcp_ddp_config   cfg;
> 
> ... how would you use it within another struct like this?
> 

You don't.

>> +
>> +	u16			pfv;
>> +	u8			cpda;
>> +	u8			dgst;
>> +	int			queue_size;
>> +	int			queue_id;
>> +	int			io_cpu;
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.
>> + *
>> + * @command_id:  identifier on the wire associated with these buffers
>> + * @nents:       number of entries in the sg_table
>> + * @sg_table:    describing the buffers for this IO request
>> + * @first_sgl:   first SGL in sg_table
>> + */
>> +struct tcp_ddp_io {
>> +	u32			command_id;
>> +	int			nents;
>> +	struct sg_table		sg_table;
>> +	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
>> +};
>> +
>> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
>> + *
>> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
>> + *                     the device driver can use this to limit the resources allocated per queue.
>> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
>> + *                     this function is used to configure either copy, crc or both offloads.
>> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
>> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
>> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
>> + *                     tcp_ddp_io.
>> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
>> + */
>> +struct tcp_ddp_dev_ops {
>> +	int (*tcp_ddp_limits)(struct net_device *netdev,
>> +			      struct tcp_ddp_limits *limits);
>> +	int (*tcp_ddp_sk_add)(struct net_device *netdev,
>> +			      struct sock *sk,
>> +			      struct tcp_ddp_config *config);
>> +	void (*tcp_ddp_sk_del)(struct net_device *netdev,
>> +			       struct sock *sk);
>> +	int (*tcp_ddp_setup)(struct net_device *netdev,
>> +			     struct sock *sk,
>> +			     struct tcp_ddp_io *io);
>> +	int (*tcp_ddp_teardown)(struct net_device *netdev,
>> +				struct sock *sk,
>> +				struct tcp_ddp_io *io,
>> +				void *ddp_ctx);
>> +	void (*tcp_ddp_resync)(struct net_device *netdev,
>> +			       struct sock *sk, u32 seq);
>> +};
>> +
>> +#define TCP_DDP_RESYNC_REQ (1 << 0)
>> +
>> +/**
>> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
>> + */
>> +struct tcp_ddp_ulp_ops {
>> +	/* NIC requests ulp to indicate if @seq is the start of a message */
>> +	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
>> +	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
>> +	void (*ddp_teardown_done)(void *ddp_ctx);
>> +};
>> +
>> +/**
>> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
>> + * use this as the first member.
>> + */
>> +struct tcp_ddp_ctx {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> 
> similar to my comment above, I did not see any uses of the buf element.
> 

Same idea, we will remove it an leave a comment.
Boris Pismenny Dec. 9, 2020, 8:28 a.m. UTC | #8
On 09/12/2020 3:11, David Ahern wrote:
> On 12/8/20 5:57 PM, David Ahern wrote:
>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>> index 7338b3865a2a..a08b85b53aa8 100644
>>> --- a/include/net/inet_connection_sock.h
>>> +++ b/include/net/inet_connection_sock.h
>>> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>>>   * @icsk_ulp_ops	   Pluggable ULP control hook
>>>   * @icsk_ulp_data	   ULP private data
>>>   * @icsk_clean_acked	   Clean acked data hook
>>> + * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
>>> + * @icsk_ulp_ddp_data	   ULP direct data placement private data
>>
>> Neither of these socket layer intrusions are needed. All references but
>> 1 -- the skbuff check -- are in the mlx5 driver. Any skb check that is
>> needed can be handled with a different setting.
> 
> missed the nvme ops for the driver to callback to the socket owner.
> 

Hopefully it is clear that these are needed, and indeed we use them in
both driver and nvme-tcp layers.
David Ahern Dec. 10, 2020, 4:26 a.m. UTC | #9
On 12/9/20 1:15 AM, Boris Pismenny wrote:
> On 09/12/2020 2:38, David Ahern wrote:
>>
>> The AF_XDP reference was to differentiate one zerocopy use case (all
>> packets go to userspace) from another (kernel managed TCP socket with
>> zerocopy payload). You are focusing on a very narrow use case - kernel
>> based NVMe over TCP - of a more general problem.
>>
> 
> Please note that although our framework implements support for nvme-tcp,
> we designed it to fit iscsi as well, and hopefully future protocols too,
> as general as we could. For why this could not be generalized further
> see below.
> 
>> You have a TCP socket and a design that only works for kernel owned
>> sockets. You have specialized queues in the NIC, a flow rule directing
>> packets to those queues. Presumably some ULP parser in the NIC
>> associated with the queues to process NVMe packets. Rather than copying
>> headers (ethernet/ip/tcp) to one buffer and payload to another (which is
>> similar to what Jonathan Lemon is working on), this design has a ULP
>> processor that just splits out the TCP payload even more making it
>> highly selective about which part of the packet is put into which
>> buffer. Take out the NVMe part, and it is header split with zerocopy for
>> the payload - a generic feature that can have a wider impact with NVMe
>> as a special case.
>>
> 
> There is more to this than TCP zerocopy that exists in userspace or
> inside the kernel. First, please note that the patches include support for
> CRC offload as well as data placement. Second, data-placement is not the same

Yes, the CRC offload is different, but I think it is orthogonal to the
'where does h/w put the data' problem.

> as zerocopy for the following reasons:
> (1) The former places buffers *exactly* where the user requests
> regardless of the order of response arrivals, while the latter places packets
> in anonymous buffers according to packet arrival order. Therefore, zerocopy
> can be implemented using data placement, but not vice versa.

Fundamentally, it is an SGL and a TCP sequence number. There is a
starting point where seq N == sgl element 0, position 0. Presumably
there is a hardware cursor to track where you are in filling the SGL as
packets are processed. You abort on OOO, so it seems like a fairly
straightfoward problem.

> (2) Data-placement supports sub-page zerocopy, unlike page-flipping
> techniques (i.e., TCP_ZEROCOPY).

I am not pushing for or suggesting any page-flipping. I understand the
limitations of that approach.

> (3) Page-flipping can't work for any storage initiator because the
> destination buffer is owned by some user pagecache or process using O_DIRECT.
> (4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
> i.e., the PDU header can be in the middle of a packet, so header-data split
> alone isn't enough.

yes, TCP is a byte stream and you have to have a cursor marking last
written spot in the SGL. More below.

> 
> I wish we could do the same using some simpler zerocopy mechanism,
> it would indeed simplify things. But, unfortunately this would severely
> restrict generality, no sub-page support and alignment between PDUs
> and packets, and performance (ordering of PDUs).
> 

My biggest concern is that you are adding checks in the fast path for a
very specific use case. If / when Rx zerocopy happens (and I suspect it
has to happen soon to handle the ever increasing speeds), nothing about
this patch set is reusable and worse more checks are needed in the fast
path. I think it is best if you make this more generic — at least
anything touching core code.

For example, you have an iov static key hook managed by a driver for
generic code. There are a few ways around that. One is by adding skb
details to the nvme code — ie., walking the skb fragments, seeing that a
given frag is in your allocated memory and skipping the copy. This would
offer best performance since it skips all unnecessary checks. Another
option is to export __skb_datagram_iter, use it and define your own copy
handler that does the address compare and skips the copy. Key point -
only your code path is affected.

Similarly for the NVMe SGLs and DDP offload - a more generic solution
allows other use cases to build on this as opposed to the checks you
want for a special case. For example, a split at the protocol headers /
payload boundaries would be a generic solution where kernel managed
protocols get data in one buffer and socket data is put into a given
SGL. I am guessing that you have to be already doing this to put PDU
payloads into an SGL and other headers into other memory to make a
complete packet, so this is not too far off from what you are already doing.

Let me walk through an example with assumptions about your hardware's
capabilities, and you correct me where I am wrong. Assume you have a
'full' command response of this form:

 +------------- ... ----------------+---------+---------+--------+-----+
 |          big data segment        | PDU hdr | TCP hdr | IP hdr | eth |
 +------------- ... ----------------+---------+---------+--------+-----+

but it shows up to the host in 3 packets like this (ideal case):

 +-------------------------+---------+---------+--------+-----+
 |       data - seg 1      | PDU hdr | TCP hdr | IP hdr | eth |
 +-------------------------+---------+---------+--------+-----+
 +-----------------------------------+---------+--------+-----+
 |       data - seg 2                | TCP hdr | IP hdr | eth |
 +-----------------------------------+---------+--------+-----+
                   +-----------------+---------+--------+-----+
                   | payload - seg 3 | TCP hdr | IP hdr | eth |
                   +-----------------+---------+--------+-----+


The hardware splits the eth/IP/tcp headers from payload like this
(again, your hardware has to know these boundaries to accomplish what
you want):

 +-------------------------+---------+     +---------+--------+-----+
 |       data - seg 1      | PDU hdr |     | TCP hdr | IP hdr | eth |
 +-------------------------+---------+     +---------+--------+-----+

 +-----------------------------------+     +---------+--------+-----+
 |       data - seg 2                |     | TCP hdr | IP hdr | eth |
 +-----------------------------------+     +---------+--------+-----+

                   +-----------------+     +---------+--------+-----+
                   | payload - seg 3 |     | TCP hdr | IP hdr | eth |
                   +-----------------+     +---------+--------+-----+

Left side goes into the SGLs posted for this socket / flow; the right
side goes into some other memory resource made available for headers.
This is very close to what you are doing now - with the exception of the
PDU header being put to the right side. NVMe code then just needs to set
the iov offset (or adjust the base_addr) to skip over the PDU header -
standard options for an iov.

Yes, TCP is a byte stream, so the packets could very well show up like this:

 +--------------+---------+-----------+---------+--------+-----+
 | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
 +--------------+---------+-----------+---------+--------+-----+
 +-----------------------------------+---------+--------+-----+
 |     payload - seg 2               | TCP hdr | IP hdr | eth |
 +-----------------------------------+---------+--------+-----+
 +-------- +-------------------------+---------+--------+-----+
 | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
 +---------+-------------------------+---------+--------+-----+

If your hardware can extract the NVMe payload into a targeted SGL like
you want in this set, then it has some logic for parsing headers and
"snapping" an SGL to a new element. ie., it already knows 'prev data'
goes with the in-progress PDU, sees more data, recognizes a new PDU
header and a new payload. That means it already has to handle a
'snap-to-PDU' style argument where the end of the payload closes out an
SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
this case, you want 'snap-to-PDU' but that could just as easily be 'no
snap at all', just a byte stream and filling an SGL after the protocol
headers.

Key point here is that this is the start of a generic header / data
split that could work for other applications - not just NVMe. eth/IP/TCP
headers are consumed by the Linux networking stack; data is in
application owned, socket based SGLs to avoid copies.

###

A dump of other comments about this patch set:
- there are a LOT of unnecessary typecasts around tcp_ddp_ctx that can
be avoided by using container_of.

- you have an accessor tcp_ddp_get_ctx but no setter; all uses of
tcp_ddp_get_ctx are within mlx5. why open code the set but use the
accessor for the get? Worse, mlx5e_nvmeotcp_queue_teardown actually has
both — uses the accessor and open codes setting icsk_ulp_ddp_data.

- the driver is storing private data on the socket. Nothing about the
socket layer cares and the mlx5 driver is already tracking that data in
priv->nvmeotcp->queue_hash. As I mentioned in a previous response, I
understand the socket ops are needed for the driver level to call into
the socket layer, but the data part does not seem to be needed.

- nvme_tcp_offload_socket and nvme_tcp_offload_limits both return int
yet the value is ignored

- the build robot found a number of problems (it pulls my github tree
and I pushed this set to it to move across computers).

I think the patch set would be easier to follow if you restructured the
patches to 1 thing only per patch -- e.g., split patch 2 into netdev
bits and socket bits. Add the netdev feature bit and operations in 1
patch and add the socket ops in a second patch with better commit logs
about why each is needed and what is done.
Jakub Kicinski Dec. 11, 2020, 2:01 a.m. UTC | #10
On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
> Yes, TCP is a byte stream, so the packets could very well show up like this:
> 
>  +--------------+---------+-----------+---------+--------+-----+
>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>  +--------------+---------+-----------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>  +-------- +-------------------------+---------+--------+-----+
>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>  +---------+-------------------------+---------+--------+-----+
> 
> If your hardware can extract the NVMe payload into a targeted SGL like
> you want in this set, then it has some logic for parsing headers and
> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> goes with the in-progress PDU, sees more data, recognizes a new PDU
> header and a new payload. That means it already has to handle a
> 'snap-to-PDU' style argument where the end of the payload closes out an
> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> snap at all', just a byte stream and filling an SGL after the protocol
> headers.

This 'snap-to-PDU' requirement is something that I don't understand
with the current TCP zero copy. In case of, say, a storage application
which wants to send some headers (whatever RPC info, block number,
etc.) and then a 4k block of data - how does the RX side get just the
4k block a into a page so it can zero copy it out to its storage device?

Per-connection state in the NIC, and FW parsing headers is one way,
but I wonder how this record split problem is best resolved generically.
Perhaps by passing hints in the headers somehow?

Sorry for the slight off-topic :)
David Ahern Dec. 11, 2020, 2:43 a.m. UTC | #11
On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
>> Yes, TCP is a byte stream, so the packets could very well show up like this:
>>
>>  +--------------+---------+-----------+---------+--------+-----+
>>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>>  +--------------+---------+-----------+---------+--------+-----+
>>  +-----------------------------------+---------+--------+-----+
>>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>>  +-----------------------------------+---------+--------+-----+
>>  +-------- +-------------------------+---------+--------+-----+
>>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>>  +---------+-------------------------+---------+--------+-----+
>>
>> If your hardware can extract the NVMe payload into a targeted SGL like
>> you want in this set, then it has some logic for parsing headers and
>> "snapping" an SGL to a new element. ie., it already knows 'prev data'
>> goes with the in-progress PDU, sees more data, recognizes a new PDU
>> header and a new payload. That means it already has to handle a
>> 'snap-to-PDU' style argument where the end of the payload closes out an
>> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
>> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
>> this case, you want 'snap-to-PDU' but that could just as easily be 'no
>> snap at all', just a byte stream and filling an SGL after the protocol
>> headers.
> 
> This 'snap-to-PDU' requirement is something that I don't understand
> with the current TCP zero copy. In case of, say, a storage application

current TCP zero-copy does not handle this and it can't AFAIK. I believe
it requires hardware level support where an Rx queue is dedicated to a
flow / socket and some degree of header and payload splitting (header is
consumed by the kernel stack and payload goes to socket owner's memory).

> which wants to send some headers (whatever RPC info, block number,
> etc.) and then a 4k block of data - how does the RX side get just the
> 4k block a into a page so it can zero copy it out to its storage device?
> 
> Per-connection state in the NIC, and FW parsing headers is one way,
> but I wonder how this record split problem is best resolved generically.
> Perhaps by passing hints in the headers somehow?
> 
> Sorry for the slight off-topic :)
> 
Hardware has to be parsing the incoming packets to find the usual
ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
have some kind of ULP processor to know how to parse the TCP byte stream
at least well enough to find the PDU header and interpret it to get pdu
header length and payload length.

At that point you push the protocol headers (eth/ip/tcp) into one buffer
for the kernel stack protocols and put the payload into another. The
former would be some page owned by the OS and the latter owned by the
process / socket (generically, in this case it is a kernel level
socket). In addition, since the payload is spread across multiple
packets the hardware has to keep track of TCP sequence number and its
current place in the SGL where it is writing the payload to keep the
bytes contiguous and detect out-of-order.

If the ULP processor knows about PDU headers it knows when enough
payload has been found to satisfy that PDU in which case it can tell the
cursor to move on to the next SGL element (or separate SGL). That's what
I meant by 'snap-to-PDU'.

Alternatively, if it is any random application with a byte stream not
understood by hardware, the cursor just keeps moving along the SGL
elements assigned it for this particular flow.

If you have a socket whose payload is getting offloaded to its own queue
(which this set is effectively doing), you can create the queue with
some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
that controls the parsing when you stop writing to one SGL element and
move on to the next. Again, assuming hardware support for such attributes.

I don't work for Nvidia, so this is all supposition based on what the
patches are doing.
Jakub Kicinski Dec. 11, 2020, 6:45 p.m. UTC | #12
On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:  
> >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> >>
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> >>  +--------------+---------+-----------+---------+--------+-----+
> >>  +-----------------------------------+---------+--------+-----+
> >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> >>  +-----------------------------------+---------+--------+-----+
> >>  +-------- +-------------------------+---------+--------+-----+
> >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> >>  +---------+-------------------------+---------+--------+-----+
> >>
> >> If your hardware can extract the NVMe payload into a targeted SGL like
> >> you want in this set, then it has some logic for parsing headers and
> >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> >> header and a new payload. That means it already has to handle a
> >> 'snap-to-PDU' style argument where the end of the payload closes out an
> >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> >> snap at all', just a byte stream and filling an SGL after the protocol
> >> headers.  
> > 
> > This 'snap-to-PDU' requirement is something that I don't understand
> > with the current TCP zero copy. In case of, say, a storage application  
> 
> current TCP zero-copy does not handle this and it can't AFAIK. I believe
> it requires hardware level support where an Rx queue is dedicated to a
> flow / socket and some degree of header and payload splitting (header is
> consumed by the kernel stack and payload goes to socket owner's memory).

Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
mlx4 NICs.

Simple workaround that comes to mind is have the headers and payloads
on separate TCP streams. That doesn't seem too slick.. but neither is
the 4k MSS, so maybe that's what Google does?

> > which wants to send some headers (whatever RPC info, block number,
> > etc.) and then a 4k block of data - how does the RX side get just the
> > 4k block a into a page so it can zero copy it out to its storage device?
> > 
> > Per-connection state in the NIC, and FW parsing headers is one way,
> > but I wonder how this record split problem is best resolved generically.
> > Perhaps by passing hints in the headers somehow?
> > 
> > Sorry for the slight off-topic :)
> >   
> Hardware has to be parsing the incoming packets to find the usual
> ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> have some kind of ULP processor to know how to parse the TCP byte stream
> at least well enough to find the PDU header and interpret it to get pdu
> header length and payload length.

The big difference between normal headers and L7 headers is that one is
at ~constant offset, self-contained, and always complete (PDU header
can be split across segments).

Edwin Peer did an implementation of TLS ULP for the NFP, it was
complex. Not to mention it's L7 protocol ossification.

To put it bluntly maybe it's fine for smaller shops but I'm guessing
it's going to be a hard sell to hyperscalers and people who don't like
to be locked in to HW.

> At that point you push the protocol headers (eth/ip/tcp) into one buffer
> for the kernel stack protocols and put the payload into another. The
> former would be some page owned by the OS and the latter owned by the
> process / socket (generically, in this case it is a kernel level
> socket). In addition, since the payload is spread across multiple
> packets the hardware has to keep track of TCP sequence number and its
> current place in the SGL where it is writing the payload to keep the
> bytes contiguous and detect out-of-order.
> 
> If the ULP processor knows about PDU headers it knows when enough
> payload has been found to satisfy that PDU in which case it can tell the
> cursor to move on to the next SGL element (or separate SGL). That's what
> I meant by 'snap-to-PDU'.
> 
> Alternatively, if it is any random application with a byte stream not
> understood by hardware, the cursor just keeps moving along the SGL
> elements assigned it for this particular flow.
> 
> If you have a socket whose payload is getting offloaded to its own queue
> (which this set is effectively doing), you can create the queue with
> some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> that controls the parsing when you stop writing to one SGL element and
> move on to the next. Again, assuming hardware support for such attributes.
> 
> I don't work for Nvidia, so this is all supposition based on what the
> patches are doing.

Ack, these patches are not exciting (to me), so I'm wondering if there
is a better way. The only reason NIC would have to understand a ULP for
ZC is to parse out header/message lengths. There's gotta be a way to
pass those in header options or such...

And, you know, if we figure something out - maybe we stand a chance
against having 4 different zero copy implementations (this, TCP,
AF_XDP, netgpu) :(
Eric Dumazet Dec. 11, 2020, 6:58 p.m. UTC | #13
On Fri, Dec 11, 2020 at 7:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> > On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
> > >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> > >>
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  +-------- +-------------------------+---------+--------+-----+
> > >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> > >>  +---------+-------------------------+---------+--------+-----+
> > >>
> > >> If your hardware can extract the NVMe payload into a targeted SGL like
> > >> you want in this set, then it has some logic for parsing headers and
> > >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> > >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> > >> header and a new payload. That means it already has to handle a
> > >> 'snap-to-PDU' style argument where the end of the payload closes out an
> > >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> > >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> > >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> > >> snap at all', just a byte stream and filling an SGL after the protocol
> > >> headers.
> > >
> > > This 'snap-to-PDU' requirement is something that I don't understand
> > > with the current TCP zero copy. In case of, say, a storage application
> >
> > current TCP zero-copy does not handle this and it can't AFAIK. I believe
> > it requires hardware level support where an Rx queue is dedicated to a
> > flow / socket and some degree of header and payload splitting (header is
> > consumed by the kernel stack and payload goes to socket owner's memory).
>
> Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
> mlx4 NICs.

And other proprietary NIC

We do not dedicate RX queues, TCP RX zerocopy is not direct placement
into application space.

These slides (from netdev 0x14) might be helpful

https://netdevconf.info/0x14/pub/slides/62/Implementing%20TCP%20RX%20zero%20copy.pdf


>
> Simple workaround that comes to mind is have the headers and payloads
> on separate TCP streams. That doesn't seem too slick.. but neither is
> the 4k MSS, so maybe that's what Google does?
>
> > > which wants to send some headers (whatever RPC info, block number,
> > > etc.) and then a 4k block of data - how does the RX side get just the
> > > 4k block a into a page so it can zero copy it out to its storage device?
> > >
> > > Per-connection state in the NIC, and FW parsing headers is one way,
> > > but I wonder how this record split problem is best resolved generically.
> > > Perhaps by passing hints in the headers somehow?
> > >
> > > Sorry for the slight off-topic :)
> > >
> > Hardware has to be parsing the incoming packets to find the usual
> > ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> > have some kind of ULP processor to know how to parse the TCP byte stream
> > at least well enough to find the PDU header and interpret it to get pdu
> > header length and payload length.
>
> The big difference between normal headers and L7 headers is that one is
> at ~constant offset, self-contained, and always complete (PDU header
> can be split across segments).
>
> Edwin Peer did an implementation of TLS ULP for the NFP, it was
> complex. Not to mention it's L7 protocol ossification.
>
> To put it bluntly maybe it's fine for smaller shops but I'm guessing
> it's going to be a hard sell to hyperscalers and people who don't like
> to be locked in to HW.
>
> > At that point you push the protocol headers (eth/ip/tcp) into one buffer
> > for the kernel stack protocols and put the payload into another. The
> > former would be some page owned by the OS and the latter owned by the
> > process / socket (generically, in this case it is a kernel level
> > socket). In addition, since the payload is spread across multiple
> > packets the hardware has to keep track of TCP sequence number and its
> > current place in the SGL where it is writing the payload to keep the
> > bytes contiguous and detect out-of-order.
> >
> > If the ULP processor knows about PDU headers it knows when enough
> > payload has been found to satisfy that PDU in which case it can tell the
> > cursor to move on to the next SGL element (or separate SGL). That's what
> > I meant by 'snap-to-PDU'.
> >
> > Alternatively, if it is any random application with a byte stream not
> > understood by hardware, the cursor just keeps moving along the SGL
> > elements assigned it for this particular flow.
> >
> > If you have a socket whose payload is getting offloaded to its own queue
> > (which this set is effectively doing), you can create the queue with
> > some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> > that controls the parsing when you stop writing to one SGL element and
> > move on to the next. Again, assuming hardware support for such attributes.
> >
> > I don't work for Nvidia, so this is all supposition based on what the
> > patches are doing.
>
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
>
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(
David Ahern Dec. 11, 2020, 7:59 p.m. UTC | #14
On 12/11/20 11:45 AM, Jakub Kicinski wrote:
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
> 
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(

AF_XDP is for userspace IP/TCP/packet handling.

netgpu is fundamentally kernel stack socket with zerocopy direct to
userspace buffers. Yes, it has more complications like the process is
running on a GPU, but essential characteristics are header / payload
split, a kernel stack managed socket, and dedicated H/W queues for the
socket and those are all common to what the nvme patches are doing. So,
can we create a common framework for those characteristics which enables
other use cases (like a more generic Rx zerocopy)?
Jonathan Lemon Dec. 11, 2020, 11:05 p.m. UTC | #15
On Fri, Dec 11, 2020 at 12:59:52PM -0700, David Ahern wrote:
> On 12/11/20 11:45 AM, Jakub Kicinski wrote:
> > Ack, these patches are not exciting (to me), so I'm wondering if there
> > is a better way. The only reason NIC would have to understand a ULP for
> > ZC is to parse out header/message lengths. There's gotta be a way to
> > pass those in header options or such...
> > 
> > And, you know, if we figure something out - maybe we stand a chance
> > against having 4 different zero copy implementations (this, TCP,
> > AF_XDP, netgpu) :(
> 
> AF_XDP is for userspace IP/TCP/packet handling.
> 
> netgpu is fundamentally kernel stack socket with zerocopy direct to
> userspace buffers. Yes, it has more complications like the process is
> running on a GPU, but essential characteristics are header / payload
> split, a kernel stack managed socket, and dedicated H/W queues for the
> socket and those are all common to what the nvme patches are doing. So,
> can we create a common framework for those characteristics which enables
> other use cases (like a more generic Rx zerocopy)?

AF_XDP could also be viewed as a special case of netgpu.
Boris Pismenny Dec. 13, 2020, 6:21 p.m. UTC | #16
On 10/12/2020 6:26, David Ahern wrote:
> On 12/9/20 1:15 AM, Boris Pismenny wrote:
>> On 09/12/2020 2:38, David Ahern wrote:
[...]
>>
>> There is more to this than TCP zerocopy that exists in userspace or
>> inside the kernel. First, please note that the patches include support for
>> CRC offload as well as data placement. Second, data-placement is not the same
> 
> Yes, the CRC offload is different, but I think it is orthogonal to the
> 'where does h/w put the data' problem.
> 

I agree

>> as zerocopy for the following reasons:
>> (1) The former places buffers *exactly* where the user requests
>> regardless of the order of response arrivals, while the latter places packets
>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>> can be implemented using data placement, but not vice versa.
> 
> Fundamentally, it is an SGL and a TCP sequence number. There is a
> starting point where seq N == sgl element 0, position 0. Presumably
> there is a hardware cursor to track where you are in filling the SGL as
> packets are processed. You abort on OOO, so it seems like a fairly
> straightfoward problem.
> 

We do not abort on OOO. Moreover, we can keep going as long as
PDU headers are not reordered.

>> (2) Data-placement supports sub-page zerocopy, unlike page-flipping
>> techniques (i.e., TCP_ZEROCOPY).
> 
> I am not pushing for or suggesting any page-flipping. I understand the
> limitations of that approach.
> 
>> (3) Page-flipping can't work for any storage initiator because the
>> destination buffer is owned by some user pagecache or process using O_DIRECT.
>> (4) Storage over TCP PDUs are not necessarily aligned to TCP packets,
>> i.e., the PDU header can be in the middle of a packet, so header-data split
>> alone isn't enough.
> 
> yes, TCP is a byte stream and you have to have a cursor marking last
> written spot in the SGL. More below.
> 
>>
>> I wish we could do the same using some simpler zerocopy mechanism,
>> it would indeed simplify things. But, unfortunately this would severely
>> restrict generality, no sub-page support and alignment between PDUs
>> and packets, and performance (ordering of PDUs).
>>
> 
> My biggest concern is that you are adding checks in the fast path for a
> very specific use case. If / when Rx zerocopy happens (and I suspect it
> has to happen soon to handle the ever increasing speeds), nothing about
> this patch set is reusable and worse more checks are needed in the fast
> path. I think it is best if you make this more generic — at least
> anything touching core code.
> 
> For example, you have an iov static key hook managed by a driver for
> generic code. There are a few ways around that. One is by adding skb
> details to the nvme code — ie., walking the skb fragments, seeing that a
> given frag is in your allocated memory and skipping the copy. This would
> offer best performance since it skips all unnecessary checks. Another
> option is to export __skb_datagram_iter, use it and define your own copy
> handler that does the address compare and skips the copy. Key point -
> only your code path is affected.

I'll submit V2 that is less invasive to core code:
IMO exporting __skb_datagram_iter and all child functions
is more generic, so we'll use that.

> 
> Similarly for the NVMe SGLs and DDP offload - a more generic solution
> allows other use cases to build on this as opposed to the checks you
> want for a special case. For example, a split at the protocol headers /
> payload boundaries would be a generic solution where kernel managed
> protocols get data in one buffer and socket data is put into a given
> SGL. I am guessing that you have to be already doing this to put PDU
> payloads into an SGL and other headers into other memory to make a
> complete packet, so this is not too far off from what you are already doing.
> 

Splitting at protocol header boundaries and placing data at socket defined
SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
responses. Here is an example:

the host submits the following requests:
+--------+--------+--------+
| Read 1 | Read 2 | Read 3 |
+--------+--------+--------+

the target responds with the following responses:
+--------+--------+--------+
| Resp 2 | Resp 3 | Resp 1 |
+--------+--------+--------+

Therefore, hardware must hold a mapping between PDU identifiers (command_id)
and the corresponding buffers. This interface is missing in the proposal
above, which is why it won't work for nvme-tcp.

> Let me walk through an example with assumptions about your hardware's
> capabilities, and you correct me where I am wrong. Assume you have a
> 'full' command response of this form:
> 
>  +------------- ... ----------------+---------+---------+--------+-----+
>  |          big data segment        | PDU hdr | TCP hdr | IP hdr | eth |
>  +------------- ... ----------------+---------+---------+--------+-----+
> 
> but it shows up to the host in 3 packets like this (ideal case):
> 
>  +-------------------------+---------+---------+--------+-----+
>  |       data - seg 1      | PDU hdr | TCP hdr | IP hdr | eth |
>  +-------------------------+---------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |       data - seg 2                | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>                    +-----------------+---------+--------+-----+
>                    | payload - seg 3 | TCP hdr | IP hdr | eth |
>                    +-----------------+---------+--------+-----+
> 
> 
> The hardware splits the eth/IP/tcp headers from payload like this
> (again, your hardware has to know these boundaries to accomplish what
> you want):
> 
>  +-------------------------+---------+     +---------+--------+-----+
>  |       data - seg 1      | PDU hdr |     | TCP hdr | IP hdr | eth |
>  +-------------------------+---------+     +---------+--------+-----+
> 
>  +-----------------------------------+     +---------+--------+-----+
>  |       data - seg 2                |     | TCP hdr | IP hdr | eth |
>  +-----------------------------------+     +---------+--------+-----+
> 
>                    +-----------------+     +---------+--------+-----+
>                    | payload - seg 3 |     | TCP hdr | IP hdr | eth |
>                    +-----------------+     +---------+--------+-----+
> 
> Left side goes into the SGLs posted for this socket / flow; the right
> side goes into some other memory resource made available for headers.
> This is very close to what you are doing now - with the exception of the
> PDU header being put to the right side. NVMe code then just needs to set
> the iov offset (or adjust the base_addr) to skip over the PDU header -
> standard options for an iov.
> 
> Yes, TCP is a byte stream, so the packets could very well show up like this:
> 
>  +--------------+---------+-----------+---------+--------+-----+
>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>  +--------------+---------+-----------+---------+--------+-----+
>  +-----------------------------------+---------+--------+-----+
>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>  +-----------------------------------+---------+--------+-----+
>  +-------- +-------------------------+---------+--------+-----+
>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>  +---------+-------------------------+---------+--------+-----+
> 
> If your hardware can extract the NVMe payload into a targeted SGL like
> you want in this set, then it has some logic for parsing headers and
> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> goes with the in-progress PDU, sees more data, recognizes a new PDU
> header and a new payload. That means it already has to handle a
> 'snap-to-PDU' style argument where the end of the payload closes out an
> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> snap at all', just a byte stream and filling an SGL after the protocol
> headers.
> 
> Key point here is that this is the start of a generic header / data
> split that could work for other applications - not just NVMe. eth/IP/TCP
> headers are consumed by the Linux networking stack; data is in
> application owned, socket based SGLs to avoid copies.
> 

I think that the interface we created (tcp_ddp) is sufficiently generic
for the task at hand, which is offloading protocols that can re-order
their responses, a non-trivial task that we claim is important.

We designed it to support other protocols and not just nvme-tcp,
which is merely an example. For instance, I think that supporting iSCSI
would be natural, and that other protocols will fit nicely.

> ###
> 
> A dump of other comments about this patch set:

Thanks for reviewing! We will fix and resubmit.

> - there are a LOT of unnecessary typecasts around tcp_ddp_ctx that can
> be avoided by using container_of.
> 
> - you have an accessor tcp_ddp_get_ctx but no setter; all uses of
> tcp_ddp_get_ctx are within mlx5. why open code the set but use the
> accessor for the get? Worse, mlx5e_nvmeotcp_queue_teardown actually has
> both — uses the accessor and open codes setting icsk_ulp_ddp_data.
> 
> - the driver is storing private data on the socket. Nothing about the
> socket layer cares and the mlx5 driver is already tracking that data in
> priv->nvmeotcp->queue_hash. As I mentioned in a previous response, I
> understand the socket ops are needed for the driver level to call into
> the socket layer, but the data part does not seem to be needed.

The socket layer does care: the socket wouldn't disappear under the
driver as it will clean things up and call the driver before the socket
disappears. This part is similar to what we have in tls where
drivers store some private data per-socket to assist offload.

> 
> - nvme_tcp_offload_socket and nvme_tcp_offload_limits both return int
> yet the value is ignored
> 

This is on purpose. Users can know whether it is successful or not using
ethtool counters of the NIC. Offload is opportunistic and its failure
is non-fatal, and as nvme-tcp has no stats we intentionally ignore the
returned values for now.

> - the build robot found a number of problems (it pulls my github tree
> and I pushed this set to it to move across computers).
> 
> I think the patch set would be easier to follow if you restructured the
> patches to 1 thing only per patch -- e.g., split patch 2 into netdev
> bits and socket bits. Add the netdev feature bit and operations in 1
> patch and add the socket ops in a second patch with better commit logs
> about why each is needed and what is done.
>
Boris Pismenny Dec. 13, 2020, 6:34 p.m. UTC | #17
On 11/12/2020 20:45, Jakub Kicinski wrote:
> On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
>> On 12/10/20 7:01 PM, Jakub Kicinski wrote:
>>> On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:  
>>>> Yes, TCP is a byte stream, so the packets could very well show up like this:
>>>>
>>>>  +--------------+---------+-----------+---------+--------+-----+
>>>>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
>>>>  +--------------+---------+-----------+---------+--------+-----+
>>>>  +-----------------------------------+---------+--------+-----+
>>>>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
>>>>  +-----------------------------------+---------+--------+-----+
>>>>  +-------- +-------------------------+---------+--------+-----+
>>>>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
>>>>  +---------+-------------------------+---------+--------+-----+
>>>>
>>>> If your hardware can extract the NVMe payload into a targeted SGL like
>>>> you want in this set, then it has some logic for parsing headers and
>>>> "snapping" an SGL to a new element. ie., it already knows 'prev data'
>>>> goes with the in-progress PDU, sees more data, recognizes a new PDU
>>>> header and a new payload. That means it already has to handle a
>>>> 'snap-to-PDU' style argument where the end of the payload closes out an
>>>> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
>>>> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
>>>> this case, you want 'snap-to-PDU' but that could just as easily be 'no
>>>> snap at all', just a byte stream and filling an SGL after the protocol
>>>> headers.  
>>>
>>> This 'snap-to-PDU' requirement is something that I don't understand
>>> with the current TCP zero copy. In case of, say, a storage application  
>>
>> current TCP zero-copy does not handle this and it can't AFAIK. I believe
>> it requires hardware level support where an Rx queue is dedicated to a
>> flow / socket and some degree of header and payload splitting (header is
>> consumed by the kernel stack and payload goes to socket owner's memory).
> 
> Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
> mlx4 NICs.
> 
> Simple workaround that comes to mind is have the headers and payloads
> on separate TCP streams. That doesn't seem too slick.. but neither is
> the 4k MSS, so maybe that's what Google does?
> 
>>> which wants to send some headers (whatever RPC info, block number,
>>> etc.) and then a 4k block of data - how does the RX side get just the
>>> 4k block a into a page so it can zero copy it out to its storage device?
>>>
>>> Per-connection state in the NIC, and FW parsing headers is one way,
>>> but I wonder how this record split problem is best resolved generically.
>>> Perhaps by passing hints in the headers somehow?
>>>
>>> Sorry for the slight off-topic :)
>>>   
>> Hardware has to be parsing the incoming packets to find the usual
>> ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
>> have some kind of ULP processor to know how to parse the TCP byte stream
>> at least well enough to find the PDU header and interpret it to get pdu
>> header length and payload length.
> 
> The big difference between normal headers and L7 headers is that one is
> at ~constant offset, self-contained, and always complete (PDU header
> can be split across segments).
> 
> Edwin Peer did an implementation of TLS ULP for the NFP, it was
> complex. Not to mention it's L7 protocol ossification.
> 

Some programability on the PDU header parsing part will resolve the
ossification, and AFAICT the interfaces in the kernel do not ossify the
protocols. 

> To put it bluntly maybe it's fine for smaller shops but I'm guessing
> it's going to be a hard sell to hyperscalers and people who don't like
> to be locked in to HW.
> 
>> At that point you push the protocol headers (eth/ip/tcp) into one buffer
>> for the kernel stack protocols and put the payload into another. The
>> former would be some page owned by the OS and the latter owned by the
>> process / socket (generically, in this case it is a kernel level
>> socket). In addition, since the payload is spread across multiple
>> packets the hardware has to keep track of TCP sequence number and its
>> current place in the SGL where it is writing the payload to keep the
>> bytes contiguous and detect out-of-order.
>>
>> If the ULP processor knows about PDU headers it knows when enough
>> payload has been found to satisfy that PDU in which case it can tell the
>> cursor to move on to the next SGL element (or separate SGL). That's what
>> I meant by 'snap-to-PDU'.
>>
>> Alternatively, if it is any random application with a byte stream not
>> understood by hardware, the cursor just keeps moving along the SGL
>> elements assigned it for this particular flow.
>>
>> If you have a socket whose payload is getting offloaded to its own queue
>> (which this set is effectively doing), you can create the queue with
>> some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
>> that controls the parsing when you stop writing to one SGL element and
>> move on to the next. Again, assuming hardware support for such attributes.
>>
>> I don't work for Nvidia, so this is all supposition based on what the
>> patches are doing.
> 
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
> 
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(
> 

As stated on another thread here. Simply splitting header and data
while also placing payload at some socket buffer address is zerocopy
but not data placement. The latter handles PDU reordering. I think
that it is unjust to place them all in the same category.
David Ahern Dec. 15, 2020, 5:19 a.m. UTC | #18
On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>> as zerocopy for the following reasons:
>>> (1) The former places buffers *exactly* where the user requests
>>> regardless of the order of response arrivals, while the latter places packets
>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>> can be implemented using data placement, but not vice versa.
>>
>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>> starting point where seq N == sgl element 0, position 0. Presumably
>> there is a hardware cursor to track where you are in filling the SGL as
>> packets are processed. You abort on OOO, so it seems like a fairly
>> straightfoward problem.
>>
> 
> We do not abort on OOO. Moreover, we can keep going as long as
> PDU headers are not reordered.

Meaning packets received OOO which contain all or part of a PDU header
are aborted, but pure data packets can arrive out-of-order?

Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
benchmarks?

>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>> allows other use cases to build on this as opposed to the checks you
>> want for a special case. For example, a split at the protocol headers /
>> payload boundaries would be a generic solution where kernel managed
>> protocols get data in one buffer and socket data is put into a given
>> SGL. I am guessing that you have to be already doing this to put PDU
>> payloads into an SGL and other headers into other memory to make a
>> complete packet, so this is not too far off from what you are already doing.
>>
> 
> Splitting at protocol header boundaries and placing data at socket defined
> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
> responses. Here is an example:
> 
> the host submits the following requests:
> +--------+--------+--------+
> | Read 1 | Read 2 | Read 3 |
> +--------+--------+--------+
> 
> the target responds with the following responses:
> +--------+--------+--------+
> | Resp 2 | Resp 3 | Resp 1 |
> +--------+--------+--------+

Does 'Resp N' == 'PDU + data' like this:

 +---------+--------+---------+--------+---------+--------+
 | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
 +---------+--------+---------+--------+---------+--------+

or is it 1 PDU hdr + all of the responses?

> 
> I think that the interface we created (tcp_ddp) is sufficiently generic
> for the task at hand, which is offloading protocols that can re-order
> their responses, a non-trivial task that we claim is important.
> 
> We designed it to support other protocols and not just nvme-tcp,
> which is merely an example. For instance, I think that supporting iSCSI
> would be natural, and that other protocols will fit nicely.

It would be good to add documentation that describes the design, its
assumptions and its limitations. tls has several under
Documentation/networking. e.g., one important limitation to note is that
this design only works for TCP sockets owned by kernel modules.

> 
>> ###
>>
>> A dump of other comments about this patch set:
> 
> Thanks for reviewing! We will fix and resubmit.

Another one I noticed today. You have several typecasts like this:

cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);

since cqe is a member of cqe128, container_of should be used instead of
the magic '- 64'.
Boris Pismenny Dec. 17, 2020, 7:06 p.m. UTC | #19
On 15/12/2020 7:19, David Ahern wrote:
> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>> as zerocopy for the following reasons:
>>>> (1) The former places buffers *exactly* where the user requests
>>>> regardless of the order of response arrivals, while the latter places packets
>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>> can be implemented using data placement, but not vice versa.
>>>
>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>> starting point where seq N == sgl element 0, position 0. Presumably
>>> there is a hardware cursor to track where you are in filling the SGL as
>>> packets are processed. You abort on OOO, so it seems like a fairly
>>> straightfoward problem.
>>>
>>
>> We do not abort on OOO. Moreover, we can keep going as long as
>> PDU headers are not reordered.
> 
> Meaning packets received OOO which contain all or part of a PDU header
> are aborted, but pure data packets can arrive out-of-order?
> 
> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
> benchmarks?
> 

Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
At the worst case we measured (5% OOO), and we got the same performance
as pure software TLS under these conditions. We will strive to have the
same for nvme-tcp. We would be able to test this on nvme-tcp only when we
have hardware. For now, we are using a mix of emulation and simulation to
test and benchmark.

>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>> allows other use cases to build on this as opposed to the checks you
>>> want for a special case. For example, a split at the protocol headers /
>>> payload boundaries would be a generic solution where kernel managed
>>> protocols get data in one buffer and socket data is put into a given
>>> SGL. I am guessing that you have to be already doing this to put PDU
>>> payloads into an SGL and other headers into other memory to make a
>>> complete packet, so this is not too far off from what you are already doing.
>>>
>>
>> Splitting at protocol header boundaries and placing data at socket defined
>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>> responses. Here is an example:
>>
>> the host submits the following requests:
>> +--------+--------+--------+
>> | Read 1 | Read 2 | Read 3 |
>> +--------+--------+--------+
>>
>> the target responds with the following responses:
>> +--------+--------+--------+
>> | Resp 2 | Resp 3 | Resp 1 |
>> +--------+--------+--------+
> 
> Does 'Resp N' == 'PDU + data' like this:
> 
>  +---------+--------+---------+--------+---------+--------+
>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>  +---------+--------+---------+--------+---------+--------+
> 
> or is it 1 PDU hdr + all of the responses?
> 

Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
chooses to do so. The PDU header's command_id field correlates between
the request and the response. We use that correlation in hardware to
identify the buffers where data needs to be scattered. In other words,
hardware holds a map between command_id and block layer buffers SGL.

>>
>> I think that the interface we created (tcp_ddp) is sufficiently generic
>> for the task at hand, which is offloading protocols that can re-order
>> their responses, a non-trivial task that we claim is important.
>>
>> We designed it to support other protocols and not just nvme-tcp,
>> which is merely an example. For instance, I think that supporting iSCSI
>> would be natural, and that other protocols will fit nicely.
> 
> It would be good to add documentation that describes the design, its
> assumptions and its limitations. tls has several under
> Documentation/networking. e.g., one important limitation to note is that
> this design only works for TCP sockets owned by kernel modules.
> 

You are right. I'll do so for tcp_ddp. You are right that it works only
for kernel TCP sockets, but maybe future work will extend it.

>>
>>> ###
>>>
>>> A dump of other comments about this patch set:
>>
>> Thanks for reviewing! We will fix and resubmit.
> 
> Another one I noticed today. You have several typecasts like this:
> 
> cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
> 
> since cqe is a member of cqe128, container_of should be used instead of
> the magic '- 64'.
> 

Will fix
David Ahern Dec. 18, 2020, 12:44 a.m. UTC | #20
On 12/17/20 12:06 PM, Boris Pismenny wrote:
> 
> 
> On 15/12/2020 7:19, David Ahern wrote:
>> On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>>>> as zerocopy for the following reasons:
>>>>> (1) The former places buffers *exactly* where the user requests
>>>>> regardless of the order of response arrivals, while the latter places packets
>>>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>>>> can be implemented using data placement, but not vice versa.
>>>>
>>>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>>>> starting point where seq N == sgl element 0, position 0. Presumably
>>>> there is a hardware cursor to track where you are in filling the SGL as
>>>> packets are processed. You abort on OOO, so it seems like a fairly
>>>> straightfoward problem.
>>>>
>>>
>>> We do not abort on OOO. Moreover, we can keep going as long as
>>> PDU headers are not reordered.
>>
>> Meaning packets received OOO which contain all or part of a PDU header
>> are aborted, but pure data packets can arrive out-of-order?
>>
>> Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
>> nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
>> benchmarks?
>>
> 
> Yes for TLS where similar ideas are used, but not for NVMe-TCP, yet.
> At the worst case we measured (5% OOO), and we got the same performance
> as pure software TLS under these conditions. We will strive to have the
> same for nvme-tcp. We would be able to test this on nvme-tcp only when we
> have hardware. For now, we are using a mix of emulation and simulation to
> test and benchmark.

Interesting. So you don't have hardware today for this feature.

> 
>>>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>>>> allows other use cases to build on this as opposed to the checks you
>>>> want for a special case. For example, a split at the protocol headers /
>>>> payload boundaries would be a generic solution where kernel managed
>>>> protocols get data in one buffer and socket data is put into a given
>>>> SGL. I am guessing that you have to be already doing this to put PDU
>>>> payloads into an SGL and other headers into other memory to make a
>>>> complete packet, so this is not too far off from what you are already doing.
>>>>
>>>
>>> Splitting at protocol header boundaries and placing data at socket defined
>>> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
>>> responses. Here is an example:
>>>
>>> the host submits the following requests:
>>> +--------+--------+--------+
>>> | Read 1 | Read 2 | Read 3 |
>>> +--------+--------+--------+
>>>
>>> the target responds with the following responses:
>>> +--------+--------+--------+
>>> | Resp 2 | Resp 3 | Resp 1 |
>>> +--------+--------+--------+
>>
>> Does 'Resp N' == 'PDU + data' like this:
>>
>>  +---------+--------+---------+--------+---------+--------+
>>  | PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
>>  +---------+--------+---------+--------+---------+--------+
>>
>> or is it 1 PDU hdr + all of the responses?
>>
> 
> Yes, 'RespN = PDU header + PDU data' segmented by TCP whichever way it
> chooses to do so. The PDU header's command_id field correlates between

I thought so and verified in the NVME over Fabrics spec. Not sure why
you brought up order of responses; that should be irrelevant to the design.

> the request and the response. We use that correlation in hardware to
> identify the buffers where data needs to be scattered. In other words,
> hardware holds a map between command_id and block layer buffers SGL.

Understood. Your design is forcing a command id to sgl correlation vs
handing h/w generic or shaped SGLs for writing the socket data.

> 
>>>
>>> I think that the interface we created (tcp_ddp) is sufficiently generic
>>> for the task at hand, which is offloading protocols that can re-order
>>> their responses, a non-trivial task that we claim is important.
>>>
>>> We designed it to support other protocols and not just nvme-tcp,
>>> which is merely an example. For instance, I think that supporting iSCSI
>>> would be natural, and that other protocols will fit nicely.
>>
>> It would be good to add documentation that describes the design, its
>> assumptions and its limitations. tls has several under
>> Documentation/networking. e.g., one important limitation to note is that
>> this design only works for TCP sockets owned by kernel modules.
>>
> 
> You are right. I'll do so for tcp_ddp. You are right that it works only
> for kernel TCP sockets, but maybe future work will extend it.

Little to nothing about this design can work for userspace sockets or
setups like Jonathan's netgpu. e.g., the memory address compares for
whether to copy the data will not work with userspace buffers which is
why I suggested something more generic like marking the skb as "h/w
offload" but that does not work for you since your skbs have a mixed bag
of fragments.
diff mbox series

Patch

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 934de56644e7..fb35dcac03d2 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -84,6 +84,7 @@  enum {
 	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
 
 	NETIF_F_HW_MACSEC_BIT,		/* Offload MACsec operations */
+	NETIF_F_HW_TCP_DDP_BIT,		/* TCP direct data placement offload */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -157,6 +158,7 @@  enum {
 #define NETIF_F_GRO_FRAGLIST	__NETIF_F(GRO_FRAGLIST)
 #define NETIF_F_GSO_FRAGLIST	__NETIF_F(GSO_FRAGLIST)
 #define NETIF_F_HW_MACSEC	__NETIF_F(HW_MACSEC)
+#define NETIF_F_HW_TCP_DDP	__NETIF_F(HW_TCP_DDP)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a07c8e431f45..755766976408 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -934,6 +934,7 @@  struct dev_ifalias {
 
 struct devlink;
 struct tlsdev_ops;
+struct tcp_ddp_dev_ops;
 
 struct netdev_name_node {
 	struct hlist_node hlist;
@@ -1930,6 +1931,10 @@  struct net_device {
 	const struct tlsdev_ops *tlsdev_ops;
 #endif
 
+#ifdef CONFIG_TCP_DDP
+	const struct tcp_ddp_dev_ops *tcp_ddp_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..a08b85b53aa8 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -66,6 +66,8 @@  struct inet_connection_sock_af_ops {
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
  * @icsk_clean_acked	   Clean acked data hook
+ * @icsk_ulp_ddp_ops	   Pluggable ULP direct data placement control hook
+ * @icsk_ulp_ddp_data	   ULP direct data placement private data
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -94,6 +96,8 @@  struct inet_connection_sock {
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void __rcu		  *icsk_ulp_data;
 	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
+	const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
+	void __rcu		  *icsk_ulp_ddp_data;
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:5,
diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
new file mode 100644
index 000000000000..df3264be4600
--- /dev/null
+++ b/include/net/tcp_ddp.h
@@ -0,0 +1,129 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * tcp_ddp.h
+ *	Author:	Boris Pismenny <borisp@mellanox.com>
+ *	Copyright (C) 2020 Mellanox Technologies.
+ */
+#ifndef _TCP_DDP_H
+#define _TCP_DDP_H
+
+#include <linux/netdevice.h>
+#include <net/inet_connection_sock.h>
+#include <net/sock.h>
+
+/* limits returned by the offload driver, zero means don't care */
+struct tcp_ddp_limits {
+	int	 max_ddp_sgl_len;
+};
+
+enum tcp_ddp_type {
+	TCP_DDP_NVME = 1,
+};
+
+/**
+ * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
+ * config implementations must use this as the first member.
+ * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
+ */
+struct tcp_ddp_config {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+/**
+ * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
+ *
+ * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
+ * @cpda:       controller pdu data alignmend (dwords, 0's based)
+ * @dgst:       digest types enabled.
+ *              The netdev will offload crc if ddp_crc is supported.
+ * @queue_size: number of nvme-tcp IO queue elements
+ * @queue_id:   queue identifier
+ * @cpu_io:     cpu core running the IO thread for this queue
+ */
+struct nvme_tcp_ddp_config {
+	struct tcp_ddp_config   cfg;
+
+	u16			pfv;
+	u8			cpda;
+	u8			dgst;
+	int			queue_size;
+	int			queue_id;
+	int			io_cpu;
+};
+
+/**
+ * struct tcp_ddp_io - tcp ddp configuration for an IO request.
+ *
+ * @command_id:  identifier on the wire associated with these buffers
+ * @nents:       number of entries in the sg_table
+ * @sg_table:    describing the buffers for this IO request
+ * @first_sgl:   first SGL in sg_table
+ */
+struct tcp_ddp_io {
+	u32			command_id;
+	int			nents;
+	struct sg_table		sg_table;
+	struct scatterlist	first_sgl[SG_CHUNK_SIZE];
+};
+
+/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
+ *
+ * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
+ *                     the device driver can use this to limit the resources allocated per queue.
+ * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
+ *                     this function is used to configure either copy, crc or both offloads.
+ * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
+ * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
+ * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
+ *                     tcp_ddp_io.
+ * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
+ */
+struct tcp_ddp_dev_ops {
+	int (*tcp_ddp_limits)(struct net_device *netdev,
+			      struct tcp_ddp_limits *limits);
+	int (*tcp_ddp_sk_add)(struct net_device *netdev,
+			      struct sock *sk,
+			      struct tcp_ddp_config *config);
+	void (*tcp_ddp_sk_del)(struct net_device *netdev,
+			       struct sock *sk);
+	int (*tcp_ddp_setup)(struct net_device *netdev,
+			     struct sock *sk,
+			     struct tcp_ddp_io *io);
+	int (*tcp_ddp_teardown)(struct net_device *netdev,
+				struct sock *sk,
+				struct tcp_ddp_io *io,
+				void *ddp_ctx);
+	void (*tcp_ddp_resync)(struct net_device *netdev,
+			       struct sock *sk, u32 seq);
+};
+
+#define TCP_DDP_RESYNC_REQ (1 << 0)
+
+/**
+ * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
+ */
+struct tcp_ddp_ulp_ops {
+	/* NIC requests ulp to indicate if @seq is the start of a message */
+	bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
+	/* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
+	void (*ddp_teardown_done)(void *ddp_ctx);
+};
+
+/**
+ * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
+ * use this as the first member.
+ */
+struct tcp_ddp_ctx {
+	enum tcp_ddp_type    type;
+	unsigned char        buf[];
+};
+
+static inline struct tcp_ddp_ctx *tcp_ddp_get_ctx(const struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return (__force struct tcp_ddp_ctx *)icsk->icsk_ulp_ddp_data;
+}
+
+#endif //_TCP_DDP_H
diff --git a/net/Kconfig b/net/Kconfig
index f4c32d982af6..3876861cdc90 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -457,6 +457,15 @@  config ETHTOOL_NETLINK
 	  netlink. It provides better extensibility and some new features,
 	  e.g. notification messages.
 
+config TCP_DDP
+	bool "TCP direct data placement offload"
+	default n
+	help
+	  Direct Data Placement (DDP) offload for TCP enables ULP, such as
+	  NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
+	  of a command response directly into kernel pages.
+
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ed61eed1195d..75354fb8fe94 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@ 
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#include <net/tcp_ddp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -6135,9 +6136,15 @@  EXPORT_SYMBOL(pskb_extract);
  */
 void skb_condense(struct sk_buff *skb)
 {
+	bool is_ddp = false;
+
+#ifdef CONFIG_TCP_DDP
+	is_ddp = skb->sk && inet_csk(skb->sk) &&
+		 inet_csk(skb->sk)->icsk_ulp_ddp_data;
+#endif
 	if (skb->data_len) {
 		if (skb->data_len > skb->end - skb->tail ||
-		    skb_cloned(skb))
+		    skb_cloned(skb) || is_ddp)
 			return;
 
 		/* Nice, we can free page frag(s) right now */
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 24036e3055a1..a2ff7a4a6bbf 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -68,6 +68,7 @@  const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
 	[NETIF_F_GRO_FRAGLIST_BIT] =	 "rx-gro-list",
 	[NETIF_F_HW_MACSEC_BIT] =	 "macsec-hw-offload",
+	[NETIF_F_HW_TCP_DDP_BIT] =	 "tcp-ddp-offload",
 };
 
 const char