[v1,net-next,00/15] nvme-tcp receive offloads

Message ID	20201207210649.19194-1-borisp@mellanox.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Boris Pismenny <borisp@mellanox.com> To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com Subject: [PATCH v1 net-next 00/15] nvme-tcp receive offloads Date: Mon, 7 Dec 2020 23:06:34 +0200 Message-Id: <20201207210649.19194-1-borisp@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme-tcp receive offloads \| expand [v1,net-next,00/15] nvme-tcp receive offloads [v1,net-next,01/15] iov_iter: Skip copy in memcpy_to_page if src==dst [v1,net-next,02/15] net: Introduce direct data placement tcp offload [v1,net-next,03/15] net: Introduce crc offload for tcp ddp ulp [v1,net-next,04/15] net/tls: expose get_netdev_for_sock [v1,net-next,05/15] nvme-tcp: Add DDP offload control path [v1,net-next,06/15] nvme-tcp: Add DDP data-path [v1,net-next,07/15] nvme-tcp : Recalculate crc in the end of the capsule [v1,net-next,08/15] nvme-tcp: Deal with netdevice DOWN events [v1,net-next,09/15] net/mlx5: Header file changes for nvme-tcp offload [v1,net-next,10/15] net/mlx5: Add 128B CQE for NVMEoTCP offload [v1,net-next,11/15] net/mlx5e: TCP flow steering for nvme-tcp [v1,net-next,12/15] net/mlx5e: NVMEoTCP DDP offload control path [v1,net-next,13/15] net/mlx5e: NVMEoTCP, data-path for DDP offload [v1,net-next,14/15] net/mlx5e: NVMEoTCP statistics [v1,net-next,15/15] net/mlx5e: NVMEoTCP workaround CRC after resync

Boris Pismenny Dec. 7, 2020, 9:06 p.m. UTC

This series adds support for nvme-tcp receive offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the last change is to a sensative function, we are careful to place it under
static_key which is only enabled when this functionality is actually used for
nvme-tcp copy offload.

Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

CRC offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

The patches are organized as follows:
=========================================
Patch 1         the iov_iter change to skip copy if (src == dst)
Patches 2-3     the infrastructure for all TCP DDP
                and TCP DDP CRC offloads, respectively.
Patch 4         exposes the get_netdev_for_sock function from TLS
Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
Patches 6       NVMe-TCP changes to call NIC driver on IO operation
                setup/teardown, and support async completions.
Patches 7       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 8       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down
Patches 9-15    implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver as the first user

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.

Future work:
=========================================
A follow-up series will introduce support for transmit side CRC. Then,
we will work on adding support for TLS in NVMe-TCP and combining the
two offloads.

Changes since RFC v1:
=========================================
* Split mlx5 driver patches to several commits
* Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
  init/teardown to the start/stop functions.

Ben Ben-ishay (5):
  net/mlx5: Header file changes for nvme-tcp offload
  net/mlx5: Add 128B CQE for NVMEoTCP offload
  net/mlx5e: NVMEoTCP DDP offload control path
  net/mlx5e: NVMEoTCP, data-path for DDP offload
  net/mlx5e: NVMEoTCP statistics

Boris Pismenny (7):
  iov_iter: Skip copy in memcpy_to_page if src==dst
  net: Introduce direct data placement tcp offload
  net: Introduce crc offload for tcp ddp ulp
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

Yoray Zack (2):
  nvme-tcp : Recalculate crc in the end of the capsule
  net/mlx5e: NVMEoTCP workaround CRC after resync

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   31 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    1 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   13 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |    1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |    1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |    9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   10 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1002 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  119 ++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  270 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   26 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   80 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   39 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   72 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |   37 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   24 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   16 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 drivers/nvme/host/tcp.c                       |  443 +++++++-
 include/linux/mlx5/device.h                   |   43 +-
 include/linux/mlx5/mlx5_ifc.h                 |  104 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdev_features.h               |    4 +
 include/linux/netdevice.h                     |    5 +
 include/linux/skbuff.h                        |    5 +
 include/linux/uio.h                           |    2 +
 include/net/inet_connection_sock.h            |    4 +
 include/net/sock.h                            |   17 +
 include/net/tcp_ddp.h                         |  129 +++
 lib/iov_iter.c                                |   11 +-
 net/Kconfig                                   |   17 +
 net/core/skbuff.c                             |    9 +-
 net/ethtool/common.c                          |    2 +
 net/ipv4/tcp_input.c                          |    7 +
 net/ipv4/tcp_ipv4.c                           |    3 +
 net/ipv4/tcp_offload.c                        |    3 +
 net/tls/tls_device.c                          |   20 +-
 41 files changed, 2551 insertions(+), 54 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/tcp_ddp.h

Sagi Grimberg Jan. 14, 2021, 1:27 a.m. UTC | #1

Hey Boris, sorry for some delays on my end...

I saw some long discussions on this set with David, what is
the status here?

I'll take some more look into the patches, but if you
addressed the feedback from the last iteration I don't
expect major issues with this patch set (at least from
nvme-tcp side).

> Changes since RFC v1:
> =========================================
> * Split mlx5 driver patches to several commits
> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>    init/teardown to the start/stop functions.

I'm assuming that you tested controller resets and network hiccups
during traffic right?

David Ahern Jan. 14, 2021, 4:47 a.m. UTC | #2

On 1/13/21 6:27 PM, Sagi Grimberg wrote:
>> Changes since RFC v1:
>> =========================================
>> * Split mlx5 driver patches to several commits
>> * Fix nvme-tcp handling of recovery flows. In particular, move queue
>> offlaod
>>    init/teardown to the start/stop functions.
> 
> I'm assuming that you tested controller resets and network hiccups
> during traffic right?

I had questions on this part as well -- e.g., what happens on a TCP
retry? packets arrive, sgl filled for the command id, but packet is
dropped in the stack (e.g., enqueue backlog is filled, so packet gets
dropped)

Boris Pismenny Jan. 14, 2021, 7:17 p.m. UTC | #3

On 14/01/2021 3:27, Sagi Grimberg wrote:
> Hey Boris, sorry for some delays on my end...
> 
> I saw some long discussions on this set with David, what is
> the status here?
> 

The main purpose of this series is to address these.

> I'll take some more look into the patches, but if you
> addressed the feedback from the last iteration I don't
> expect major issues with this patch set (at least from
> nvme-tcp side).
> 
>> Changes since RFC v1:
>> =========================================
>> * Split mlx5 driver patches to several commits
>> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>>    init/teardown to the start/stop functions.
> 
> I'm assuming that you tested controller resets and network hiccups
> during traffic right?
> 

Network hiccups were tested through netem packet drops and reordering.
We tested error recovery by taking the controller down and bringing it
back up while the system is quiescent and during traffic.

If you have another test in mind, please let me know.

Boris Pismenny Jan. 14, 2021, 7:21 p.m. UTC | #4

On 14/01/2021 6:47, David Ahern wrote:
> On 1/13/21 6:27 PM, Sagi Grimberg wrote:
>>> Changes since RFC v1:
>>> =========================================
>>> * Split mlx5 driver patches to several commits
>>> * Fix nvme-tcp handling of recovery flows. In particular, move queue
>>> offlaod
>>>    init/teardown to the start/stop functions.
>>
>> I'm assuming that you tested controller resets and network hiccups
>> during traffic right?
> 
> I had questions on this part as well -- e.g., what happens on a TCP
> retry? packets arrive, sgl filled for the command id, but packet is
> dropped in the stack (e.g., enqueue backlog is filled, so packet gets
> dropped)
> 

On re-transmission the HW context's expected tcp sequence number doesn't
match. As a result, the received packet is un-offloaded and software
will do the copy/crc for its data.

As a general rule, if HW context expected sequence numbers don't match,
then there's no offload.

Sagi Grimberg Jan. 14, 2021, 9:07 p.m. UTC | #5

>> Hey Boris, sorry for some delays on my end...
>>
>> I saw some long discussions on this set with David, what is
>> the status here?
>>
> 
> The main purpose of this series is to address these.
> 
>> I'll take some more look into the patches, but if you
>> addressed the feedback from the last iteration I don't
>> expect major issues with this patch set (at least from
>> nvme-tcp side).
>>
>>> Changes since RFC v1:
>>> =========================================
>>> * Split mlx5 driver patches to several commits
>>> * Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
>>>     init/teardown to the start/stop functions.
>>
>> I'm assuming that you tested controller resets and network hiccups
>> during traffic right?
>>
> 
> Network hiccups were tested through netem packet drops and reordering.
> We tested error recovery by taking the controller down and bringing it
> back up while the system is quiescent and during traffic.
> 
> If you have another test in mind, please let me know.

I suggest to also perform interface down/up during traffic both
on the host and the targets.

Other than that we should be in decent shape...

[v1,net-next,00/15] nvme-tcp receive offloads

Message

Comments