mbox series

[RFC,v3,00/11] NVMeTCP Offload ULP and QEDN Device Driver

Message ID 20210207181324.11429-1-smalin@marvell.com (mailing list archive)
Headers show
Series NVMeTCP Offload ULP and QEDN Device Driver | expand

Message

Shai Malin Feb. 7, 2021, 6:13 p.m. UTC
With the goal of enabling a generic infrastructure that allows NVMe/TCP 
offload devices like NICs to seamlessly plug into the NVMe-oF stack, this 
patch series introduces the nvme-tcp-offload ULP host layer, which will 
be a new transport type called "tcp-offload" and will serve as an 
abstraction layer to work with vendor specific nvme-tcp offload drivers.

NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes 
both the TCP level and the NVMeTCP level.

The nvme-tcp-offload transport can co-exist with the existing tcp and 
other transports. The tcp offload was designed so that stack changes are 
kept to a bare minimum: only registering new transports. 
All other APIs, ops etc. are identical to the regular tcp transport.
Representing the TCP offload as a new transport allows clear and manageable
differentiation between the connections which should use the offload path
and those that are not offloaded (even on the same device).


Queue Initialization:
=====================
The nvme-tcp-offload ULP module shall register with the existing 
nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following ops:
 - claim_dev() - in order to resolve the route to the target according to
                 the net_dev.
 - create_queue() - in order to create offloaded nvme-tcp queue.

The nvme-tcp-offload ULP module shall manage all the controller level
functionalities, call claim_dev and based on the return values shall call
the relevant module create_queue in order to create the admin queue and
the IO queues.


IO-path:
========
The nvme-tcp-offload shall work at the IO-level - the nvme-tcp-offload 
ULP module shall pass the request (the IO) to the nvme-tcp-offload vendor
driver and later, the nvme-tcp-offload vendor driver returns the request
completion (the IO completion).
No additional handling is needed in between; this design will reduce the
CPU utilization as we will describe below.

The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
 - init_req()
 - map_sg() - in order to map the request sg (similar to 
              nvme_rdma_map_data() ).
 - send_req() - in order to pass the request to the handling of the
                offload driver that shall pass it to the vendor specific
				device.
 - poll_queue()

Once the IO completes, the nvme-tcp-offload vendor driver shall call 
command.done() that will invoke the nvme-tcp-offload ULP layer to
complete the request.


TCP events:
===========
The Marvell FastLinQ NIC HW engine handle all the TCP re-transmissions
and OOO events.


Teardown and errors:
====================
In case of NVMeTCP queue error the nvme-tcp-offload vendor driver shall
call the nvme_tcp_ofld_report_queue_err.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following teardown ops:
 - drain_queue()
 - destroy_queue()
 

 The Marvell FastLinQ NIC HW engine:
====================================
The Marvell NIC HW engine is capable of offloading the entire TCP/IP
stack and managing up to 64K connections per PF, already implemented and 
upstream use cases for this include iWARP (by the Marvell qedr driver) 
and iSCSI (by the Marvell qedi driver).
In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer
and is able to manage the IO level also in case of TCP re-transmissions
and OOO events.
The HW engine enables direct data placement (including the data digest CRC
calculation and validation) and direct data transmission (including data
digest CRC calculation).


The Marvell qedn driver:
========================
The new driver will be added under "drivers/nvme/hw" and will be enabled
by the Kconfig "Marvell NVM Express over Fabrics TCP offload".
As part of the qedn init, the driver will register as a pci device driver 
and will work with the Marvell fastlinQ NIC.
As part of the probe, the driver will register to the nvme_tcp_offload
(ULP) and to the qed module (qed_nvmetcp_ops) - similar to other
"qed_*_ops" which are used by the qede, qedr, qedf and qedi device
drivers.
  
 
The series' patches:
===================
Patch 1-2       Add the nvme-tcp-offload ULP module, including the APIs.
Patches 3-5     nvme-tcp-offload ULP controller level functionalities.
Patch 6         nvme-tcp-offload ULP queue level functionalities.
Patch 7         nvme-tcp-offload ULP IO level functionalities.
Patch 8			nvme-qedn Marvell's NVMeTCP HW offload vendor driver.
Patch 9			net-qed NVMeTCP Offload PF level FW and HW HSI.
Patch 10-11		nvme-qedn probe level functionalities.


Performance:
============
With this implementation on top of the Marvell qedn driver (using the
Marvell FastLinQ NIC), we were able to demonstrate x3 CPU utilization
improvement for 4K queued read/write IOs and up to x20 in case of 512K
read/write IOs.
In addition, we were able to demonstrate latency improvement, and 
specifically 99.99% tail latency improvement of up to x2-5 (depends on
the queue-depth).


Future work:
============
For simplicity, the RFC series does not include the following 
functionalities which will be added based on the comments on patches 1-11.
 - nvme-tcp-offload teardown, IO timeout and async flows.
 - qedn device/queue/IO level.
 
 
Long term future work:
============
 - The nvme-tcp-offload ULP target abstraction layer.
 - The Marvell nvme-tcp-offload "qednt" target driver.


Changes since RFC v1:
=====================
- Fix nvme_tcp_ofld_ops return values.
- Remove NVMF_TRTYPE_TCP_OFFLOAD.
- Add nvme_tcp_ofld_poll() implementation.
- Fix nvme_tcp_ofld_queue_rq() to check map_sg() and send_req() return
  values.


Changes since RFC v2:
=====================
- Add qedn - Marvell's NVMeTCP HW offload vendor driver init and probe
  (patches 8-11).
- Fixes in controller and queue level (patches 3-6).
  

Arie Gershberg (3):
  nvme-fabrics: Move NVMF_ALLOWED_OPTS and NVMF_REQUIRED_OPTS
    definitions
  nvme-tcp-offload: Add controller level implementation
  nvme-tcp-offload: Add controller level error recovery implementation

Dean Balandin (3):
  nvme-tcp-offload: Add device scan implementation
  nvme-tcp-offload: Add queue level implementation
  nvme-tcp-offload: Add IO level implementation

Shai Malin (5):
  nvme-tcp-offload: Add nvme-tcp-offload - NVMeTCP HW offload ULP
  nvme-qedn: Add qedn - Marvell's NVMeTCP HW offload vendor driver
  net-qed: Add NVMeTCP Offload PF Level FW and HW HSI
  nvme-qedn: Add qedn probe
  nvme-qedn: Add IRQ and fast-path resources initializations

 MAINTAINERS                                   |   10 +
 drivers/net/ethernet/qlogic/Kconfig           |    3 +
 drivers/net/ethernet/qlogic/qed/Makefile      |    1 +
 drivers/net/ethernet/qlogic/qed/qed.h         |    3 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h     |    1 +
 drivers/net/ethernet/qlogic/qed/qed_nvmetcp.c |  269 ++++
 drivers/net/ethernet/qlogic/qed/qed_nvmetcp.h |   48 +
 drivers/net/ethernet/qlogic/qed/qed_sp.h      |    2 +
 drivers/nvme/Kconfig                          |    1 +
 drivers/nvme/Makefile                         |    1 +
 drivers/nvme/host/Kconfig                     |   16 +
 drivers/nvme/host/Makefile                    |    3 +
 drivers/nvme/host/fabrics.c                   |    7 -
 drivers/nvme/host/fabrics.h                   |    7 +
 drivers/nvme/host/tcp-offload.c               | 1123 +++++++++++++++++
 drivers/nvme/host/tcp-offload.h               |  185 +++
 drivers/nvme/hw/Kconfig                       |    9 +
 drivers/nvme/hw/Makefile                      |    3 +
 drivers/nvme/hw/qedn/Makefile                 |    3 +
 drivers/nvme/hw/qedn/qedn.c                   |  652 ++++++++++
 drivers/nvme/hw/qedn/qedn.h                   |   90 ++
 include/linux/qed/common_hsi.h                |    1 +
 include/linux/qed/nvmetcp_common.h            |   47 +
 include/linux/qed/qed_if.h                    |   22 +
 include/linux/qed/qed_nvmetcp_if.h            |   71 ++
 25 files changed, 2571 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_nvmetcp.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_nvmetcp.h
 create mode 100644 drivers/nvme/host/tcp-offload.c
 create mode 100644 drivers/nvme/host/tcp-offload.h
 create mode 100644 drivers/nvme/hw/Kconfig
 create mode 100644 drivers/nvme/hw/Makefile
 create mode 100644 drivers/nvme/hw/qedn/Makefile
 create mode 100644 drivers/nvme/hw/qedn/qedn.c
 create mode 100644 drivers/nvme/hw/qedn/qedn.h
 create mode 100644 include/linux/qed/nvmetcp_common.h
 create mode 100644 include/linux/qed/qed_nvmetcp_if.h

Comments

Chris Leech Feb. 12, 2021, 6:06 p.m. UTC | #1
On Sun, Feb 07, 2021 at 08:13:13PM +0200, Shai Malin wrote:
> Queue Initialization:
> =====================
> The nvme-tcp-offload ULP module shall register with the existing 
> nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
> The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
> with the following ops:
>  - claim_dev() - in order to resolve the route to the target according to
>                  the net_dev.
>  - create_queue() - in order to create offloaded nvme-tcp queue.
> 
> The nvme-tcp-offload ULP module shall manage all the controller level
> functionalities, call claim_dev and based on the return values shall call
> the relevant module create_queue in order to create the admin queue and
> the IO queues.

Hi Shai,

How well does this claim_dev approach work with multipathing?  Is it
expected that providing HOST_TRADDR is sufficient control over which
offload device will be used with multiple valid paths to the controller?

- Chris
Shai Malin Feb. 13, 2021, 4:47 p.m. UTC | #2
On Fri, 12 Feb 2021 at 20:06, Chris Leech wrote:
>
> On Sun, Feb 07, 2021 at 08:13:13PM +0200, Shai Malin wrote:
> > Queue Initialization:
> > =====================
> > The nvme-tcp-offload ULP module shall register with the existing
> > nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
> > The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
> > with the following ops:
> >  - claim_dev() - in order to resolve the route to the target according to
> >                  the net_dev.
> >  - create_queue() - in order to create offloaded nvme-tcp queue.
> >
> > The nvme-tcp-offload ULP module shall manage all the controller level
> > functionalities, call claim_dev and based on the return values shall call
> > the relevant module create_queue in order to create the admin queue and
> > the IO queues.
>
> Hi Shai,
>
> How well does this claim_dev approach work with multipathing?  Is it
> expected that providing HOST_TRADDR is sufficient control over which
> offload device will be used with multiple valid paths to the controller?
>
> - Chris
>

Hi Chris,

The nvme-tcp-offload multipath behaves the same as the non-offloaded
nvme-tcp. The HOST_TRADDR is sufficient to control which offload device
will be used with multiple valid paths.

- Shai
Shai Malin Feb. 18, 2021, 6:38 p.m. UTC | #3
> 
> With the goal of enabling a generic infrastructure that allows NVMe/TCP
> offload devices like NICs to seamlessly plug into the NVMe-oF stack, this
> patch series introduces the nvme-tcp-offload ULP host layer, which will be a
> new transport type called "tcp-offload" and will serve as an abstraction layer
> to work with vendor specific nvme-tcp offload drivers.
> 
> NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes
> both the TCP level and the NVMeTCP level.
> 
> The nvme-tcp-offload transport can co-exist with the existing tcp and other
> transports. The tcp offload was designed so that stack changes are kept to a
> bare minimum: only registering new transports.
> All other APIs, ops etc. are identical to the regular tcp transport.
> Representing the TCP offload as a new transport allows clear and
> manageable differentiation between the connections which should use the
> offload path and those that are not offloaded (even on the same device).
> 

Sagi, Christoph, Jens, Keith,
So, as there are no more comments / questions, we understand the direction 
is acceptable and will proceed to the full series.
Christoph Hellwig Feb. 19, 2021, 9:12 a.m. UTC | #4
On Thu, Feb 18, 2021 at 06:38:07PM +0000, Shai Malin wrote:
> So, as there are no more comments / questions, we understand the direction 
> is acceptable and will proceed to the full series.

I do not think we should support offloads at all, and certainly not onces
requiring extra drivers.  Those drivers have caused unbelivable pain for
iSCSI and we should not repeat that mistake.
Ariel Elior Feb. 19, 2021, 9:28 p.m. UTC | #5
> On Thu, Feb 18, 2021 at 06:38:07PM +0000, Shai Malin wrote:
> > So, as there are no more comments / questions, we understand the
> > direction is acceptable and will proceed to the full series.
> 
> I do not think we should support offloads at all, and certainly not onces
> requiring extra drivers.  Those drivers have caused unbelivable pain for iSCSI
> and we should not repeat that mistake.

Hi Christoph,

We are fully aware of the challenges the iSCSI offload faced - I was there too
(in bnx2i and qedi). In our mind the heart of that hardship was the iSCSI uio
design, essentially a thin alternative networking stack, which led to no end of
compatibility challenges.

But we were also there for RoCE and iWARP (TCP based) RDMA offloads where a
different approach was used, working with the networking stack instead of around
it. We feel this is a much better approach, and this is what we are attempting
to implement here.

For this reason exactly we designed this offload to be completely seemless.
There is no alternate user stack - we plug in directly into the networking
stack and there are zero changes to the regular nvme-tcp.

We are just adding a new transport alongside it, which interacts with the
networking stack when needed, and leaves it alone most of the time. Our
intention is to completely own the maintenance of the new transport, including
any compatibility requirements, and have purposefully designed it to be
streamlined in this aspect.

Protocol offload is at the core of our technology, and our device offloads RoCE,
iWARP, iSCSI and FCoE, all already in upstream drivers (qedr, qedi and qedf
respectively).

We are especially excited about NVMeTCP offload as it brings huge benefits:
RDMA-like latency, tremendous cpu utilization reduction and the reliability of
TCP.

We would be more than happy to incorporate any feedback you may have on the
design, in how to make it more robust and correct. We are aware of other work
being done in creating special types of offloaded queue, and could model our
design similarly, although our thinking was that this would be more intrusive to
regular nvme over tcp. In our original submission of the RFC we were not adding
a ULP driver, only our own vendor driver, but Sagi pointed us in the direction
of a vendor agnostic ulp layer, which made a lot of sense to us and we think is
a good approach.

Thanks,
Ariel