mbox series

[v4,00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

Message ID 20190620150337.7847-1-jinpuwang@gmail.com (mailing list archive)
Headers show
Series InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) | expand

Message

Jinpu Wang June 20, 2019, 3:03 p.m. UTC
Hi all,

Here is v4 of IBNBD/IBTRS patches, which have minor changes

 Changelog
 ---------
v4:
  o Protocol extended to transport IO priorities
  o Support for Mellanox ConnectX-4/X-5
  o Minor sysfs extentions (display access mode on server side)
  o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
  o Style fixes

v3:
  o Sparse fixes:
     - le32 -> le16 conversion
     - pcpu and RCU wrong declaration
     - sysfs: dynamically alloc array of sockaddr structures to reduce
	   size of a stack frame

  o Rename sysfs folder on client and server sides to show source and
    destination addresses of the connection, i.e.:
	   .../<session-name>/paths/<src@dst>/

  o Remove external inclusions from Makefiles.
  * https://lwn.net/Articles/756994/

v2:
  o IBNBD:
     - No legacy request IO mode, only MQ is left.

  o IBTRS:
     - No FMR registration, only FR is left.

  * https://lwn.net/Articles/755075/

v1:
  - IBTRS: load-balancing and IO fail-over using multipath features were added.

  - Major parts of the code were rewritten, simplified and overall code
    size was reduced by a quarter.

  * https://lwn.net/Articles/746342/

v0:
  - Initial submission

  * https://lwn.net/Articles/718181/


 Introduction
 -------------

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
and iWARP, but we mainly tested in IB environment. It is optimized to
transfer (read/write) IO blocks in the sense that it follows the BIO 
semantics of providing the possibility to either write data from a 
scatter-gather list to the remote side or to request ("read") data
transfer from the remote side into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
CMs and particular path is selected according to the load-balancing policy.
It can be used for other components not bind to IBNBD.


IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.


   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
     thus internal protocol is simple.
   - IBTRS was developed as an independent RDMA transport library, which
     supports fail-over and load-balancing policies using multipath, thus
     it can be used for any other IO needs rather than only for block
     device.
   - IBNBD/IBTRS is fast.
     Old comparison results:
     https://www.spinics.net/lists/linux-rdma/msg48799.html
     New comparison results: see performance measurements section below.

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
     server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.  According to
  our test loads additional path brings ~20% of bandwidth.  

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
     explicitly exported.
   - Only IB port GID and device path needed on client side to map
     a block device.
   - A device is remapped automatically i.e. after storage reboot.

Commits for kernel can be found here:
   https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4
The out-of-tree modules are here:
   https://github.com/ionos-enterprise/ibnbd

Vault 2017 presentation:
  https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

 Performance measurements
 ------------------------

o IBNBD and NVMEoRDMA

  Performance results for the v5.2-rc3 kernel
  link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3

Roman Pen (25):
  sysfs: export sysfs_remove_file_self()
  ibtrs: public interface header to establish RDMA connections
  ibtrs: private headers with IBTRS protocol structs and helpers
  ibtrs: core: lib functions shared between client and server modules
  ibtrs: client: private header with client structs and functions
  ibtrs: client: main functionality
  ibtrs: client: statistics functions
  ibtrs: client: sysfs interface functions
  ibtrs: server: private header with server structs and functions
  ibtrs: server: main functionality
  ibtrs: server: statistics functions
  ibtrs: server: sysfs interface functions
  ibtrs: include client and server modules into kernel compilation
  ibtrs: a bit of documentation
  ibnbd: private headers with IBNBD protocol structs and helpers
  ibnbd: client: private header with client structs and functions
  ibnbd: client: main functionality
  ibnbd: client: sysfs interface functions
  ibnbd: server: private header with server structs and functions
  ibnbd: server: main functionality
  ibnbd: server: functionality for IO submission to file or block dev
  ibnbd: server: sysfs interface functions
  ibnbd: include client and server modules into kernel compilation
  ibnbd: a bit of documentation
  MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

 MAINTAINERS                                   |   14 +
 drivers/block/Kconfig                         |    2 +
 drivers/block/Makefile                        |    1 +
 drivers/block/ibnbd/Kconfig                   |   24 +
 drivers/block/ibnbd/Makefile                  |   13 +
 drivers/block/ibnbd/README                    |  315 ++
 drivers/block/ibnbd/ibnbd-clt-sysfs.c         |  691 ++++
 drivers/block/ibnbd/ibnbd-clt.c               | 1832 +++++++++++
 drivers/block/ibnbd/ibnbd-clt.h               |  166 +
 drivers/block/ibnbd/ibnbd-log.h               |   59 +
 drivers/block/ibnbd/ibnbd-proto.h             |  378 +++
 drivers/block/ibnbd/ibnbd-srv-dev.c           |  408 +++
 drivers/block/ibnbd/ibnbd-srv-dev.h           |  143 +
 drivers/block/ibnbd/ibnbd-srv-sysfs.c         |  270 ++
 drivers/block/ibnbd/ibnbd-srv.c               |  945 ++++++
 drivers/block/ibnbd/ibnbd-srv.h               |   94 +
 drivers/infiniband/Kconfig                    |    1 +
 drivers/infiniband/ulp/Makefile               |    1 +
 drivers/infiniband/ulp/ibtrs/Kconfig          |   22 +
 drivers/infiniband/ulp/ibtrs/Makefile         |   15 +
 drivers/infiniband/ulp/ibtrs/README           |  385 +++
 .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    |  447 +++
 .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    |  514 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c      | 2844 +++++++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h      |  308 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h      |   84 +
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h      |  463 +++
 .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    |  103 +
 .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    |  303 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c      | 1998 ++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h      |  170 +
 drivers/infiniband/ulp/ibtrs/ibtrs.c          |  610 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs.h          |  318 ++
 fs/sysfs/file.c                               |    1 +
 34 files changed, 13942 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile
 create mode 100644 drivers/block/ibnbd/README
 create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
 create mode 100644 drivers/block/ibnbd/ibnbd-log.h
 create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
 create mode 100644 drivers/infiniband/ulp/ibtrs/README
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

Comments

Danil Kipnis July 9, 2019, 9:55 a.m. UTC | #1
Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,

Could you please provide some feedback to the IBNBD driver and the
IBTRS library?
So far we addressed all the requests provided by the community and
continue to maintain our code up-to-date with the upstream kernel
while having an extra compatibility layer for older kernels in our
out-of-tree repository.
I understand that SRP and NVMEoF which are in the kernel already do
provide equivalent functionality for the majority of the use cases.
IBNBD on the other hand is showing higher performance and more
importantly includes the IBTRS - a general purpose library to
establish connections and transport BIO-like read/write sg-lists over
RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
I believe IBNBD does meet the kernel coding standards, it doesn't have
a lot of users, while SRP and NVMEoF are widely accepted. Do you think
it would make sense for us to rework our patchset and try pushing it
for staging tree first, so that we can proof IBNBD is well maintained,
beneficial for the eco-system, find a proper location for it within
block/rdma subsystems? This would make it easier for people to try it
out and would also be a huge step for us in terms of maintenance
effort.
The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
near future). Do you think it would make sense to rename the driver to
RNBD/RTRS?

Thank you,
Best Regards,
Danil

On Thu, Jun 20, 2019 at 5:03 PM Jack Wang <jinpuwang@gmail.com> wrote:
>
> Hi all,
>
> Here is v4 of IBNBD/IBTRS patches, which have minor changes
>
>  Changelog
>  ---------
> v4:
>   o Protocol extended to transport IO priorities
>   o Support for Mellanox ConnectX-4/X-5
>   o Minor sysfs extentions (display access mode on server side)
>   o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
>   o Style fixes
>
> v3:
>   o Sparse fixes:
>      - le32 -> le16 conversion
>      - pcpu and RCU wrong declaration
>      - sysfs: dynamically alloc array of sockaddr structures to reduce
>            size of a stack frame
>
>   o Rename sysfs folder on client and server sides to show source and
>     destination addresses of the connection, i.e.:
>            .../<session-name>/paths/<src@dst>/
>
>   o Remove external inclusions from Makefiles.
>   * https://lwn.net/Articles/756994/
>
> v2:
>   o IBNBD:
>      - No legacy request IO mode, only MQ is left.
>
>   o IBTRS:
>      - No FMR registration, only FR is left.
>
>   * https://lwn.net/Articles/755075/
>
> v1:
>   - IBTRS: load-balancing and IO fail-over using multipath features were added.
>
>   - Major parts of the code were rewritten, simplified and overall code
>     size was reduced by a quarter.
>
>   * https://lwn.net/Articles/746342/
>
> v0:
>   - Initial submission
>
>   * https://lwn.net/Articles/718181/
>
>
>  Introduction
>  -------------
>
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
> and iWARP, but we mainly tested in IB environment. It is optimized to
> transfer (read/write) IO blocks in the sense that it follows the BIO
> semantics of providing the possibility to either write data from a
> scatter-gather list to the remote side or to request ("read") data
> transfer from the remote side into a given set of buffers.
>
> IBTRS is multipath capable and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> It can be used for other components not bind to IBNBD.
>
>
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
>
>
>    - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>      thus internal protocol is simple.
>    - IBTRS was developed as an independent RDMA transport library, which
>      supports fail-over and load-balancing policies using multipath, thus
>      it can be used for any other IO needs rather than only for block
>      device.
>    - IBNBD/IBTRS is fast.
>      Old comparison results:
>      https://www.spinics.net/lists/linux-rdma/msg48799.html
>      New comparison results: see performance measurements section below.
>
> Key features of IBTRS transport library and IBNBD block device:
>
> o High throughput and low latency due to:
>    - Only two RDMA messages per IO.
>    - IMM InfiniBand messages on responses to reduce round trip latency.
>    - Simplified memory management: memory allocation happens once on
>      server side when IBTRS session is established.
>
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.
>
> o Simple configuration of IBNBD:
>    - Server side is completely passive: volumes do not need to be
>      explicitly exported.
>    - Only IB port GID and device path needed on client side to map
>      a block device.
>    - A device is remapped automatically i.e. after storage reboot.
>
> Commits for kernel can be found here:
>    https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4
> The out-of-tree modules are here:
>    https://github.com/ionos-enterprise/ibnbd
>
> Vault 2017 presentation:
>   https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
>
>  Performance measurements
>  ------------------------
>
> o IBNBD and NVMEoRDMA
>
>   Performance results for the v5.2-rc3 kernel
>   link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3
>
> Roman Pen (25):
>   sysfs: export sysfs_remove_file_self()
>   ibtrs: public interface header to establish RDMA connections
>   ibtrs: private headers with IBTRS protocol structs and helpers
>   ibtrs: core: lib functions shared between client and server modules
>   ibtrs: client: private header with client structs and functions
>   ibtrs: client: main functionality
>   ibtrs: client: statistics functions
>   ibtrs: client: sysfs interface functions
>   ibtrs: server: private header with server structs and functions
>   ibtrs: server: main functionality
>   ibtrs: server: statistics functions
>   ibtrs: server: sysfs interface functions
>   ibtrs: include client and server modules into kernel compilation
>   ibtrs: a bit of documentation
>   ibnbd: private headers with IBNBD protocol structs and helpers
>   ibnbd: client: private header with client structs and functions
>   ibnbd: client: main functionality
>   ibnbd: client: sysfs interface functions
>   ibnbd: server: private header with server structs and functions
>   ibnbd: server: main functionality
>   ibnbd: server: functionality for IO submission to file or block dev
>   ibnbd: server: sysfs interface functions
>   ibnbd: include client and server modules into kernel compilation
>   ibnbd: a bit of documentation
>   MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
>
>  MAINTAINERS                                   |   14 +
>  drivers/block/Kconfig                         |    2 +
>  drivers/block/Makefile                        |    1 +
>  drivers/block/ibnbd/Kconfig                   |   24 +
>  drivers/block/ibnbd/Makefile                  |   13 +
>  drivers/block/ibnbd/README                    |  315 ++
>  drivers/block/ibnbd/ibnbd-clt-sysfs.c         |  691 ++++
>  drivers/block/ibnbd/ibnbd-clt.c               | 1832 +++++++++++
>  drivers/block/ibnbd/ibnbd-clt.h               |  166 +
>  drivers/block/ibnbd/ibnbd-log.h               |   59 +
>  drivers/block/ibnbd/ibnbd-proto.h             |  378 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.c           |  408 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.h           |  143 +
>  drivers/block/ibnbd/ibnbd-srv-sysfs.c         |  270 ++
>  drivers/block/ibnbd/ibnbd-srv.c               |  945 ++++++
>  drivers/block/ibnbd/ibnbd-srv.h               |   94 +
>  drivers/infiniband/Kconfig                    |    1 +
>  drivers/infiniband/ulp/Makefile               |    1 +
>  drivers/infiniband/ulp/ibtrs/Kconfig          |   22 +
>  drivers/infiniband/ulp/ibtrs/Makefile         |   15 +
>  drivers/infiniband/ulp/ibtrs/README           |  385 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    |  447 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    |  514 +++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.c      | 2844 +++++++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.h      |  308 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-log.h      |   84 +
>  drivers/infiniband/ulp/ibtrs/ibtrs-pri.h      |  463 +++
>  .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    |  103 +
>  .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    |  303 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.c      | 1998 ++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.h      |  170 +
>  drivers/infiniband/ulp/ibtrs/ibtrs.c          |  610 ++++
>  drivers/infiniband/ulp/ibtrs/ibtrs.h          |  318 ++
>  fs/sysfs/file.c                               |    1 +
>  34 files changed, 13942 insertions(+)
>  create mode 100644 drivers/block/ibnbd/Kconfig
>  create mode 100644 drivers/block/ibnbd/Makefile
>  create mode 100644 drivers/block/ibnbd/README
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-log.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>  create mode 100644 drivers/infiniband/ulp/ibtrs/README
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h
>
> --
> 2.17.1
>
Leon Romanovsky July 9, 2019, 11 a.m. UTC | #2
On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
>
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?
> So far we addressed all the requests provided by the community and
> continue to maintain our code up-to-date with the upstream kernel
> while having an extra compatibility layer for older kernels in our
> out-of-tree repository.
> I understand that SRP and NVMEoF which are in the kernel already do
> provide equivalent functionality for the majority of the use cases.
> IBNBD on the other hand is showing higher performance and more
> importantly includes the IBTRS - a general purpose library to
> establish connections and transport BIO-like read/write sg-lists over
> RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> I believe IBNBD does meet the kernel coding standards, it doesn't have
> a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> it would make sense for us to rework our patchset and try pushing it
> for staging tree first, so that we can proof IBNBD is well maintained,
> beneficial for the eco-system, find a proper location for it within
> block/rdma subsystems? This would make it easier for people to try it
> out and would also be a huge step for us in terms of maintenance
> effort.
> The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> near future). Do you think it would make sense to rename the driver to
> RNBD/RTRS?

It is better to avoid "staging" tree, because it will lack attention of
relevant people and your efforts will be lost once you will try to move
out of staging. We are all remembering Lustre and don't want to see it
again.

Back then, you was asked to provide support for performance superiority.
Can you please share any numbers with us?

Thanks
Greg Kroah-Hartman July 9, 2019, 11:17 a.m. UTC | #3
On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
> 
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.

That's up to the developers, that had nothing to do with the fact that
the code was in the staging tree.  If the Lustre developers had actually
done the requested work, it would have moved out of the staging tree.

So if these developers are willing to do the work to get something out
of staging, and into the "real" part of the kernel, I will gladly take
it.

But I will note that it is almost always easier to just do the work
ahead of time, and merge it in "correctly" than to go from staging into
the real part of the kernel.  But it's up to the developers what they
want to do.

thanks,

greg k-h
Jinpu Wang July 9, 2019, 11:37 a.m. UTC | #4
Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.
> Can you please share any numbers with us?
Hi Leon,

Thanks for you feedback.

For performance numbers,  Danil did intensive benchmark, and create
some PDF with graphes here:
https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3

It includes both single path results also different multipath policy results.

If you have any question regarding the results, please let us know.

>
> Thanks

Thanks
Jack Wang
Jinpu Wang July 9, 2019, 11:57 a.m. UTC | #5
Greg KH <gregkh@linuxfoundation.org> 于2019年7月9日周二 下午1:17写道:
>
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.
Thanks Greg,

This is encouraging, we ARE willing to do the work to get IBNBD/IBTRS merged to
upstream kernel. We regularly contribute to stable kernel also
upsteam, backport patches, testing
stable rc release etc. We believe in opensource and the power of community.

Sure, we will try to go with so called real kernel, this is also what
we are doing
and did in the past, but since v3, we did not receive any real feedback.

We will see how thing will go.

Thanks again!
Jack Wang @ 1 & 1 IONOS Cloud GmbH


>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h
Jason Gunthorpe July 9, 2019, 12:04 p.m. UTC | #6
On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> 
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?

From my perspective you need to get people from the block community to
go over this.

It is the merge window right now so nobody is really looking at
patches, you may need to resend it after rc1 to get attention.

Jason
Jason Gunthorpe July 9, 2019, 12:06 p.m. UTC | #7
On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> >
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
> >
> > Back then, you was asked to provide support for performance superiority.
> > Can you please share any numbers with us?
> Hi Leon,
> 
> Thanks for you feedback.
> 
> For performance numbers,  Danil did intensive benchmark, and create
> some PDF with graphes here:
> https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> 
> It includes both single path results also different multipath policy results.
> 
> If you have any question regarding the results, please let us know.

I kind of recall that last time the perf numbers were skewed toward
IBNBD because the invalidation model for MR was wrong - did this get
fixed?

Jason
Jinpu Wang July 9, 2019, 1:15 p.m. UTC | #8
On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > >
> > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > >
> > > > Could you please provide some feedback to the IBNBD driver and the
> > > > IBTRS library?
> > > > So far we addressed all the requests provided by the community and
> > > > continue to maintain our code up-to-date with the upstream kernel
> > > > while having an extra compatibility layer for older kernels in our
> > > > out-of-tree repository.
> > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > provide equivalent functionality for the majority of the use cases.
> > > > IBNBD on the other hand is showing higher performance and more
> > > > importantly includes the IBTRS - a general purpose library to
> > > > establish connections and transport BIO-like read/write sg-lists over
> > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > it would make sense for us to rework our patchset and try pushing it
> > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > beneficial for the eco-system, find a proper location for it within
> > > > block/rdma subsystems? This would make it easier for people to try it
> > > > out and would also be a huge step for us in terms of maintenance
> > > > effort.
> > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > near future). Do you think it would make sense to rename the driver to
> > > > RNBD/RTRS?
> > >
> > > It is better to avoid "staging" tree, because it will lack attention of
> > > relevant people and your efforts will be lost once you will try to move
> > > out of staging. We are all remembering Lustre and don't want to see it
> > > again.
> > >
> > > Back then, you was asked to provide support for performance superiority.
> > > Can you please share any numbers with us?
> > Hi Leon,
> >
> > Thanks for you feedback.
> >
> > For performance numbers,  Danil did intensive benchmark, and create
> > some PDF with graphes here:
> > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> >
> > It includes both single path results also different multipath policy results.
> >
> > If you have any question regarding the results, please let us know.
>
> I kind of recall that last time the perf numbers were skewed toward
> IBNBD because the invalidation model for MR was wrong - did this get
> fixed?
>
> Jason

Thanks Jason for feedback.
Can you be  more specific about  "the invalidation model for MR was wrong"

I checked in the history of the email thread, only found
"I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.
"
And no reply from any of you since then.

Thanks,
Jack
Jason Gunthorpe July 9, 2019, 1:19 p.m. UTC | #9
On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote:
> On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > > >
> > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > > >
> > > > > Could you please provide some feedback to the IBNBD driver and the
> > > > > IBTRS library?
> > > > > So far we addressed all the requests provided by the community and
> > > > > continue to maintain our code up-to-date with the upstream kernel
> > > > > while having an extra compatibility layer for older kernels in our
> > > > > out-of-tree repository.
> > > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > > provide equivalent functionality for the majority of the use cases.
> > > > > IBNBD on the other hand is showing higher performance and more
> > > > > importantly includes the IBTRS - a general purpose library to
> > > > > establish connections and transport BIO-like read/write sg-lists over
> > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > > it would make sense for us to rework our patchset and try pushing it
> > > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > > beneficial for the eco-system, find a proper location for it within
> > > > > block/rdma subsystems? This would make it easier for people to try it
> > > > > out and would also be a huge step for us in terms of maintenance
> > > > > effort.
> > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > > near future). Do you think it would make sense to rename the driver to
> > > > > RNBD/RTRS?
> > > >
> > > > It is better to avoid "staging" tree, because it will lack attention of
> > > > relevant people and your efforts will be lost once you will try to move
> > > > out of staging. We are all remembering Lustre and don't want to see it
> > > > again.
> > > >
> > > > Back then, you was asked to provide support for performance superiority.
> > > > Can you please share any numbers with us?
> > > Hi Leon,
> > >
> > > Thanks for you feedback.
> > >
> > > For performance numbers,  Danil did intensive benchmark, and create
> > > some PDF with graphes here:
> > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> > >
> > > It includes both single path results also different multipath policy results.
> > >
> > > If you have any question regarding the results, please let us know.
> >
> > I kind of recall that last time the perf numbers were skewed toward
> > IBNBD because the invalidation model for MR was wrong - did this get
> > fixed?
> >
> > Jason
> 
> Thanks Jason for feedback.
> Can you be  more specific about  "the invalidation model for MR was wrong"

MR's must be invalidated before data is handed over to the block
layer. It can't leave MRs open for access and then touch the memory
the MR covers.

IMHO this is the most likely explanation for any performance difference
from nvme..

> I checked in the history of the email thread, only found
> "I think from the RDMA side, before we accept something like this, I'd
> like to hear from Christoph, Chuck or Sagi that the dataplane
> implementation of this is correct, eg it uses the MRs properly and
> invalidates at the right time, sequences with dma_ops as required,
> etc.
> "
> And no reply from any of you since then.

This task still needs to happen..

Jason
Leon Romanovsky July 9, 2019, 1:32 p.m. UTC | #10
On Tue, Jul 09, 2019 at 01:17:37PM +0200, Greg KH wrote:
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.

Greg,

It is not matter of how much *real* work developers will do, but
it is a matter of guidance to do *right* thing, which is hard to achieve
if people mentioned in the beginning of this thread wouldn't look on
staging code.

>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h
Jinpu Wang July 9, 2019, 2:17 p.m. UTC | #11
On Tue, Jul 9, 2019 at 3:19 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote:
> > On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> > >
> > > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > > > >
> > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > > > >
> > > > > > Could you please provide some feedback to the IBNBD driver and the
> > > > > > IBTRS library?
> > > > > > So far we addressed all the requests provided by the community and
> > > > > > continue to maintain our code up-to-date with the upstream kernel
> > > > > > while having an extra compatibility layer for older kernels in our
> > > > > > out-of-tree repository.
> > > > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > > > provide equivalent functionality for the majority of the use cases.
> > > > > > IBNBD on the other hand is showing higher performance and more
> > > > > > importantly includes the IBTRS - a general purpose library to
> > > > > > establish connections and transport BIO-like read/write sg-lists over
> > > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > > > it would make sense for us to rework our patchset and try pushing it
> > > > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > > > beneficial for the eco-system, find a proper location for it within
> > > > > > block/rdma subsystems? This would make it easier for people to try it
> > > > > > out and would also be a huge step for us in terms of maintenance
> > > > > > effort.
> > > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > > > near future). Do you think it would make sense to rename the driver to
> > > > > > RNBD/RTRS?
> > > > >
> > > > > It is better to avoid "staging" tree, because it will lack attention of
> > > > > relevant people and your efforts will be lost once you will try to move
> > > > > out of staging. We are all remembering Lustre and don't want to see it
> > > > > again.
> > > > >
> > > > > Back then, you was asked to provide support for performance superiority.
> > > > > Can you please share any numbers with us?
> > > > Hi Leon,
> > > >
> > > > Thanks for you feedback.
> > > >
> > > > For performance numbers,  Danil did intensive benchmark, and create
> > > > some PDF with graphes here:
> > > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> > > >
> > > > It includes both single path results also different multipath policy results.
> > > >
> > > > If you have any question regarding the results, please let us know.
> > >
> > > I kind of recall that last time the perf numbers were skewed toward
> > > IBNBD because the invalidation model for MR was wrong - did this get
> > > fixed?
> > >
> > > Jason
> >
> > Thanks Jason for feedback.
> > Can you be  more specific about  "the invalidation model for MR was wrong"
>
> MR's must be invalidated before data is handed over to the block
> layer. It can't leave MRs open for access and then touch the memory
> the MR covers.
>
> IMHO this is the most likely explanation for any performance difference
> from nvme..
>
> > I checked in the history of the email thread, only found
> > "I think from the RDMA side, before we accept something like this, I'd
> > like to hear from Christoph, Chuck or Sagi that the dataplane
> > implementation of this is correct, eg it uses the MRs properly and
> > invalidates at the right time, sequences with dma_ops as required,
> > etc.
> > "
> > And no reply from any of you since then.
>
> This task still needs to happen..
>
> Jason

We did extensive testing and cross-checked how iser and nvmeof does
invalidation of MR,
doesn't find a problem.

+ Chuck
It will be appreciated if Christoph, Chuck, Sagi or Bart could give a
check, thank you in advance.

Thanks
Jack
Bart Van Assche July 9, 2019, 3:39 p.m. UTC | #12
On 7/9/19 4:17 AM, Greg KH wrote:
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.

Linus once famously said "given enough eyeballs, all bugs are shallow".
There are already two block-over-RDMA driver pairs upstream (NVMeOF and
SRP). Accepting the IBTRS and IBNBD drivers upstream would reduce the
number of users of the upstream block-over-RDMA drivers and hence would
fragment the block-over-RDMA driver user base further. Additionally, I'm
not yet convinced that the interesting parts of IBNBD cannot be
integrated into the existing upstream drivers. So it's not clear to me
whether taking the IBTRS and IBNBD drivers upstream would help the Linux
user community.

Bart.
Sagi Grimberg July 9, 2019, 7:45 p.m. UTC | #13
Hi Danil and Jack,

> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> 
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?
> So far we addressed all the requests provided by the community

That is not exactly correct AFAIR,

My main issues which were raised before are:
- IMO there isn't any justification to this ibtrs layering separation
   given that the only user of this is your ibnbd. Unless you are
   trying to submit another consumer, you should avoid adding another
   subsystem that is not really general purpose.

- ibtrs in general is using almost no infrastructure from the existing
   kernel subsystems. Examples are:
   - tag allocation mechanism (which I'm not clear why its needed)
   - rdma rw abstraction similar to what we have in the core
   - list_next_or_null_rr_rcu ??
   - few other examples sprinkled around..

Another question, from what I understand from the code, the client
always rdma_writes data on writes (with imm) from a remote pool of
server buffers dedicated to it. Essentially all writes are immediate (no
rdma reads ever). How is that different than using send wrs to a set of
pre-posted recv buffers (like all others are doing)? Is it faster?

Also, given that the server pre-allocate a substantial amount of memory
for each connection, is it documented the requirements from the server
side? Usually kernel implementations (especially upstream ones) will
avoid imposing such large longstanding memory requirements on the system
by default. I don't have a firm stand on this, but wanted to highlight
this as you are sending this for upstream inclusion.

  and
> continue to maintain our code up-to-date with the upstream kernel
> while having an extra compatibility layer for older kernels in our
> out-of-tree repository.

Overall, while I absolutely support your cause to lower your maintenance
overhead by having this sit upstream, I don't see why this can be
helpful to anyone else in the rdma community. If instead you can
crystallize why/how ibnbd is faster than anything else, and perhaps
contribute a common infrastructure piece (or enhance an existing one)
such that other existing ulps can leverage, it will be a lot more
compelling to include it upstream.

> I understand that SRP and NVMEoF which are in the kernel already do
> provide equivalent functionality for the majority of the use cases.
> IBNBD on the other hand is showing higher performance and more
> importantly includes the IBTRS - a general purpose library to
> establish connections and transport BIO-like read/write sg-lists over
> RDMA,

But who needs it? Can other ulps use it or pieces of it? I keep failing
to understand why is this a benefit if its specific to your ibnbd?
Sagi Grimberg July 9, 2019, 9:27 p.m. UTC | #14
>> Thanks Jason for feedback.
>> Can you be  more specific about  "the invalidation model for MR was wrong"
> 
> MR's must be invalidated before data is handed over to the block
> layer. It can't leave MRs open for access and then touch the memory
> the MR covers.

Jason is referring to these fixes:
2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing 
a request")
4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work 
request has completed")
b4b591c87f2b ("nvme-rdma: don't suppress send completions")
Jason Gunthorpe July 10, 2019, 1:55 p.m. UTC | #15
On Tue, Jul 09, 2019 at 12:45:57PM -0700, Sagi Grimberg wrote:

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?

RDMA WRITE only is generally a bit faster, and if you use a buffer
pool in a smart way it is possible to get very good data packing. With
SEND the number of recvq entries dictates how big the rx buffer can
be, or you waste even more memory by using partial send buffers..

A scheme like this seems like a high performance idea, but on the
other side, I have no idea how you could possibly manage invalidations
efficiently with a shared RX buffer pool...

The RXer has to push out an invalidation for the shared buffer pool
MR, but we don't have protocols for partial MR invalidation.

Which is back to my earlier thought that the main reason this perfoms
better is because it doesn't have synchronous MR invalidation.

Maybe this is fine, but it needs to be made very clear that it uses
this insecure operating model to get higher performance..

Jason
Danil Kipnis July 10, 2019, 2:55 p.m. UTC | #16
Hi Leon,

thanks for the feedback!

On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.

I have only theories of why ibnbd is showing better numbers than nvmeof:
1. The way we utilize the MQ framework in IBNBD. We promise to have
queue_depth (say 512) requests on each of the num_cpus hardware queues
of each device, but in fact we have only queue_depth for the whole
"session" toward a given server. The moment we have queue_depth
inflights we need stop the queue (on a device on a cpu) we get more
requests on. We need to start them again after some requests are
completed. We maintain per cpu lists of stopped HW queues, a bitmap
showing which lists are not empty, etc. to wake them up in a
round-robin fashion to avoid starvation of any devices.
2. We only do rdma writes with imm. A server reserves queue_depth of
max_io_size buffers for a given client. The client manages those
himself. Client uses imm field to tell to the server which buffer has
been written (and where) and server uses the imm field to send back
errno. If our max_io_size is 64K and queue_depth 512 and client only
issues 4K IOs all the time, then 60*512K memory is wasted. On the
other hand we do no buffer allocation/registration in io path on
server side. Server sends rdma addresses and keys to those
preregistered buffers on connection establishment and
deallocates/unregisters them when a session is closed. That's for
writes. For reads, client registers user buffers (after fr) and sends
the addresses and keys to the server (with an rdma write with imm).
Server rdma writes into those buffers. Client does the
unregistering/invalidation and completes the request.

> Can you please share any numbers with us?
Apart from github
(https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3)
the performance results for v5.2-rc3 on two different systems can be
accessed under dcd.ionos.com/ibnbd-performance-report. The page allows
to filter out test scenarios interesting for comparison.

>
> Thanks
Sagi Grimberg July 10, 2019, 4:25 p.m. UTC | #17
>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> 
> RDMA WRITE only is generally a bit faster, and if you use a buffer
> pool in a smart way it is possible to get very good data packing.

There is no packing, its used exactly as send/recv, but with a remote
buffer pool (pool of 512K buffers) and the client selects one and rdma
write with imm to it.

> With
> SEND the number of recvq entries dictates how big the rx buffer can
> be, or you waste even more memory by using partial send buffers..

This is exactly how it used here.

> A scheme like this seems like a high performance idea, but on the
> other side, I have no idea how you could possibly manage invalidations
> efficiently with a shared RX buffer pool...

There are no invalidations, this remote server pool is registered once
and long lived with the session.

> The RXer has to push out an invalidation for the shared buffer pool
> MR, but we don't have protocols for partial MR invalidation.
> 
> Which is back to my earlier thought that the main reason this perfoms
> better is because it doesn't have synchronous MR invalidation.

This issue only exists on the client side. The server never
invalidates any of its buffers.

> Maybe this is fine, but it needs to be made very clear that it uses
> this insecure operating model to get higher performance..

I still do not understand why this should give any notice-able
performance advantage.
Jason Gunthorpe July 10, 2019, 5:25 p.m. UTC | #18
On Wed, Jul 10, 2019 at 09:25:05AM -0700, Sagi Grimberg wrote:
> 
> > > Another question, from what I understand from the code, the client
> > > always rdma_writes data on writes (with imm) from a remote pool of
> > > server buffers dedicated to it. Essentially all writes are immediate (no
> > > rdma reads ever). How is that different than using send wrs to a set of
> > > pre-posted recv buffers (like all others are doing)? Is it faster?
> > 
> > RDMA WRITE only is generally a bit faster, and if you use a buffer
> > pool in a smart way it is possible to get very good data packing.
> 
> There is no packing, its used exactly as send/recv, but with a remote
> buffer pool (pool of 512K buffers) and the client selects one and rdma
> write with imm to it.

Well that makes little sense then:)

> > Maybe this is fine, but it needs to be made very clear that it uses
> > this insecure operating model to get higher performance..
> 
> I still do not understand why this should give any notice-able
> performance advantage.

Usually omitting invalidations gives a healthy bump.

Also, RDMA WRITE is generally faster than READ at the HW level in
various ways.

Jason
Sagi Grimberg July 10, 2019, 7:11 p.m. UTC | #19
>> I still do not understand why this should give any notice-able
>> performance advantage.
> 
> Usually omitting invalidations gives a healthy bump.
> 
> Also, RDMA WRITE is generally faster than READ at the HW level in
> various ways.

Yes, but this should be essentially identical to running nvme-rdma
with 512KB of immediate-data (the nvme term is in-capsule data).

In the upstream nvme target we have inline_data_size port attribute
that is tunable for that (defaults to PAGE_SIZE).
Danil Kipnis July 11, 2019, 7:27 a.m. UTC | #20
Hi Sagi,

thanks a lot for the analysis. I didn't know about about the
inline_data_size parameter in nvmet. It is at PAGE_SIZE on our
systems.
Will rerun our benchmarks with
echo 2097152 > /sys/kernel/config/nvmet/ports/1/param_inline_data_size
echo 2097152 > /sys/kernel/config/nvmet/ports/2/param_inline_data_size
before enabling the port.
Best
Danil.

On Wed, Jul 10, 2019 at 9:11 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> I still do not understand why this should give any notice-able
> >> performance advantage.
> >
> > Usually omitting invalidations gives a healthy bump.
> >
> > Also, RDMA WRITE is generally faster than READ at the HW level in
> > various ways.
>
> Yes, but this should be essentially identical to running nvme-rdma
> with 512KB of immediate-data (the nvme term is in-capsule data).
>
> In the upstream nvme target we have inline_data_size port attribute
> that is tunable for that (defaults to PAGE_SIZE).
Danil Kipnis July 11, 2019, 8:54 a.m. UTC | #21
Hi Sagi,

thanks a lot for the detailed reply. Answers inline below:

On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hi Danil and Jack,
>
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community
>
> That is not exactly correct AFAIR,
>
> My main issues which were raised before are:
> - IMO there isn't any justification to this ibtrs layering separation
>    given that the only user of this is your ibnbd. Unless you are
>    trying to submit another consumer, you should avoid adding another
>    subsystem that is not really general purpose.
We designed ibtrs not only with the IBNBD in mind but also as the
transport layer for a distributed SDS. We'd like to be able to do what
ceph is capable of (automatic up/down scaling of the storage cluster,
automatic recovery) but using in-kernel rdma-based IO transport
drivers, thin-provisioned volume managers, etc. to keep the highest
possible performance. That modest plan of ours should among others
cover for the following:
When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned
volumes) from server to client and building an (md-)raid on top of the
imported devices on client side in order to provide for redundancy
across different machines, one gets very decent throughput and low
latency, since the IOs are sent in parallel to the storage machines.
One downside of this setup, is that the resync traffic has to flow
over the client, where the md-raid is sitting. Ideally the resync
traffic should flow directly between the two "legs" (storage machines)
of the raid. The server side of such a "distributed raid" capable of
this direct syncing between the array members would necessarily
require to have some logic on server side and hence could also sit on
top of ibtrs. (To continue the analogy, the "server" side of an
md-raid build on top of say two NVMEoF devices are just two block
devices, which couldn't communicate with each other)
All in all itbrs is a library to establish a "fat", multipath,
autoreconnectable connection between two hosts on top of rdma,
optimized for transport of IO traffic.

> - ibtrs in general is using almost no infrastructure from the existing
>    kernel subsystems. Examples are:
>    - tag allocation mechanism (which I'm not clear why its needed)
As you correctly noticed our client manages the buffers allocated and
registered by the server on the connection establishment. Our tags are
just a mechanism to take and release those buffers for incoming
requests on client side. Since the buffers allocated by the server are
to be shared between all the devices mapped from that server and all
their HW queues (each having num_cpus of them) the mechanism behind
get_tag/put_tag also takes care of the fairness.

>    - rdma rw abstraction similar to what we have in the core
On the one hand we have only single IO related function:
ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
with imm, or requests an rdma write with imm to be executed by the
server. On the other hand we provide an abstraction to establish and
manage what we call "session", which consist of multiple paths (to do
failover and multipath with different policies), where each path
consists of num_cpu rdma connections. Once you established a session
you can add or remove paths from it on the fly. In case the connection
to server is lost, the client does periodic attempts to reconnect
automatically. On the server side you get just sg-lists with a
direction READ or WRITE as requested by the client. We designed this
interface not only as the minimum required to build a block device on
top of rdma but also with a distributed raid in mind.

>    - list_next_or_null_rr_rcu ??
We use that for multipath. The macro (and more importantly the way we
use it) has been reviewed by Linus and quit closely by Paul E.
McKenney. AFAIR the conclusion was that Romans implementation is
correct, but too tricky to use correctly in order to be included into
kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659

>    - few other examples sprinkled around..
To my best knowledge we addressed everything we got comments on and
will definitely do so in the future.

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?
At the very beginning of the project we did some measurements and saw,
that it is faster. I'm not sure if this is still true, since the
hardware and the drivers and rdma subsystem did change in that time.
Also it seemed to make the code simpler.

> Also, given that the server pre-allocate a substantial amount of memory
> for each connection, is it documented the requirements from the server
> side? Usually kernel implementations (especially upstream ones) will
> avoid imposing such large longstanding memory requirements on the system
> by default. I don't have a firm stand on this, but wanted to highlight
> this as you are sending this for upstream inclusion.
We definitely need to stress that somewhere. Will include into readme
and add to the cover letter next time. Our memory management is indeed
basically absent in favor of performance: The server reserves
queue_depth of say 512K buffers. Each buffer is used by client for
single IO only, no matter how big the request is. So if client only
issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
for lowest possible latency from the beginning. It is probably
possible to implement some clever allocator on the server side which
wouldn't affect the performance a lot.

>
>   and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
>
> Overall, while I absolutely support your cause to lower your maintenance
> overhead by having this sit upstream, I don't see why this can be
> helpful to anyone else in the rdma community. If instead you can
> crystallize why/how ibnbd is faster than anything else, and perhaps
> contribute a common infrastructure piece (or enhance an existing one)
> such that other existing ulps can leverage, it will be a lot more
> compelling to include it upstream.
>
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA,
>
> But who needs it? Can other ulps use it or pieces of it? I keep failing
> to understand why is this a benefit if its specific to your ibnbd?
See above and please ask if you have more questions to this.

Thank you,
Danil.
Sagi Grimberg July 12, 2019, 12:22 a.m. UTC | #22
>> My main issues which were raised before are:
>> - IMO there isn't any justification to this ibtrs layering separation
>>     given that the only user of this is your ibnbd. Unless you are
>>     trying to submit another consumer, you should avoid adding another
>>     subsystem that is not really general purpose.
> We designed ibtrs not only with the IBNBD in mind but also as the
> transport layer for a distributed SDS. We'd like to be able to do what
> ceph is capable of (automatic up/down scaling of the storage cluster,
> automatic recovery) but using in-kernel rdma-based IO transport
> drivers, thin-provisioned volume managers, etc. to keep the highest
> possible performance.

Sounds lovely, but still very much bound to your ibnbd. And that part
is not included in the patch set, so I still don't see why should this
be considered as a "generic" transport subsystem (it clearly isn't).

> All in all itbrs is a library to establish a "fat", multipath,
> autoreconnectable connection between two hosts on top of rdma,
> optimized for transport of IO traffic.

That is also dictating a wire-protocol which makes it useless to pretty
much any other consumer. Personally, I don't see how this library
would ever be used outside of your ibnbd.

>> - ibtrs in general is using almost no infrastructure from the existing
>>     kernel subsystems. Examples are:
>>     - tag allocation mechanism (which I'm not clear why its needed)
> As you correctly noticed our client manages the buffers allocated and
> registered by the server on the connection establishment. Our tags are
> just a mechanism to take and release those buffers for incoming
> requests on client side. Since the buffers allocated by the server are
> to be shared between all the devices mapped from that server and all
> their HW queues (each having num_cpus of them) the mechanism behind
> get_tag/put_tag also takes care of the fairness.

We have infrastructure for this, sbitmaps.

>>     - rdma rw abstraction similar to what we have in the core
> On the one hand we have only single IO related function:
> ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
> with imm, or requests an rdma write with imm to be executed by the
> server.

For sure you can enhance the rw API to have imm support?

> On the other hand we provide an abstraction to establish and
> manage what we call "session", which consist of multiple paths (to do
> failover and multipath with different policies), where each path
> consists of num_cpu rdma connections.

That's fine, but it doesn't mean that it also needs to re-write
infrastructure that we already have.

> Once you established a session
> you can add or remove paths from it on the fly. In case the connection
> to server is lost, the client does periodic attempts to reconnect
> automatically. On the server side you get just sg-lists with a
> direction READ or WRITE as requested by the client. We designed this
> interface not only as the minimum required to build a block device on
> top of rdma but also with a distributed raid in mind.

I suggest you take a look at the rw API and use that in your transport.

>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> At the very beginning of the project we did some measurements and saw,
> that it is faster. I'm not sure if this is still true

Its not significantly faster (can't imagine why it would be).
What could make a difference is probably the fact that you never
do rdma reads for I/O writes which might be better. Also perhaps the
fact that you normally don't wait for send completions before completing
I/O (which is broken), and the fact that you batch recv operations.

I would be interested to understand what indeed makes ibnbd run faster
though.

>> Also, given that the server pre-allocate a substantial amount of memory
>> for each connection, is it documented the requirements from the server
>> side? Usually kernel implementations (especially upstream ones) will
>> avoid imposing such large longstanding memory requirements on the system
>> by default. I don't have a firm stand on this, but wanted to highlight
>> this as you are sending this for upstream inclusion.
> We definitely need to stress that somewhere. Will include into readme
> and add to the cover letter next time. Our memory management is indeed
> basically absent in favor of performance: The server reserves
> queue_depth of say 512K buffers. Each buffer is used by client for
> single IO only, no matter how big the request is. So if client only
> issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
> for lowest possible latency from the beginning. It is probably
> possible to implement some clever allocator on the server side which
> wouldn't affect the performance a lot.

Or you can fallback to rdma_read like the rest of the ulps.
Jinpu Wang July 12, 2019, 7:57 a.m. UTC | #23
Hi Sagi,

> >> Another question, from what I understand from the code, the client
> >> always rdma_writes data on writes (with imm) from a remote pool of
> >> server buffers dedicated to it. Essentially all writes are immediate (no
> >> rdma reads ever). How is that different than using send wrs to a set of
> >> pre-posted recv buffers (like all others are doing)? Is it faster?
> > At the very beginning of the project we did some measurements and saw,
> > that it is faster. I'm not sure if this is still true
>
> Its not significantly faster (can't imagine why it would be).
> What could make a difference is probably the fact that you never
> do rdma reads for I/O writes which might be better. Also perhaps the
> fact that you normally don't wait for send completions before completing
> I/O (which is broken), and the fact that you batch recv operations.

I don't know how do you come to the conclusion we don't wait for send
completion before completing IO.

We do chain wr on successfull read request from server, see funtion
rdma_write_sg,

 318 static int rdma_write_sg(struct ibtrs_srv_op *id)
 319 {
 320         struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
 321         dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
 322         struct ibtrs_srv *srv = sess->srv;
 323         struct ib_send_wr inv_wr, imm_wr;
 324         struct ib_rdma_wr *wr = NULL;
snip
333         need_inval = le16_to_cpu(id->rd_msg->flags) &
IBTRS_MSG_NEED_INVAL_F;
snip
 357                 wr->wr.wr_cqe   = &io_comp_cqe;
 358                 wr->wr.sg_list  = list;
 359                 wr->wr.num_sge  = 1;
 360                 wr->remote_addr = le64_to_cpu(id->rd_msg->desc[i].addr);
 361                 wr->rkey        = le32_to_cpu(id->rd_msg->desc[i].key);
 snip
368                 if (i < (sg_cnt - 1))
 369                         wr->wr.next = &id->tx_wr[i + 1].wr;
 370                 else if (need_inval)
 371                         wr->wr.next = &inv_wr;
 372                 else
 373                         wr->wr.next = &imm_wr;
 374
 375                 wr->wr.opcode = IB_WR_RDMA_WRITE;
 376                 wr->wr.ex.imm_data = 0;
 377                 wr->wr.send_flags  = 0;
snip
 386         if (need_inval) {
 387                 inv_wr.next = &imm_wr;
 388                 inv_wr.wr_cqe = &io_comp_cqe;
 389                 inv_wr.sg_list = NULL;
 390                 inv_wr.num_sge = 0;
 391                 inv_wr.opcode = IB_WR_SEND_WITH_INV;
 392                 inv_wr.send_flags = 0;
 393                 inv_wr.ex.invalidate_rkey = rkey;
 394         }
 395         imm_wr.next = NULL;
 396         imm_wr.wr_cqe = &io_comp_cqe;
 397         imm_wr.sg_list = NULL;
 398         imm_wr.num_sge = 0;
 399         imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
 400         imm_wr.send_flags = flags;
 401         imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
 402                                                              0,
need_inval));
 403


when we need to do invalidation of remote memory, there will chain WR
togather, last 2 are inv_wr, and imm_wr.
imm_wr is the last one, this is important, due to the fact RC QP are
ordered, we know when when we received
IB_WC_RECV_RDMA_WITH_IMM and w_inval is true, hardware should already
finished it's job to invalidate the MR.
If server fails to invalidate, we will do local invalidation, and wait
for completion.

On client side
284 static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
 285                               bool notify, bool can_wait)
 286 {
 287         struct ibtrs_clt_con *con = req->con;
 288         struct ibtrs_clt_sess *sess;
 289         struct ibtrs_clt *clt;
 290         int err;
 291
 292         if (WARN_ON(!req->in_use))
 293                 return;
 294         if (WARN_ON(!req->con))
 295                 return;
 296         sess = to_clt_sess(con->c.sess);
 297         clt = sess->clt;
 298
 299         if (req->sg_cnt) {
 300                 if (unlikely(req->dir == DMA_FROM_DEVICE &&
req->need_inv)) {
 301                         /*
 302                          * We are here to invalidate RDMA read requests
 303                          * ourselves.  In normal scenario server should
 304                          * send INV for all requested RDMA reads, but
 305                          * we are here, thus two things could happen:
 306                          *
 307                          *    1.  this is failover, when errno != 0
 308                          *        and can_wait == 1,
 309                          *
 310                          *    2.  something totally bad happened and
 311                          *        server forgot to send INV, so we
 312                          *        should do that ourselves.
 313                          */
 314
 315                         if (likely(can_wait)) {
 316                                 req->need_inv_comp = true;
 317                         } else {
 318                                 /* This should be IO path, so
always notify */
 319                                 WARN_ON(!notify);
 320                                 /* Save errno for INV callback */
 321                                 req->inv_errno = errno;
 322                         }
 323
 324                         err = ibtrs_inv_rkey(req);
 325                         if (unlikely(err)) {
 326                                 ibtrs_err(sess, "Send INV WR
key=%#x: %d\n",
 327                                           req->mr->rkey, err);
 328                         } else if (likely(can_wait)) {
 329                                 wait_for_completion(&req->inv_comp);
 330                         } else {
330                         } else {
 331                                 /*
 332                                  * Something went wrong, so request will be
 333                                  * completed from INV callback.
 334                                  */
 335                                 WARN_ON_ONCE(1);
 336
 337                                 return;
 338                         }
 339                 }
 340                 ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
 341                                 req->sg_cnt, req->dir);
 342         }
 343         if (sess->stats.enable_rdma_lat)
 344                 ibtrs_clt_update_rdma_lat(&sess->stats,
 345                                 req->dir == DMA_FROM_DEVICE,
 346                                 jiffies_to_msecs(jiffies -
req->start_jiffies));
 347         ibtrs_clt_decrease_inflight(&sess->stats);
 348
 349         req->in_use = false;
 350         req->con = NULL;
 351
 352         if (notify)
 353                 req->conf(req->priv, errno);
 354 }

 356 static void process_io_rsp(struct ibtrs_clt_sess *sess, u32
msg_id,
 357                            s16 errno, bool w_inval)
 358 {
 359         struct ibtrs_clt_io_req *req;
 360
 361         if (WARN_ON(msg_id >= sess->queue_depth))
 362                 return;
 363
 364         req = &sess->reqs[msg_id];
 365         /* Drop need_inv if server responsed with invalidation */
 366         req->need_inv &= !w_inval;
 367         complete_rdma_req(req, errno, true, false);
 368 }

Hope this clears the doubt.

Regards,
Jack
Danil Kipnis July 12, 2019, 10:58 a.m. UTC | #24
On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> My main issues which were raised before are:
> >> - IMO there isn't any justification to this ibtrs layering separation
> >>     given that the only user of this is your ibnbd. Unless you are
> >>     trying to submit another consumer, you should avoid adding another
> >>     subsystem that is not really general purpose.
> > We designed ibtrs not only with the IBNBD in mind but also as the
> > transport layer for a distributed SDS. We'd like to be able to do what
> > ceph is capable of (automatic up/down scaling of the storage cluster,
> > automatic recovery) but using in-kernel rdma-based IO transport
> > drivers, thin-provisioned volume managers, etc. to keep the highest
> > possible performance.
>
> Sounds lovely, but still very much bound to your ibnbd. And that part
> is not included in the patch set, so I still don't see why should this
> be considered as a "generic" transport subsystem (it clearly isn't).
Having IBTRS sit on a storage enables that storage to communicate with
other storages (forward requests, request read from other storages
i.e. for sync traffic). IBTRS is generic in the sense that it removes
the strict separation into initiator (converting BIOs into some
hardware specific protocol messages) and target (which forwards those
messages to some local device supporting that protocol).
It appears less generic to me to talk SCSI or NVME between storages if
some storages have SCSI, other NVME disks or LVM volumes, or mixed
setup. IBTRS allows to just send or request read of an sg-list between
machines over rdma - the very minimum required to transport a BIO.
It would in-deed support our case with the library if we would propose
at least two users of it. We now only have a very early stage
prototype capable of organizing storages in pools, multiplexing io
between different storages, etc. sitting on top of ibtrs, it's not
functional yet. On the other hand ibnbd with ibtrs alone already make
over 10000 lines.

> > All in all itbrs is a library to establish a "fat", multipath,
> > autoreconnectable connection between two hosts on top of rdma,
> > optimized for transport of IO traffic.
>
> That is also dictating a wire-protocol which makes it useless to pretty
> much any other consumer. Personally, I don't see how this library
> would ever be used outside of your ibnbd.
Its true, IBTRS also imposes a protocol for connection establishment
and IO path. I think at least the IO part we did reduce to a bare
minimum:
350 * Write *
351
352 1. When processing a write request client selects one of the memory chunks
353 on the server side and rdma writes there the user data, user header and the
354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
355 contains size of the user header. The client tells the server
which chunk has
356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by
357 using the IMM field.
358
359 2. When confirming a write request server sends an "empty" rdma message with
360 an immediate field. The 32 bit field is used to specify the outstanding
361 inflight IO and for the error code.
362
363 CLT                                                          SRV
364 usr_data + usr_hdr + ibtrs_msg_rdma_write ----------------->
[IBTRS_IO_REQ_IMM]
365 [IBTRS_IO_RSP_IMM]                        <----------------- (id + errno)
366
367 * Read *
368
369 1. When processing a read request client selects one of the memory chunks
370 on the server side and rdma writes there the user header and the
371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of
372 the user header, flags (specifying if memory invalidation is
necessary) and the
373 list of addresses along with keys for the data to be read into.
374
375 2. When confirming a read request server transfers the requested data first,
376 attaches an invalidation message if requested and finally an "empty" rdma
377 message with an immediate field. The 32 bit field is used to specify the
378 outstanding inflight IO and the error code.
379
380 CLT                                           SRV
381 usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM]
382 [IBTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
383 or in case client requested invalidation:
384 [IBTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) +
(id + errno)

> >> - ibtrs in general is using almost no infrastructure from the existing
> >>     kernel subsystems. Examples are:
> >>     - tag allocation mechanism (which I'm not clear why its needed)
> > As you correctly noticed our client manages the buffers allocated and
> > registered by the server on the connection establishment. Our tags are
> > just a mechanism to take and release those buffers for incoming
> > requests on client side. Since the buffers allocated by the server are
> > to be shared between all the devices mapped from that server and all
> > their HW queues (each having num_cpus of them) the mechanism behind
> > get_tag/put_tag also takes care of the fairness.
>
> We have infrastructure for this, sbitmaps.
AFAIR Roman did try to use sbitmap but found no benefits in terms of
readability or number of lines:
" What is left unchanged on IBTRS side but was suggested to modify:
     - Bart suggested to use sbitmap instead of calling find_first_zero_bit()
  and friends.  I found calling pure bit API is more explicit in
  comparison to sbitmap - there is no need in using sbitmap_queue
  and all the power of wait queues, no benefits in terms of LoC
  as well." https://lwn.net/Articles/756994/

If sbitmap is a must for our use case from the infrastructure point of
view, we will reiterate on it.

>
> >>     - rdma rw abstraction similar to what we have in the core
> > On the one hand we have only single IO related function:
> > ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
> > with imm, or requests an rdma write with imm to be executed by the
> > server.
>
> For sure you can enhance the rw API to have imm support?
I'm not familiar with the architectural intention behind rw.c.
Extending the API with the support of imm field is (I guess) doable.

> > On the other hand we provide an abstraction to establish and
> > manage what we call "session", which consist of multiple paths (to do
> > failover and multipath with different policies), where each path
> > consists of num_cpu rdma connections.
>
> That's fine, but it doesn't mean that it also needs to re-write
> infrastructure that we already have.
Do you refer to rw.c?

> > Once you established a session
> > you can add or remove paths from it on the fly. In case the connection
> > to server is lost, the client does periodic attempts to reconnect
> > automatically. On the server side you get just sg-lists with a
> > direction READ or WRITE as requested by the client. We designed this
> > interface not only as the minimum required to build a block device on
> > top of rdma but also with a distributed raid in mind.
>
> I suggest you take a look at the rw API and use that in your transport.
We will look into rw.c. Do you suggest we move the multipath and the
multiple QPs per path and connection establishment on *top* of it or
*into* it?

> >> Another question, from what I understand from the code, the client
> >> always rdma_writes data on writes (with imm) from a remote pool of
> >> server buffers dedicated to it. Essentially all writes are immediate (no
> >> rdma reads ever). How is that different than using send wrs to a set of
> >> pre-posted recv buffers (like all others are doing)? Is it faster?
> > At the very beginning of the project we did some measurements and saw,
> > that it is faster. I'm not sure if this is still true
>
> Its not significantly faster (can't imagine why it would be).
> What could make a difference is probably the fact that you never
> do rdma reads for I/O writes which might be better. Also perhaps the
> fact that you normally don't wait for send completions before completing
> I/O (which is broken), and the fact that you batch recv operations.
>
> I would be interested to understand what indeed makes ibnbd run faster
> though.
Yes, we would like to understand this too. I will try increasing the
inline_data_size on nvme in our benchmarks as the next step to check
if this influences the results.

> >> Also, given that the server pre-allocate a substantial amount of memory
> >> for each connection, is it documented the requirements from the server
> >> side? Usually kernel implementations (especially upstream ones) will
> >> avoid imposing such large longstanding memory requirements on the system
> >> by default. I don't have a firm stand on this, but wanted to highlight
> >> this as you are sending this for upstream inclusion.
> > We definitely need to stress that somewhere. Will include into readme
> > and add to the cover letter next time. Our memory management is indeed
> > basically absent in favor of performance: The server reserves
> > queue_depth of say 512K buffers. Each buffer is used by client for
> > single IO only, no matter how big the request is. So if client only
> > issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
> > for lowest possible latency from the beginning. It is probably
> > possible to implement some clever allocator on the server side which
> > wouldn't affect the performance a lot.
>
> Or you can fallback to rdma_read like the rest of the ulps.
We currently have a single round trip for every write IO: write + ack.
Wouldn't switching to rdma_read make 2 round trips out of it: command
+ rdma_read + ack?
Sagi Grimberg July 12, 2019, 7:40 p.m. UTC | #25
> Hi Sagi,
> 
>>>> Another question, from what I understand from the code, the client
>>>> always rdma_writes data on writes (with imm) from a remote pool of
>>>> server buffers dedicated to it. Essentially all writes are immediate (no
>>>> rdma reads ever). How is that different than using send wrs to a set of
>>>> pre-posted recv buffers (like all others are doing)? Is it faster?
>>> At the very beginning of the project we did some measurements and saw,
>>> that it is faster. I'm not sure if this is still true
>>
>> Its not significantly faster (can't imagine why it would be).
>> What could make a difference is probably the fact that you never
>> do rdma reads for I/O writes which might be better. Also perhaps the
>> fact that you normally don't wait for send completions before completing
>> I/O (which is broken), and the fact that you batch recv operations.
> 
> I don't know how do you come to the conclusion we don't wait for send
> completion before completing IO.
> 
> We do chain wr on successfull read request from server, see funtion
> rdma_write_sg,

I was referring to the client side
Jinpu Wang July 15, 2019, 11:21 a.m. UTC | #26
Sagi Grimberg <sagi@grimberg.me> 于2019年7月12日周五 下午9:40写道:
>
>
> > Hi Sagi,
> >
> >>>> Another question, from what I understand from the code, the client
> >>>> always rdma_writes data on writes (with imm) from a remote pool of
> >>>> server buffers dedicated to it. Essentially all writes are immediate (no
> >>>> rdma reads ever). How is that different than using send wrs to a set of
> >>>> pre-posted recv buffers (like all others are doing)? Is it faster?
> >>> At the very beginning of the project we did some measurements and saw,
> >>> that it is faster. I'm not sure if this is still true
> >>
> >> Its not significantly faster (can't imagine why it would be).
> >> What could make a difference is probably the fact that you never
> >> do rdma reads for I/O writes which might be better. Also perhaps the
> >> fact that you normally don't wait for send completions before completing
> >> I/O (which is broken), and the fact that you batch recv operations.
> >
> > I don't know how do you come to the conclusion we don't wait for send
> > completion before completing IO.
> >
> > We do chain wr on successfull read request from server, see funtion
> > rdma_write_sg,
>
> I was referring to the client side
Hi Sagi,

I checked the 3 commits you mentioned in earlier thread again, I now
get your point.
You meant the behavior following commits try to fix.

4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work
request has completed")
b4b591c87f2b ("nvme-rdma: don't suppress send completions")

In this sense, ibtrs client side are not waiting for the completions
for RDMA WRITE WR to finish.
But we did it right for local invalidation.

I checked SRP/iser, they are not even wait for local invalidation, no
signal flag set.

If it's a problem, we should fix them too, maybe more.

My question is do you see the behavior (HCA retry send due to drop ack
) in the field,
is it possible to reproduce?

Thanks,
Jack
Danil Kipnis July 19, 2019, 1:12 p.m. UTC | #27
Hi Sagi,

thanks a lot for the information. We are doing the right thing
regarding the invalidation (your 2f122e4f5107), but we do use
unsignalled sends and need to fix that. Please correct me if I'm
wrong: The patches (b4b591c87f2b, b4b591c87f2b) fix the problem that
if the ack from target is lost for some reason, the initiators HCA
will resend it even after the request is completed.
But doesn't the same problem persist also other way around: for the
lost acks from client? I mean, target is did a send for the "read"
IOs; client completed the request (after invalidation, refcount
dropped to 0, etc), but the ack is not delivered to the HCA of the
target, so the target will also resend it. This seems unfixable, since
the client can't possible know if the server received his ack or not?
Doesn't the problem go away, if rdma_conn_param.retry_count is just set to 0?

Thanks for your help,
Best,
Danil.

On Tue, Jul 9, 2019 at 11:27 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> Thanks Jason for feedback.
> >> Can you be  more specific about  "the invalidation model for MR was wrong"
> >
> > MR's must be invalidated before data is handed over to the block
> > layer. It can't leave MRs open for access and then touch the memory
> > the MR covers.
>
> Jason is referring to these fixes:
> 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing
> a request")
> 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work
> request has completed")
> b4b591c87f2b ("nvme-rdma: don't suppress send completions")