mbox series

[rdma-next,0/8] Introduce mlx5 Memory Scheme ODP

Message ID 20240904153038.23054-1-michaelgur@nvidia.com (mailing list archive)
Headers show
Series Introduce mlx5 Memory Scheme ODP | expand

Message

Michael Guralnik Sept. 4, 2024, 3:30 p.m. UTC
This series introduces a new ODP scheme in mlx5 where the FW takes the
responsibility of parsing and providing page fault data to the driver
to handle the fault.
As opposed to the current ODP transport scheme where the driver is
responsible for reading and parsing work queues and querying mkeys to
acquire needed info to handle the page fault.

The new scheme allows driver to support ODP over Devx QPs where driver
is not able to access the QP buffers, owned by the user application,
to read the work queue requests.
Furthermore, the new scheme allows support for ODP with new indirect
MKEY types as the driver doesn't need to query or parse indirect mkeys
in this scheme.

The driver will enable the new scheme on devices that have the relevant
capabilities. Otherwise, transport scheme ODP will be the default.

The move to memory scheme ODP is transparent to existing ODP
applications and no change is needed.
New application that want to take advantage of the new functionality
should query which scheme is active and it's capabilities using Devx.

Michael Guralnik (8):
  net/mlx5: Expand mkey page size to support 6 bits
  net/mlx5: Expose HW bits for Memory scheme ODP
  RDMA/mlx5: Add new ODP memory scheme eqe format
  RDMA/mlx5: Enforce umem boundaries for explicit ODP page faults
  RDMA/mlx5: Split ODP mkey search logic
  RDMA/mlx5: Add handling for memory scheme page fault events
  RDMA/mlx5: Add implicit MR handling to ODP memory scheme
  net/mlx5: Handle memory scheme ODP capabilities

 drivers/infiniband/hw/mlx5/mlx5_ib.h          |  17 +-
 drivers/infiniband/hw/mlx5/mr.c               |  10 +-
 drivers/infiniband/hw/mlx5/odp.c              | 400 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/main.c    |  54 ++-
 include/linux/mlx5/device.h                   |  30 +-
 include/linux/mlx5/mlx5_ifc.h                 |  64 ++-
 6 files changed, 449 insertions(+), 126 deletions(-)

Comments

Zhu Yanjun Sept. 6, 2024, 5:35 a.m. UTC | #1
在 2024/9/4 23:30, Michael Guralnik 写道:
> This series introduces a new ODP scheme in mlx5 where the FW takes the
> responsibility of parsing and providing page fault data to the driver
> to handle the fault.
> As opposed to the current ODP transport scheme where the driver is
> responsible for reading and parsing work queues and querying mkeys to
> acquire needed info to handle the page fault.
> 
> The new scheme allows driver to support ODP over Devx QPs where driver
> is not able to access the QP buffers, owned by the user application,
> to read the work queue requests.
> Furthermore, the new scheme allows support for ODP with new indirect
> MKEY types as the driver doesn't need to query or parse indirect mkeys
> in this scheme.
> 
> The driver will enable the new scheme on devices that have the relevant
> capabilities. Otherwise, transport scheme ODP will be the default.
> 
> The move to memory scheme ODP is transparent to existing ODP
> applications and no change is needed.
> New application that want to take advantage of the new functionality
> should query which scheme is active and it's capabilities using Devx.

On-Demand-Paging (ODP) is a technique to alleviate much of the 
shortcomings of memory registration. Applications no longer need to pin 
down the underlying physical pages of the address space, and track the 
validity of the mappings. Rather, the HCA requests the latest 
translations from the OS when pages are not present, and the OS 
invalidates translations which are no longer valid due to either 
non-present pages or mapping changes.

As such, it seems that it can save memory via not pinning down the 
underlying physical pages of the address space, and track the validity 
of the mappings.

What is the difference on the performance with/without ODP enabled? And 
about memory usage, is there any test result about this?

And ODP can be used mlx5 IB device? Or ODP can only be used in mlx5 
RoCEv2 device?

Thanks,
Zhu Yanjun

> 
> Michael Guralnik (8):
>    net/mlx5: Expand mkey page size to support 6 bits
>    net/mlx5: Expose HW bits for Memory scheme ODP
>    RDMA/mlx5: Add new ODP memory scheme eqe format
>    RDMA/mlx5: Enforce umem boundaries for explicit ODP page faults
>    RDMA/mlx5: Split ODP mkey search logic
>    RDMA/mlx5: Add handling for memory scheme page fault events
>    RDMA/mlx5: Add implicit MR handling to ODP memory scheme
>    net/mlx5: Handle memory scheme ODP capabilities
> 
>   drivers/infiniband/hw/mlx5/mlx5_ib.h          |  17 +-
>   drivers/infiniband/hw/mlx5/mr.c               |  10 +-
>   drivers/infiniband/hw/mlx5/odp.c              | 400 ++++++++++++++----
>   .../net/ethernet/mellanox/mlx5/core/main.c    |  54 ++-
>   include/linux/mlx5/device.h                   |  30 +-
>   include/linux/mlx5/mlx5_ifc.h                 |  64 ++-
>   6 files changed, 449 insertions(+), 126 deletions(-)
>
Michael Guralnik Sept. 8, 2024, 6:18 a.m. UTC | #2
On 06/09/2024 08:35, Zhu Yanjun wrote:
>
> As such, it seems that it can save memory via not pinning down the
> underlying physical pages of the address space, and track the validity
> of the mappings.
>
> What is the difference on the performance with/without ODP enabled? And
> about memory usage, is there any test result about this?
>
> And ODP can be used mlx5 IB device? Or ODP can only be used in mlx5
> RoCEv2 device?
>
The performance while using ODP is highly dependent on many factors that 
dictate how many page faults the kernel will have to deal with.
Each page fault will introduce a latency hit.

Both the examples in rdma_core (e.g ibv_rc_pingpong) and the perftest 
(e.g. ib_write_bw) support running with ODP to test this.

ODP can be used in both IB and RoCE.


Michael


> Thanks,
> Zhu Yanjun
>
Zhu Yanjun Sept. 9, 2024, 2:10 a.m. UTC | #3
在 2024/9/8 14:18, Michael Guralnik 写道:
> 
> On 06/09/2024 08:35, Zhu Yanjun wrote:
>>
>> As such, it seems that it can save memory via not pinning down the
>> underlying physical pages of the address space, and track the validity
>> of the mappings.
>>
>> What is the difference on the performance with/without ODP enabled? And
>> about memory usage, is there any test result about this?
>>
>> And ODP can be used mlx5 IB device? Or ODP can only be used in mlx5
>> RoCEv2 device?
>>
> The performance while using ODP is highly dependent on many factors that 
> dictate how many page faults the kernel will have to deal with.
> Each page fault will introduce a latency hit.
> 
> Both the examples in rdma_core (e.g ibv_rc_pingpong) and the perftest 
> (e.g. ib_write_bw) support running with ODP to test this.

Thanks a lot. I have developed ODP for other RDMA devices. From my 
tests, it seems that with ODP, the system memory is needed less than 
without ODP.

 From your descriptions, it seems that the latency of RDMA will increase 
if I get you correctly.

If others (for example, bandwidth) remain unchanged, the tradeoff should 
be between memory and latency.

Best Regards,
Zhu Yanjun

> 
> ODP can be used in both IB and RoCE.
> 
> 
> Michael
> 
> 
>> Thanks,
>> Zhu Yanjun
>>