mbox series

[vhost,v2,0/7] vdpa/mlx5: Optimze MKEY operations

Message ID 20240830105838.2666587-2-dtatulea@nvidia.com (mailing list archive)
Headers show
Series vdpa/mlx5: Optimze MKEY operations | expand

Message

Dragos Tatulea Aug. 30, 2024, 10:58 a.m. UTC
This series improves the time of .set_map() operations by parallelizing
the MKEY creation and deletion for direct MKEYs. Looking at the top
level MKEY creation/deletion functions, the following improvement can be
seen:

|-------------------+-------------|
| operation         | improvement |
|-------------------+-------------|
| create_user_mr()  | 3-5x        |
| destroy_user_mr() | 8x          |
|-------------------+-------------|

The last part of the series introduces lazy MKEY deletion which
postpones the MKEY deletion to a later point in a workqueue.

As this series and the previous ones were targeting live migration,
we can also observe improvements on this front:

|-------------------+------------------+------------------|
| Stage             | Downtime #1 (ms) | Downtime #2 (ms) |
|-------------------+------------------+------------------|
| Baseline          | 3140             | 3630             |
| Parallel MKEY ops | 1200             | 2000             |
| Deferred deletion | 1014             | 1253             |
|-------------------+------------------+------------------|

Test configuration: 256 GB VM, 32 CPUs x 2 threads per core, 4 x mlx5
vDPA devices x 32 VQs (16 VQPs)

This series must be applied on top of the parallel VQ suspend/resume
series [0].

[0] https://lore.kernel.org/all/20240816090159.1967650-1-dtatulea@nvidia.com/

---
v2:
- Swapped flex array usage for plain zero length array in first patch.
- Updated code to use Scope-Based Cleanup Helpers where appropriate
  (only second patch).
- Added macro define for MTT alignment in first patch.
- Improved commit messages/comments based on review comments.
- Removed extra newlines.
---

Dragos Tatulea (7):
  vdpa/mlx5: Create direct MKEYs in parallel
  vdpa/mlx5: Delete direct MKEYs in parallel
  vdpa/mlx5: Rename function
  vdpa/mlx5: Extract mr members in own resource struct
  vdpa/mlx5: Rename mr_mtx -> lock
  vdpa/mlx5: Introduce init/destroy for MR resources
  vdpa/mlx5: Postpone MR deletion

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  25 ++-
 drivers/vdpa/mlx5/core/mr.c        | 288 +++++++++++++++++++++++++----
 drivers/vdpa/mlx5/core/resources.c |   3 -
 drivers/vdpa/mlx5/net/mlx5_vnet.c  |  53 +++---
 4 files changed, 296 insertions(+), 73 deletions(-)

Comments

Dragos Tatulea Sept. 9, 2024, 9:30 a.m. UTC | #1
On 30.08.24 12:58, Dragos Tatulea wrote:
> This series improves the time of .set_map() operations by parallelizing
> the MKEY creation and deletion for direct MKEYs. Looking at the top
> level MKEY creation/deletion functions, the following improvement can be
> seen:
> 
> |-------------------+-------------|
> | operation         | improvement |
> |-------------------+-------------|
> | create_user_mr()  | 3-5x        |
> | destroy_user_mr() | 8x          |
> |-------------------+-------------|
> 
> The last part of the series introduces lazy MKEY deletion which
> postpones the MKEY deletion to a later point in a workqueue.
> 
> As this series and the previous ones were targeting live migration,
> we can also observe improvements on this front:
> 
> |-------------------+------------------+------------------|
> | Stage             | Downtime #1 (ms) | Downtime #2 (ms) |
> |-------------------+------------------+------------------|
> | Baseline          | 3140             | 3630             |
> | Parallel MKEY ops | 1200             | 2000             |
> | Deferred deletion | 1014             | 1253             |
> |-------------------+------------------+------------------|
> 
> Test configuration: 256 GB VM, 32 CPUs x 2 threads per core, 4 x mlx5
> vDPA devices x 32 VQs (16 VQPs)
> 
> This series must be applied on top of the parallel VQ suspend/resume
> series [0].
> 
> [0] https://lore.kernel.org/all/20240816090159.1967650-1-dtatulea@nvidia.com/
> 
> ---
> v2:
> - Swapped flex array usage for plain zero length array in first patch.
> - Updated code to use Scope-Based Cleanup Helpers where appropriate
>   (only second patch).
> - Added macro define for MTT alignment in first patch.
> - Improved commit messages/comments based on review comments.
> - Removed extra newlines.
Gentle ping for the remaining patches in v2.

Thanks,
Dragos
Eugenio Perez Martin Sept. 11, 2024, 8:02 a.m. UTC | #2
On Mon, Sep 9, 2024 at 11:30 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
>
>
> On 30.08.24 12:58, Dragos Tatulea wrote:
> > This series improves the time of .set_map() operations by parallelizing
> > the MKEY creation and deletion for direct MKEYs. Looking at the top
> > level MKEY creation/deletion functions, the following improvement can be
> > seen:
> >
> > |-------------------+-------------|
> > | operation         | improvement |
> > |-------------------+-------------|
> > | create_user_mr()  | 3-5x        |
> > | destroy_user_mr() | 8x          |
> > |-------------------+-------------|
> >
> > The last part of the series introduces lazy MKEY deletion which
> > postpones the MKEY deletion to a later point in a workqueue.
> >
> > As this series and the previous ones were targeting live migration,
> > we can also observe improvements on this front:
> >
> > |-------------------+------------------+------------------|
> > | Stage             | Downtime #1 (ms) | Downtime #2 (ms) |
> > |-------------------+------------------+------------------|
> > | Baseline          | 3140             | 3630             |
> > | Parallel MKEY ops | 1200             | 2000             |
> > | Deferred deletion | 1014             | 1253             |
> > |-------------------+------------------+------------------|
> >
> > Test configuration: 256 GB VM, 32 CPUs x 2 threads per core, 4 x mlx5
> > vDPA devices x 32 VQs (16 VQPs)
> >
> > This series must be applied on top of the parallel VQ suspend/resume
> > series [0].
> >
> > [0] https://lore.kernel.org/all/20240816090159.1967650-1-dtatulea@nvidia.com/
> >
> > ---
> > v2:
> > - Swapped flex array usage for plain zero length array in first patch.
> > - Updated code to use Scope-Based Cleanup Helpers where appropriate
> >   (only second patch).
> > - Added macro define for MTT alignment in first patch.
> > - Improved commit messages/comments based on review comments.
> > - Removed extra newlines.
> Gentle ping for the remaining patches in v2.
>

Same here, this series is already in MST's branch:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=vhost&id=d424b079e243128383e88bee79f143ff30b4ec62
Dragos Tatulea Sept. 11, 2024, 5:05 p.m. UTC | #3
On 11.09.24 10:02, Eugenio Perez Martin wrote:
> On Mon, Sep 9, 2024 at 11:30 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>>
>>
>> On 30.08.24 12:58, Dragos Tatulea wrote:
>>> This series improves the time of .set_map() operations by parallelizing
>>> the MKEY creation and deletion for direct MKEYs. Looking at the top
>>> level MKEY creation/deletion functions, the following improvement can be
>>> seen:
>>>
>>> |-------------------+-------------|
>>> | operation         | improvement |
>>> |-------------------+-------------|
>>> | create_user_mr()  | 3-5x        |
>>> | destroy_user_mr() | 8x          |
>>> |-------------------+-------------|
>>>
>>> The last part of the series introduces lazy MKEY deletion which
>>> postpones the MKEY deletion to a later point in a workqueue.
>>>
>>> As this series and the previous ones were targeting live migration,
>>> we can also observe improvements on this front:
>>>
>>> |-------------------+------------------+------------------|
>>> | Stage             | Downtime #1 (ms) | Downtime #2 (ms) |
>>> |-------------------+------------------+------------------|
>>> | Baseline          | 3140             | 3630             |
>>> | Parallel MKEY ops | 1200             | 2000             |
>>> | Deferred deletion | 1014             | 1253             |
>>> |-------------------+------------------+------------------|
>>>
>>> Test configuration: 256 GB VM, 32 CPUs x 2 threads per core, 4 x mlx5
>>> vDPA devices x 32 VQs (16 VQPs)
>>>
>>> This series must be applied on top of the parallel VQ suspend/resume
>>> series [0].
>>>
>>> [0] https://lore.kernel.org/all/20240816090159.1967650-1-dtatulea@nvidia.com/
>>>
>>> ---
>>> v2:
>>> - Swapped flex array usage for plain zero length array in first patch.
>>> - Updated code to use Scope-Based Cleanup Helpers where appropriate
>>>   (only second patch).
>>> - Added macro define for MTT alignment in first patch.
>>> - Improved commit messages/comments based on review comments.
>>> - Removed extra newlines.
>> Gentle ping for the remaining patches in v2.
>>
> 
> Same here, this series is already in MST's branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=vhost&id=d424b079e243128383e88bee79f143ff30b4ec62
> 
Ack. Thanks!