[net-next] net/mlx5e: Transmit small messages in linear skb

Message ID	20241204140230.23858-1-wintera@linux.ibm.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF07C1CEADF; Wed, 4 Dec 2024 14:02:58 +0000 (UTC) From: Alexandra Winter <wintera@linux.ibm.com> To: Rahul Rameshbabu <rrameshbabu@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>, Leon Romanovsky <leon@kernel.org>, David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Eric Dumazet <edumazet@google.com>, Andrew Lunn <andrew+netdev@lunn.ch> Cc: Nils Hoppmann <niho@linux.ibm.com>, netdev@vger.kernel.org, linux-s390@vger.kernel.org, Heiko Carstens <hca@linux.ibm.com>, Vasily Gorbik <gor@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Sven Schnelle <svens@linux.ibm.com>, Thorsten Winkler <twinkler@linux.ibm.com>, Simon Horman <horms@kernel.org> Subject: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Date: Wed, 4 Dec 2024 15:02:30 +0100 Message-ID: <20241204140230.23858-1-wintera@linux.ibm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net-next] net/mlx5e: Transmit small messages in linear skb \| expand [net-next] net/mlx5e: Transmit small messages in linear skb

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 3 this patch: 3
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	1 maintainers not CCed: linux-rdma@vger.kernel.org
netdev/build_clang	success	Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 304 this patch: 304
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-12-04--15-02 (tests: 760)

Alexandra Winter Dec. 4, 2024, 2:02 p.m. UTC

Linearize the skb if the device uses IOMMU and the data buffer can fit
into one page. So messages can be transferred in one transfer to the card
instead of two.

Performance issue:
------------------
Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
tcp skbs are always non-linear. Especially on platforms with IOMMU,
mapping and unmapping two pages instead of one per transfer can make a
noticeable difference. On s390 we saw a 13% degradation in throughput,
when running uperf with a request-response pattern with 1k payload and
250 connections parallel. See [0] for a discussion.

This patch mitigates these effects using a work-around in the mlx5 driver.

Notes on implementation:
------------------------
TCP skbs never contain any tailroom, so skb_linearize() will allocate a
new data buffer.
No need to handle rc of skb_linearize(). If it fails, we continue with the
unchanged skb.

As mentioned in the discussion, an alternative, but more invasive approach
would be: premapping a coherent piece of memory in which you can copy
small skbs.

Measurement results:
--------------------
We see an improvement in throughput of up to 16% compared to kernel v6.12.
We measured throughput and CPU consumption of uperf benchmarks with
ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
with and without this patch.

+------------------------------------------+
| Transactions per Second - Deviation in % |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |          4.75        |
|  rr1c-1x1-250     |         14.53        |
| rr1c-200x1000--50 |          2.22        |
| rr1c-200x1000-250 |         12.24        |
+-------------------+----------------------+
| Server CPU Consumption - Deviation in %  |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |         -1.66        |
|  rr1c-1x1-250     |        -10.00        |
| rr1c-200x1000--50 |         -0.83        |
| rr1c-200x1000-250 |         -8.71        |
+-------------------+----------------------+

Note:
- CPU consumption: less is better
- Client CPU consumption is similar
- Workload:
  rr1c-<bytes send>x<bytes received>-<parallel connections>

  Highly transactional small data sizes (rr1c-1x1)
    This is a Request & Response (RR) test that sends a 1-byte request
    from the client and receives a 1-byte response from the server. This
    is the smallest possible transactional workload test and is smaller
    than most customer workloads. This test represents the RR overhead
    costs.
  Highly transactional medium data sizes (rr1c-200x1000)
    Request & Response (RR) test that sends a 200-byte request from the
    client and receives a 1000-byte response from the server. This test
    should be representative of a typical user's interaction with a remote
    web site.

Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
 1 file changed, 5 insertions(+)

Eric Dumazet Dec. 4, 2024, 2:16 p.m. UTC | #1

On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
>
> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.
>
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
>
> This patch mitigates these effects using a work-around in the mlx5 driver.
>
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
>
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.
>
> Measurement results:
> --------------------
> We see an improvement in throughput of up to 16% compared to kernel v6.12.
> We measured throughput and CPU consumption of uperf benchmarks with
> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> with and without this patch.
>
> +------------------------------------------+
> | Transactions per Second - Deviation in % |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |          4.75        |
> |  rr1c-1x1-250     |         14.53        |
> | rr1c-200x1000--50 |          2.22        |
> | rr1c-200x1000-250 |         12.24        |
> +-------------------+----------------------+
> | Server CPU Consumption - Deviation in %  |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |         -1.66        |
> |  rr1c-1x1-250     |        -10.00        |
> | rr1c-200x1000--50 |         -0.83        |
> | rr1c-200x1000-250 |         -8.71        |
> +-------------------+----------------------+
>
> Note:
> - CPU consumption: less is better
> - Client CPU consumption is similar
> - Workload:
>   rr1c-<bytes send>x<bytes received>-<parallel connections>
>
>   Highly transactional small data sizes (rr1c-1x1)
>     This is a Request & Response (RR) test that sends a 1-byte request
>     from the client and receives a 1-byte response from the server. This
>     is the smallest possible transactional workload test and is smaller
>     than most customer workloads. This test represents the RR overhead
>     costs.
>   Highly transactional medium data sizes (rr1c-200x1000)
>     Request & Response (RR) test that sends a 200-byte request from the
>     client and receives a 1000-byte response from the server. This test
>     should be representative of a typical user's interaction with a remote
>     web site.
>
> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
> Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
> Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index f8c7912abe0e..421ba6798ca7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -32,6 +32,7 @@
>
>  #include <linux/tcp.h>
>  #include <linux/if_vlan.h>
> +#include <linux/iommu-dma.h>
>  #include <net/geneve.h>
>  #include <net/dsfield.h>
>  #include "en.h"
> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>  {
>         struct mlx5e_sq_stats *stats = sq->stats;
>
> +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> +               skb_linearize(skb);
> +
>         if (skb_is_gso(skb)) {
>                 int hopbyhop;
>                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> --
> 2.45.2


Was this tested on x86_64 or any other arch than s390, especially ones
with PAGE_SIZE = 65536 ?

Alexander Lobakin Dec. 4, 2024, 2:32 p.m. UTC | #2

From: Alexandra Winter <wintera@linux.ibm.com>
Date: Wed,  4 Dec 2024 15:02:30 +0100

> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.

I'd expect this to be on the generic level, not copied over the drivers?
Not sure about PAGE_SIZE, but I never saw a NIC/driver/platform where
copying let's say 256 bytes would be slower than 2x dma_map (even with
direct DMA).

> 
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
> 
> This patch mitigates these effects using a work-around in the mlx5 driver.
> 
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
> 
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.

Yes, that one would be better.

[...]

> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>  {
>  	struct mlx5e_sq_stats *stats = sq->stats;
>  
> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)

1. What's with the direct DMA? I believe it would benefit, too?
2. Why truesize, not something like

	if (skb->len <= some_sane_value_maybe_1k)

3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
   it's a good idea to rely on this.
   Some test-based hardcode would be enough (i.e. threshold on which
   DMA mapping starts performing better).

> +		skb_linearize(skb);
> +
>  	if (skb_is_gso(skb)) {

BTW can't there be a case when the skb is GSO, but its truesize is
PAGE_SIZE and linearize will be way too slow (not sure it's possible,
just guessing)?

>  		int hopbyhop;
>  		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

Thanks,
Olek

Alexandra Winter Dec. 4, 2024, 2:35 p.m. UTC | #3

On 04.12.24 15:16, Eric Dumazet wrote:
> On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>
>> Linearize the skb if the device uses IOMMU and the data buffer can fit
>> into one page. So messages can be transferred in one transfer to the card
>> instead of two.
>>
>> Performance issue:
>> ------------------
>> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
>> tcp skbs are always non-linear. Especially on platforms with IOMMU,
>> mapping and unmapping two pages instead of one per transfer can make a
>> noticeable difference. On s390 we saw a 13% degradation in throughput,
>> when running uperf with a request-response pattern with 1k payload and
>> 250 connections parallel. See [0] for a discussion.
>>
>> This patch mitigates these effects using a work-around in the mlx5 driver.
>>
>> Notes on implementation:
>> ------------------------
>> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
>> new data buffer.
>> No need to handle rc of skb_linearize(). If it fails, we continue with the
>> unchanged skb.
>>
>> As mentioned in the discussion, an alternative, but more invasive approach
>> would be: premapping a coherent piece of memory in which you can copy
>> small skbs.
>>
>> Measurement results:
>> --------------------
>> We see an improvement in throughput of up to 16% compared to kernel v6.12.
>> We measured throughput and CPU consumption of uperf benchmarks with
>> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
>> with and without this patch.
>>
>> +------------------------------------------+
>> | Transactions per Second - Deviation in % |
>> +-------------------+----------------------+
>> | Workload          |                      |
>> |  rr1c-1x1--50     |          4.75        |
>> |  rr1c-1x1-250     |         14.53        |
>> | rr1c-200x1000--50 |          2.22        |
>> | rr1c-200x1000-250 |         12.24        |
>> +-------------------+----------------------+
>> | Server CPU Consumption - Deviation in %  |
>> +-------------------+----------------------+
>> | Workload          |                      |
>> |  rr1c-1x1--50     |         -1.66        |
>> |  rr1c-1x1-250     |        -10.00        |
>> | rr1c-200x1000--50 |         -0.83        |
>> | rr1c-200x1000-250 |         -8.71        |
>> +-------------------+----------------------+
>>
>> Note:
>> - CPU consumption: less is better
>> - Client CPU consumption is similar
>> - Workload:
>>   rr1c-<bytes send>x<bytes received>-<parallel connections>
>>
>>   Highly transactional small data sizes (rr1c-1x1)
>>     This is a Request & Response (RR) test that sends a 1-byte request
>>     from the client and receives a 1-byte response from the server. This
>>     is the smallest possible transactional workload test and is smaller
>>     than most customer workloads. This test represents the RR overhead
>>     costs.
>>   Highly transactional medium data sizes (rr1c-200x1000)
>>     Request & Response (RR) test that sends a 200-byte request from the
>>     client and receives a 1000-byte response from the server. This test
>>     should be representative of a typical user's interaction with a remote
>>     web site.
>>
>> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
>> Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
>> Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
>> Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> index f8c7912abe0e..421ba6798ca7 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> @@ -32,6 +32,7 @@
>>
>>  #include <linux/tcp.h>
>>  #include <linux/if_vlan.h>
>> +#include <linux/iommu-dma.h>
>>  #include <net/geneve.h>
>>  #include <net/dsfield.h>
>>  #include "en.h"
>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>  {
>>         struct mlx5e_sq_stats *stats = sq->stats;
>>
>> +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
>> +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>> +               skb_linearize(skb);
>> +
>>         if (skb_is_gso(skb)) {
>>                 int hopbyhop;
>>                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
>> --
>> 2.45.2
> 
> 
> Was this tested on x86_64 or any other arch than s390, especially ones
> with PAGE_SIZE = 65536 ?
> 

No, I don't have a mlx5 card in an x86_64.
Rahul, could you test this patch?

Eric Dumazet Dec. 4, 2024, 2:36 p.m. UTC | #4

On Wed, Dec 4, 2024 at 3:16 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
> >
> > Linearize the skb if the device uses IOMMU and the data buffer can fit
> > into one page. So messages can be transferred in one transfer to the card
> > instead of two.
> >
> > Performance issue:
> > ------------------
> > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> > tcp skbs are always non-linear. Especially on platforms with IOMMU,
> > mapping and unmapping two pages instead of one per transfer can make a
> > noticeable difference. On s390 we saw a 13% degradation in throughput,
> > when running uperf with a request-response pattern with 1k payload and
> > 250 connections parallel. See [0] for a discussion.
> >
> > This patch mitigates these effects using a work-around in the mlx5 driver.
> >
> > Notes on implementation:
> > ------------------------
> > TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> > new data buffer.
> > No need to handle rc of skb_linearize(). If it fails, we continue with the
> > unchanged skb.
> >
> > As mentioned in the discussion, an alternative, but more invasive approach
> > would be: premapping a coherent piece of memory in which you can copy
> > small skbs.
> >
> > Measurement results:
> > --------------------
> > We see an improvement in throughput of up to 16% compared to kernel v6.12.
> > We measured throughput and CPU consumption of uperf benchmarks with
> > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> > with and without this patch.
> >
> > +------------------------------------------+
> > | Transactions per Second - Deviation in % |
> > +-------------------+----------------------+
> > | Workload          |                      |
> > |  rr1c-1x1--50     |          4.75        |
> > |  rr1c-1x1-250     |         14.53        |
> > | rr1c-200x1000--50 |          2.22        |
> > | rr1c-200x1000-250 |         12.24        |
> > +-------------------+----------------------+
> > | Server CPU Consumption - Deviation in %  |
> > +-------------------+----------------------+
> > | Workload          |                      |
> > |  rr1c-1x1--50     |         -1.66        |
> > |  rr1c-1x1-250     |        -10.00        |
> > | rr1c-200x1000--50 |         -0.83        |
> > | rr1c-200x1000-250 |         -8.71        |
> > +-------------------+----------------------+
> >
> > Note:
> > - CPU consumption: less is better
> > - Client CPU consumption is similar
> > - Workload:
> >   rr1c-<bytes send>x<bytes received>-<parallel connections>
> >
> >   Highly transactional small data sizes (rr1c-1x1)
> >     This is a Request & Response (RR) test that sends a 1-byte request
> >     from the client and receives a 1-byte response from the server. This
> >     is the smallest possible transactional workload test and is smaller
> >     than most customer workloads. This test represents the RR overhead
> >     costs.
> >   Highly transactional medium data sizes (rr1c-200x1000)
> >     Request & Response (RR) test that sends a 200-byte request from the
> >     client and receives a 1000-byte response from the server. This test
> >     should be representative of a typical user's interaction with a remote
> >     web site.
> >
> > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
> > Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> > Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> > Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
> > Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > index f8c7912abe0e..421ba6798ca7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > @@ -32,6 +32,7 @@
> >
> >  #include <linux/tcp.h>
> >  #include <linux/if_vlan.h>
> > +#include <linux/iommu-dma.h>
> >  #include <net/geneve.h>
> >  #include <net/dsfield.h>
> >  #include "en.h"
> > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >  {
> >         struct mlx5e_sq_stats *stats = sq->stats;
> >
> > +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> > +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> > +               skb_linearize(skb);
> > +
> >         if (skb_is_gso(skb)) {
> >                 int hopbyhop;
> >                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> > --
> > 2.45.2
>
>
> Was this tested on x86_64 or any other arch than s390, especially ones
> with PAGE_SIZE = 65536 ?

I would suggest the opposite : copy the headers (typically less than
128 bytes) on a piece of coherent memory.

As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.

include/net/tso.h and net/core/tso.c users do this.

Sure, patch is going to be more invasive, but all arches will win.

David Laight Dec. 6, 2024, 2:47 p.m. UTC | #5

From: Eric Dumazet
> Sent: 04 December 2024 14:36
...
> I would suggest the opposite : copy the headers (typically less than
> 128 bytes) on a piece of coherent memory.

A long time ago a colleague tested the cutoff between copying to
a fixed buffer and dma access to the kernel memory buffer for
a sparc mbus/sbus system (which has an iommu).
While entirely different in all regards the cutoff was just over 1k.

The ethernet drivers I wrote did a data copy to/from a pre-mapped
area for both transmit and receive.
I suspect the simplicity of that also improved things.

These days you'd definitely want to map tso buffers.
But the 'copybreak' size for receive could be quite high.

On x86 just make sure the destination address for 'rep movsb'
is 64 byte aligned - it will double the copy speed.
The source alignment doesn't matter at all.
(AMD chips might be different, but an aligned copy of a whole
number of 'words' can always be done.)

I've also wondered whether the ethernet driver could 'hold' the
iommu page table entries after (eg) a receive frame is processed
and then drop the PA of the replacement buffer into the same slot.
That is likely to speed up iommu setup.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Alexandra Winter Dec. 6, 2024, 3:20 p.m. UTC | #6

On 04.12.24 15:32, Alexander Lobakin wrote:
>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>  {
>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>  
>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
   +		skb_linearize(skb);
> 1. What's with the direct DMA? I believe it would benefit, too?

Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
Any opinions from the NVidia people?

> 2. Why truesize, not something like
> 
> 	if (skb->len <= some_sane_value_maybe_1k)

With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
When we set the threshhold at a smaller value, skb->len makes more sense

> 
> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>    it's a good idea to rely on this.
>    Some test-based hardcode would be enough (i.e. threshold on which
>    DMA mapping starts performing better).

A threshhold of 4k is absolutely fine with us (s390). 
A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.

NVidia people do you have any opinion on a good threshhold?

Alexandra Winter Dec. 6, 2024, 3:25 p.m. UTC | #7

On 04.12.24 15:36, Eric Dumazet wrote:
> I would suggest the opposite : copy the headers (typically less than
> 128 bytes) on a piece of coherent memory.
> 
> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
> 
> include/net/tso.h and net/core/tso.c users do this.
> 
> Sure, patch is going to be more invasive, but all arches will win.

Thank you very much for the examples, I think I understand what you are proposing.
I am not sure whether I'm able to map it to the mlx5 driver, but I could
try to come up with a RFC. It may take some time though.

NVidia people, any suggesttions? Do you want to handle that yourselves?

Eric Dumazet Dec. 6, 2024, 4:35 p.m. UTC | #8

On Fri, Dec 6, 2024 at 3:48 PM David Laight <David.Laight@aculab.com> wrote:
>
> I've also wondered whether the ethernet driver could 'hold' the
> iommu page table entries after (eg) a receive frame is processed
> and then drop the PA of the replacement buffer into the same slot.
> That is likely to speed up iommu setup.

This has been done a long time ago (Alexander Duyck page refcount
trick in Intel drivers)

Then put into generic page_pool

https://static.lwn.net/kerneldoc/networking/page_pool.html

Tariq Toukan Dec. 9, 2024, 11:36 a.m. UTC | #9

On 06/12/2024 17:20, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:32, Alexander Lobakin wrote:
>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>   {
>>>   	struct mlx5e_sq_stats *stats = sq->stats;
>>>   
>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>     +		skb_linearize(skb);
>> 1. What's with the direct DMA? I believe it would benefit, too?
> 
> 
> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> Any opinions from the NVidia people?
> 
> 
>> 2. Why truesize, not something like
>>
>> 	if (skb->len <= some_sane_value_maybe_1k)
> 
> 
> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> When we set the threshhold at a smaller value, skb->len makes more sense
> 
> 
>>
>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>     it's a good idea to rely on this.
>>     Some test-based hardcode would be enough (i.e. threshold on which
>>     DMA mapping starts performing better).
> 
> 
> A threshhold of 4k is absolutely fine with us (s390).
> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
> 
> 
> NVidia people do you have any opinion on a good threshhold?
> 

Hi,

Many approaches in the past few years are going the opposite direction, 
trying to avoid copies ("zero-copy").

In many cases, copy up to PAGE_SIZE means copy everything.
For high NIC speeds this is not realistic.

Anyway, based on past experience, threshold should not exceed "max 
header size" (128/256b).

Dragos Tatulea Dec. 10, 2024, 11:44 a.m. UTC | #10

On 06.12.24 16:20, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:32, Alexander Lobakin wrote:
>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>  {
>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>  
>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>    +		skb_linearize(skb);
>> 1. What's with the direct DMA? I believe it would benefit, too?
> 
> 
> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> Any opinions from the NVidia people?
> 
Agreed.

> 
>> 2. Why truesize, not something like
>>
>> 	if (skb->len <= some_sane_value_maybe_1k)
> 
> 
> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> When we set the threshhold at a smaller value, skb->len makes more sense
> 
> 
>>
>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>    it's a good idea to rely on this.
>>    Some test-based hardcode would be enough (i.e. threshold on which
>>    DMA mapping starts performing better).
> 
> 
> A threshhold of 4k is absolutely fine with us (s390). 
> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>
> 
> NVidia people do you have any opinion on a good threshhold?
> 
1KB is still to large. As Tariq mentioned, the threshold should not
exceed 128/256B. I am currently testing this with 256B on x86. So far no
regressions but I need to play with it more.

Thanks,
Dragos

Dragos Tatulea Dec. 10, 2024, 11:49 a.m. UTC | #11

On 06.12.24 16:25, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:36, Eric Dumazet wrote:
>> I would suggest the opposite : copy the headers (typically less than
>> 128 bytes) on a piece of coherent memory.
>>
>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>
>> include/net/tso.h and net/core/tso.c users do this.
>>
>> Sure, patch is going to be more invasive, but all arches will win.
> 
> 
> Thank you very much for the examples, I think I understand what you are proposing.
> I am not sure whether I'm able to map it to the mlx5 driver, but I could
> try to come up with a RFC. It may take some time though.
> 
> NVidia people, any suggesttions? Do you want to handle that yourselves?
> 
Discussed with Saeed and he proposed another approach that is better for
us: copy the whole skb payload inline into the WQE if it's size is below a
threshold. This threshold can be configured through the
tx-copybreak mechanism.

Thanks,
Dragos

Alexander Lobakin Dec. 10, 2024, 1:54 p.m. UTC | #12

From: Dragos Tatulea <dtatulea@nvidia.com>
Date: Tue, 10 Dec 2024 12:44:04 +0100

> 
> 
> On 06.12.24 16:20, Alexandra Winter wrote:
>>
>>
>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>  {
>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>  
>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>    +		skb_linearize(skb);
>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>
>>
>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>> Any opinions from the NVidia people?
>>
> Agreed.
> 
>>
>>> 2. Why truesize, not something like
>>>
>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>
>>
>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>> When we set the threshhold at a smaller value, skb->len makes more sense
>>
>>
>>>
>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>    it's a good idea to rely on this.
>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>    DMA mapping starts performing better).
>>
>>
>> A threshhold of 4k is absolutely fine with us (s390). 
>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>
>>
>> NVidia people do you have any opinion on a good threshhold?
>>
> 1KB is still to large. As Tariq mentioned, the threshold should not
> exceed 128/256B. I am currently testing this with 256B on x86. So far no
> regressions but I need to play with it more.

On different setups, usually the copybreak of 192 or 256 bytes was the
most efficient as well.

> 
> Thanks,
> Dragos

Thanks,
Olek

Joe Damato Dec. 10, 2024, 5:10 p.m. UTC | #13

On Tue, Dec 10, 2024 at 02:54:26PM +0100, Alexander Lobakin wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> Date: Tue, 10 Dec 2024 12:44:04 +0100
> 
> > 
> > 
> > On 06.12.24 16:20, Alexandra Winter wrote:
> >>
> >>
> >> On 04.12.24 15:32, Alexander Lobakin wrote:
> >>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >>>>  {
> >>>>  	struct mlx5e_sq_stats *stats = sq->stats;
> >>>>  
> >>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
> >>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> >>    +		skb_linearize(skb);
> >>> 1. What's with the direct DMA? I believe it would benefit, too?
> >>
> >>
> >> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> >> Any opinions from the NVidia people?
> >>
> > Agreed.
> > 
> >>
> >>> 2. Why truesize, not something like
> >>>
> >>> 	if (skb->len <= some_sane_value_maybe_1k)
> >>
> >>
> >> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> >> When we set the threshhold at a smaller value, skb->len makes more sense
> >>
> >>
> >>>
> >>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
> >>>    it's a good idea to rely on this.
> >>>    Some test-based hardcode would be enough (i.e. threshold on which
> >>>    DMA mapping starts performing better).
> >>
> >>
> >> A threshhold of 4k is absolutely fine with us (s390). 
> >> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
> >>
> >>
> >> NVidia people do you have any opinion on a good threshhold?
> >>
> > 1KB is still to large. As Tariq mentioned, the threshold should not
> > exceed 128/256B. I am currently testing this with 256B on x86. So far no
> > regressions but I need to play with it more.
> 
> On different setups, usually the copybreak of 192 or 256 bytes was the
> most efficient as well.

A minor suggestion:

Would it be at all possible for the people who've run these
experiments to document their findings somewhere: what the different
test setups were, what the copybreak settings were, what the
results were, and how they were measured?

Some drivers have a few details documented in
Documentation/networking/device_drivers/ethernet/, but if others
could do this too, like mlx5, in detail so findings could be
reproduced by others, that would be amazing.

Alexandra Winter Dec. 11, 2024, 1:35 p.m. UTC | #14

On 10.12.24 14:54, Alexander Lobakin wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> Date: Tue, 10 Dec 2024 12:44:04 +0100
> 
>>
>>
>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>
>>>
>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>  {
>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>  
>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>    +		skb_linearize(skb);
>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>
>>>
>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>> Any opinions from the NVidia people?
>>>
>> Agreed.
>>
>>>
>>>> 2. Why truesize, not something like
>>>>
>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>
>>>
>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>
>>>
>>>>
>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>    it's a good idea to rely on this.
>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>    DMA mapping starts performing better).
>>>
>>>
>>> A threshhold of 4k is absolutely fine with us (s390). 
>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>
>>>
>>> NVidia people do you have any opinion on a good threshhold?
>>>

On 09.12.24 12:36, Tariq Toukan wrote:
> Hi,
> 
> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
> 
> In many cases, copy up to PAGE_SIZE means copy everything.
> For high NIC speeds this is not realistic.
> 
> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).

>> 1KB is still to large. As Tariq mentioned, the threshold should not
>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>> regressions but I need to play with it more.
> 
> On different setups, usually the copybreak of 192 or 256 bytes was the
> most efficient as well.
> 
>>

Thank you very much, everybody for looking into this.

Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
and with this patch we get up to 12% more transactions/sec even for 1k messages.

As we're always using an IOMMU and are operating with large multi socket systems,
DMA costs far outweigh the costs of small to medium memcpy()s on our system.
We realize that the recommendation is to just run without IOMMU when performance is important,
but this is not an option in our environment.

I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
strategic direction, it is more a simple mitigation of our problem.

Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
We could even restrict it to S390 arch, if you want.

A threshold of 256b would cover some amount of our typical request-response workloads
(think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
Could we maybe define an architecture dependent threshold?

My preferred scenario for the next steps would be the following:
1) It would be great if we could get a simple mitigation patch upstream, that the distros could
easily backport. This would avoid our customers experiencing performance regression when they
upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)

2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
This would then replace the simple mitigation patch upstream, and future releases would contain 
that solution. If somebody else wants to work on this improved version, fine with me, otherwise
I could give it a try.

What do you think of this approach?

Alexandra Winter Dec. 11, 2024, 4:19 p.m. UTC | #15

On 10.12.24 12:49, Dragos Tatulea wrote:
> 
> 
> On 06.12.24 16:25, Alexandra Winter wrote:
>>
>>
>> On 04.12.24 15:36, Eric Dumazet wrote:
>>> I would suggest the opposite : copy the headers (typically less than
>>> 128 bytes) on a piece of coherent memory.
>>>
>>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>>
>>> include/net/tso.h and net/core/tso.c users do this.
>>>
>>> Sure, patch is going to be more invasive, but all arches will win.
>>
>>
>> Thank you very much for the examples, I think I understand what you are proposing.
>> I am not sure whether I'm able to map it to the mlx5 driver, but I could
>> try to come up with a RFC. It may take some time though.
>>
>> NVidia people, any suggesttions? Do you want to handle that yourselves?
>>
> Discussed with Saeed and he proposed another approach that is better for
> us: copy the whole skb payload inline into the WQE if it's size is below a
> threshold. This threshold can be configured through the
> tx-copybreak mechanism.
> 
> Thanks,
> Dragos

Thank you very much Dargos and Saeed.
I am not sure I understand the details of "inline into the WQE". 
The idea seems to be to use a premapped coherent array per WQ 
that is indexed by queue element index and can be used to copy headers and
maybe small messages into.
I think I see something similar to your proposal in mlx4 (?).
To me the general concept seems to be similar to what Eric is proposing.
Did I get it right?

I really like the idea to use tx-copybreak for threshold configuration.

As Eric mentioned that is not a very small patch and maybe not fit for backporting
to older distro versions.
What do you think of a two-step approach as described in the other sub-thread?
A simple patch for mitigation that can be backported, and then the improvement
as a replacement?

Thanks,
Alexandra

Dragos Tatulea Dec. 11, 2024, 5:28 p.m. UTC | #16

Hi Alexandra,

On 11.12.24 14:35, Alexandra Winter wrote:
> 
> 
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@nvidia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>>  {
>>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>  
>>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>>    +		skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>>    it's a good idea to rely on this.
>>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>>    DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390). 
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
> 
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> 
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).

Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
  CPU. This is to better isolate CPU behaviour. The idea is to have the
  CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
  (profile attached).
  Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.

The results:
```
Case:                                          Throughput   Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline     3.86Mb/s     30036552  
app cpu == irq cpu, nthreads= 25, skb_linear   3.52Mb/s     26969315  

app cpu == irq cpu, nthreads= 50, baseline     4.26Mb/s     33122458 
app cpu == irq cpu, nthreads= 50, skb_linear   4.06Mb/s     31606290 

app cpu == irq cpu, nthreads=100, baseline     4.08Mb/s     31775434  
app cpu == irq cpu, nthreads=100, skb_linear   3.86Mb/s     30105582

app cpu != irq cpu, nthreads= 25, baseline     3.57Mb/s     27785488
app cpu != irq cpu, nthreads= 25, skb_linear   3.56Mb/s     27730133

app cpu != irq cpu, nthreads= 50, baseline     3.97Mb/s     30876264
app cpu != irq cpu, nthreads= 50, skb_linear   3.82Mb/s     29737654

app cpu != irq cpu, nthreads=100, baseline     4.06Mb/s     31621140
app cpu != irq cpu, nthreads=100, skb_linear   3.90Mb/s     30364382
```

So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.

>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
> 
> 
> Thank you very much, everybody for looking into this.
> 
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
> 
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
> 
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
> 
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.

> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
> 
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.

> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain 
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
> 
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...

> What do you think of this approach?
> 
>
Sounds tricky. Let's see what Tariq has to say.

Thanks,
Dragos

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
        struct mlx5e_sq_stats *stats = sq->stats;
 
+       if (skb->len < 256)
+               skb_linearize(skb);
+
        if (skb_is_gso(skb)) {
                int hopbyhop;
                u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

Dragos Tatulea Dec. 11, 2024, 5:36 p.m. UTC | #17

Hi Alexandra,

On 11.12.24 17:19, Alexandra Winter wrote:
> 
> 
> On 10.12.24 12:49, Dragos Tatulea wrote:
>>
>>
>> On 06.12.24 16:25, Alexandra Winter wrote:
>>>
>>>
>>> On 04.12.24 15:36, Eric Dumazet wrote:
>>>> I would suggest the opposite : copy the headers (typically less than
>>>> 128 bytes) on a piece of coherent memory.
>>>>
>>>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>>>
>>>> include/net/tso.h and net/core/tso.c users do this.
>>>>
>>>> Sure, patch is going to be more invasive, but all arches will win.
>>>
>>>
>>> Thank you very much for the examples, I think I understand what you are proposing.
>>> I am not sure whether I'm able to map it to the mlx5 driver, but I could
>>> try to come up with a RFC. It may take some time though.
>>>
>>> NVidia people, any suggesttions? Do you want to handle that yourselves?
>>>
>> Discussed with Saeed and he proposed another approach that is better for
>> us: copy the whole skb payload inline into the WQE if it's size is below a
>> threshold. This threshold can be configured through the
>> tx-copybreak mechanism.
>>
>> Thanks,
>> Dragos
> 
> 
> Thank you very much Dargos and Saeed.
> I am not sure I understand the details of "inline into the WQE". 
> The idea seems to be to use a premapped coherent array per WQ 
> that is indexed by queue element index and can be used to copy headers and
> maybe small messages into.
> I think I see something similar to your proposal in mlx4 (?).
I think so, yes.

> To me the general concept seems to be similar to what Eric is proposing.
> Did I get it right?
>
AFAIU, it's not quite the same thing. With Eric's proposal we'd use an
additional premapped buffer, copy data there and sumbit WQEs with
pointers to this buffer. The inline proposal is to copy the data inline
in the WQE directly without the need of an additional buffer.
To understand if this is feasible, I still need to find out what is
actual max inline space is.

> I really like the idea to use tx-copybreak for threshold configuration.
> 
That's good to know. 
> As Eric mentioned that is not a very small patch and maybe not fit for backporting
> to older distro versions.
> What do you think of a two-step approach as described in the other sub-thread?
> A simple patch for mitigation that can be backported, and then the improvement
> as a replacement?
As stated in the previous mail, let's see what Tariq has to say about
doing some arch specific fix. I am not very optimistic though...

Thanks,
Dragos

Niklas Schnelle Dec. 11, 2024, 5:50 p.m. UTC | #18

On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote:
> > > > > 

---8<---

> 
> > On 09.12.24 12:36, Tariq Toukan wrote:
> > > Hi,
> > > 
> > > Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
> > > 
> > > In many cases, copy up to PAGE_SIZE means copy everything.
> > > For high NIC speeds this is not realistic.
> > > 
> > > Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> > 
> > > > 1KB is still to large. As Tariq mentioned, the threshold should not
> > > > exceed 128/256B. I am currently testing this with 256B on x86. So far no
> > > > regressions but I need to play with it more.
> I checked on a x86 system with CX7 and we seem to get ~4% degradation
> when using this approach. The patch was modified a bit according to
> previous discussions (diff at end of mail).
> 
> Here's how I tested:
> - uperf client side has many queues.
> - uperf server side has single queue with interrupts pinned to a single
>   CPU. This is to better isolate CPU behaviour. The idea is to have the
>   CPU on the server side saturated or close to saturation.
> - Used the uperf 1B read x 1B write scenario with 50 and 100 threads
>   (profile attached).
>   Both the on-cpu and off-cpu cases were checked.
> - Code change was done only on server side.

I'm assuming this is with the x86 default IOMMU pass-through mode?
Would you be able and willing to try with iommu.passthrough=0
and amd_iommu=on respectively intel_iommu=on? Check
/sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure
the dma-iommu code is used. This is obviously not a "tuned for all out
perf at any cost" configuration but it is recommended in hardening
guides and I believe some ARM64 systems also default to using the IOMMU
for bare metal DMA API use. So it's not an unexpected configuration
either.

Thanks,
Niklas

Christian Borntraeger Dec. 12, 2024, 10:36 a.m. UTC | #19

Am 11.12.24 um 18:28 schrieb Dragos Tatulea:

>> My preferred scenario for the next steps would be the following:
>> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
>> easily backport. This would avoid our customers experiencing performance regression when they
>> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>>
> Stupid question on my behalf: why can't this patch be taken as a distro
> patch for s390 and carried over over releases? This way the kernel
> upgrade pain would be avoided.

This is not how distros work today. All code/patches must come from upstream
and all code/patches must be cross-architecture. There are exceptions, but
only for very critical things.

Furthermore, the right answer to avoid upgrade pain is to fix things upstream.
So lets try to find a solution that we can integrate.

Christian

Dragos Tatulea Dec. 13, 2024, 8:41 p.m. UTC | #20

On 11.12.24 18:50, Niklas Schnelle wrote:
> On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote:
>>>>>>
> 
> ---8<---
> 
>>
>>> On 09.12.24 12:36, Tariq Toukan wrote:
>>>> Hi,
>>>>
>>>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>>>
>>>> In many cases, copy up to PAGE_SIZE means copy everything.
>>>> For high NIC speeds this is not realistic.
>>>>
>>>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
>>>
>>>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>>>> regressions but I need to play with it more.
>> I checked on a x86 system with CX7 and we seem to get ~4% degradation
>> when using this approach. The patch was modified a bit according to
>> previous discussions (diff at end of mail).
>>
>> Here's how I tested:
>> - uperf client side has many queues.
>> - uperf server side has single queue with interrupts pinned to a single
>>   CPU. This is to better isolate CPU behaviour. The idea is to have the
>>   CPU on the server side saturated or close to saturation.
>> - Used the uperf 1B read x 1B write scenario with 50 and 100 threads
>>   (profile attached).
>>   Both the on-cpu and off-cpu cases were checked.
>> - Code change was done only on server side.
> 
> I'm assuming this is with the x86 default IOMMU pass-through mode?
It was in a VM with PCI passthrough for the device.

> Would you be able and willing to try with iommu.passthrough=0
> and amd_iommu=on respectively intel_iommu=on? Check
> /sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure
> the dma-iommu code is used. This is obviously not a "tuned for all out
> perf at any cost" configuration but it is recommended in hardening
> guides and I believe some ARM64 systems also default to using the IOMMU
> for bare metal DMA API use. So it's not an unexpected configuration
> either.
> 
I got hold of a bare metal system where I could turn iommu passthrough
off and confirm iommu_group/type as being DMA_FQ. But the results are
inconclusive due to instability. I will look into this again after the
holidays.

Thanks,
Dragos

[net-next] net/mlx5e: Transmit small messages in linear skb

Checks

Commit Message

Comments

Patch