Message ID | 20241204140230.23858-1-wintera@linux.ibm.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net/mlx5e: Transmit small messages in linear skb | expand |
On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote: > > Linearize the skb if the device uses IOMMU and the data buffer can fit > into one page. So messages can be transferred in one transfer to the card > instead of two. > > Performance issue: > ------------------ > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") > tcp skbs are always non-linear. Especially on platforms with IOMMU, > mapping and unmapping two pages instead of one per transfer can make a > noticeable difference. On s390 we saw a 13% degradation in throughput, > when running uperf with a request-response pattern with 1k payload and > 250 connections parallel. See [0] for a discussion. > > This patch mitigates these effects using a work-around in the mlx5 driver. > > Notes on implementation: > ------------------------ > TCP skbs never contain any tailroom, so skb_linearize() will allocate a > new data buffer. > No need to handle rc of skb_linearize(). If it fails, we continue with the > unchanged skb. > > As mentioned in the discussion, an alternative, but more invasive approach > would be: premapping a coherent piece of memory in which you can copy > small skbs. > > Measurement results: > -------------------- > We see an improvement in throughput of up to 16% compared to kernel v6.12. > We measured throughput and CPU consumption of uperf benchmarks with > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12 > with and without this patch. > > +------------------------------------------+ > | Transactions per Second - Deviation in % | > +-------------------+----------------------+ > | Workload | | > | rr1c-1x1--50 | 4.75 | > | rr1c-1x1-250 | 14.53 | > | rr1c-200x1000--50 | 2.22 | > | rr1c-200x1000-250 | 12.24 | > +-------------------+----------------------+ > | Server CPU Consumption - Deviation in % | > +-------------------+----------------------+ > | Workload | | > | rr1c-1x1--50 | -1.66 | > | rr1c-1x1-250 | -10.00 | > | rr1c-200x1000--50 | -0.83 | > | rr1c-200x1000-250 | -8.71 | > +-------------------+----------------------+ > > Note: > - CPU consumption: less is better > - Client CPU consumption is similar > - Workload: > rr1c-<bytes send>x<bytes received>-<parallel connections> > > Highly transactional small data sizes (rr1c-1x1) > This is a Request & Response (RR) test that sends a 1-byte request > from the client and receives a 1-byte response from the server. This > is the smallest possible transactional workload test and is smaller > than most customer workloads. This test represents the RR overhead > costs. > Highly transactional medium data sizes (rr1c-200x1000) > Request & Response (RR) test that sends a 200-byte request from the > client and receives a 1000-byte response from the server. This test > should be representative of a typical user's interaction with a remote > web site. > > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0] > Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> > Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> > Co-developed-by: Nils Hoppmann <niho@linux.ibm.com> > Signed-off-by: Nils Hoppmann <niho@linux.ibm.com> > --- > drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > index f8c7912abe0e..421ba6798ca7 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > @@ -32,6 +32,7 @@ > > #include <linux/tcp.h> > #include <linux/if_vlan.h> > +#include <linux/iommu-dma.h> > #include <net/geneve.h> > #include <net/dsfield.h> > #include "en.h" > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > { > struct mlx5e_sq_stats *stats = sq->stats; > > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > + skb_linearize(skb); > + > if (skb_is_gso(skb)) { > int hopbyhop; > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); > -- > 2.45.2 Was this tested on x86_64 or any other arch than s390, especially ones with PAGE_SIZE = 65536 ?
From: Alexandra Winter <wintera@linux.ibm.com> Date: Wed, 4 Dec 2024 15:02:30 +0100 > Linearize the skb if the device uses IOMMU and the data buffer can fit > into one page. So messages can be transferred in one transfer to the card > instead of two. I'd expect this to be on the generic level, not copied over the drivers? Not sure about PAGE_SIZE, but I never saw a NIC/driver/platform where copying let's say 256 bytes would be slower than 2x dma_map (even with direct DMA). > > Performance issue: > ------------------ > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") > tcp skbs are always non-linear. Especially on platforms with IOMMU, > mapping and unmapping two pages instead of one per transfer can make a > noticeable difference. On s390 we saw a 13% degradation in throughput, > when running uperf with a request-response pattern with 1k payload and > 250 connections parallel. See [0] for a discussion. > > This patch mitigates these effects using a work-around in the mlx5 driver. > > Notes on implementation: > ------------------------ > TCP skbs never contain any tailroom, so skb_linearize() will allocate a > new data buffer. > No need to handle rc of skb_linearize(). If it fails, we continue with the > unchanged skb. > > As mentioned in the discussion, an alternative, but more invasive approach > would be: premapping a coherent piece of memory in which you can copy > small skbs. Yes, that one would be better. [...] > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > { > struct mlx5e_sq_stats *stats = sq->stats; > > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) 1. What's with the direct DMA? I believe it would benefit, too? 2. Why truesize, not something like if (skb->len <= some_sane_value_maybe_1k) 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think it's a good idea to rely on this. Some test-based hardcode would be enough (i.e. threshold on which DMA mapping starts performing better). > + skb_linearize(skb); > + > if (skb_is_gso(skb)) { BTW can't there be a case when the skb is GSO, but its truesize is PAGE_SIZE and linearize will be way too slow (not sure it's possible, just guessing)? > int hopbyhop; > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); Thanks, Olek
On 04.12.24 15:16, Eric Dumazet wrote: > On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote: >> >> Linearize the skb if the device uses IOMMU and the data buffer can fit >> into one page. So messages can be transferred in one transfer to the card >> instead of two. >> >> Performance issue: >> ------------------ >> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") >> tcp skbs are always non-linear. Especially on platforms with IOMMU, >> mapping and unmapping two pages instead of one per transfer can make a >> noticeable difference. On s390 we saw a 13% degradation in throughput, >> when running uperf with a request-response pattern with 1k payload and >> 250 connections parallel. See [0] for a discussion. >> >> This patch mitigates these effects using a work-around in the mlx5 driver. >> >> Notes on implementation: >> ------------------------ >> TCP skbs never contain any tailroom, so skb_linearize() will allocate a >> new data buffer. >> No need to handle rc of skb_linearize(). If it fails, we continue with the >> unchanged skb. >> >> As mentioned in the discussion, an alternative, but more invasive approach >> would be: premapping a coherent piece of memory in which you can copy >> small skbs. >> >> Measurement results: >> -------------------- >> We see an improvement in throughput of up to 16% compared to kernel v6.12. >> We measured throughput and CPU consumption of uperf benchmarks with >> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12 >> with and without this patch. >> >> +------------------------------------------+ >> | Transactions per Second - Deviation in % | >> +-------------------+----------------------+ >> | Workload | | >> | rr1c-1x1--50 | 4.75 | >> | rr1c-1x1-250 | 14.53 | >> | rr1c-200x1000--50 | 2.22 | >> | rr1c-200x1000-250 | 12.24 | >> +-------------------+----------------------+ >> | Server CPU Consumption - Deviation in % | >> +-------------------+----------------------+ >> | Workload | | >> | rr1c-1x1--50 | -1.66 | >> | rr1c-1x1-250 | -10.00 | >> | rr1c-200x1000--50 | -0.83 | >> | rr1c-200x1000-250 | -8.71 | >> +-------------------+----------------------+ >> >> Note: >> - CPU consumption: less is better >> - Client CPU consumption is similar >> - Workload: >> rr1c-<bytes send>x<bytes received>-<parallel connections> >> >> Highly transactional small data sizes (rr1c-1x1) >> This is a Request & Response (RR) test that sends a 1-byte request >> from the client and receives a 1-byte response from the server. This >> is the smallest possible transactional workload test and is smaller >> than most customer workloads. This test represents the RR overhead >> costs. >> Highly transactional medium data sizes (rr1c-200x1000) >> Request & Response (RR) test that sends a 200-byte request from the >> client and receives a 1000-byte response from the server. This test >> should be representative of a typical user's interaction with a remote >> web site. >> >> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0] >> Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> >> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> >> Co-developed-by: Nils Hoppmann <niho@linux.ibm.com> >> Signed-off-by: Nils Hoppmann <niho@linux.ibm.com> >> --- >> drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c >> index f8c7912abe0e..421ba6798ca7 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c >> @@ -32,6 +32,7 @@ >> >> #include <linux/tcp.h> >> #include <linux/if_vlan.h> >> +#include <linux/iommu-dma.h> >> #include <net/geneve.h> >> #include <net/dsfield.h> >> #include "en.h" >> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >> { >> struct mlx5e_sq_stats *stats = sq->stats; >> >> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >> + skb_linearize(skb); >> + >> if (skb_is_gso(skb)) { >> int hopbyhop; >> u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); >> -- >> 2.45.2 > > > Was this tested on x86_64 or any other arch than s390, especially ones > with PAGE_SIZE = 65536 ? > No, I don't have a mlx5 card in an x86_64. Rahul, could you test this patch?
On Wed, Dec 4, 2024 at 3:16 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote: > > > > Linearize the skb if the device uses IOMMU and the data buffer can fit > > into one page. So messages can be transferred in one transfer to the card > > instead of two. > > > > Performance issue: > > ------------------ > > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx") > > tcp skbs are always non-linear. Especially on platforms with IOMMU, > > mapping and unmapping two pages instead of one per transfer can make a > > noticeable difference. On s390 we saw a 13% degradation in throughput, > > when running uperf with a request-response pattern with 1k payload and > > 250 connections parallel. See [0] for a discussion. > > > > This patch mitigates these effects using a work-around in the mlx5 driver. > > > > Notes on implementation: > > ------------------------ > > TCP skbs never contain any tailroom, so skb_linearize() will allocate a > > new data buffer. > > No need to handle rc of skb_linearize(). If it fails, we continue with the > > unchanged skb. > > > > As mentioned in the discussion, an alternative, but more invasive approach > > would be: premapping a coherent piece of memory in which you can copy > > small skbs. > > > > Measurement results: > > -------------------- > > We see an improvement in throughput of up to 16% compared to kernel v6.12. > > We measured throughput and CPU consumption of uperf benchmarks with > > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12 > > with and without this patch. > > > > +------------------------------------------+ > > | Transactions per Second - Deviation in % | > > +-------------------+----------------------+ > > | Workload | | > > | rr1c-1x1--50 | 4.75 | > > | rr1c-1x1-250 | 14.53 | > > | rr1c-200x1000--50 | 2.22 | > > | rr1c-200x1000-250 | 12.24 | > > +-------------------+----------------------+ > > | Server CPU Consumption - Deviation in % | > > +-------------------+----------------------+ > > | Workload | | > > | rr1c-1x1--50 | -1.66 | > > | rr1c-1x1-250 | -10.00 | > > | rr1c-200x1000--50 | -0.83 | > > | rr1c-200x1000-250 | -8.71 | > > +-------------------+----------------------+ > > > > Note: > > - CPU consumption: less is better > > - Client CPU consumption is similar > > - Workload: > > rr1c-<bytes send>x<bytes received>-<parallel connections> > > > > Highly transactional small data sizes (rr1c-1x1) > > This is a Request & Response (RR) test that sends a 1-byte request > > from the client and receives a 1-byte response from the server. This > > is the smallest possible transactional workload test and is smaller > > than most customer workloads. This test represents the RR overhead > > costs. > > Highly transactional medium data sizes (rr1c-200x1000) > > Request & Response (RR) test that sends a 200-byte request from the > > client and receives a 1000-byte response from the server. This test > > should be representative of a typical user's interaction with a remote > > web site. > > > > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0] > > Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> > > Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> > > Co-developed-by: Nils Hoppmann <niho@linux.ibm.com> > > Signed-off-by: Nils Hoppmann <niho@linux.ibm.com> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > index f8c7912abe0e..421ba6798ca7 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c > > @@ -32,6 +32,7 @@ > > > > #include <linux/tcp.h> > > #include <linux/if_vlan.h> > > +#include <linux/iommu-dma.h> > > #include <net/geneve.h> > > #include <net/dsfield.h> > > #include "en.h" > > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > > { > > struct mlx5e_sq_stats *stats = sq->stats; > > > > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > > + skb_linearize(skb); > > + > > if (skb_is_gso(skb)) { > > int hopbyhop; > > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); > > -- > > 2.45.2 > > > Was this tested on x86_64 or any other arch than s390, especially ones > with PAGE_SIZE = 65536 ? I would suggest the opposite : copy the headers (typically less than 128 bytes) on a piece of coherent memory. As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. include/net/tso.h and net/core/tso.c users do this. Sure, patch is going to be more invasive, but all arches will win.
From: Eric Dumazet > Sent: 04 December 2024 14:36 ... > I would suggest the opposite : copy the headers (typically less than > 128 bytes) on a piece of coherent memory. A long time ago a colleague tested the cutoff between copying to a fixed buffer and dma access to the kernel memory buffer for a sparc mbus/sbus system (which has an iommu). While entirely different in all regards the cutoff was just over 1k. The ethernet drivers I wrote did a data copy to/from a pre-mapped area for both transmit and receive. I suspect the simplicity of that also improved things. These days you'd definitely want to map tso buffers. But the 'copybreak' size for receive could be quite high. On x86 just make sure the destination address for 'rep movsb' is 64 byte aligned - it will double the copy speed. The source alignment doesn't matter at all. (AMD chips might be different, but an aligned copy of a whole number of 'words' can always be done.) I've also wondered whether the ethernet driver could 'hold' the iommu page table entries after (eg) a receive frame is processed and then drop the PA of the replacement buffer into the same slot. That is likely to speed up iommu setup. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 04.12.24 15:32, Alexander Lobakin wrote: >> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >> { >> struct mlx5e_sq_stats *stats = sq->stats; >> >> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) + skb_linearize(skb); > 1. What's with the direct DMA? I believe it would benefit, too? Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. Any opinions from the NVidia people? > 2. Why truesize, not something like > > if (skb->len <= some_sane_value_maybe_1k) With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. When we set the threshhold at a smaller value, skb->len makes more sense > > 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think > it's a good idea to rely on this. > Some test-based hardcode would be enough (i.e. threshold on which > DMA mapping starts performing better). A threshhold of 4k is absolutely fine with us (s390). A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. NVidia people do you have any opinion on a good threshhold?
On 04.12.24 15:36, Eric Dumazet wrote: > I would suggest the opposite : copy the headers (typically less than > 128 bytes) on a piece of coherent memory. > > As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. > > include/net/tso.h and net/core/tso.c users do this. > > Sure, patch is going to be more invasive, but all arches will win. Thank you very much for the examples, I think I understand what you are proposing. I am not sure whether I'm able to map it to the mlx5 driver, but I could try to come up with a RFC. It may take some time though. NVidia people, any suggesttions? Do you want to handle that yourselves?
On Fri, Dec 6, 2024 at 3:48 PM David Laight <David.Laight@aculab.com> wrote: > > I've also wondered whether the ethernet driver could 'hold' the > iommu page table entries after (eg) a receive frame is processed > and then drop the PA of the replacement buffer into the same slot. > That is likely to speed up iommu setup. This has been done a long time ago (Alexander Duyck page refcount trick in Intel drivers) Then put into generic page_pool https://static.lwn.net/kerneldoc/networking/page_pool.html
On 06/12/2024 17:20, Alexandra Winter wrote: > > > On 04.12.24 15:32, Alexander Lobakin wrote: >>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>> { >>> struct mlx5e_sq_stats *stats = sq->stats; >>> >>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > + skb_linearize(skb); >> 1. What's with the direct DMA? I believe it would benefit, too? > > > Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. > Any opinions from the NVidia people? > > >> 2. Why truesize, not something like >> >> if (skb->len <= some_sane_value_maybe_1k) > > > With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. > When we set the threshhold at a smaller value, skb->len makes more sense > > >> >> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >> it's a good idea to rely on this. >> Some test-based hardcode would be enough (i.e. threshold on which >> DMA mapping starts performing better). > > > A threshhold of 4k is absolutely fine with us (s390). > A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. > > > NVidia people do you have any opinion on a good threshhold? > Hi, Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). In many cases, copy up to PAGE_SIZE means copy everything. For high NIC speeds this is not realistic. Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
On 06.12.24 16:20, Alexandra Winter wrote: > > > On 04.12.24 15:32, Alexander Lobakin wrote: >>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>> { >>> struct mlx5e_sq_stats *stats = sq->stats; >>> >>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > + skb_linearize(skb); >> 1. What's with the direct DMA? I believe it would benefit, too? > > > Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. > Any opinions from the NVidia people? > Agreed. > >> 2. Why truesize, not something like >> >> if (skb->len <= some_sane_value_maybe_1k) > > > With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. > When we set the threshhold at a smaller value, skb->len makes more sense > > >> >> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >> it's a good idea to rely on this. >> Some test-based hardcode would be enough (i.e. threshold on which >> DMA mapping starts performing better). > > > A threshhold of 4k is absolutely fine with us (s390). > A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. > > > NVidia people do you have any opinion on a good threshhold? > 1KB is still to large. As Tariq mentioned, the threshold should not exceed 128/256B. I am currently testing this with 256B on x86. So far no regressions but I need to play with it more. Thanks, Dragos
On 06.12.24 16:25, Alexandra Winter wrote: > > > On 04.12.24 15:36, Eric Dumazet wrote: >> I would suggest the opposite : copy the headers (typically less than >> 128 bytes) on a piece of coherent memory. >> >> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. >> >> include/net/tso.h and net/core/tso.c users do this. >> >> Sure, patch is going to be more invasive, but all arches will win. > > > Thank you very much for the examples, I think I understand what you are proposing. > I am not sure whether I'm able to map it to the mlx5 driver, but I could > try to come up with a RFC. It may take some time though. > > NVidia people, any suggesttions? Do you want to handle that yourselves? > Discussed with Saeed and he proposed another approach that is better for us: copy the whole skb payload inline into the WQE if it's size is below a threshold. This threshold can be configured through the tx-copybreak mechanism. Thanks, Dragos
From: Dragos Tatulea <dtatulea@nvidia.com> Date: Tue, 10 Dec 2024 12:44:04 +0100 > > > On 06.12.24 16:20, Alexandra Winter wrote: >> >> >> On 04.12.24 15:32, Alexander Lobakin wrote: >>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>>> { >>>> struct mlx5e_sq_stats *stats = sq->stats; >>>> >>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >> + skb_linearize(skb); >>> 1. What's with the direct DMA? I believe it would benefit, too? >> >> >> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. >> Any opinions from the NVidia people? >> > Agreed. > >> >>> 2. Why truesize, not something like >>> >>> if (skb->len <= some_sane_value_maybe_1k) >> >> >> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. >> When we set the threshhold at a smaller value, skb->len makes more sense >> >> >>> >>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >>> it's a good idea to rely on this. >>> Some test-based hardcode would be enough (i.e. threshold on which >>> DMA mapping starts performing better). >> >> >> A threshhold of 4k is absolutely fine with us (s390). >> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. >> >> >> NVidia people do you have any opinion on a good threshhold? >> > 1KB is still to large. As Tariq mentioned, the threshold should not > exceed 128/256B. I am currently testing this with 256B on x86. So far no > regressions but I need to play with it more. On different setups, usually the copybreak of 192 or 256 bytes was the most efficient as well. > > Thanks, > Dragos Thanks, Olek
On Tue, Dec 10, 2024 at 02:54:26PM +0100, Alexander Lobakin wrote: > From: Dragos Tatulea <dtatulea@nvidia.com> > Date: Tue, 10 Dec 2024 12:44:04 +0100 > > > > > > > On 06.12.24 16:20, Alexandra Winter wrote: > >> > >> > >> On 04.12.24 15:32, Alexander Lobakin wrote: > >>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, > >>>> { > >>>> struct mlx5e_sq_stats *stats = sq->stats; > >>>> > >>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ > >>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) > >> + skb_linearize(skb); > >>> 1. What's with the direct DMA? I believe it would benefit, too? > >> > >> > >> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. > >> Any opinions from the NVidia people? > >> > > Agreed. > > > >> > >>> 2. Why truesize, not something like > >>> > >>> if (skb->len <= some_sane_value_maybe_1k) > >> > >> > >> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. > >> When we set the threshhold at a smaller value, skb->len makes more sense > >> > >> > >>> > >>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think > >>> it's a good idea to rely on this. > >>> Some test-based hardcode would be enough (i.e. threshold on which > >>> DMA mapping starts performing better). > >> > >> > >> A threshhold of 4k is absolutely fine with us (s390). > >> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. > >> > >> > >> NVidia people do you have any opinion on a good threshhold? > >> > > 1KB is still to large. As Tariq mentioned, the threshold should not > > exceed 128/256B. I am currently testing this with 256B on x86. So far no > > regressions but I need to play with it more. > > On different setups, usually the copybreak of 192 or 256 bytes was the > most efficient as well. A minor suggestion: Would it be at all possible for the people who've run these experiments to document their findings somewhere: what the different test setups were, what the copybreak settings were, what the results were, and how they were measured? Some drivers have a few details documented in Documentation/networking/device_drivers/ethernet/, but if others could do this too, like mlx5, in detail so findings could be reproduced by others, that would be amazing.
On 10.12.24 14:54, Alexander Lobakin wrote: > From: Dragos Tatulea <dtatulea@nvidia.com> > Date: Tue, 10 Dec 2024 12:44:04 +0100 > >> >> >> On 06.12.24 16:20, Alexandra Winter wrote: >>> >>> >>> On 04.12.24 15:32, Alexander Lobakin wrote: >>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>>>> { >>>>> struct mlx5e_sq_stats *stats = sq->stats; >>>>> >>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >>> + skb_linearize(skb); >>>> 1. What's with the direct DMA? I believe it would benefit, too? >>> >>> >>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. >>> Any opinions from the NVidia people? >>> >> Agreed. >> >>> >>>> 2. Why truesize, not something like >>>> >>>> if (skb->len <= some_sane_value_maybe_1k) >>> >>> >>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. >>> When we set the threshhold at a smaller value, skb->len makes more sense >>> >>> >>>> >>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >>>> it's a good idea to rely on this. >>>> Some test-based hardcode would be enough (i.e. threshold on which >>>> DMA mapping starts performing better). >>> >>> >>> A threshhold of 4k is absolutely fine with us (s390). >>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. >>> >>> >>> NVidia people do you have any opinion on a good threshhold? >>> On 09.12.24 12:36, Tariq Toukan wrote: > Hi, > > Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). > > In many cases, copy up to PAGE_SIZE means copy everything. > For high NIC speeds this is not realistic. > > Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). >> 1KB is still to large. As Tariq mentioned, the threshold should not >> exceed 128/256B. I am currently testing this with 256B on x86. So far no >> regressions but I need to play with it more. > > On different setups, usually the copybreak of 192 or 256 bytes was the > most efficient as well. > >> Thank you very much, everybody for looking into this. Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390 and with this patch we get up to 12% more transactions/sec even for 1k messages. As we're always using an IOMMU and are operating with large multi socket systems, DMA costs far outweigh the costs of small to medium memcpy()s on our system. We realize that the recommendation is to just run without IOMMU when performance is important, but this is not an option in our environment. I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a strategic direction, it is more a simple mitigation of our problem. Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements. We could even restrict it to S390 arch, if you want. A threshold of 256b would cover some amount of our typical request-response workloads (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k). Could we maybe define an architecture dependent threshold? My preferred scenario for the next steps would be the following: 1) It would be great if we could get a simple mitigation patch upstream, that the distros could easily backport. This would avoid our customers experiencing performance regression when they upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example) 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed). This would then replace the simple mitigation patch upstream, and future releases would contain that solution. If somebody else wants to work on this improved version, fine with me, otherwise I could give it a try. What do you think of this approach?
On 10.12.24 12:49, Dragos Tatulea wrote: > > > On 06.12.24 16:25, Alexandra Winter wrote: >> >> >> On 04.12.24 15:36, Eric Dumazet wrote: >>> I would suggest the opposite : copy the headers (typically less than >>> 128 bytes) on a piece of coherent memory. >>> >>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. >>> >>> include/net/tso.h and net/core/tso.c users do this. >>> >>> Sure, patch is going to be more invasive, but all arches will win. >> >> >> Thank you very much for the examples, I think I understand what you are proposing. >> I am not sure whether I'm able to map it to the mlx5 driver, but I could >> try to come up with a RFC. It may take some time though. >> >> NVidia people, any suggesttions? Do you want to handle that yourselves? >> > Discussed with Saeed and he proposed another approach that is better for > us: copy the whole skb payload inline into the WQE if it's size is below a > threshold. This threshold can be configured through the > tx-copybreak mechanism. > > Thanks, > Dragos Thank you very much Dargos and Saeed. I am not sure I understand the details of "inline into the WQE". The idea seems to be to use a premapped coherent array per WQ that is indexed by queue element index and can be used to copy headers and maybe small messages into. I think I see something similar to your proposal in mlx4 (?). To me the general concept seems to be similar to what Eric is proposing. Did I get it right? I really like the idea to use tx-copybreak for threshold configuration. As Eric mentioned that is not a very small patch and maybe not fit for backporting to older distro versions. What do you think of a two-step approach as described in the other sub-thread? A simple patch for mitigation that can be backported, and then the improvement as a replacement? Thanks, Alexandra
Hi Alexandra, On 11.12.24 14:35, Alexandra Winter wrote: > > > On 10.12.24 14:54, Alexander Lobakin wrote: >> From: Dragos Tatulea <dtatulea@nvidia.com> >> Date: Tue, 10 Dec 2024 12:44:04 +0100 >> >>> >>> >>> On 06.12.24 16:20, Alexandra Winter wrote: >>>> >>>> >>>> On 04.12.24 15:32, Alexander Lobakin wrote: >>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, >>>>>> { >>>>>> struct mlx5e_sq_stats *stats = sq->stats; >>>>>> >>>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ >>>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) >>>> + skb_linearize(skb); >>>>> 1. What's with the direct DMA? I believe it would benefit, too? >>>> >>>> >>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact. >>>> Any opinions from the NVidia people? >>>> >>> Agreed. >>> >>>> >>>>> 2. Why truesize, not something like >>>>> >>>>> if (skb->len <= some_sane_value_maybe_1k) >>>> >>>> >>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page. >>>> When we set the threshhold at a smaller value, skb->len makes more sense >>>> >>>> >>>>> >>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think >>>>> it's a good idea to rely on this. >>>>> Some test-based hardcode would be enough (i.e. threshold on which >>>>> DMA mapping starts performing better). >>>> >>>> >>>> A threshhold of 4k is absolutely fine with us (s390). >>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios. >>>> >>>> >>>> NVidia people do you have any opinion on a good threshhold? >>>> > > On 09.12.24 12:36, Tariq Toukan wrote: >> Hi, >> >> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). >> >> In many cases, copy up to PAGE_SIZE means copy everything. >> For high NIC speeds this is not realistic. >> >> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). > >>> 1KB is still to large. As Tariq mentioned, the threshold should not >>> exceed 128/256B. I am currently testing this with 256B on x86. So far no >>> regressions but I need to play with it more. I checked on a x86 system with CX7 and we seem to get ~4% degradation when using this approach. The patch was modified a bit according to previous discussions (diff at end of mail). Here's how I tested: - uperf client side has many queues. - uperf server side has single queue with interrupts pinned to a single CPU. This is to better isolate CPU behaviour. The idea is to have the CPU on the server side saturated or close to saturation. - Used the uperf 1B read x 1B write scenario with 50 and 100 threads (profile attached). Both the on-cpu and off-cpu cases were checked. - Code change was done only on server side. The results: ``` Case: Throughput Operations ---------------------------------------------------------------------- app cpu == irq cpu, nthreads= 25, baseline 3.86Mb/s 30036552 app cpu == irq cpu, nthreads= 25, skb_linear 3.52Mb/s 26969315 app cpu == irq cpu, nthreads= 50, baseline 4.26Mb/s 33122458 app cpu == irq cpu, nthreads= 50, skb_linear 4.06Mb/s 31606290 app cpu == irq cpu, nthreads=100, baseline 4.08Mb/s 31775434 app cpu == irq cpu, nthreads=100, skb_linear 3.86Mb/s 30105582 app cpu != irq cpu, nthreads= 25, baseline 3.57Mb/s 27785488 app cpu != irq cpu, nthreads= 25, skb_linear 3.56Mb/s 27730133 app cpu != irq cpu, nthreads= 50, baseline 3.97Mb/s 30876264 app cpu != irq cpu, nthreads= 50, skb_linear 3.82Mb/s 29737654 app cpu != irq cpu, nthreads=100, baseline 4.06Mb/s 31621140 app cpu != irq cpu, nthreads=100, skb_linear 3.90Mb/s 30364382 ``` So not encouraging. I restricted the tests to 1B payloads only as I expected worse perf for larger payloads. >> >> On different setups, usually the copybreak of 192 or 256 bytes was the >> most efficient as well. >> >>> > > > Thank you very much, everybody for looking into this. > > Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390 > and with this patch we get up to 12% more transactions/sec even for 1k messages. > > As we're always using an IOMMU and are operating with large multi socket systems, > DMA costs far outweigh the costs of small to medium memcpy()s on our system. > We realize that the recommendation is to just run without IOMMU when performance is important, > but this is not an option in our environment. > > I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a > strategic direction, it is more a simple mitigation of our problem. > > Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements. > We could even restrict it to S390 arch, if you want. > Maybe Tariq can comment on this. > A threshold of 256b would cover some amount of our typical request-response workloads > (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k). > Could we maybe define an architecture dependent threshold? > > My preferred scenario for the next steps would be the following: > 1) It would be great if we could get a simple mitigation patch upstream, that the distros could > easily backport. This would avoid our customers experiencing performance regression when they > upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example) > Stupid question on my behalf: why can't this patch be taken as a distro patch for s390 and carried over over releases? This way the kernel upgrade pain would be avoided. > 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed). > This would then replace the simple mitigation patch upstream, and future releases would contain > that solution. If somebody else wants to work on this improved version, fine with me, otherwise > I could give it a try. > For the inline WQE solution I don't think we have a large amout of space to copy so much data into. For Eric's side buffer suggestion it will be even more invasive and it will touch many more code paths... > What do you think of this approach? > > Sounds tricky. Let's see what Tariq has to say. Thanks, Dragos diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index f8c7912abe0e..cc947daa538b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, { struct mlx5e_sq_stats *stats = sq->stats; + if (skb->len < 256) + skb_linearize(skb); + if (skb_is_gso(skb)) { int hopbyhop; u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
Hi Alexandra, On 11.12.24 17:19, Alexandra Winter wrote: > > > On 10.12.24 12:49, Dragos Tatulea wrote: >> >> >> On 06.12.24 16:25, Alexandra Winter wrote: >>> >>> >>> On 04.12.24 15:36, Eric Dumazet wrote: >>>> I would suggest the opposite : copy the headers (typically less than >>>> 128 bytes) on a piece of coherent memory. >>>> >>>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb. >>>> >>>> include/net/tso.h and net/core/tso.c users do this. >>>> >>>> Sure, patch is going to be more invasive, but all arches will win. >>> >>> >>> Thank you very much for the examples, I think I understand what you are proposing. >>> I am not sure whether I'm able to map it to the mlx5 driver, but I could >>> try to come up with a RFC. It may take some time though. >>> >>> NVidia people, any suggesttions? Do you want to handle that yourselves? >>> >> Discussed with Saeed and he proposed another approach that is better for >> us: copy the whole skb payload inline into the WQE if it's size is below a >> threshold. This threshold can be configured through the >> tx-copybreak mechanism. >> >> Thanks, >> Dragos > > > Thank you very much Dargos and Saeed. > I am not sure I understand the details of "inline into the WQE". > The idea seems to be to use a premapped coherent array per WQ > that is indexed by queue element index and can be used to copy headers and > maybe small messages into. > I think I see something similar to your proposal in mlx4 (?). I think so, yes. > To me the general concept seems to be similar to what Eric is proposing. > Did I get it right? > AFAIU, it's not quite the same thing. With Eric's proposal we'd use an additional premapped buffer, copy data there and sumbit WQEs with pointers to this buffer. The inline proposal is to copy the data inline in the WQE directly without the need of an additional buffer. To understand if this is feasible, I still need to find out what is actual max inline space is. > I really like the idea to use tx-copybreak for threshold configuration. > That's good to know. > As Eric mentioned that is not a very small patch and maybe not fit for backporting > to older distro versions. > What do you think of a two-step approach as described in the other sub-thread? > A simple patch for mitigation that can be backported, and then the improvement > as a replacement? As stated in the previous mail, let's see what Tariq has to say about doing some arch specific fix. I am not very optimistic though... Thanks, Dragos
On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote: > > > > > ---8<--- > > > On 09.12.24 12:36, Tariq Toukan wrote: > > > Hi, > > > > > > Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). > > > > > > In many cases, copy up to PAGE_SIZE means copy everything. > > > For high NIC speeds this is not realistic. > > > > > > Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). > > > > > > 1KB is still to large. As Tariq mentioned, the threshold should not > > > > exceed 128/256B. I am currently testing this with 256B on x86. So far no > > > > regressions but I need to play with it more. > I checked on a x86 system with CX7 and we seem to get ~4% degradation > when using this approach. The patch was modified a bit according to > previous discussions (diff at end of mail). > > Here's how I tested: > - uperf client side has many queues. > - uperf server side has single queue with interrupts pinned to a single > CPU. This is to better isolate CPU behaviour. The idea is to have the > CPU on the server side saturated or close to saturation. > - Used the uperf 1B read x 1B write scenario with 50 and 100 threads > (profile attached). > Both the on-cpu and off-cpu cases were checked. > - Code change was done only on server side. I'm assuming this is with the x86 default IOMMU pass-through mode? Would you be able and willing to try with iommu.passthrough=0 and amd_iommu=on respectively intel_iommu=on? Check /sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure the dma-iommu code is used. This is obviously not a "tuned for all out perf at any cost" configuration but it is recommended in hardening guides and I believe some ARM64 systems also default to using the IOMMU for bare metal DMA API use. So it's not an unexpected configuration either. Thanks, Niklas
Am 11.12.24 um 18:28 schrieb Dragos Tatulea: >> My preferred scenario for the next steps would be the following: >> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could >> easily backport. This would avoid our customers experiencing performance regression when they >> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example) >> > Stupid question on my behalf: why can't this patch be taken as a distro > patch for s390 and carried over over releases? This way the kernel > upgrade pain would be avoided. This is not how distros work today. All code/patches must come from upstream and all code/patches must be cross-architecture. There are exceptions, but only for very critical things. Furthermore, the right answer to avoid upgrade pain is to fix things upstream. So lets try to find a solution that we can integrate. Christian
On 11.12.24 18:50, Niklas Schnelle wrote: > On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote: >>>>>> > > ---8<--- > >> >>> On 09.12.24 12:36, Tariq Toukan wrote: >>>> Hi, >>>> >>>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy"). >>>> >>>> In many cases, copy up to PAGE_SIZE means copy everything. >>>> For high NIC speeds this is not realistic. >>>> >>>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b). >>> >>>>> 1KB is still to large. As Tariq mentioned, the threshold should not >>>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no >>>>> regressions but I need to play with it more. >> I checked on a x86 system with CX7 and we seem to get ~4% degradation >> when using this approach. The patch was modified a bit according to >> previous discussions (diff at end of mail). >> >> Here's how I tested: >> - uperf client side has many queues. >> - uperf server side has single queue with interrupts pinned to a single >> CPU. This is to better isolate CPU behaviour. The idea is to have the >> CPU on the server side saturated or close to saturation. >> - Used the uperf 1B read x 1B write scenario with 50 and 100 threads >> (profile attached). >> Both the on-cpu and off-cpu cases were checked. >> - Code change was done only on server side. > > I'm assuming this is with the x86 default IOMMU pass-through mode? It was in a VM with PCI passthrough for the device. > Would you be able and willing to try with iommu.passthrough=0 > and amd_iommu=on respectively intel_iommu=on? Check > /sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure > the dma-iommu code is used. This is obviously not a "tuned for all out > perf at any cost" configuration but it is recommended in hardening > guides and I believe some ARM64 systems also default to using the IOMMU > for bare metal DMA API use. So it's not an unexpected configuration > either. > I got hold of a bare metal system where I could turn iommu passthrough off and confirm iommu_group/type as being DMA_FQ. But the results are inconclusive due to instability. I will look into this again after the holidays. Thanks, Dragos
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index f8c7912abe0e..421ba6798ca7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -32,6 +32,7 @@ #include <linux/tcp.h> #include <linux/if_vlan.h> +#include <linux/iommu-dma.h> #include <net/geneve.h> #include <net/dsfield.h> #include "en.h" @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, { struct mlx5e_sq_stats *stats = sq->stats; + /* Don't require 2 IOMMU TLB entries, if one is sufficient */ + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE) + skb_linearize(skb); + if (skb_is_gso(skb)) { int hopbyhop; u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);