diff mbox series

[RFC,net] tcp: Fix performance regression for request-response workloads

Message ID 20220907122505.26953-1-wintera@linux.ibm.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] tcp: Fix performance regression for request-response workloads | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 2 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: 'explicitely' may be misspelled - perhaps 'explicitly'?
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexandra Winter Sept. 7, 2022, 12:25 p.m. UTC
Since linear payload was removed even for single small messages,
an additional page is required and we are measuring performance impact.

3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
explicitely allowed "payload in skb->head for first skb put in the queue,
to not impact RPC workloads."
472c2e07eef0 ("tcp: add one skb cache for tx")
made that obsolete and removed it.
When 
d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
reverted it, this piece was not reverted and not added back in.

When running uperf with a request-response pattern with 1k payload
and 250 connections parallel, we measure 13% difference in throughput
for our PCI based network interfaces since 472c2e07eef0.
(our IO MMU is sensitive to the number of mapped pages)

Could you please consider allowing linear payload for the first
skb in queue again? A patch proposal is appended below.

Kind regards
Alexandra

---------------------------------------------------------------

tcp: allow linear skb payload for first in queue

Allow payload in skb->head for first skb in the queue,
RPC workloads will benefit.

Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
---
 net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

Comments

Eric Dumazet Sept. 7, 2022, 4:06 p.m. UTC | #1
On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>
> Since linear payload was removed even for single small messages,
> an additional page is required and we are measuring performance impact.
>
> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> explicitely allowed "payload in skb->head for first skb put in the queue,
> to not impact RPC workloads."
> 472c2e07eef0 ("tcp: add one skb cache for tx")
> made that obsolete and removed it.
> When
> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> reverted it, this piece was not reverted and not added back in.
>
> When running uperf with a request-response pattern with 1k payload
> and 250 connections parallel, we measure 13% difference in throughput
> for our PCI based network interfaces since 472c2e07eef0.
> (our IO MMU is sensitive to the number of mapped pages)



>
> Could you please consider allowing linear payload for the first
> skb in queue again? A patch proposal is appended below.

No.

Please add a work around in your driver.

You can increase throughput by 20% by premapping a coherent piece of
memory in which
you can copy small skbs (skb->head included)

Something like 256 bytes per slot in the TX ring.


>
> Kind regards
> Alexandra
>
> ---------------------------------------------------------------
>
> tcp: allow linear skb payload for first in queue
>
> Allow payload in skb->head for first skb in the queue,
> RPC workloads will benefit.
>
> Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> ---
>  net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e5011c136fdb..f7cbccd41d85 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
>  }
>  EXPORT_SYMBOL(tcp_sendpage);
>
> +/* Do not bother using a page frag for very small frames.
> + * But use this heuristic only for the first skb in write queue.
> + *
> + * Having no payload in skb->head allows better SACK shifting
> + * in tcp_shift_skb_data(), reducing sack/rack overhead, because
> + * write queue has less skbs.
> + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
> + * This also speeds up tso_fragment(), since it won't fallback
> + * to tcp_fragment().
> + */
> +static int linear_payload_sz(bool first_skb)
> +{
> +               if (first_skb)
> +                       return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
> +               return 0;
> +}
> +
> +static int select_size(bool first_skb, bool zc)
> +{
> +               if (zc)
> +                       return 0;
> +               return linear_payload_sz(first_skb);
> +}
> +
>  void tcp_free_fastopen_req(struct tcp_sock *tp)
>  {
>         if (tp->fastopen_req) {
> @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>
>                 if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
>                         bool first_skb;
> +                       int linear;
>
>  new_segment:
>                         if (!sk_stream_memory_free(sk))
> @@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                                         goto restart;
>                         }
>                         first_skb = tcp_rtx_and_write_queues_empty(sk);
> -                       skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
> +                       linear = select_size(first_skb, zc);
> +                       skb = tcp_stream_alloc_skb(sk, linear,
> +                                                  sk->sk_allocation,
>                                                    first_skb);
>                         if (!skb)
>                                 goto wait_for_space;
> @@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                 if (copy > msg_data_left(msg))
>                         copy = msg_data_left(msg);
>
> -               if (!zc) {
> +               /* Where to copy to? */
> +               if (skb_availroom(skb) > 0 && !zc) {
> +                       /* We have some space in skb head. Superb! */
> +                       copy = min_t(int, copy, skb_availroom(skb));
> +                       err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
> +                                                  copy);
> +                       if (err)
> +                               goto do_error;
> +               } else if (!zc) {
>                         bool merge = true;
>                         int i = skb_shinfo(skb)->nr_frags;
>                         struct page_frag *pfrag = sk_page_frag(sk);
> --
> 2.24.3 (Apple Git-128)
>
Christian Borntraeger Sept. 8, 2022, 9:40 a.m. UTC | #2
Am 07.09.22 um 18:06 schrieb Eric Dumazet:
> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>
>> Since linear payload was removed even for single small messages,
>> an additional page is required and we are measuring performance impact.
>>
>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>> explicitely allowed "payload in skb->head for first skb put in the queue,
>> to not impact RPC workloads."
>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>> made that obsolete and removed it.
>> When
>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>> reverted it, this piece was not reverted and not added back in.
>>
>> When running uperf with a request-response pattern with 1k payload
>> and 250 connections parallel, we measure 13% difference in throughput
>> for our PCI based network interfaces since 472c2e07eef0.
>> (our IO MMU is sensitive to the number of mapped pages)
> 
> 
> 
>>
>> Could you please consider allowing linear payload for the first
>> skb in queue again? A patch proposal is appended below.
> 
> No.
> 
> Please add a work around in your driver.
> 
> You can increase throughput by 20% by premapping a coherent piece of
> memory in which
> you can copy small skbs (skb->head included)
> 
> Something like 256 bytes per slot in the TX ring.
> 

FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
Eric Dumazet Sept. 8, 2022, 12:41 p.m. UTC | #3
On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
> > On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
> >>
> >> Since linear payload was removed even for single small messages,
> >> an additional page is required and we are measuring performance impact.
> >>
> >> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> >> explicitely allowed "payload in skb->head for first skb put in the queue,
> >> to not impact RPC workloads."
> >> 472c2e07eef0 ("tcp: add one skb cache for tx")
> >> made that obsolete and removed it.
> >> When
> >> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> >> reverted it, this piece was not reverted and not added back in.
> >>
> >> When running uperf with a request-response pattern with 1k payload
> >> and 250 connections parallel, we measure 13% difference in throughput
> >> for our PCI based network interfaces since 472c2e07eef0.
> >> (our IO MMU is sensitive to the number of mapped pages)
> >
> >
> >
> >>
> >> Could you please consider allowing linear payload for the first
> >> skb in queue again? A patch proposal is appended below.
> >
> > No.
> >
> > Please add a work around in your driver.
> >
> > You can increase throughput by 20% by premapping a coherent piece of
> > memory in which
> > you can copy small skbs (skb->head included)
> >
> > Something like 256 bytes per slot in the TX ring.
> >
>
> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).

I did not claim this was s390 specific.

Only IOMMU mode.

I would rather not add back something which makes TCP stack slower
(more tests in fast path)
for the majority of us _not_ using IOMMU.

In our own tests, this trick of using linear skbs was only helping
benchmarks, not real workloads.

Many drivers have to map skb->head a second time if they contain TCP payload,
thus adding yet another corner case in their fast path.

- Typical RPC workloads are playing with TCP_NODELAY
- Typical bulk flows never have empty write queues...

Really, I do not want this optimization back, this is not worth it.

Again, a driver knows better if it is using IOMMU and if pathological
layouts can be optimized
to non SG ones, and using a pre-dma-map zone will also benefit pure
TCP ACK packets (which do not have any payload)

Here is the changelog of a patch I did for our GQ NIC (not yet
upstreamed, but will be soon)

...
   The problem is coming from gq_tx_clean() calling
    dma_unmap_single(q->dev, p->addr, -p->len, DMA_TO_DEVICE);

    This seems silly to perform possibly expensive IOMMU operations to
send small packets.
    (TCP pure acks are 86 bytes long in total for 99% of the cases)

    Idea of this patch is to pre-dma-map a memory zone to hold the
headers of the
    packet (if less than 128/256 bytes long)

    Then if the whole packet can be copied into this 128/256 bytes
zone, just copy it
    entirely.

    This permits to consume the small packets right away in ndo_start_xmit()
    while the skb (and associated socket sk_wmem_alloc) is hot, instead of later
    at TX completion time.
    This makes ACK packets cost much smaller, but also tiny TCP
packets (say, synthetic benchmarks)

    We enable this behavior only if IOMMU is used/forced on GQ,
    although we might use it regardless of IOMMU being used or not.
...
    To recap, there is a huge difference if we cross the 42 byte limit
: (for a 128 bytes zone per TX ring slot)

    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r40,40
    2648141
    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r44,44
     970691

    We might experiment with bigger GQ_TX_INLINE_HEADER_SIZE in the future ?
   ...
Alexandra Winter Sept. 26, 2022, 10:06 a.m. UTC | #4
On 08.09.22 14:41, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
>>
>> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
>>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>>>
>>>> Since linear payload was removed even for single small messages,
>>>> an additional page is required and we are measuring performance impact.
>>>>
>>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>>>> explicitely allowed "payload in skb->head for first skb put in the queue,
>>>> to not impact RPC workloads."
>>>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>>>> made that obsolete and removed it.
>>>> When
>>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>>>> reverted it, this piece was not reverted and not added back in.
>>>>
>>>> When running uperf with a request-response pattern with 1k payload
>>>> and 250 connections parallel, we measure 13% difference in throughput
>>>> for our PCI based network interfaces since 472c2e07eef0.
>>>> (our IO MMU is sensitive to the number of mapped pages)
>>>
>>>
>>>
>>>>
>>>> Could you please consider allowing linear payload for the first
>>>> skb in queue again? A patch proposal is appended below.
>>>
>>> No.
>>>
>>> Please add a work around in your driver.
>>>
>>> You can increase throughput by 20% by premapping a coherent piece of
>>> memory in which
>>> you can copy small skbs (skb->head included)
>>>
>>> Something like 256 bytes per slot in the TX ring.
>>>
>>
>> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
> 
> I did not claim this was s390 specific.
> 
> Only IOMMU mode.
> 
> I would rather not add back something which makes TCP stack slower
> (more tests in fast path)
> for the majority of us _not_ using IOMMU.
> 
> In our own tests, this trick of using linear skbs was only helping
> benchmarks, not real workloads.
> 
> Many drivers have to map skb->head a second time if they contain TCP payload,
> thus adding yet another corner case in their fast path.
> 
> - Typical RPC workloads are playing with TCP_NODELAY
> - Typical bulk flows never have empty write queues...
> 
> Really, I do not want this optimization back, this is not worth it.
> 
> Again, a driver knows better if it is using IOMMU and if pathological
> layouts can be optimized
> to non SG ones, and using a pre-dma-map zone will also benefit pure
> TCP ACK packets (which do not have any payload)
> 
> Here is the changelog of a patch I did for our GQ NIC (not yet
> upstreamed, but will be soon)
> 
[...]

Saeed,
As discussed at LPC, could you please consider adding a workaround to the
Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
we are seeing 13% throughput degradation, if 2 pages need to be mapped
instead of 1.

While Eric's ideas sound very promising, just using non-SG in these cases
should be enough to mitigate the performance regression we see.

Thank you in advance.
Alexandra
Saeed Mahameed Sept. 30, 2022, 11:37 p.m. UTC | #5
On 26 Sep 12:06, Alexandra Winter wrote:
>

[ ... ]

>[...]
>
>Saeed,
>As discussed at LPC, could you please consider adding a workaround to the
>Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>we are seeing 13% throughput degradation, if 2 pages need to be mapped
>instead of 1.
>
>While Eric's ideas sound very promising, just using non-SG in these cases
>should be enough to mitigate the performance regression we see.

Hi Alexandra, sorry for the late response.

Yeas linearizing small messages makes sense, but will require some careful
perf testing.

We will do our best to include this in the next kernel release cycle.
I will take it with the mlx5e team next week, everybody is on vacation this
time of year :).

Thanks,
Saeed.
Alexandra Winter Dec. 29, 2022, 8:27 a.m. UTC | #6
On 01.10.22 01:37, Saeed Mahameed wrote:
> On 26 Sep 12:06, Alexandra Winter wrote:
>>
> 
> [ ... ]
> 
>> [...]
>>
>> Saeed,
>> As discussed at LPC, could you please consider adding a workaround to the
>> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>> we are seeing 13% throughput degradation, if 2 pages need to be mapped
>> instead of 1.
>>
>> While Eric's ideas sound very promising, just using non-SG in these cases
>> should be enough to mitigate the performance regression we see.
> 
> Hi Alexandra, sorry for the late response.
> 
> Yeas linearizing small messages makes sense, but will require some careful
> perf testing.
> 
> We will do our best to include this in the next kernel release cycle.
> I will take it with the mlx5e team next week, everybody is on vacation this
> time of year :).
> 
> Thanks,
> Saeed.

Hello Saeed,
may I ask whether you had a chance to include such a patch in the 6.2 kernel?
Or is this still on your ToDo list?
I haven't seen anything like this on the mailing list, but I may have overlooked it.
All the best for 2023
Alexandra
Alexandra Winter April 27, 2023, 9:44 a.m. UTC | #7
On 29.12.22 09:27, Alexandra Winter wrote:
> 
> 
> On 01.10.22 01:37, Saeed Mahameed wrote:
>> On 26 Sep 12:06, Alexandra Winter wrote:
>>
>> [ ... ]
>>>
>>> Saeed,
>>> As discussed at LPC, could you please consider adding a workaround to the
>>> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>>> we are seeing 13% throughput degradation, if 2 pages need to be mapped
>>> instead of 1.
>>>
>>> While Eric's ideas sound very promising, just using non-SG in these cases
>>> should be enough to mitigate the performance regression we see.
>>
>> Hi Alexandra, sorry for the late response.
>>
>> Yeas linearizing small messages makes sense, but will require some careful
>> perf testing.
>>
>> We will do our best to include this in the next kernel release cycle.
>> I will take it with the mlx5e team next week, everybody is on vacation this
>> time of year :).
>>
>> Thanks,
>> Saeed.
> 
> Hello Saeed,
> may I ask whether you had a chance to include such a patch in the 6.2 kernel?
> Or is this still on your ToDo list?
> I haven't seen anything like this on the mailing list, but I may have overlooked it.
> All the best for 2023
> Alexandra


Hello Saeed,
any news about linearizing small messages? Is there any way we could be of help?

Kind regards
Alexandra
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e5011c136fdb..f7cbccd41d85 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1154,6 +1154,30 @@  int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 EXPORT_SYMBOL(tcp_sendpage);
 
+/* Do not bother using a page frag for very small frames.
+ * But use this heuristic only for the first skb in write queue.
+ *
+ * Having no payload in skb->head allows better SACK shifting
+ * in tcp_shift_skb_data(), reducing sack/rack overhead, because
+ * write queue has less skbs.
+ * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
+ * This also speeds up tso_fragment(), since it won't fallback
+ * to tcp_fragment().
+ */
+static int linear_payload_sz(bool first_skb)
+{
+		if (first_skb)
+			return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
+		return 0;
+}
+
+static int select_size(bool first_skb, bool zc)
+{
+		if (zc)
+			return 0;
+		return linear_payload_sz(first_skb);
+}
+
 void tcp_free_fastopen_req(struct tcp_sock *tp)
 {
 	if (tp->fastopen_req) {
@@ -1311,6 +1335,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
+			int linear;
 
 new_segment:
 			if (!sk_stream_memory_free(sk))
@@ -1322,7 +1347,9 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 					goto restart;
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
-			skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
+			linear = select_size(first_skb, zc);
+			skb = tcp_stream_alloc_skb(sk, linear,
+						   sk->sk_allocation,
 						   first_skb);
 			if (!skb)
 				goto wait_for_space;
@@ -1344,7 +1371,15 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (copy > msg_data_left(msg))
 			copy = msg_data_left(msg);
 
-		if (!zc) {
+		/* Where to copy to? */
+		if (skb_availroom(skb) > 0 && !zc) {
+			/* We have some space in skb head. Superb! */
+			copy = min_t(int, copy, skb_availroom(skb));
+			err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
+						   copy);
+			if (err)
+				goto do_error;
+		} else if (!zc) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);