[mptcp-next,v2,0/7] mptcp: rx path refactor

Message ID	cover.1733486870.git.pabeni@redhat.com (mailing list archive)
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00EF11FF7A1 for <mptcp@lists.linux.dev>; Fri, 6 Dec 2024 12:10:14 +0000 (UTC) From: Paolo Abeni <pabeni@redhat.com> To: mptcp@lists.linux.dev Subject: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor Date: Fri, 6 Dec 2024 13:09:44 +0100 Message-ID: <cover.1733486870.git.pabeni@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit content-type: text/plain; charset="US-ASCII"; x-default=true
Series	mptcp: rx path refactor \| expand [mptcp-next,v2,0/7] mptcp: rx path refactor [mptcp-next,v2,1/7] mptcp: prevent excessive coalescing on receive [mptcp-next,v2,2/7] tcp: fix recvbuffer adjust on sleeping rcvmsg [mptcp-next,v2,3/7] mptcp: don't always assume copied data in mptcp_cleanup_rbuf() [mptcp-next,v2,4/7] mptcp: consolidate subflow cleanup [mptcp-next,v2,5/7] mptcp: move the whole rx path under msk socket lock protection [mptcp-next,v2,6/7] mptcp: cleanup mem accounting. [mptcp-next,v2,7/7] net: dismiss sk_forward_alloc_get()

Message ID

cover.1733486870.git.pabeni@redhat.com (mailing list archive)

Headers

From: Paolo Abeni <pabeni@redhat.com>
To: mptcp@lists.linux.dev
Subject: [PATCH mptcp-next v2 0/7] mptcp: rx path refactor
Date: Fri,  6 Dec 2024 13:09:44 +0100
Message-ID: <cover.1733486870.git.pabeni@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
content-type: text/plain; charset="US-ASCII"; x-default=true

Series

mptcp: rx path refactor | expand

Message

Paolo Abeni Dec. 6, 2024, 12:09 p.m. UTC

This is a batch of changes I had sitting in my local tree for a while.
Why another refactor you may ask? Two main resons:

- currently the mptcp RX path introduces quite a bit of 'exceptional'
 accounting/locking processing WRT to plain TCP, adding up to the
 implementation complexity in a misurable way
- the performance gap WRT plain TCP for single subflow connections is
 quite measurable.

The present refactor addresses both the above items: most of the
additional complexity is dropped, and single stream performances
increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
reference, plain TCP is around 84Gps on the same host.

The above comes to a price: the patch are invasive, even in subtle ways:
the chance of destabilizing the implementation is real (ence the
additional, intentional '-next' into the subj).

In any case keeping the patch hidden for longer was not going to do any
good, so here we are.

Changes from v1:
  - fixed several data stream corruption and wake-up misses due
    to multi subflows races
  - added patches 1-3 mainly to address the above
  - added an additional follow-up patch (patch 7) with more cleanup

Paolo Abeni (7):
  mptcp: prevent excessive coalescing on receive
  tcp: fix recvbuffer adjust on sleeping rcvmsg
  mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
  mptcp: consolidate subflow cleanup
  mptcp: move the whole rx path under msk socket lock protection
  mptcp: cleanup mem accounting.
  net: dismiss sk_forward_alloc_get()

 include/net/sock.h   |  13 ---
 net/core/sock.c      |   2 +-
 net/ipv4/af_inet.c   |   2 +-
 net/ipv4/inet_diag.c |   2 +-
 net/mptcp/fastopen.c |   4 +-
 net/mptcp/protocol.c | 259 +++++++++++++------------------------------
 net/mptcp/protocol.h |   6 +-
 net/mptcp/subflow.c  |  33 +++---
 net/sched/em_meta.c  |   2 +-
 9 files changed, 104 insertions(+), 219 deletions(-)

Comments

MPTCP CI Dec. 6, 2024, 1:17 p.m. UTC | #1

Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/12198751082

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/1c3b9857b5b4
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=915337


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

Paolo Abeni Dec. 6, 2024, 4:41 p.m. UTC | #2

On 12/6/24 13:09, Paolo Abeni wrote:
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.

I forgot to update this section.

I did some additional test with multiple links, differences vs the
existing codebase is within the noise range.

Disclaimer: tests over multiple veth pairs, with RPS to use multiple
cores for the forwarding. This is a somewhat decent approximation of a
real setup with multiple H/W links but I still have to try the latter.

I think the issue reported by the CI in the previous iteration should be
addressed here - or at least this iteration addresses the closest
problem I was able to reproduce locally.

I'm not aware of any standing issue, but I would bet a few coins this is
not bug-free.

Paolo

Mat Martineau Dec. 21, 2024, 2:12 a.m. UTC | #3

On Fri, 6 Dec 2024, Paolo Abeni wrote:

> This is a batch of changes I had sitting in my local tree for a while.
> Why another refactor you may ask? Two main resons:
>
> - currently the mptcp RX path introduces quite a bit of 'exceptional'
> accounting/locking processing WRT to plain TCP, adding up to the
> implementation complexity in a misurable way
> - the performance gap WRT plain TCP for single subflow connections is
> quite measurable.
>
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.
>

Hi Paolo -

Thanks for the v2 and for sharing the performance numbers - great to see 
such a dramatic improvement in throughput!

> The above comes to a price: the patch are invasive, even in subtle ways:
> the chance of destabilizing the implementation is real (ence the
> additional, intentional '-next' into the subj).
>

I think we will also need to be extra careful about monitoring stable tree 
backports that modify the affected functions. But to me the 
simplifications and performance fixes are worth it.

It will help to get this in to the export branch for further testing!

Reviewed-by: Mat Martineau <martineau@kernel.org>


> In any case keeping the patch hidden for longer was not going to do any
> good, so here we are.
>
> Changes from v1:
>  - fixed several data stream corruption and wake-up misses due
>    to multi subflows races
>  - added patches 1-3 mainly to address the above
>  - added an additional follow-up patch (patch 7) with more cleanup
>
> Paolo Abeni (7):
>  mptcp: prevent excessive coalescing on receive
>  tcp: fix recvbuffer adjust on sleeping rcvmsg
>  mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
>  mptcp: consolidate subflow cleanup
>  mptcp: move the whole rx path under msk socket lock protection
>  mptcp: cleanup mem accounting.
>  net: dismiss sk_forward_alloc_get()
>
> include/net/sock.h   |  13 ---
> net/core/sock.c      |   2 +-
> net/ipv4/af_inet.c   |   2 +-
> net/ipv4/inet_diag.c |   2 +-
> net/mptcp/fastopen.c |   4 +-
> net/mptcp/protocol.c | 259 +++++++++++++------------------------------
> net/mptcp/protocol.h |   6 +-
> net/mptcp/subflow.c  |  33 +++---
> net/sched/em_meta.c  |   2 +-
> 9 files changed, 104 insertions(+), 219 deletions(-)
>
> -- 
> 2.45.2
>
>
>

Matthieu Baerts Dec. 21, 2024, 11:17 a.m. UTC | #4

Hi Paolo, Mat,

On 06/12/2024 13:09, Paolo Abeni wrote:
> This is a batch of changes I had sitting in my local tree for a while.
> Why another refactor you may ask? Two main resons:
> 
> - currently the mptcp RX path introduces quite a bit of 'exceptional'
>  accounting/locking processing WRT to plain TCP, adding up to the
>  implementation complexity in a misurable way
> - the performance gap WRT plain TCP for single subflow connections is
>  quite measurable.
> 
> The present refactor addresses both the above items: most of the
> additional complexity is dropped, and single stream performances
> increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
> reference, plain TCP is around 84Gps on the same host.
> 
> The above comes to a price: the patch are invasive, even in subtle ways:
> the chance of destabilizing the implementation is real (ence the
> additional, intentional '-next' into the subj).
> 
> In any case keeping the patch hidden for longer was not going to do any
> good, so here we are.

Thank you for the patches and the reviews!

Patches 2 and 3 are now in the "fixes for net", while the others are in
"feat. for next". I think patch 1 can go to "Fixes for net", any objections?

New patches for t/upstream-net and t/upstream:
- d443f83de23d: mptcp: fix recvbuffer adjust on sleeping rcvmsg
- 11d4943d66ee: mptcp: don't always assume copied data in
mptcp_cleanup_rbuf()
- Results: 3a0f411ba1c5..2116e71814b7 (export-net)
- Results: df3193ae83eb..953b0d388720 (export)

Tests are now in progress:

- export-net:
https://github.com/multipath-tcp/mptcp_net-next/commit/047c2f99575350d9971b6257f96ce5f734f0186a/checks

New patches for t/upstream:
- 972d8a5b7f20: mptcp: prevent excessive coalescing on receive
- 0b0a25a24d21: mptcp: consolidate subflow cleanup
- 533bb69f1b0d: mptcp: move the whole rx path under msk socket lock
protection
- c5d13b230497: mptcp: cleanup mem accounting
- a564b9923a94: net: dismiss sk_forward_alloc_get()
- Results: 953b0d388720..9213903d77ba (export)

Tests are now in progress:

- export:
https://github.com/multipath-tcp/mptcp_net-next/commit/6aae9c8bd8361dd90a7def87459ca0173915c56c/checks

Cheers,
Matt