From patchwork Tue May 25 17:37:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 12279735 X-Patchwork-Delegate: mat@martineau.name Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8123370 for ; Tue, 25 May 2021 17:37:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1621964262; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=voo/hz+knWmOVyC8UbgUKx6hF7iLzVDYsrKElUerk9c=; b=E3OkWtKCESiVJEiuuyQ1M9U1gElW2VvD6myLSsJ8ZRByhSu2LZq1tWmtxs1f8MSUCCIWjx nfBlu52AMaaNsu8No20xwv312+A42FmS0akx2owxaL4DeVRyFf8huVOF48zVLZseqFQNLg ZRZfty6isCgjVcphXrjgdarQja0vB2g= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-313-DI5BoTUBMuONsl41bjRIZw-1; Tue, 25 May 2021 13:37:40 -0400 X-MC-Unique: DI5BoTUBMuONsl41bjRIZw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9B4C31020C36 for ; Tue, 25 May 2021 17:37:39 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-115-92.ams2.redhat.com [10.36.115.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id E210C100E113; Tue, 25 May 2021 17:37:35 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: fwestpha@redhat.com Subject: [RFC PATCH 1/4] mptcp: wake-up readers only for in sequence data. Date: Tue, 25 May 2021 19:37:17 +0200 Message-Id: <1a99371fbd038cfcabac5fa4fe0f983c6ea96433.1621963632.git.pabeni@redhat.com> In-Reply-To: References: X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Currently we relay on the subflow->data_avail field, which is subject to races: ssk1 skb len = 500 DSS(seq=1, len=1000, off=0) # data_avail == MPTCP_SUBFLOW_DATA_AVAIL ssk2 skb len = 500 DSS(seq = 501, len=1000) # data_avail == MPTCP_SUBFLOW_DATA_AVAIL ssk1 skb len = 500 DSS(seq = 1, len=1000, off =500) # still data_avail == MPTCP_SUBFLOW_DATA_AVAIL, # as the skb is covered by a pre-existing map, # which was in-sequence at reception time. Instead we can explicitly check if some has been received in-sequence, propagating the info from __mptcp_move_skbs_from_subflow(). Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 25 +++++++++---------------- net/mptcp/protocol.h | 1 - net/mptcp/subflow.c | 15 +++++---------- 3 files changed, 14 insertions(+), 27 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index bb029dd4ff5e..e85ec0e84e06 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -677,7 +677,7 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk) /* In most cases we will be able to lock the mptcp socket. If its already * owned, we need to defer to the work queue to avoid ABBA deadlock. */ -static void move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) +static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) { struct sock *sk = (struct sock *)msk; unsigned int moved = 0; @@ -698,6 +698,8 @@ static void move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) if (mptcp_pending_data_fin(sk, NULL)) mptcp_schedule_work(sk); mptcp_data_unlock(sk); + + return moved > 0; } void mptcp_data_ready(struct sock *sk, struct sock *ssk) @@ -705,7 +707,6 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); struct mptcp_sock *msk = mptcp_sk(sk); int sk_rbuf, ssk_rbuf; - bool wake; /* The peer can send data while we are shutting down this * subflow at msk destruction time, but we must avoid enqueuing @@ -714,28 +715,20 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) if (unlikely(subflow->disposable)) return; - /* move_skbs_to_msk below can legitly clear the data_avail flag, - * but we will need later to properly woke the reader, cache its - * value - */ - wake = subflow->data_avail == MPTCP_SUBFLOW_DATA_AVAIL; - if (wake) - set_bit(MPTCP_DATA_READY, &msk->flags); - ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf); sk_rbuf = READ_ONCE(sk->sk_rcvbuf); if (unlikely(ssk_rbuf > sk_rbuf)) sk_rbuf = ssk_rbuf; - /* over limit? can't append more skbs to msk */ + /* over limit? can't append more skbs to msk, Also, no need to wake-up*/ if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf) - goto wake; - - move_skbs_to_msk(msk, ssk); + return; -wake: - if (wake) + /* Wake-up the reader only for in-sequence data */ + if (move_skbs_to_msk(msk, ssk)) { + set_bit(MPTCP_DATA_READY, &msk->flags); sk->sk_data_ready(sk); + } } static bool mptcp_do_flush_join_list(struct mptcp_sock *msk) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 520098188d80..2f22046a7565 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -374,7 +374,6 @@ mptcp_subflow_rsk(const struct request_sock *rsk) enum mptcp_data_avail { MPTCP_SUBFLOW_NODATA, MPTCP_SUBFLOW_DATA_AVAIL, - MPTCP_SUBFLOW_OOO_DATA }; struct mptcp_delegated_action { diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 6b1cd4257edf..f8323a759af1 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -1137,18 +1137,13 @@ static bool subflow_check_data_avail(struct sock *ssk) ack_seq = mptcp_subflow_get_mapped_dsn(subflow); pr_debug("msk ack_seq=%llx subflow ack_seq=%llx", old_ack, ack_seq); - if (ack_seq == old_ack) { - subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL; - break; - } else if (after64(ack_seq, old_ack)) { - subflow->data_avail = MPTCP_SUBFLOW_OOO_DATA; - break; + if (unlikely(before64(ack_seq, old_ack))) { + mptcp_subflow_discard_data(ssk, skb, old_ack - ack_seq); + continue; } - /* only accept in-sequence mapping. Old values are spurious - * retransmission - */ - mptcp_subflow_discard_data(ssk, skb, old_ack - ack_seq); + subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL; + break; } return true; From patchwork Tue May 25 17:37:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 12279737 X-Patchwork-Delegate: mat@martineau.name Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74F8C2FB1 for ; Tue, 25 May 2021 17:37:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1621964263; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=am/xBEvQZd7yTJypSRhSUnla6evNiXgF9ZJM7S6uRqw=; b=FsVBcs/2SwE62qSlt6a2W3PsWU0/IQoWSF5T+mGF/VuVzfJS80w2KhLU/ZfPo3f+F9VDzu 69tFcFnKYGjBiw/rY74FuFHmSVV+5JkefgRRHnb3wF6e6zhwF19hqDgH0qTWEw7hCidQtH ulalV64XReaDlyDdkGk7PqtHGCYUGJE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-195-MB3JAqAqO5Kcfc7q8pr_ZQ-1; Tue, 25 May 2021 13:37:41 -0400 X-MC-Unique: MB3JAqAqO5Kcfc7q8pr_ZQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B0160107ACE4 for ; Tue, 25 May 2021 17:37:40 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-115-92.ams2.redhat.com [10.36.115.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id EEC24100E113; Tue, 25 May 2021 17:37:39 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: fwestpha@redhat.com Subject: [RFC PATCH 2/4] mptcp: don't clear MPTCP_DATA_READY in sk_wait_event() Date: Tue, 25 May 2021 19:37:18 +0200 Message-Id: In-Reply-To: References: X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com If we don't flush entirely the receive queue, we need set again such bit later. We can simply avoid clearing it. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index e85ec0e84e06..6050431f4c86 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1707,7 +1707,7 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); sk_wait_event(sk, timeo, - test_and_clear_bit(MPTCP_DATA_READY, &msk->flags), &wait); + test_bit(MPTCP_DATA_READY, &msk->flags), &wait); sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); remove_wait_queue(sk_sleep(sk), &wait); @@ -2028,10 +2028,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, */ if (unlikely(__mptcp_move_skbs(msk))) set_bit(MPTCP_DATA_READY, &msk->flags); - } else if (unlikely(!test_bit(MPTCP_DATA_READY, &msk->flags))) { - /* data to read but mptcp_wait_data() cleared DATA_READY */ - set_bit(MPTCP_DATA_READY, &msk->flags); } + out_err: if (cmsg_flags && copied >= 0) { if (cmsg_flags & MPTCP_CMSG_TS) From patchwork Tue May 25 17:37:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 12279739 X-Patchwork-Delegate: mat@martineau.name Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FB6770 for ; Tue, 25 May 2021 17:37:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1621964264; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qPOSvso8iGVDynvkkuV+bUivg123pbBrIAqQlnfEB7w=; b=KJQCOqAlH3I77NzhclSXFhq4H1M18TUcPlmKVi6/z81ZQot19YPekQ/imfLFEJ4u8AtWgR NzClwrdwQjkwAHdjA0cSNHaNMtazfRqVDJEsg2b4tGdOt2qqAl1bUT2GVoybSuXWbo+wNE cHATV6NmpdfOxA5/Jl/HKl7ATgqyPqU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-569-tPrE_qCbP4-YtjfOPzpp7w-1; Tue, 25 May 2021 13:37:42 -0400 X-MC-Unique: tPrE_qCbP4-YtjfOPzpp7w-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C68301020C38 for ; Tue, 25 May 2021 17:37:41 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-115-92.ams2.redhat.com [10.36.115.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id 118EB100E113; Tue, 25 May 2021 17:37:40 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: fwestpha@redhat.com Subject: [RFC PATCH 3/4] mptcp: move the whole rx path under msk socket lock protection Date: Tue, 25 May 2021 19:37:19 +0200 Message-Id: <3fa2a7102e0ca7e89dac2e14e60469ac414bcef4.1621963632.git.pabeni@redhat.com> In-Reply-To: References: X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com After commit c2e6048fa1cf ("mptcp: fix race in release_cb") it's pretty straight forward move the whole MPTCP rx path under the socket lock leveraging the release_cb. We can drop a bunch of spin_lock pairs in the receive functions, use a single receive queue and invoke __mptcp_move_skbs only when subflows ask for it. This will allow more cleanup in the next patch Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 94 ++++++++++++++++++-------------------------- net/mptcp/protocol.h | 2 +- 2 files changed, 39 insertions(+), 57 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 6050431f4c86..57deea409d0c 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -682,11 +682,6 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) struct sock *sk = (struct sock *)msk; unsigned int moved = 0; - if (inet_sk_state_load(sk) == TCP_CLOSE) - return; - - mptcp_data_lock(sk); - __mptcp_move_skbs_from_subflow(msk, ssk, &moved); __mptcp_ofo_queue(msk); @@ -697,12 +692,11 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) */ if (mptcp_pending_data_fin(sk, NULL)) mptcp_schedule_work(sk); - mptcp_data_unlock(sk); return moved > 0; } -void mptcp_data_ready(struct sock *sk, struct sock *ssk) +static void __mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); struct mptcp_sock *msk = mptcp_sk(sk); @@ -731,6 +725,16 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) } } +void mptcp_data_ready(struct sock *sk, struct sock *ssk) +{ + mptcp_data_lock(sk); + if (!sock_owned_by_user(sk)) + __mptcp_data_ready(sk, ssk); + else + set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags); + mptcp_data_unlock(sk); +} + static bool mptcp_do_flush_join_list(struct mptcp_sock *msk) { struct mptcp_subflow_context *subflow; @@ -967,7 +971,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size) if (msk->wmem_reserved >= size) goto account; - mptcp_data_lock(sk); if (!sk_wmem_schedule(sk, size)) { mptcp_data_unlock(sk); return false; @@ -975,7 +978,6 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size) sk->sk_forward_alloc -= size; msk->wmem_reserved += size; - mptcp_data_unlock(sk); account: msk->wmem_reserved -= size; @@ -1002,12 +1004,10 @@ static void mptcp_mem_reclaim_partial(struct sock *sk) if (msk->wmem_reserved < 0) return; - mptcp_data_lock(sk); sk->sk_forward_alloc += msk->wmem_reserved; sk_mem_reclaim_partial(sk); msk->wmem_reserved = sk->sk_forward_alloc; sk->sk_forward_alloc = 0; - mptcp_data_unlock(sk); } static void dfrag_uncharge(struct sock *sk, int len) @@ -1092,9 +1092,7 @@ static void __mptcp_clean_una_wakeup(struct sock *sk) static void mptcp_clean_una_wakeup(struct sock *sk) { - mptcp_data_lock(sk); __mptcp_clean_una_wakeup(sk); - mptcp_data_unlock(sk); } static void mptcp_enter_memory_pressure(struct sock *sk) @@ -1713,16 +1711,22 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) remove_wait_queue(sk_sleep(sk), &wait); } -static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, +static bool __mptcp_move_skbs(struct sock *sk); + +static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg, size_t len, int flags, struct scm_timestamping_internal *tss, int *cmsg_flags) { + struct mptcp_sock *msk = mptcp_sk(sk); struct sk_buff *skb, *tmp; int copied = 0; - skb_queue_walk_safe(&msk->receive_queue, skb, tmp) { + if (skb_queue_empty(&sk->sk_receive_queue) && !__mptcp_move_skbs(sk)) + return 0; + + skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) { u32 offset = MPTCP_SKB_CB(skb)->offset; u32 data_len = skb->len - offset; u32 count = min_t(size_t, len - copied, data_len); @@ -1754,7 +1758,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, /* we will bulk release the skb memory later */ skb->destructor = NULL; msk->rmem_released += skb->truesize; - __skb_unlink(skb, &msk->receive_queue); + __skb_unlink(skb, &sk->sk_receive_queue); __kfree_skb(skb); } @@ -1875,16 +1879,9 @@ static void __mptcp_update_rmem(struct sock *sk) msk->rmem_released = 0; } -static void __mptcp_splice_receive_queue(struct sock *sk) +static bool __mptcp_move_skbs(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); - - skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue); -} - -static bool __mptcp_move_skbs(struct mptcp_sock *msk) -{ - struct sock *sk = (struct sock *)msk; unsigned int moved = 0; bool ret, done; @@ -1893,18 +1890,12 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) struct sock *ssk = mptcp_subflow_recv_lookup(msk); bool slowpath; - /* we can have data pending in the subflows only if the msk - * receive buffer was full at subflow_data_ready() time, - * that is an unlikely slow path. - */ - if (likely(!ssk)) + if (unlikely(!ssk)) break; slowpath = lock_sock_fast(ssk); - mptcp_data_lock(sk); __mptcp_update_rmem(sk); done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved); - mptcp_data_unlock(sk); tcp_cleanup_rbuf(ssk, moved); unlock_sock_fast(ssk, slowpath); } while (!done); @@ -1912,17 +1903,16 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) /* acquire the data lock only if some input data is pending */ ret = moved > 0; if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) || - !skb_queue_empty_lockless(&sk->sk_receive_queue)) { - mptcp_data_lock(sk); + !skb_queue_empty(&sk->sk_receive_queue)) { __mptcp_update_rmem(sk); ret |= __mptcp_ofo_queue(msk); - __mptcp_splice_receive_queue(sk); - mptcp_data_unlock(sk); mptcp_cleanup_rbuf(msk); } - if (ret) + if (ret) { + set_bit(MPTCP_DATA_READY, &msk->flags); mptcp_check_data_fin((struct sock *)msk); - return !skb_queue_empty(&msk->receive_queue); + } + return ret; } static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, @@ -1938,7 +1928,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); - mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk)); + lock_sock(sk); if (unlikely(sk->sk_state == TCP_LISTEN)) { copied = -ENOTCONN; goto out_err; @@ -1952,7 +1942,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, while (copied < len) { int bytes_read; - bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags); + bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags); if (unlikely(bytes_read < 0)) { if (!copied) copied = bytes_read; @@ -1964,9 +1954,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, /* be sure to advertise window change */ mptcp_cleanup_rbuf(msk); - if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk)) - continue; - /* only the master socket status is relevant here. The exit * conditions mirror closely tcp_recvmsg() */ @@ -1993,7 +1980,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, /* race breaker: the shutdown could be after the * previous receive queue check */ - if (__mptcp_move_skbs(msk)) + if (__mptcp_move_skbs(sk)) continue; break; } @@ -2018,16 +2005,11 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, mptcp_wait_data(sk, &timeo); } - if (skb_queue_empty_lockless(&sk->sk_receive_queue) && - skb_queue_empty(&msk->receive_queue)) { - /* entire backlog drained, clear DATA_READY. */ - clear_bit(MPTCP_DATA_READY, &msk->flags); - - /* .. race-breaker: ssk might have gotten new data - * after last __mptcp_move_skbs() returned false. + if (skb_queue_empty(&sk->sk_receive_queue)) { + /* entire backlog drained, clear DATA_READY. + * release_cb/__mptcp_move_skbs() will eventually set it again if needed */ - if (unlikely(__mptcp_move_skbs(msk))) - set_bit(MPTCP_DATA_READY, &msk->flags); + clear_bit(MPTCP_DATA_READY, &msk->flags); } out_err: @@ -2376,7 +2358,6 @@ static int __mptcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&msk->join_list); INIT_LIST_HEAD(&msk->rtx_queue); INIT_WORK(&msk->work, mptcp_worker); - __skb_queue_head_init(&msk->receive_queue); msk->out_of_order_queue = RB_ROOT; msk->first_pending = NULL; msk->wmem_reserved = 0; @@ -2828,9 +2809,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk) __mptcp_clear_xmit(sk); - /* move to sk_receive_queue, sk_stream_kill_queues will purge it */ - skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue); - skb_rbtree_purge(&msk->out_of_order_queue); mptcp_token_destroy(msk); mptcp_pm_free_anno_list(msk); @@ -2882,6 +2860,8 @@ static void mptcp_release_cb(struct sock *sk) flags |= BIT(MPTCP_PUSH_PENDING); if (test_and_clear_bit(MPTCP_RETRANSMIT, &mptcp_sk(sk)->flags)) flags |= BIT(MPTCP_RETRANSMIT); + if (test_and_clear_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->flags)) + flags |= BIT(MPTCP_DEQUEUE); if (!flags) break; @@ -2898,6 +2878,8 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_push_pending(sk, 0); if (flags & BIT(MPTCP_RETRANSMIT)) __mptcp_retrans(sk); + if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) + sk->sk_data_ready(sk); cond_resched(); spin_lock_bh(&sk->sk_lock.slock); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 2f22046a7565..d392ee44deb3 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -111,6 +111,7 @@ #define MPTCP_ERROR_REPORT 8 #define MPTCP_RETRANSMIT 9 #define MPTCP_WORK_SYNC_SETSOCKOPT 10 +#define MPTCP_DEQUEUE 11 static inline bool before64(__u64 seq1, __u64 seq2) { @@ -244,7 +245,6 @@ struct mptcp_sock { struct work_struct work; struct sk_buff *ooo_last_skb; struct rb_root out_of_order_queue; - struct sk_buff_head receive_queue; int tx_pending_data; int size_goal_cache; struct list_head conn_list; From patchwork Tue May 25 17:37:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 12279741 X-Patchwork-Delegate: mat@martineau.name Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1BD42FB1 for ; Tue, 25 May 2021 17:37:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1621964266; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SIfnSKNQzCirITF0K5U5KzE6KpntffSxsuPOXTP6N+w=; b=SDPcDLjPhpFuHJnnuRCb9kejpC8I/NkXLiCo++lN3Js9C0tJrzab46y1a7FVVd9y0Hjl17 apE+X1L1gEDIOlkt+hr0ad8RmAtK7UrDRXzOVibAKewqDcbgUuRNTIMdX/UGEBQWSO8CoA lBhfA2n9psdcFfxOqDePGDRjjTo1dys= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-368-mb4FdqqNOVuz3kwjJR7YpA-1; Tue, 25 May 2021 13:37:44 -0400 X-MC-Unique: mb4FdqqNOVuz3kwjJR7YpA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 64F37C7406 for ; Tue, 25 May 2021 17:37:43 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-115-92.ams2.redhat.com [10.36.115.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id A3CD6100EBAF; Tue, 25 May 2021 17:37:41 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Cc: fwestpha@redhat.com Subject: [RFC PATCH 4/4] mptcp: cleanup mem accounting. Date: Tue, 25 May 2021 19:37:20 +0200 Message-Id: In-Reply-To: References: X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com After the previous patch, updating sk_forward_memory is cheap and we can drop a lot of complexity from the MPTCP memory acconting, removing the bulk fwd mem allocations for wmem and rmem. Singed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 175 ++++--------------------------------------- net/mptcp/protocol.h | 17 +---- 2 files changed, 14 insertions(+), 178 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 57deea409d0c..1a9ac2986581 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -900,116 +900,6 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk, df->data_seq + df->data_len == msk->write_seq; } -static int mptcp_wmem_with_overhead(int size) -{ - return size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT); -} - -static void __mptcp_wmem_reserve(struct sock *sk, int size) -{ - int amount = mptcp_wmem_with_overhead(size); - struct mptcp_sock *msk = mptcp_sk(sk); - - WARN_ON_ONCE(msk->wmem_reserved); - if (WARN_ON_ONCE(amount < 0)) - amount = 0; - - if (amount <= sk->sk_forward_alloc) - goto reserve; - - /* under memory pressure try to reserve at most a single page - * otherwise try to reserve the full estimate and fallback - * to a single page before entering the error path - */ - if ((tcp_under_memory_pressure(sk) && amount > PAGE_SIZE) || - !sk_wmem_schedule(sk, amount)) { - if (amount <= PAGE_SIZE) - goto nomem; - - amount = PAGE_SIZE; - if (!sk_wmem_schedule(sk, amount)) - goto nomem; - } - -reserve: - msk->wmem_reserved = amount; - sk->sk_forward_alloc -= amount; - return; - -nomem: - /* we will wait for memory on next allocation */ - msk->wmem_reserved = -1; -} - -static void __mptcp_update_wmem(struct sock *sk) -{ - struct mptcp_sock *msk = mptcp_sk(sk); - -#ifdef CONFIG_LOCKDEP - WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock)); -#endif - - if (!msk->wmem_reserved) - return; - - if (msk->wmem_reserved < 0) - msk->wmem_reserved = 0; - if (msk->wmem_reserved > 0) { - sk->sk_forward_alloc += msk->wmem_reserved; - msk->wmem_reserved = 0; - } -} - -static bool mptcp_wmem_alloc(struct sock *sk, int size) -{ - struct mptcp_sock *msk = mptcp_sk(sk); - - /* check for pre-existing error condition */ - if (msk->wmem_reserved < 0) - return false; - - if (msk->wmem_reserved >= size) - goto account; - - if (!sk_wmem_schedule(sk, size)) { - mptcp_data_unlock(sk); - return false; - } - - sk->sk_forward_alloc -= size; - msk->wmem_reserved += size; - -account: - msk->wmem_reserved -= size; - return true; -} - -static void mptcp_wmem_uncharge(struct sock *sk, int size) -{ - struct mptcp_sock *msk = mptcp_sk(sk); - - if (msk->wmem_reserved < 0) - msk->wmem_reserved = 0; - msk->wmem_reserved += size; -} - -static void mptcp_mem_reclaim_partial(struct sock *sk) -{ - struct mptcp_sock *msk = mptcp_sk(sk); - - /* if we are experiencing a transint allocation error, - * the forward allocation memory has been already - * released - */ - if (msk->wmem_reserved < 0) - return; - - sk->sk_forward_alloc += msk->wmem_reserved; - sk_mem_reclaim_partial(sk); - msk->wmem_reserved = sk->sk_forward_alloc; - sk->sk_forward_alloc = 0; -} - static void dfrag_uncharge(struct sock *sk, int len) { sk_mem_uncharge(sk, len); @@ -1066,12 +956,8 @@ static void __mptcp_clean_una(struct sock *sk) } out: - if (cleaned) { - if (tcp_under_memory_pressure(sk)) { - __mptcp_update_wmem(sk); - sk_mem_reclaim_partial(sk); - } - } + if (cleaned && tcp_under_memory_pressure(sk)) + sk_mem_reclaim_partial(sk); if (snd_una == READ_ONCE(msk->snd_nxt)) { if (msk->timer_ival && !mptcp_data_fin_enabled(msk)) @@ -1083,18 +969,10 @@ static void __mptcp_clean_una(struct sock *sk) static void __mptcp_clean_una_wakeup(struct sock *sk) { -#ifdef CONFIG_LOCKDEP - WARN_ON_ONCE(!lockdep_is_held(&sk->sk_lock.slock)); -#endif __mptcp_clean_una(sk); mptcp_write_space(sk); } -static void mptcp_clean_una_wakeup(struct sock *sk) -{ - __mptcp_clean_una_wakeup(sk); -} - static void mptcp_enter_memory_pressure(struct sock *sk) { struct mptcp_subflow_context *subflow; @@ -1229,7 +1107,7 @@ static bool mptcp_must_reclaim_memory(struct sock *sk, struct sock *ssk) static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) { if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) - mptcp_mem_reclaim_partial(sk); + sk_mem_reclaim_partial(sk); return __mptcp_alloc_tx_skb(sk, ssk, sk->sk_allocation); } @@ -1533,10 +1411,8 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk) goto out; } - if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) { - __mptcp_update_wmem(sk); + if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) sk_mem_reclaim_partial(sk); - } if (!__mptcp_alloc_tx_skb(sk, ssk, GFP_ATOMIC)) goto out; @@ -1560,7 +1436,6 @@ static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk) /* __mptcp_alloc_tx_skb could have released some wmem and we are * not going to flush it via release_sock() */ - __mptcp_update_wmem(sk); if (copied) { mptcp_set_timeout(sk, ssk); tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle, @@ -1598,7 +1473,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) /* silently ignore everything else */ msg->msg_flags &= MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL; - mptcp_lock_sock(sk, __mptcp_wmem_reserve(sk, min_t(size_t, 1 << 20, len))); + lock_sock(sk); timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); @@ -1611,8 +1486,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) pfrag = sk_page_frag(sk); while (msg_data_left(msg)) { - int total_ts, frag_truesize = 0; struct mptcp_data_frag *dfrag; + int frag_truesize = 0; bool dfrag_collapsed; size_t psize, offset; @@ -1644,14 +1519,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) offset = dfrag->offset + dfrag->data_len; psize = pfrag->size - offset; psize = min_t(size_t, psize, msg_data_left(msg)); - total_ts = psize + frag_truesize; + frag_truesize += psize; - if (!mptcp_wmem_alloc(sk, total_ts)) + if (!sk_wmem_schedule(sk, frag_truesize)) goto wait_for_memory; if (copy_page_from_iter(dfrag->page, offset, psize, &msg->msg_iter) != psize) { - mptcp_wmem_uncharge(sk, psize + frag_truesize); ret = -EFAULT; goto out; } @@ -1659,7 +1533,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) /* data successfully copied into the write queue */ copied += psize; dfrag->data_len += psize; - frag_truesize += psize; pfrag->offset += frag_truesize; WRITE_ONCE(msk->write_seq, msk->write_seq + psize); msk->tx_pending_data += psize; @@ -1668,6 +1541,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) * Note: we charge such data both to sk and ssk */ sk_wmem_queued_add(sk, frag_truesize); + sk_mem_charge(sk, frag_truesize); if (!dfrag_collapsed) { get_page(dfrag->page); list_add_tail(&dfrag->list, &msk->rtx_queue); @@ -1719,7 +1593,6 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, struct scm_timestamping_internal *tss, int *cmsg_flags) { - struct mptcp_sock *msk = mptcp_sk(sk); struct sk_buff *skb, *tmp; int copied = 0; @@ -1755,9 +1628,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk, } if (!(flags & MSG_PEEK)) { - /* we will bulk release the skb memory later */ + /* avoid the indirect call, we know the destructor is sock_wfree */ skb->destructor = NULL; - msk->rmem_released += skb->truesize; + atomic_sub(skb->truesize, &sk->sk_rmem_alloc); + sk_mem_uncharge(sk, skb->truesize); __skb_unlink(skb, &sk->sk_receive_queue); __kfree_skb(skb); } @@ -1867,17 +1741,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) msk->rcvq_space.time = mstamp; } -static void __mptcp_update_rmem(struct sock *sk) -{ - struct mptcp_sock *msk = mptcp_sk(sk); - - if (!msk->rmem_released) - return; - - atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc); - sk_mem_uncharge(sk, msk->rmem_released); - msk->rmem_released = 0; -} static bool __mptcp_move_skbs(struct sock *sk) { @@ -1894,7 +1757,6 @@ static bool __mptcp_move_skbs(struct sock *sk) break; slowpath = lock_sock_fast(ssk); - __mptcp_update_rmem(sk); done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved); tcp_cleanup_rbuf(ssk, moved); unlock_sock_fast(ssk, slowpath); @@ -1904,7 +1766,6 @@ static bool __mptcp_move_skbs(struct sock *sk) ret = moved > 0; if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) || !skb_queue_empty(&sk->sk_receive_queue)) { - __mptcp_update_rmem(sk); ret |= __mptcp_ofo_queue(msk); mptcp_cleanup_rbuf(msk); } @@ -2250,7 +2111,7 @@ static void __mptcp_retrans(struct sock *sk) struct sock *ssk; int ret; - mptcp_clean_una_wakeup(sk); + __mptcp_clean_una_wakeup(sk); dfrag = mptcp_rtx_head(sk); if (!dfrag) { if (mptcp_data_fin_enabled(msk)) { @@ -2360,8 +2221,6 @@ static int __mptcp_init_sock(struct sock *sk) INIT_WORK(&msk->work, mptcp_worker); msk->out_of_order_queue = RB_ROOT; msk->first_pending = NULL; - msk->wmem_reserved = 0; - msk->rmem_released = 0; msk->tx_pending_data = 0; msk->ack_hint = NULL; @@ -2576,8 +2435,6 @@ static void __mptcp_destroy_sock(struct sock *sk) sk->sk_prot->destroy(sk); - WARN_ON_ONCE(msk->wmem_reserved); - WARN_ON_ONCE(msk->rmem_released); sk_stream_kill_queues(sk); xfrm_sk_free_policy(sk); @@ -2889,12 +2746,6 @@ static void mptcp_release_cb(struct sock *sk) __mptcp_clean_una_wakeup(sk); if (test_and_clear_bit(MPTCP_ERROR_REPORT, &mptcp_sk(sk)->flags)) __mptcp_error_report(sk); - - /* push_pending may touch wmem_reserved, ensure we do the cleanup - * later - */ - __mptcp_update_wmem(sk); - __mptcp_update_rmem(sk); } void mptcp_subflow_process_delegated(struct sock *ssk) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index d392ee44deb3..94ca8d6e2f97 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -223,7 +223,6 @@ struct mptcp_sock { u64 ack_seq; u64 rcv_wnd_sent; u64 rcv_data_fin_seq; - int wmem_reserved; struct sock *last_snd; int snd_burst; int old_wspace; @@ -231,7 +230,6 @@ struct mptcp_sock { u64 wnd_end; unsigned long timer_ival; u32 token; - int rmem_released; unsigned long flags; bool can_ack; bool fully_established; @@ -265,19 +263,6 @@ struct mptcp_sock { char ca_name[TCP_CA_NAME_MAX]; }; -#define mptcp_lock_sock(___sk, cb) do { \ - struct sock *__sk = (___sk); /* silence macro reuse warning */ \ - might_sleep(); \ - spin_lock_bh(&__sk->sk_lock.slock); \ - if (__sk->sk_lock.owned) \ - __lock_sock(__sk); \ - cb; \ - __sk->sk_lock.owned = 1; \ - spin_unlock(&__sk->sk_lock.slock); \ - mutex_acquire(&__sk->sk_lock.dep_map, 0, 0, _RET_IP_); \ - local_bh_enable(); \ -} while (0) - #define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock) #define mptcp_data_unlock(sk) spin_unlock_bh(&(sk)->sk_lock.slock) @@ -296,7 +281,7 @@ static inline struct mptcp_sock *mptcp_sk(const struct sock *sk) static inline int __mptcp_space(const struct sock *sk) { - return tcp_space(sk) + READ_ONCE(mptcp_sk(sk)->rmem_released); + return tcp_space(sk); } static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)