diff mbox series

[mptcp-net] mptcp: fix fwd memory accounting on coalesce

Message ID e07e982885b904607d42e1e45840c4eb3c124d4b.1661426770.git.pabeni@redhat.com (mailing list archive)
State Accepted, archived
Commit f5d40f685b8a519c4e4f4522f348947f3f8e7da3
Delegated to: Matthieu Baerts
Headers show
Series [mptcp-net] mptcp: fix fwd memory accounting on coalesce | expand

Checks

Context Check Description
matttbe/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
matttbe/build success Build and static analysis OK
matttbe/KVM_Validation__normal success Success! ✅
matttbe/KVM_Validation__debug success Success! ✅

Commit Message

Paolo Abeni Aug. 25, 2022, 11:29 a.m. UTC
The intel bot reported a memory accounting related splat:

[  240.473094] ------------[ cut here ]------------
[  240.478507] page_counter underflow: -4294828518 nr_pages=4294967290
[  240.485500] WARNING: CPU: 2 PID: 14986 at mm/page_counter.c:56 page_counter_cancel+0x96/0xc0
[  240.570849] CPU: 2 PID: 14986 Comm: mptcp_connect Tainted: G S                5.19.0-rc4-00739-gd24141fe7b48 #1
[  240.581637] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017
[  240.590600] RIP: 0010:page_counter_cancel+0x96/0xc0
[  240.596179] Code: 00 00 00 45 31 c0 48 89 ef 5d 4c 89 c6 41 5c e9 40 fd ff ff 4c 89 e2 48 c7 c7 20 73 39 84 c6 05 d5 b1 52 04 01 e8 e7 95 f3
01 <0f> 0b eb a9 48 89 ef e8 1e 25 fc ff eb c3 66 66 2e 0f 1f 84 00 00
[  240.615639] RSP: 0018:ffffc9000496f7c8 EFLAGS: 00010082
[  240.621569] RAX: 0000000000000000 RBX: ffff88819c9c0120 RCX: 0000000000000000
[  240.629404] RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff5200092deeb
[  240.637239] RBP: ffff88819c9c0120 R08: 0000000000000001 R09: ffff888366527a2b
[  240.645069] R10: ffffed106cca4f45 R11: 0000000000000001 R12: 00000000fffffffa
[  240.652903] R13: ffff888366536118 R14: 00000000fffffffa R15: ffff88819c9c0000
[  240.660738] FS:  00007f3786e72540(0000) GS:ffff888366500000(0000) knlGS:0000000000000000
[  240.669529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  240.675974] CR2: 00007f966b346000 CR3: 0000000168cea002 CR4: 00000000003706e0
[  240.683807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  240.691641] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  240.699468] Call Trace:
[  240.702613]  <TASK>
[  240.705413]  page_counter_uncharge+0x29/0x80
[  240.710389]  drain_stock+0xd0/0x180
[  240.714585]  refill_stock+0x278/0x580
[  240.718951]  __sk_mem_reduce_allocated+0x222/0x5c0
[  240.729248]  __mptcp_update_rmem+0x235/0x2c0
[  240.734228]  __mptcp_move_skbs+0x194/0x6c0
[  240.749764]  mptcp_recvmsg+0xdfa/0x1340
[  240.763153]  inet_recvmsg+0x37f/0x500
[  240.782109]  sock_read_iter+0x24a/0x380
[  240.805353]  new_sync_read+0x420/0x540
[  240.838552]  vfs_read+0x37f/0x4c0
[  240.842582]  ksys_read+0x170/0x200
[  240.864039]  do_syscall_64+0x5c/0x80
[  240.872770]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  240.878526] RIP: 0033:0x7f3786d9ae8e
[  240.882805] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 18 0a 00 e8 89 e8 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[  240.902259] RSP: 002b:00007fff7be81e08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  240.910533] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f3786d9ae8e
[  240.918368] RDX: 0000000000002000 RSI: 00007fff7be87ec0 RDI: 0000000000000005
[  240.926206] RBP: 0000000000000005 R08: 00007f3786e6a230 R09: 00007f3786e6a240
[  240.934046] R10: fffffffffffff288 R11: 0000000000000246 R12: 0000000000002000
[  240.941884] R13: 00007fff7be87ec0 R14: 00007fff7be87ec0 R15: 0000000000002000
[  240.949741]  </TASK>
[  240.952632] irq event stamp: 27367
[  240.956735] hardirqs last  enabled at (27366): [<ffffffff81ba50ea>] mem_cgroup_uncharge_skmem+0x6a/0x80
[  240.966848] hardirqs last disabled at (27367): [<ffffffff81b8fd42>] refill_stock+0x282/0x580
[  240.976017] softirqs last  enabled at (27360): [<ffffffff83a4d8ef>] mptcp_recvmsg+0xaf/0x1340
[  240.985273] softirqs last disabled at (27364): [<ffffffff83a4d30c>] __mptcp_move_skbs+0x18c/0x6c0
[  240.994872] ---[ end trace 0000000000000000 ]---

After commit d24141fe7b48 ("mptcp: drop SK_RECLAIM_* macros"),
if rmem_fwd_alloc become negative, mptcp_rmem_uncharge() can
try to reclaim a negative amount of pages, since the expression:

	reclaimable >= PAGE_SIZE

will evaluate to true for any negative value of the int
'reclaimable': 'PAGE_SIZE' is an unsigned long and
the negative integer will be promoted to a (very large)
unsigned long value.

Still after the mentioned commit, kfree_skb_partial()
in mptcp_try_coalesce() will reclaim most of just released fwd
memory, so that following charging of the skb delta size will
lead to negative fwd memory values.

At that point a racing recvmsg() can trigger the splat.

Address the issue switching the order of the memory accounting
operations. The fwd memory can still transiently reach negative
values, but that will happen in an atomic scope and no code
path could touch/use such value.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: d24141fe7b ("mptcp: drop SK_RECLAIM_* macros")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
- Given all the above, I think we should revert/drop the 
  rx path refactor patches: not needed to address the memory
  accounting issue, and with too much problems to be solved
  in a reasonable time-frame.
- AFAICS plain TCP does not have a similar issue, as at
coalesce time the memory account operation are always performed
in the safe order.
---
 net/mptcp/protocol.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

MPTCP CI Aug. 25, 2022, 1 p.m. UTC | #1
Hi Paolo,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal:
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/6639699153387520
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6639699153387520/summary/summary.txt

- KVM Validation: debug:
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/4599005572235264
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4599005572235264/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/80065c141cc7


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)
Matthieu Baerts Aug. 25, 2022, 5:20 p.m. UTC | #2
Hi Paolo,

On 25/08/2022 13:29, Paolo Abeni wrote:
> The intel bot reported a memory accounting related splat:

(...)

> After commit d24141fe7b48 ("mptcp: drop SK_RECLAIM_* macros"),
> if rmem_fwd_alloc become negative, mptcp_rmem_uncharge() can
> try to reclaim a negative amount of pages, since the expression:
> 
> 	reclaimable >= PAGE_SIZE
> 
> will evaluate to true for any negative value of the int
> 'reclaimable': 'PAGE_SIZE' is an unsigned long and
> the negative integer will be promoted to a (very large)
> unsigned long value.
> 
> Still after the mentioned commit, kfree_skb_partial()
> in mptcp_try_coalesce() will reclaim most of just released fwd
> memory, so that following charging of the skb delta size will
> lead to negative fwd memory values.
> 
> At that point a racing recvmsg() can trigger the splat.
> 
> Address the issue switching the order of the memory accounting
> operations. The fwd memory can still transiently reach negative
> values, but that will happen in an atomic scope and no code
> path could touch/use such value.

Thank you for the patch and the explanation at the meeting we just had!

> Reported-by: kernel test robot <oliver.sang@intel.com>
> Fixes: d24141fe7b ("mptcp: drop SK_RECLAIM_* macros")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> - Given all the above, I think we should revert/drop the 
>   rx path refactor patches: not needed to address the memory
>   accounting issue, and with too much problems to be solved
>   in a reasonable time-frame.

As discussed at the last meeting, I first reverted these 4 patches from
our tree:

- mptcp: move RCVPRUNE event later
- mptcp: more accurate receive buffer updates
- mptcp: move msk input path under full msk socket lock
- mptcp: use common helper for rmem memory accounting

- Results: f937b3dbcdfa..de53498937c4 (export)


Then I applied the new patch:

New patches for t/upstream-net:
- f5d40f685b8a: mptcp: fix fwd memory accounting on coalesce
- Results: 1d1e5f83f7a4..0449ad887e88 (export-net)

New patches for t/upstream:
- f5d40f685b8a: mptcp: fix fwd memory accounting on coalesce
- Results: de53498937c4..aa5df4a5bcf9 (export)


As expected, I had a small conflict to resolve but while at it, I also
fixed a small typo (accunting) and added a new line before the "return".


Tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export-net/20220825T171734
https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20220825T171734


Cheers,
Matt
diff mbox series

Patch

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 74699bd47edf..c03e3162d98d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -145,9 +145,14 @@  static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
 		 MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq,
 		 to->len, MPTCP_SKB_CB(from)->end_seq);
 	MPTCP_SKB_CB(to)->end_seq = MPTCP_SKB_CB(from)->end_seq;
-	kfree_skb_partial(from, fragstolen);
+
+	/* note the fwd memory can reach a negative value after accunting
+	 * for the delta, but the later skb free will restore a non
+	 * negative one
+	 */
 	atomic_add(delta, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc -= delta;
+	kfree_skb_partial(from, fragstolen);
 	return true;
 }