diff mbox series

[v4,2/2] net: memcg: late association of sock to memcg

Message ID 20200310051606.33121-2-shakeelb@google.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Shakeel Butt March 10, 2020, 5:16 a.m. UTC
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.

This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.

To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changes since v3:
- Moved the memcg association completely at accept time.

Changes since v2:
- Additional check for charging.
- Release the sock after charging.

Changes since v1:
- added sk->sk_rmem_alloc to initial charging.
- added synchronization to get memory usage and set sk_memcg race-free.

 mm/memcontrol.c                 | 14 --------------
 net/core/sock.c                 |  5 ++++-
 net/ipv4/inet_connection_sock.c | 20 ++++++++++++++++++++
 3 files changed, 24 insertions(+), 15 deletions(-)

Comments

Eric Dumazet March 10, 2020, 3:53 p.m. UTC | #1
On 3/9/20 10:16 PM, Shakeel Butt wrote:
> If a TCP socket is allocated in IRQ context or cloned from unassociated
> (i.e. not associated to a memcg) in IRQ context then it will remain
> unassociated for its whole life. Almost half of the TCPs created on the
> system are created in IRQ context, so, memory used by such sockets will
> not be accounted by the memcg.
> 
> This issue is more widespread in cgroup v1 where network memory
> accounting is opt-in but it can happen in cgroup v2 if the source socket
> for the cloning was created in root memcg.
> 
> To fix the issue, just do the association of the sockets at the accept()
> time in the process context and then force charge the memory buffer
> already used and reserved by the socket.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !
David Miller March 10, 2020, 10:34 p.m. UTC | #2
From: Shakeel Butt <shakeelb@google.com>
Date: Mon,  9 Mar 2020 22:16:06 -0700

> If a TCP socket is allocated in IRQ context or cloned from unassociated
> (i.e. not associated to a memcg) in IRQ context then it will remain
> unassociated for its whole life. Almost half of the TCPs created on the
> system are created in IRQ context, so, memory used by such sockets will
> not be accounted by the memcg.
> 
> This issue is more widespread in cgroup v1 where network memory
> accounting is opt-in but it can happen in cgroup v2 if the source socket
> for the cloning was created in root memcg.
> 
> To fix the issue, just do the association of the sockets at the accept()
> time in the process context and then force charge the memory buffer
> already used and reserved by the socket.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Applied and queued up for -stable.
Roman Gushchin March 10, 2020, 10:38 p.m. UTC | #3
On Mon, Mar 09, 2020 at 10:16:06PM -0700, Shakeel Butt wrote:
> If a TCP socket is allocated in IRQ context or cloned from unassociated
> (i.e. not associated to a memcg) in IRQ context then it will remain
> unassociated for its whole life. Almost half of the TCPs created on the
> system are created in IRQ context, so, memory used by such sockets will
> not be accounted by the memcg.
> 
> This issue is more widespread in cgroup v1 where network memory
> accounting is opt-in but it can happen in cgroup v2 if the source socket
> for the cloning was created in root memcg.
> 
> To fix the issue, just do the association of the sockets at the accept()
> time in the process context and then force charge the memory buffer
> already used and reserved by the socket.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Reviewed-by: Roman Gushchin <guro@fb.com>

Thank you!
Qian Cai March 12, 2020, 2:03 p.m. UTC | #4
On Mon, 2020-03-09 at 22:16 -0700, Shakeel Butt wrote:
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index a4db79b1b643..65a3b2565102 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -482,6 +482,26 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
>  		}
>  		spin_unlock_bh(&queue->fastopenq.lock);
>  	}
> +
> +	if (mem_cgroup_sockets_enabled) {
> +		int amt;
> +
> +		/* atomically get the memory usage, set and charge the
> +		 * sk->sk_memcg.
> +		 */
> +		lock_sock(newsk);

Here we have a deadlock,

[  362.620977][ T4106] WARNING: possible recursive locking detected
[  362.626983][ T4106] 5.6.0-rc5-next-20200312+ #5 Tainted: G             L   
[  362.633941][ T4106] --------------------------------------------
[  362.639944][ T4106] sshd/4106 is trying to acquire lock:
[  362.645251][ T4106] 7bff008a2eae6330 (sk_lock-AF_INET){+.+.}, at:
inet_csk_accept+0x370/0x45c
inet_csk_accept at net/ipv4/inet_connection_sock.c:497
[  362.653791][ T4106] 
[  362.653791][ T4106] but task is already holding lock:
[  362.661007][ T4106] c0ff008a2eae9430 (sk_lock-AF_INET){+.+.}, at:
inet_csk_accept+0x48/0x45c
inet_csk_accept at net/ipv4/inet_connection_sock.c:451
[  362.669452][ T4106] 
[  362.669452][ T4106] other info that might help us debug this:
[  362.677364][ T4106]  Possible unsafe locking scenario:
[  362.677364][ T4106] 
[  362.684666][ T4106]        CPU0
[  362.687801][ T4106]        ----
[  362.690937][ T4106]   lock(sk_lock-AF_INET);
[  362.695204][ T4106]   lock(sk_lock-AF_INET);
[  362.699472][ T4106] 
[  362.699472][ T4106]  *** DEADLOCK ***
[  362.699472][ T4106] 
[  362.707469][ T4106]  May be due to missing lock nesting notation
[  362.707469][ T4106] 
[  362.715643][ T4106] 1 lock held by sshd/4106:
[  362.719993][ T4106]  #0: c0ff008a2eae9430 (sk_lock-AF_INET){+.+.}, at:
inet_csk_accept+0x48/0x45c
[  362.728874][ T4106] 
[  362.728874][ T4106] stack backtrace:
[  362.734622][ T4106] CPU: 22 PID: 4106 Comm: sshd Tainted:
G             L    5.6.0-rc5-next-20200312+ #5
[  362.744096][ T4106] Hardware name: HPE Apollo
70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
[  362.754525][ T4106] Call trace:
[  362.757667][ T4106]  dump_backtrace+0x0/0x2c8
[  362.762022][ T4106]  show_stack+0x20/0x2c
[  362.766032][ T4106]  dump_stack+0xe8/0x150
[  362.770128][ T4106]  validate_chain+0x2f08/0x35e0
[  362.774830][ T4106]  __lock_acquire+0x868/0xc2c
[  362.779358][ T4106]  lock_acquire+0x320/0x360
[  362.783715][ T4106]  lock_sock_nested+0x9c/0xd8
[  362.788243][ T4106]  inet_csk_accept+0x370/0x45c
[  362.792861][ T4106]  inet_accept+0x80/0x1cc
[  362.797045][ T4106]  __sys_accept4_file+0x1b0/0x2bc
[  362.801921][ T4106]  __arm64_sys_accept+0x74/0xc8
[  362.806625][ T4106]  do_el0_svc+0x170/0x240
[  362.810807][ T4106]  el0_sync_handler+0x150/0x250
[  362.815509][ T4106]  el0_sync+0x164/0x180


> +
> +		/* The sk has not been accepted yet, no need to look at
> +		 * sk->sk_wmem_queued.
> +		 */
> +		amt = sk_mem_pages(newsk->sk_forward_alloc +
> +				   atomic_read(&sk->sk_rmem_alloc));
> +		mem_cgroup_sk_alloc(newsk);
> +		if (newsk->sk_memcg && amt)
> +			mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> +
> +		release_sock(newsk);
> +	}
>  out:
>  	release_sock(sk);
>  	if (req)
Shakeel Butt March 12, 2020, 2:05 p.m. UTC | #5
On Thu, Mar 12, 2020 at 7:03 AM Qian Cai <cai@lca.pw> wrote:
>
> On Mon, 2020-03-09 at 22:16 -0700, Shakeel Butt wrote:
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index a4db79b1b643..65a3b2565102 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -482,6 +482,26 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> >               }
> >               spin_unlock_bh(&queue->fastopenq.lock);
> >       }
> > +
> > +     if (mem_cgroup_sockets_enabled) {
> > +             int amt;
> > +
> > +             /* atomically get the memory usage, set and charge the
> > +              * sk->sk_memcg.
> > +              */
> > +             lock_sock(newsk);
>
> Here we have a deadlock,

It's a missing lockdep annotation. Eric already has a patch in
progress to fix this and another typo in the original patch.

>
> [  362.620977][ T4106] WARNING: possible recursive locking detected
> [  362.626983][ T4106] 5.6.0-rc5-next-20200312+ #5 Tainted: G             L
> [  362.633941][ T4106] --------------------------------------------
> [  362.639944][ T4106] sshd/4106 is trying to acquire lock:
> [  362.645251][ T4106] 7bff008a2eae6330 (sk_lock-AF_INET){+.+.}, at:
> inet_csk_accept+0x370/0x45c
> inet_csk_accept at net/ipv4/inet_connection_sock.c:497
> [  362.653791][ T4106]
> [  362.653791][ T4106] but task is already holding lock:
> [  362.661007][ T4106] c0ff008a2eae9430 (sk_lock-AF_INET){+.+.}, at:
> inet_csk_accept+0x48/0x45c
> inet_csk_accept at net/ipv4/inet_connection_sock.c:451
> [  362.669452][ T4106]
> [  362.669452][ T4106] other info that might help us debug this:
> [  362.677364][ T4106]  Possible unsafe locking scenario:
> [  362.677364][ T4106]
> [  362.684666][ T4106]        CPU0
> [  362.687801][ T4106]        ----
> [  362.690937][ T4106]   lock(sk_lock-AF_INET);
> [  362.695204][ T4106]   lock(sk_lock-AF_INET);
> [  362.699472][ T4106]
> [  362.699472][ T4106]  *** DEADLOCK ***
> [  362.699472][ T4106]
> [  362.707469][ T4106]  May be due to missing lock nesting notation
> [  362.707469][ T4106]
> [  362.715643][ T4106] 1 lock held by sshd/4106:
> [  362.719993][ T4106]  #0: c0ff008a2eae9430 (sk_lock-AF_INET){+.+.}, at:
> inet_csk_accept+0x48/0x45c
> [  362.728874][ T4106]
> [  362.728874][ T4106] stack backtrace:
> [  362.734622][ T4106] CPU: 22 PID: 4106 Comm: sshd Tainted:
> G             L    5.6.0-rc5-next-20200312+ #5
> [  362.744096][ T4106] Hardware name: HPE Apollo
> 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
> [  362.754525][ T4106] Call trace:
> [  362.757667][ T4106]  dump_backtrace+0x0/0x2c8
> [  362.762022][ T4106]  show_stack+0x20/0x2c
> [  362.766032][ T4106]  dump_stack+0xe8/0x150
> [  362.770128][ T4106]  validate_chain+0x2f08/0x35e0
> [  362.774830][ T4106]  __lock_acquire+0x868/0xc2c
> [  362.779358][ T4106]  lock_acquire+0x320/0x360
> [  362.783715][ T4106]  lock_sock_nested+0x9c/0xd8
> [  362.788243][ T4106]  inet_csk_accept+0x370/0x45c
> [  362.792861][ T4106]  inet_accept+0x80/0x1cc
> [  362.797045][ T4106]  __sys_accept4_file+0x1b0/0x2bc
> [  362.801921][ T4106]  __arm64_sys_accept+0x74/0xc8
> [  362.806625][ T4106]  do_el0_svc+0x170/0x240
> [  362.810807][ T4106]  el0_sync_handler+0x150/0x250
> [  362.815509][ T4106]  el0_sync+0x164/0x180
>
>
> > +
> > +             /* The sk has not been accepted yet, no need to look at
> > +              * sk->sk_wmem_queued.
> > +              */
> > +             amt = sk_mem_pages(newsk->sk_forward_alloc +
> > +                                atomic_read(&sk->sk_rmem_alloc));
> > +             mem_cgroup_sk_alloc(newsk);
> > +             if (newsk->sk_memcg && amt)
> > +                     mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> > +
> > +             release_sock(newsk);
> > +     }
> >  out:
> >       release_sock(sk);
> >       if (req)
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 06a889b0538b..351603c6c1c9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6737,20 +6737,6 @@  void mem_cgroup_sk_alloc(struct sock *sk)
 	if (!mem_cgroup_sockets_enabled)
 		return;
 
-	/*
-	 * Socket cloning can throw us here with sk_memcg already
-	 * filled. It won't however, necessarily happen from
-	 * process context. So the test for root memcg given
-	 * the current task's memcg won't help us in this case.
-	 *
-	 * Respecting the original socket's memcg is a better
-	 * decision in this case.
-	 */
-	if (sk->sk_memcg) {
-		css_get(&sk->sk_memcg->css);
-		return;
-	}
-
 	/* Do not associate the sock with unrelated interrupted task's memcg. */
 	if (in_interrupt())
 		return;
diff --git a/net/core/sock.c b/net/core/sock.c
index e4af4dbc1c9e..0fc8937a7ff4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1832,7 +1832,10 @@  struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		atomic_set(&newsk->sk_zckey, 0);
 
 		sock_reset_flag(newsk, SOCK_DONE);
-		mem_cgroup_sk_alloc(newsk);
+
+		/* sk->sk_memcg will be populated at accept() time */
+		newsk->sk_memcg = NULL;
+
 		cgroup_sk_alloc(&newsk->sk_cgrp_data);
 
 		rcu_read_lock();
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index a4db79b1b643..65a3b2565102 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -482,6 +482,26 @@  struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
 		}
 		spin_unlock_bh(&queue->fastopenq.lock);
 	}
+
+	if (mem_cgroup_sockets_enabled) {
+		int amt;
+
+		/* atomically get the memory usage, set and charge the
+		 * sk->sk_memcg.
+		 */
+		lock_sock(newsk);
+
+		/* The sk has not been accepted yet, no need to look at
+		 * sk->sk_wmem_queued.
+		 */
+		amt = sk_mem_pages(newsk->sk_forward_alloc +
+				   atomic_read(&sk->sk_rmem_alloc));
+		mem_cgroup_sk_alloc(newsk);
+		if (newsk->sk_memcg && amt)
+			mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+
+		release_sock(newsk);
+	}
 out:
 	release_sock(sk);
 	if (req)