[v2] net: memcg: late association of sock to memcg
diff mbox series

Message ID 20200304233856.257891-1-shakeelb@google.com
State New
Headers show
Series
  • [v2] net: memcg: late association of sock to memcg
Related show

Commit Message

Shakeel Butt March 4, 2020, 11:38 p.m. UTC
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.

This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.

To fix the issue, just do the late association of the unassociated
sockets at accept() time in the process context and then force charge
the memory buffer already reserved by the socket.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changes since v1:
- added sk->sk_rmem_alloc to initial charging.
- added synchronization to get memory usage and set sk_memcg race-free.

 net/ipv4/inet_connection_sock.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Eric Dumazet March 5, 2020, 1:36 a.m. UTC | #1
On Wed, Mar 4, 2020 at 3:39 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> If a TCP socket is allocated in IRQ context or cloned from unassociated
> (i.e. not associated to a memcg) in IRQ context then it will remain
> unassociated for its whole life. Almost half of the TCPs created on the
> system are created in IRQ context, so, memory used by such sockets will
> not be accounted by the memcg.
>
> This issue is more widespread in cgroup v1 where network memory
> accounting is opt-in but it can happen in cgroup v2 if the source socket
> for the cloning was created in root memcg.
>
> To fix the issue, just do the late association of the unassociated
> sockets at accept() time in the process context and then force charge
> the memory buffer already reserved by the socket.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
> Changes since v1:
> - added sk->sk_rmem_alloc to initial charging.
> - added synchronization to get memory usage and set sk_memcg race-free.
>
>  net/ipv4/inet_connection_sock.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index a4db79b1b643..7bcd657cd45e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -482,6 +482,25 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
>                 }
>                 spin_unlock_bh(&queue->fastopenq.lock);
>         }
> +
> +       if (mem_cgroup_sockets_enabled && !newsk->sk_memcg) {
> +               int amt;
> +
> +               /* atomically get the memory usage and set sk->sk_memcg. */
> +               lock_sock(newsk);
> +
> +               /* The sk has not been accepted yet, no need to look at
> +                * sk->sk_wmem_queued.
> +                */
> +               amt = sk_mem_pages(newsk->sk_forward_alloc +
> +                                  atomic_read(&sk->sk_rmem_alloc));
> +               mem_cgroup_sk_alloc(newsk);
> +
> +               release_sock(newsk);
> +
> +               if (newsk->sk_memcg)

Most sockets in accept queue should have amt == 0, so maybe avoid
calling this thing only when amt == 0 ?

Also  I would release_sock(newsk) after this, otherwise incoming
packets could mess with newsk->sk_forward_alloc

if (amt && newsk->sk_memcg)
      mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
release_sock(newsk);

Also, I wonder if     mem_cgroup_charge_skmem() has been used at all
these last four years
on arches with PAGE_SIZE != 4096

( SK_MEM_QUANTUM is not anymore PAGE_SIZE, but 4096)



> +                       mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> +       }
>  out:
>         release_sock(sk);
>         if (req)
> --
> 2.25.0.265.gbab2e86ba0-goog
>
Shakeel Butt March 5, 2020, 2:18 a.m. UTC | #2
On Wed, Mar 4, 2020 at 5:36 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 4, 2020 at 3:39 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > If a TCP socket is allocated in IRQ context or cloned from unassociated
> > (i.e. not associated to a memcg) in IRQ context then it will remain
> > unassociated for its whole life. Almost half of the TCPs created on the
> > system are created in IRQ context, so, memory used by such sockets will
> > not be accounted by the memcg.
> >
> > This issue is more widespread in cgroup v1 where network memory
> > accounting is opt-in but it can happen in cgroup v2 if the source socket
> > for the cloning was created in root memcg.
> >
> > To fix the issue, just do the late association of the unassociated
> > sockets at accept() time in the process context and then force charge
> > the memory buffer already reserved by the socket.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> > Changes since v1:
> > - added sk->sk_rmem_alloc to initial charging.
> > - added synchronization to get memory usage and set sk_memcg race-free.
> >
> >  net/ipv4/inet_connection_sock.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index a4db79b1b643..7bcd657cd45e 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -482,6 +482,25 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> >                 }
> >                 spin_unlock_bh(&queue->fastopenq.lock);
> >         }
> > +
> > +       if (mem_cgroup_sockets_enabled && !newsk->sk_memcg) {
> > +               int amt;
> > +
> > +               /* atomically get the memory usage and set sk->sk_memcg. */
> > +               lock_sock(newsk);
> > +
> > +               /* The sk has not been accepted yet, no need to look at
> > +                * sk->sk_wmem_queued.
> > +                */
> > +               amt = sk_mem_pages(newsk->sk_forward_alloc +
> > +                                  atomic_read(&sk->sk_rmem_alloc));
> > +               mem_cgroup_sk_alloc(newsk);
> > +
> > +               release_sock(newsk);
> > +
> > +               if (newsk->sk_memcg)
>
> Most sockets in accept queue should have amt == 0, so maybe avoid
> calling this thing only when amt == 0 ?
>

Thanks, will do in the next version. BTW I have tested with adding
mdelay() here and running iperf3 and I did see non-zero amt.

> Also  I would release_sock(newsk) after this, otherwise incoming
> packets could mess with newsk->sk_forward_alloc
>

I think that is fine. Once sk->sk_memcg is set then
mem_cgroup_charge_skmem() will be called for new incoming packets.
Here we just need to call mem_cgroup_charge_skmem() with amt before
sk->sk_memcg was set.

> if (amt && newsk->sk_memcg)
>       mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> release_sock(newsk);
>
> Also, I wonder if     mem_cgroup_charge_skmem() has been used at all
> these last four years
> on arches with PAGE_SIZE != 4096
>
> ( SK_MEM_QUANTUM is not anymore PAGE_SIZE, but 4096)
>

Oh so sk_mem_pages() does not really give the number of pages. Yeah
this needs a fix for non-4906 page size architectures. Though I can
understand why this has not been caught yet. Network memory accounting
is opt-in in cgroup v1 and most of the users still use v1. In cgroup
v2, it is enabled and there is no way to opt-out. Facebook is a
well-known v2 user and it seems like they don't have non-4096 page
size arch systems.
Eric Dumazet March 5, 2020, 4:37 a.m. UTC | #3
On Wed, Mar 4, 2020 at 6:19 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 4, 2020 at 5:36 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:39 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > If a TCP socket is allocated in IRQ context or cloned from unassociated
> > > (i.e. not associated to a memcg) in IRQ context then it will remain
> > > unassociated for its whole life. Almost half of the TCPs created on the
> > > system are created in IRQ context, so, memory used by such sockets will
> > > not be accounted by the memcg.
> > >
> > > This issue is more widespread in cgroup v1 where network memory
> > > accounting is opt-in but it can happen in cgroup v2 if the source socket
> > > for the cloning was created in root memcg.
> > >
> > > To fix the issue, just do the late association of the unassociated
> > > sockets at accept() time in the process context and then force charge
> > > the memory buffer already reserved by the socket.
> > >
> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > ---
> > > Changes since v1:
> > > - added sk->sk_rmem_alloc to initial charging.
> > > - added synchronization to get memory usage and set sk_memcg race-free.
> > >
> > >  net/ipv4/inet_connection_sock.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > >
> > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > index a4db79b1b643..7bcd657cd45e 100644
> > > --- a/net/ipv4/inet_connection_sock.c
> > > +++ b/net/ipv4/inet_connection_sock.c
> > > @@ -482,6 +482,25 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> > >                 }
> > >                 spin_unlock_bh(&queue->fastopenq.lock);
> > >         }
> > > +
> > > +       if (mem_cgroup_sockets_enabled && !newsk->sk_memcg) {
> > > +               int amt;
> > > +
> > > +               /* atomically get the memory usage and set sk->sk_memcg. */
> > > +               lock_sock(newsk);
> > > +
> > > +               /* The sk has not been accepted yet, no need to look at
> > > +                * sk->sk_wmem_queued.
> > > +                */
> > > +               amt = sk_mem_pages(newsk->sk_forward_alloc +
> > > +                                  atomic_read(&sk->sk_rmem_alloc));
> > > +               mem_cgroup_sk_alloc(newsk);
> > > +
> > > +               release_sock(newsk);
> > > +
> > > +               if (newsk->sk_memcg)
> >
> > Most sockets in accept queue should have amt == 0, so maybe avoid
> > calling this thing only when amt == 0 ?
> >
>
> Thanks, will do in the next version. BTW I have tested with adding
> mdelay() here and running iperf3 and I did see non-zero amt.
>
> > Also  I would release_sock(newsk) after this, otherwise incoming
> > packets could mess with newsk->sk_forward_alloc
> >
>
> I think that is fine. Once sk->sk_memcg is set then
> mem_cgroup_charge_skmem() will be called for new incoming packets.
> Here we just need to call mem_cgroup_charge_skmem() with amt before
> sk->sk_memcg was set.


Unfortunately, as soon as release_sock(newsk) is done, incoming
packets can be fed to the socket,
and completely change memory usage of the socket.

For example, the whole queue might have been zapped, or collapsed, if
we receive a RST packet,
or if memory pressure asks us to prune the out of order queue.

So you might charge something, then never uncharge it, since at
close() time the socket will have zero bytes to uncharge.



>
> > if (amt && newsk->sk_memcg)
> >       mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> > release_sock(newsk);
> >
> > Also, I wonder if     mem_cgroup_charge_skmem() has been used at all
> > these last four years
> > on arches with PAGE_SIZE != 4096
> >
> > ( SK_MEM_QUANTUM is not anymore PAGE_SIZE, but 4096)
> >
>
> Oh so sk_mem_pages() does not really give the number of pages. Yeah
> this needs a fix for non-4906 page size architectures. Though I can
> understand why this has not been caught yet. Network memory accounting
> is opt-in in cgroup v1 and most of the users still use v1. In cgroup
> v2, it is enabled and there is no way to opt-out. Facebook is a
> well-known v2 user and it seems like they don't have non-4096 page
> size arch systems.
Shakeel Butt March 5, 2020, 4:54 a.m. UTC | #4
On Wed, Mar 4, 2020 at 8:38 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 4, 2020 at 6:19 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 5:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Mar 4, 2020 at 3:39 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > If a TCP socket is allocated in IRQ context or cloned from unassociated
> > > > (i.e. not associated to a memcg) in IRQ context then it will remain
> > > > unassociated for its whole life. Almost half of the TCPs created on the
> > > > system are created in IRQ context, so, memory used by such sockets will
> > > > not be accounted by the memcg.
> > > >
> > > > This issue is more widespread in cgroup v1 where network memory
> > > > accounting is opt-in but it can happen in cgroup v2 if the source socket
> > > > for the cloning was created in root memcg.
> > > >
> > > > To fix the issue, just do the late association of the unassociated
> > > > sockets at accept() time in the process context and then force charge
> > > > the memory buffer already reserved by the socket.
> > > >
> > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > ---
> > > > Changes since v1:
> > > > - added sk->sk_rmem_alloc to initial charging.
> > > > - added synchronization to get memory usage and set sk_memcg race-free.
> > > >
> > > >  net/ipv4/inet_connection_sock.c | 19 +++++++++++++++++++
> > > >  1 file changed, 19 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > index a4db79b1b643..7bcd657cd45e 100644
> > > > --- a/net/ipv4/inet_connection_sock.c
> > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > @@ -482,6 +482,25 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> > > >                 }
> > > >                 spin_unlock_bh(&queue->fastopenq.lock);
> > > >         }
> > > > +
> > > > +       if (mem_cgroup_sockets_enabled && !newsk->sk_memcg) {
> > > > +               int amt;
> > > > +
> > > > +               /* atomically get the memory usage and set sk->sk_memcg. */
> > > > +               lock_sock(newsk);
> > > > +
> > > > +               /* The sk has not been accepted yet, no need to look at
> > > > +                * sk->sk_wmem_queued.
> > > > +                */
> > > > +               amt = sk_mem_pages(newsk->sk_forward_alloc +
> > > > +                                  atomic_read(&sk->sk_rmem_alloc));
> > > > +               mem_cgroup_sk_alloc(newsk);
> > > > +
> > > > +               release_sock(newsk);
> > > > +
> > > > +               if (newsk->sk_memcg)
> > >
> > > Most sockets in accept queue should have amt == 0, so maybe avoid
> > > calling this thing only when amt == 0 ?
> > >
> >
> > Thanks, will do in the next version. BTW I have tested with adding
> > mdelay() here and running iperf3 and I did see non-zero amt.
> >
> > > Also  I would release_sock(newsk) after this, otherwise incoming
> > > packets could mess with newsk->sk_forward_alloc
> > >
> >
> > I think that is fine. Once sk->sk_memcg is set then
> > mem_cgroup_charge_skmem() will be called for new incoming packets.
> > Here we just need to call mem_cgroup_charge_skmem() with amt before
> > sk->sk_memcg was set.
>
>
> Unfortunately, as soon as release_sock(newsk) is done, incoming
> packets can be fed to the socket,
> and completely change memory usage of the socket.
>
> For example, the whole queue might have been zapped, or collapsed, if
> we receive a RST packet,
> or if memory pressure asks us to prune the out of order queue.
>
> So you might charge something, then never uncharge it, since at
> close() time the socket will have zero bytes to uncharge.
>

Ok, thanks for the explanation. I will fix this in the next version.

Patch
diff mbox series

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index a4db79b1b643..7bcd657cd45e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -482,6 +482,25 @@  struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
 		}
 		spin_unlock_bh(&queue->fastopenq.lock);
 	}
+
+	if (mem_cgroup_sockets_enabled && !newsk->sk_memcg) {
+		int amt;
+
+		/* atomically get the memory usage and set sk->sk_memcg. */
+		lock_sock(newsk);
+
+		/* The sk has not been accepted yet, no need to look at
+		 * sk->sk_wmem_queued.
+		 */
+		amt = sk_mem_pages(newsk->sk_forward_alloc +
+				   atomic_read(&sk->sk_rmem_alloc));
+		mem_cgroup_sk_alloc(newsk);
+
+		release_sock(newsk);
+
+		if (newsk->sk_memcg)
+			mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+	}
 out:
 	release_sock(sk);
 	if (req)