diff mbox series

[net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()

Message ID 20210817194003.2102381-1-weiwan@google.com (mailing list archive)
State New
Headers show
Series [net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem() | expand

Commit Message

Wei Wang Aug. 17, 2021, 7:40 p.m. UTC
Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
to give more control to the networking stack and enable it to change
memcg charging behavior. In the future, the networking stack may decide
to avoid oom-kills when fallbacks are more appropriate.

One behavior change in mem_cgroup_charge_skmem() by this patch is to
avoid force charging by default and let the caller decide when and if
force charging is needed through the presence or absence of
__GFP_NOFAIL.

Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h      |  3 ++-
 include/net/sock.h              |  5 +++++
 mm/memcontrol.c                 | 24 +++++++++++-------------
 net/core/sock.c                 | 16 ++++++++++++----
 net/ipv4/inet_connection_sock.c |  3 ++-
 net/ipv4/tcp_output.c           |  3 ++-
 6 files changed, 34 insertions(+), 20 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org Aug. 18, 2021, 10:50 a.m. UTC | #1
Hello:

This patch was applied to netdev/net-next.git (refs/heads/master):

On Tue, 17 Aug 2021 12:40:03 -0700 you wrote:
> Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> to give more control to the networking stack and enable it to change
> memcg charging behavior. In the future, the networking stack may decide
> to avoid oom-kills when fallbacks are more appropriate.
> 
> One behavior change in mem_cgroup_charge_skmem() by this patch is to
> avoid force charging by default and let the caller decide when and if
> force charging is needed through the presence or absence of
> __GFP_NOFAIL.
> 
> [...]

Here is the summary with links:
  - [net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
    https://git.kernel.org/netdev/net-next/c/4b1327be9fe5

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Jakub Kicinski Oct. 12, 2022, 11:33 p.m. UTC | #2
On Tue, 17 Aug 2021 12:40:03 -0700 Wei Wang wrote:
> Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> to give more control to the networking stack and enable it to change
> memcg charging behavior. In the future, the networking stack may decide
> to avoid oom-kills when fallbacks are more appropriate.
> 
> One behavior change in mem_cgroup_charge_skmem() by this patch is to
> avoid force charging by default and let the caller decide when and if
> force charging is needed through the presence or absence of
> __GFP_NOFAIL.

This patch is causing a little bit of pain to us, to workloads running
with just memory.max set. After this change the TCP rx path no longer
forces the charging.

Any recommendation for the fix? Setting memory.high a few MB under
memory.max seems to remove the failures.
Shakeel Butt Oct. 13, 2022, 12:17 a.m. UTC | #3
On Wed, Oct 12, 2022 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 17 Aug 2021 12:40:03 -0700 Wei Wang wrote:
> > Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
> > to give more control to the networking stack and enable it to change
> > memcg charging behavior. In the future, the networking stack may decide
> > to avoid oom-kills when fallbacks are more appropriate.
> >
> > One behavior change in mem_cgroup_charge_skmem() by this patch is to
> > avoid force charging by default and let the caller decide when and if
> > force charging is needed through the presence or absence of
> > __GFP_NOFAIL.
>
> This patch is causing a little bit of pain to us, to workloads running
> with just memory.max set. After this change the TCP rx path no longer
> forces the charging.
>
> Any recommendation for the fix? Setting memory.high a few MB under
> memory.max seems to remove the failures.

Did the revert of this patch fix the issue you are seeing? The reason
I am asking is because this patch should not change the behavior.
Actually someone else reported the similar issue for UDP RX at [1] and
they tested the revert as well. The revert did not fix the issue for
them.

Wei has a better explanation at [2] why this patch is not the cause
for these issues.

[1] https://lore.kernel.org/all/CALvZod5_LVkOkF+gmefnctmx+bRjykSARm2JA9eqKJx85NYBGQ@mail.gmail.com/
[2] https://lore.kernel.org/all/CAEA6p_BhAh6f_kAHEoEJ38nunY=c=4WqxhJQUjT+dCSAr_rm8g@mail.gmail.com/
Jakub Kicinski Oct. 13, 2022, 12:38 a.m. UTC | #4
On Wed, 12 Oct 2022 17:17:38 -0700 Shakeel Butt wrote:
> Did the revert of this patch fix the issue you are seeing? The reason
> I am asking is because this patch should not change the behavior.
> Actually someone else reported the similar issue for UDP RX at [1] and
> they tested the revert as well. The revert did not fix the issue for
> them.
> 
> Wei has a better explanation at [2] why this patch is not the cause
> for these issues.

We're talking TCP here, to be clear. I haven't tested a revert, yet (not
that easy to test with a real workload) but I'm relatively confident the
change did introduce an "unforced" call, specifically this bit:

@@ -2728,10 +2728,12 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct proto *prot = sk->sk_prot;
 	long allocated = sk_memory_allocated_add(sk, amt);
+	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
 	bool charged = true;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt)))
+	if (memcg_charge &&
+	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+						gfp_memcg_charge())))

where gfp_memcg_charge() is GFP_NOWAIT in softirq.

The above gets called from (inverted stack):
 tcp_data_queue()
 tcp_try_rmem_schedule(sk, skb, skb->truesize)
 tcp_try_rmem_schedule()
 sk_rmem_schedule()
 __sk_mem_schedule()
 __sk_mem_raise_allocated()

Is my confidence unjustified? :)
Shakeel Butt Oct. 13, 2022, 12:54 a.m. UTC | #5
On Wed, Oct 12, 2022 at 05:38:25PM -0700, Jakub Kicinski wrote:
> On Wed, 12 Oct 2022 17:17:38 -0700 Shakeel Butt wrote:
> > Did the revert of this patch fix the issue you are seeing? The reason
> > I am asking is because this patch should not change the behavior.
> > Actually someone else reported the similar issue for UDP RX at [1] and
> > they tested the revert as well. The revert did not fix the issue for
> > them.
> > 
> > Wei has a better explanation at [2] why this patch is not the cause
> > for these issues.
> 
> We're talking TCP here, to be clear. I haven't tested a revert, yet (not
> that easy to test with a real workload) but I'm relatively confident the
> change did introduce an "unforced" call, specifically this bit:
> 
> @@ -2728,10 +2728,12 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>  {
>  	struct proto *prot = sk->sk_prot;
>  	long allocated = sk_memory_allocated_add(sk, amt);
> +	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
>  	bool charged = true;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
> -	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt)))
> +	if (memcg_charge &&
> +	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
> +						gfp_memcg_charge())))
> 
> where gfp_memcg_charge() is GFP_NOWAIT in softirq.
> 
> The above gets called from (inverted stack):
>  tcp_data_queue()
>  tcp_try_rmem_schedule(sk, skb, skb->truesize)
>  tcp_try_rmem_schedule()
>  sk_rmem_schedule()
>  __sk_mem_schedule()
>  __sk_mem_raise_allocated()
> 
> Is my confidence unjustified? :)
> 

Let me add Wei's explanation inline which is protocol independent:

	__sk_mem_raise_allocated() BEFORE the above patch is:
	- mem_cgroup_charge_skmem() gets called:
	    - try_charge() with GFP_NOWAIT gets called and  failed
	    - try_charge() with __GFP_NOFAIL
	    - return false
	- goto suppress_allocation:
	    - mem_cgroup_uncharge_skmem() gets called
	- return 0 (which means failure)

	AFTER the above patch, what happens in __sk_mem_raise_allocated() is:
	- mem_cgroup_charge_skmem() gets called:
	    - try_charge() with GFP_NOWAIT gets called and failed
	    - return false
	- goto suppress_allocation:
	    - We no longer calls mem_cgroup_uncharge_skmem()
	- return 0

So, before the patch, the memcg code may force charges but it will
return false and make the networking code to uncharge memcg for
SK_MEM_RECV.
Jakub Kicinski Oct. 13, 2022, 1:40 a.m. UTC | #6
On Thu, 13 Oct 2022 00:54:31 +0000 Shakeel Butt wrote:
> So, before the patch, the memcg code may force charges but it will
> return false and make the networking code to uncharge memcg for
> SK_MEM_RECV.

Ah, right, I see it now :(

I guess I'll have to try to test (some approximation of) a revert 
after all.

Did the fact that we used to force charge not potentially cause
reclaim, tho?  Letting TCP accept the next packet even if it had
to drop the current one?
Jakub Kicinski Oct. 13, 2022, 3:16 a.m. UTC | #7
On Wed, 12 Oct 2022 18:40:50 -0700 Jakub Kicinski wrote:
> Did the fact that we used to force charge not potentially cause
> reclaim, tho?  Letting TCP accept the next packet even if it had
> to drop the current one?

I pushed this little nugget to one affected machine via KLP:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 03ffbb255e60..c1ca369a1b77 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
                return true;
        }
 
+       if (gfp_mask == GFP_NOWAIT) {
+               try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
+               refill_stock(memcg, nr_pages);
+       }
        return false;
 }

The problem normally reproes reliably within 10min -- 30min and counting
and the application-level latency has not spiked.
Wei Wang Oct. 13, 2022, 3:34 a.m. UTC | #8
On Wed, Oct 12, 2022 at 8:16 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Oct 2022 18:40:50 -0700 Jakub Kicinski wrote:
> > Did the fact that we used to force charge not potentially cause
> > reclaim, tho?  Letting TCP accept the next packet even if it had
> > to drop the current one?
>
> I pushed this little nugget to one affected machine via KLP:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 03ffbb255e60..c1ca369a1b77 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
>                 return true;
>         }
>
> +       if (gfp_mask == GFP_NOWAIT) {
> +               try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> +               refill_stock(memcg, nr_pages);
> +       }
>         return false;
>  }
>
AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(),
you should return true to tell the caller that the nr_pages is
actually being charged. Although I am not very sure what
refill_stock() does. Does that "uncharge" those pages?


> The problem normally reproes reliably within 10min -- 30min and counting
> and the application-level latency has not spiked.
Jakub Kicinski Oct. 13, 2022, 3:49 a.m. UTC | #9
On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote:
> > I pushed this little nugget to one affected machine via KLP:
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 03ffbb255e60..c1ca369a1b77 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> >                 return true;
> >         }
> >
> > +       if (gfp_mask == GFP_NOWAIT) {
> > +               try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> > +               refill_stock(memcg, nr_pages);
> > +       }
> >         return false;
> >  }
> >  
> AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(),
> you should return true to tell the caller that the nr_pages is
> actually being charged.

Ack - not sure what the best thing to do is, tho. Always pass NOFAIL 
in softirq?

It's not clear to me yet why doing the charge/uncharge actually helps,
perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed?

I'll do more digging tomorrow.

> Although I am not very sure what refill_stock() does. Does that
> "uncharge" those pages?

I think so, I copied it from mem_cgroup_uncharge_skmem().
Wei Wang Oct. 13, 2022, 4:04 a.m. UTC | #10
On Wed, Oct 12, 2022 at 8:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote:
> > > I pushed this little nugget to one affected machine via KLP:
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 03ffbb255e60..c1ca369a1b77 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> > >                 return true;
> > >         }
> > >
> > > +       if (gfp_mask == GFP_NOWAIT) {
> > > +               try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> > > +               refill_stock(memcg, nr_pages);
> > > +       }
> > >         return false;
> > >  }
> > >
> > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(),
> > you should return true to tell the caller that the nr_pages is
> > actually being charged.
>
> Ack - not sure what the best thing to do is, tho. Always pass NOFAIL
> in softirq?
>
> It's not clear to me yet why doing the charge/uncharge actually helps,
> perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed?
>
I am curious to know as well.

> I'll do more digging tomorrow.
>
> > Although I am not very sure what refill_stock() does. Does that
> > "uncharge" those pages?
>
> I think so, I copied it from mem_cgroup_uncharge_skmem().
Shakeel Butt Oct. 13, 2022, 4:18 a.m. UTC | #11
On Wed, Oct 12, 2022 at 09:04:59PM -0700, Wei Wang wrote:
> On Wed, Oct 12, 2022 at 8:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Wed, 12 Oct 2022 20:34:00 -0700 Wei Wang wrote:
> > > > I pushed this little nugget to one affected machine via KLP:
> > > >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 03ffbb255e60..c1ca369a1b77 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -7121,6 +7121,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> > > >                 return true;
> > > >         }
> > > >
> > > > +       if (gfp_mask == GFP_NOWAIT) {
> > > > +               try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> > > > +               refill_stock(memcg, nr_pages);
> > > > +       }
> > > >         return false;
> > > >  }
> > > >
> > > AFAICT, if you force charge by passing __GFP_NOFAIL to try_charge(),
> > > you should return true to tell the caller that the nr_pages is
> > > actually being charged.
> >
> > Ack - not sure what the best thing to do is, tho. Always pass NOFAIL
> > in softirq?
> >
> > It's not clear to me yet why doing the charge/uncharge actually helps,
> > perhaps try_to_free_mem_cgroup_pages() does more when NOFAIL is passed?
> >
> I am curious to know as well.
> 
> > I'll do more digging tomorrow.
> >
> > > Although I am not very sure what refill_stock() does. Does that
> > > "uncharge" those pages?
> >
> > I think so, I copied it from mem_cgroup_uncharge_skmem().

I think I understand why this issue start happening after this patch.
The memcg charging happens in batches of 32 (64 nowadays) pages even
if the charge request is less. The remaining pre-charge is cached in
the per-cpu cache (or stock).

With (GFP_NOWAIT | __GFP_NOFAIL), you let the memcg go over the limit
without triggering oom-kill and then refill_stock just put the
pre-charge in the per-cpu cache. So, the later allocation/charge succeed
from the per-cpu cache even though memcg is over the limit.

So, with this patch we no longer force charge and then uncharge on
failure, so the later allocation/charge fail similarly.

Regarding what is the right thing to do, IMHO, is to use GFP_ATOMIC
instead of GFP_NOWAIT. If you see the following comment in
try_charge_memcg(), we added this exception particularly for these kind
of situations.

...
	/*
	 * Memcg doesn't have a dedicated reserve for atomic
	 * allocations. But like the global atomic pool, we need to
	 * put the burden of reclaim on regular allocation requests
	 * and let these go through as privileged allocations.
	 */
	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
		return -ENOMEM;
...

Shakeel
Jakub Kicinski Oct. 13, 2022, 9:49 p.m. UTC | #12
On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> This patch is causing a little bit of pain to us, to workloads running
> with just memory.max set. After this change the TCP rx path no longer
> forces the charging.
> 
> Any recommendation for the fix? Setting memory.high a few MB under
> memory.max seems to remove the failures.

Eric, is there anything that would make the TCP perform particularly
poorly under mem pressure?

Dropping and pruning happens a lot here:

# nstat -a | grep -i -E 'Prune|Drop'
TcpExtPruneCalled               1202577            0.0
TcpExtOfoPruned                 734606             0.0
TcpExtTCPOFODrop                64191              0.0
TcpExtTCPRcvQDrop               384305             0.0

Same workload on 5.6 kernel:

TcpExtPruneCalled               1223043            0.0
TcpExtOfoPruned                 3377               0.0
TcpExtListenDrops               10596              0.0
TcpExtTCPOFODrop                22                 0.0
TcpExtTCPRcvQDrop               734                0.0

From a quick look at the code and with what Shakeel explained in mind -
previously we would have "loaded up the cache" after the first failed
try, so we never got into the loop inside tcp_try_rmem_schedule() which
most likely nukes the entire OFO queue:

static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
				 unsigned int size)
{
	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
	    !sk_rmem_schedule(sk, skb, size)) {
	    /* ^ would fail but "load up the cache" ^ */

		if (tcp_prune_queue(sk) < 0)
			return -1;

		/* v this one would not fail due to the cache v */
		while (!sk_rmem_schedule(sk, skb, size)) {
			if (!tcp_prune_ofo_queue(sk))
				return -1;

Neil mentioned that he's seen multi-second stalls when SACKed segments
get dropped from the OFO queue. Sender waits for a very long time before
retrying something that was already SACKed if the receiver keeps
sacking new, later segments. Even when ACK reaches the previously-SACKed
block which should prove to the sender that something is very wrong.

I tried to repro this with a packet drill and it's not what I see
exactly, I need to keep shortening the RTT otherwise the retx comes 
out before the next SACK arrives.

I'll try to read the code, and maybe I'll get lucky and manage capture
the exact impacted flows :S But does anything of this nature ring the
bell?

`../common/defaults.sh`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 2048
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 60000) = 60000
   +0 > P. 1:10001(10000) ack 1

// Do some SACK-ing
  +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
+.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
// ..and we pretend we lost 1001:2001
+.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>

// re-xmit holes and send more
   +0 > . 10001:11001(1000) ack 1
   +0 > . 1:1001(1000) ack 1
   +0 > . 2001:3001(1000) ack 1 win 256
   +0 > P. 11001:13001(2000) ack 1 win 256
   +0 > P. 13001:15001(2000) ack 1 win 256

  +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>

   +0 > P. 15001:18001(3000) ack 1 win 256
   +0 > P. 18001:20001(2000) ack 1 win 256
   +0 > P. 20001:22001(2000) ack 1 win 256

  +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>

   +0 > P. 22001:24001(2000) ack 1 win 256
   +0 > P. 24001:26001(2000) ack 1 win 256
   +0 > P. 26001:28001(2000) ack 1 win 256
   +0 > .  28001:29001(1000) ack 1 win 256

+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>

   +0 > P. 29001:31001(2000) ack 1 win 256
   +0 > P. 31001:33001(2000) ack 1 win 256
   +0 > P. 33001:35001(2000) ack 1 win 256
   +0 > . 35001:36001(1000) ack 1 win 256

+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>

   +0 > P. 36001:38001(2000) ack 1 win 256
   +0 > P. 38001:40001(2000) ack 1 win 256
   +0 > P. 40001:42001(2000) ack 1 win 256
   +0 > .  42001:43001(1000) ack 1 win 256

+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>

   +0 > P. 43001:45001(2000) ack 1 win 256
   +0 > P. 45001:47001(2000) ack 1 win 256
   +0 > P. 47001:49001(2000) ack 1 win 256
   +0 > .  49001:50001(1000) ack 1 win 256

+0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>

   +0 > P. 50001:52001(2000) ack 1 win 256
   +0 > P. 52001:54001(2000) ack 1 win 256
   +0 > P. 54001:56001(2000) ack 1 win 256
   +0 > .  56001:57001(1000) ack 1 win 256

+0.04 > . 1001:2001(1000) ack 1 win 256


  +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
Eric Dumazet Oct. 13, 2022, 10:02 p.m. UTC | #13
On Thu, Oct 13, 2022 at 2:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> > This patch is causing a little bit of pain to us, to workloads running
> > with just memory.max set. After this change the TCP rx path no longer
> > forces the charging.
> >
> > Any recommendation for the fix? Setting memory.high a few MB under
> > memory.max seems to remove the failures.
>
> Eric, is there anything that would make the TCP perform particularly
> poorly under mem pressure?
>
> Dropping and pruning happens a lot here:
>
> # nstat -a | grep -i -E 'Prune|Drop'
> TcpExtPruneCalled               1202577            0.0
> TcpExtOfoPruned                 734606             0.0
> TcpExtTCPOFODrop                64191              0.0
> TcpExtTCPRcvQDrop               384305             0.0
>
> Same workload on 5.6 kernel:
>
> TcpExtPruneCalled               1223043            0.0
> TcpExtOfoPruned                 3377               0.0
> TcpExtListenDrops               10596              0.0
> TcpExtTCPOFODrop                22                 0.0
> TcpExtTCPRcvQDrop               734                0.0
>
> From a quick look at the code and with what Shakeel explained in mind -
> previously we would have "loaded up the cache" after the first failed
> try, so we never got into the loop inside tcp_try_rmem_schedule() which
> most likely nukes the entire OFO queue:
>
> static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>                                  unsigned int size)
> {
>         if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
>             !sk_rmem_schedule(sk, skb, size)) {
>             /* ^ would fail but "load up the cache" ^ */
>
>                 if (tcp_prune_queue(sk) < 0)
>                         return -1;
>
>                 /* v this one would not fail due to the cache v */
>                 while (!sk_rmem_schedule(sk, skb, size)) {
>                         if (!tcp_prune_ofo_queue(sk))
>                                 return -1;
>
> Neil mentioned that he's seen multi-second stalls when SACKed segments
> get dropped from the OFO queue. Sender waits for a very long time before
> retrying something that was already SACKed if the receiver keeps
> sacking new, later segments. Even when ACK reaches the previously-SACKed
> block which should prove to the sender that something is very wrong.
>
> I tried to repro this with a packet drill and it's not what I see
> exactly, I need to keep shortening the RTT otherwise the retx comes
> out before the next SACK arrives.
>
> I'll try to read the code, and maybe I'll get lucky and manage capture
> the exact impacted flows :S But does anything of this nature ring the
> bell?
>
> `../common/defaults.sh`
>
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
>    +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>   +.1 < . 1:1(0) ack 1 win 2048
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 60000) = 60000
>    +0 > P. 1:10001(10000) ack 1
>
> // Do some SACK-ing
>   +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
> +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
> // ..and we pretend we lost 1001:2001
> +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>
>
> // re-xmit holes and send more
>    +0 > . 10001:11001(1000) ack 1
>    +0 > . 1:1001(1000) ack 1
>    +0 > . 2001:3001(1000) ack 1 win 256
>    +0 > P. 11001:13001(2000) ack 1 win 256
>    +0 > P. 13001:15001(2000) ack 1 win 256
>
>   +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>
>
>    +0 > P. 15001:18001(3000) ack 1 win 256
>    +0 > P. 18001:20001(2000) ack 1 win 256
>    +0 > P. 20001:22001(2000) ack 1 win 256
>
>   +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>
>
>    +0 > P. 22001:24001(2000) ack 1 win 256
>    +0 > P. 24001:26001(2000) ack 1 win 256
>    +0 > P. 26001:28001(2000) ack 1 win 256
>    +0 > .  28001:29001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
>    +0 > P. 29001:31001(2000) ack 1 win 256
>    +0 > P. 31001:33001(2000) ack 1 win 256
>    +0 > P. 33001:35001(2000) ack 1 win 256
>    +0 > . 35001:36001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>
>
>    +0 > P. 36001:38001(2000) ack 1 win 256
>    +0 > P. 38001:40001(2000) ack 1 win 256
>    +0 > P. 40001:42001(2000) ack 1 win 256
>    +0 > .  42001:43001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>
>
>    +0 > P. 43001:45001(2000) ack 1 win 256
>    +0 > P. 45001:47001(2000) ack 1 win 256
>    +0 > P. 47001:49001(2000) ack 1 win 256
>    +0 > .  49001:50001(1000) ack 1 win 256
>
> +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>
>
>    +0 > P. 50001:52001(2000) ack 1 win 256
>    +0 > P. 52001:54001(2000) ack 1 win 256
>    +0 > P. 54001:56001(2000) ack 1 win 256
>    +0 > .  56001:57001(1000) ack 1 win 256
>
> +0.04 > . 1001:2001(1000) ack 1 win 256
>
>

This is SACK reneging, I would have to double check linux behavior, but
reverting to timeout could very well happen.

>   +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfe5c486f4ad..f0ee30881ca9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1581,7 +1581,8 @@  static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
+			     gfp_t gfp_mask);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
diff --git a/include/net/sock.h b/include/net/sock.h
index 6e761451c927..95b25777b53e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2400,6 +2400,11 @@  static inline gfp_t gfp_any(void)
 	return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
 }
 
+static inline gfp_t gfp_memcg_charge(void)
+{
+	return in_softirq() ? GFP_NOWAIT : GFP_KERNEL;
+}
+
 static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
 {
 	return noblock ? 0 : sk->sk_rcvtimeo;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ef06f9e0db1..be585ceaba98 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7048,14 +7048,14 @@  void mem_cgroup_sk_free(struct sock *sk)
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
  * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
  *
  * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * @memcg's configured limit, %false if the charge had to be forced.
+ * @memcg's configured limit, %false if it doesn't.
  */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
+			     gfp_t gfp_mask)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		struct page_counter *fail;
 
@@ -7063,21 +7063,19 @@  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 			memcg->tcpmem_pressure = 0;
 			return true;
 		}
-		page_counter_charge(&memcg->tcpmem, nr_pages);
 		memcg->tcpmem_pressure = 1;
+		if (gfp_mask & __GFP_NOFAIL) {
+			page_counter_charge(&memcg->tcpmem, nr_pages);
+			return true;
+		}
 		return false;
 	}
 
-	/* Don't block in the packet receive path */
-	if (in_softirq())
-		gfp_mask = GFP_NOWAIT;
-
-	mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
-
-	if (try_charge(memcg, gfp_mask, nr_pages) == 0)
+	if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
+		mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
 		return true;
+	}
 
-	try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
 	return false;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index aada649e07e8..950f1e70dbf5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2728,10 +2728,12 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct proto *prot = sk->sk_prot;
 	long allocated = sk_memory_allocated_add(sk, amt);
+	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
 	bool charged = true;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt)))
+	if (memcg_charge &&
+	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+						gfp_memcg_charge())))
 		goto suppress_allocation;
 
 	/* Under limit. */
@@ -2785,8 +2787,14 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 		/* Fail only if socket is _under_ its sndbuf.
 		 * In this case we cannot block, so that we have to fail.
 		 */
-		if (sk->sk_wmem_queued + size >= sk->sk_sndbuf)
+		if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
+			/* Force charge with __GFP_NOFAIL */
+			if (memcg_charge && !charged) {
+				mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+					gfp_memcg_charge() | __GFP_NOFAIL);
+			}
 			return 1;
+		}
 	}
 
 	if (kind == SK_MEM_SEND || (kind == SK_MEM_RECV && charged))
@@ -2794,7 +2802,7 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	sk_memory_allocated_sub(sk, amt);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+	if (memcg_charge && charged)
 		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
 
 	return 0;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 754013fa393b..f25d02ad4a8a 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -534,7 +534,8 @@  struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
 				   atomic_read(&newsk->sk_rmem_alloc));
 		mem_cgroup_sk_alloc(newsk);
 		if (newsk->sk_memcg && amt)
-			mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+			mem_cgroup_charge_skmem(newsk->sk_memcg, amt,
+						GFP_KERNEL | __GFP_NOFAIL);
 
 		release_sock(newsk);
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 29553fce8502..6d72f3ea48c4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3373,7 +3373,8 @@  void sk_forced_mem_schedule(struct sock *sk, int size)
 	sk_memory_allocated_add(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
-		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
+		mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+					gfp_memcg_charge() | __GFP_NOFAIL);
 }
 
 /* Send a FIN. The caller locks the socket for us.