[RFC,net-next,0/3] sock: Be aware of memcg pressure on alloc

Message ID	20230901062141.51972-1-wuyun.abel@bytedance.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Abel Wu <wuyun.abel@bytedance.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Shakeel Butt <shakeelb@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Abel Wu <wuyun.abel@bytedance.com>, Yafang Shao <laoar.shao@gmail.com>, Kuniyuki Iwashima <kuniyu@amazon.com>, Martin KaFai Lau <martin.lau@kernel.org>, Breno Leitao <leitao@debian.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, David Howells <dhowells@redhat.com>, Jason Xing <kernelxing@tencent.com> Cc: linux-kernel@vger.kernel.org (open list), netdev@vger.kernel.org (open list:NETWORKING [GENERAL]), linux-mm@kvack.org (open list:MEMORY MANAGEMENT) Subject: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Date: Fri, 1 Sep 2023 14:21:25 +0800 Message-Id: <20230901062141.51972-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	sock: Be aware of memcg pressure on alloc \| expand [RFC,net-next,0/3] sock: Be aware of memcg pressure on alloc [RFC,net-next,1/3] sock: Code cleanup on __sk_mem_raise_allocated() [RFC,net-next,2/3] net-memcg: Record pressure level when under pressure [RFC,net-next,3/3] sock: Throttle pressure-aware sockets under pressure

Message ID

20230901062141.51972-1-wuyun.abel@bytedance.com (mailing list archive)

Headers

From: Abel Wu <wuyun.abel@bytedance.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Yafang Shao <laoar.shao@gmail.com>,
	Kuniyuki Iwashima <kuniyu@amazon.com>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Breno Leitao <leitao@debian.org>,
	Alexander Mikhalitsyn <alexander@mihalicyn.com>,
	David Howells <dhowells@redhat.com>,
	Jason Xing <kernelxing@tencent.com>
Cc: linux-kernel@vger.kernel.org (open list),
	netdev@vger.kernel.org (open list:NETWORKING [GENERAL]),
	linux-mm@kvack.org (open list:MEMORY MANAGEMENT)
Subject: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
Date: Fri,  1 Sep 2023 14:21:25 +0800
Message-Id: <20230901062141.51972-1-wuyun.abel@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

sock: Be aware of memcg pressure on alloc | expand

Message

Abel Wu Sept. 1, 2023, 6:21 a.m. UTC

As a cloud service provider, we encountered a problem in our production
environment during the transition from cgroup v1 to v2 (partly due to the
heavy taxes of accounting socket memory in v1). Say one workload behaves
fine in cgroupv1 with memcg limit configured to 10GB memory and another
1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
due to burst memory usage on socket, since there is no specific limit for
socket memory in cgroupv2 and relies largely on workloads doing traffic
control themselves.

It's rational for the workloads to build some traffic control to better
utilize the resources they bought, but from kernel's point of view it's
also reasonable to suppress the allocation of socket memory once there is
a shortage of free memory, given that performance degradation is usually
better than failure.

This patchset aims to be more conservative on alloc for pressure-aware
sockets under global and/or memcg pressure, to avoid further memstall or
possibly OOM in such case. The patchset includes:

  1/3: simple code cleanup, no functional change intended.
  2/3: record memcg pressure level to enable fine-grained control.
  3/3: throttle alloc for pressure-aware sockets under pressure.

The whole patchset focuses on the pressure-aware protocols, and should
have no/little impact on pressure-unaware protocols like UDP etc.

Tested on Intel Xeon(R) Platinum 8260, a dual socket machine containing 2
NUMA nodes each of which has 24C/48T. All the benchmarks are done inside a
separate memcg in a clean host.

  baseline:	net-next c639a708a0b8
  compare:	baseline + patchset

case            	load    	baseline(std%)	compare%( std%)
tbench-loopback        	thread-24	 1.00 (  0.50)	 -0.98 (  0.87)
tbench-loopback        	thread-48	 1.00 (  0.76)	 -0.29 (  0.92)
tbench-loopback        	thread-72	 1.00 (  0.75)	 +1.51 (  0.14)
tbench-loopback        	thread-96	 1.00 (  4.11)	 +1.29 (  3.73)
tbench-loopback        	thread-192	 1.00 (  3.52)	 +1.44 (  3.30)
TCP_RR          	thread-24	 1.00 (  1.87)	 -0.87 (  2.40)
TCP_RR          	thread-48	 1.00 (  0.92)	 -0.22 (  1.61)
TCP_RR          	thread-72	 1.00 (  2.35)	 +2.42 (  2.27)
TCP_RR          	thread-96	 1.00 (  2.66)	 -1.37 (  3.02)
TCP_RR          	thread-192	 1.00 ( 13.25)	 +0.29 ( 11.80)
TCP_STREAM      	thread-24	 1.00 (  1.26)	 -0.75 (  0.87)
TCP_STREAM      	thread-48	 1.00 (  0.29)	 -1.55 (  0.14)
TCP_STREAM      	thread-72	 1.00 (  0.05)	 -1.59 (  0.05)
TCP_STREAM      	thread-96	 1.00 (  0.19)	 -0.06 (  0.29)
TCP_STREAM      	thread-192	 1.00 (  0.23)	 -0.01 (  0.28)
UDP_RR          	thread-24	 1.00 (  2.27)	 +0.33 (  2.82)
UDP_RR          	thread-48	 1.00 (  1.25)	 -0.30 (  1.21)
UDP_RR          	thread-72	 1.00 (  2.54)	 +2.99 (  2.34)
UDP_RR          	thread-96	 1.00 (  4.76)	 +2.49 (  2.19)
UDP_RR          	thread-192	 1.00 ( 14.43)	 -0.02 ( 12.98)
UDP_STREAM      	thread-24	 1.00 (107.41)	 -0.48 (106.93)
UDP_STREAM      	thread-48	 1.00 (100.85)	 +1.38 (100.59)
UDP_STREAM      	thread-72	 1.00 (103.43)	 +1.40 (103.48)
UDP_STREAM      	thread-96	 1.00 ( 99.91)	 -0.25 (100.06)
UDP_STREAM      	thread-192	 1.00 (109.83)	 -3.67 (104.12)

As patch 3 moves forward traversal of cgroup hierarchy for pressure-aware
protocols, which could turn a conditional overhead into constant, tests
running inside 5-level-depth cgroups are also performed.

case            	load    	baseline(std%)	compare%( std%)
tbench-loopback        	thread-24	 1.00 (  0.59)	 +0.68 (  0.09)
tbench-loopback        	thread-48	 1.00 (  0.16)	 +0.01 (  0.26)
tbench-loopback        	thread-72	 1.00 (  0.34)	 -0.67 (  0.48)
tbench-loopback        	thread-96	 1.00 (  4.40)	 -3.27 (  4.84)
tbench-loopback        	thread-192	 1.00 (  0.49)	 -1.07 (  1.18)
TCP_RR          	thread-24	 1.00 (  2.40)	 -0.34 (  2.49)
TCP_RR          	thread-48	 1.00 (  1.62)	 -0.48 (  1.35)
TCP_RR          	thread-72	 1.00 (  1.26)	 +0.46 (  0.95)
TCP_RR          	thread-96	 1.00 (  2.98)	 +0.13 (  2.64)
TCP_RR          	thread-192	 1.00 ( 13.75)	 -0.20 ( 15.42)
TCP_STREAM      	thread-24	 1.00 (  0.21)	 +0.68 (  1.02)
TCP_STREAM      	thread-48	 1.00 (  0.20)	 -1.41 (  0.01)
TCP_STREAM      	thread-72	 1.00 (  0.09)	 -1.23 (  0.19)
TCP_STREAM      	thread-96	 1.00 (  0.01)	 +0.01 (  0.01)
TCP_STREAM      	thread-192	 1.00 (  0.20)	 -0.02 (  0.25)
UDP_RR          	thread-24	 1.00 (  2.20)	 +0.84 ( 17.45)
UDP_RR          	thread-48	 1.00 (  1.34)	 -0.73 (  1.12)
UDP_RR          	thread-72	 1.00 (  2.32)	 +0.49 (  2.11)
UDP_RR          	thread-96	 1.00 (  2.36)	 +0.53 (  2.42)
UDP_RR          	thread-192	 1.00 ( 16.34)	 -0.67 ( 14.06)
UDP_STREAM      	thread-24	 1.00 (106.55)	 -0.70 (107.13)
UDP_STREAM      	thread-48	 1.00 (105.11)	 +1.60 (103.48)
UDP_STREAM      	thread-72	 1.00 (100.60)	 +1.98 (101.13)
UDP_STREAM      	thread-96	 1.00 ( 99.91)	 +2.59 (101.04)
UDP_STREAM      	thread-192	 1.00 (135.39)	 -2.51 (108.00)

As expected, no obvious performance gain or loss observed. As for the
issue we encountered, this patchset provides better worst-case behavior
that such OOM cases are reduced at some extent. While further fine-
grained traffic control is what the workloads need to think about.

Comments are welcomed! Thanks!

Abel Wu (3):
  sock: Code cleanup on __sk_mem_raise_allocated()
  net-memcg: Record pressure level when under pressure
  sock: Throttle pressure-aware sockets under pressure

 include/linux/memcontrol.h | 39 +++++++++++++++++++++++++----
 include/net/sock.h         |  2 +-
 include/net/tcp.h          |  2 +-
 mm/vmpressure.c            |  9 ++++++-
 net/core/sock.c            | 51 +++++++++++++++++++++++++++++---------
 5 files changed, 83 insertions(+), 20 deletions(-)

Comments

Abel Wu Sept. 8, 2023, 7:55 a.m. UTC | #1

Friendly ping :)

On 9/1/23 2:21 PM, Abel Wu wrote:
> As a cloud service provider, we encountered a problem in our production
> environment during the transition from cgroup v1 to v2 (partly due to the
> heavy taxes of accounting socket memory in v1). Say one workload behaves
> fine in cgroupv1 with memcg limit configured to 10GB memory and another
> 1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
> due to burst memory usage on socket, since there is no specific limit for
> socket memory in cgroupv2 and relies largely on workloads doing traffic
> control themselves.
> 
> It's rational for the workloads to build some traffic control to better
> utilize the resources they bought, but from kernel's point of view it's
> also reasonable to suppress the allocation of socket memory once there is
> a shortage of free memory, given that performance degradation is usually
> better than failure.
> 
> This patchset aims to be more conservative on alloc for pressure-aware
> sockets under global and/or memcg pressure, to avoid further memstall or
> possibly OOM in such case. The patchset includes:
> 
>    1/3: simple code cleanup, no functional change intended.
>    2/3: record memcg pressure level to enable fine-grained control.
>    3/3: throttle alloc for pressure-aware sockets under pressure.
> 
> The whole patchset focuses on the pressure-aware protocols, and should
> have no/little impact on pressure-unaware protocols like UDP etc.
> 
> Tested on Intel Xeon(R) Platinum 8260, a dual socket machine containing 2
> NUMA nodes each of which has 24C/48T. All the benchmarks are done inside a
> separate memcg in a clean host.
> 
>    baseline:	net-next c639a708a0b8
>    compare:	baseline + patchset
> 
> case            	load    	baseline(std%)	compare%( std%)
> tbench-loopback        	thread-24	 1.00 (  0.50)	 -0.98 (  0.87)
> tbench-loopback        	thread-48	 1.00 (  0.76)	 -0.29 (  0.92)
> tbench-loopback        	thread-72	 1.00 (  0.75)	 +1.51 (  0.14)
> tbench-loopback        	thread-96	 1.00 (  4.11)	 +1.29 (  3.73)
> tbench-loopback        	thread-192	 1.00 (  3.52)	 +1.44 (  3.30)
> TCP_RR          	thread-24	 1.00 (  1.87)	 -0.87 (  2.40)
> TCP_RR          	thread-48	 1.00 (  0.92)	 -0.22 (  1.61)
> TCP_RR          	thread-72	 1.00 (  2.35)	 +2.42 (  2.27)
> TCP_RR          	thread-96	 1.00 (  2.66)	 -1.37 (  3.02)
> TCP_RR          	thread-192	 1.00 ( 13.25)	 +0.29 ( 11.80)
> TCP_STREAM      	thread-24	 1.00 (  1.26)	 -0.75 (  0.87)
> TCP_STREAM      	thread-48	 1.00 (  0.29)	 -1.55 (  0.14)
> TCP_STREAM      	thread-72	 1.00 (  0.05)	 -1.59 (  0.05)
> TCP_STREAM      	thread-96	 1.00 (  0.19)	 -0.06 (  0.29)
> TCP_STREAM      	thread-192	 1.00 (  0.23)	 -0.01 (  0.28)
> UDP_RR          	thread-24	 1.00 (  2.27)	 +0.33 (  2.82)
> UDP_RR          	thread-48	 1.00 (  1.25)	 -0.30 (  1.21)
> UDP_RR          	thread-72	 1.00 (  2.54)	 +2.99 (  2.34)
> UDP_RR          	thread-96	 1.00 (  4.76)	 +2.49 (  2.19)
> UDP_RR          	thread-192	 1.00 ( 14.43)	 -0.02 ( 12.98)
> UDP_STREAM      	thread-24	 1.00 (107.41)	 -0.48 (106.93)
> UDP_STREAM      	thread-48	 1.00 (100.85)	 +1.38 (100.59)
> UDP_STREAM      	thread-72	 1.00 (103.43)	 +1.40 (103.48)
> UDP_STREAM      	thread-96	 1.00 ( 99.91)	 -0.25 (100.06)
> UDP_STREAM      	thread-192	 1.00 (109.83)	 -3.67 (104.12)
> 
> As patch 3 moves forward traversal of cgroup hierarchy for pressure-aware
> protocols, which could turn a conditional overhead into constant, tests
> running inside 5-level-depth cgroups are also performed.
> 
> case            	load    	baseline(std%)	compare%( std%)
> tbench-loopback        	thread-24	 1.00 (  0.59)	 +0.68 (  0.09)
> tbench-loopback        	thread-48	 1.00 (  0.16)	 +0.01 (  0.26)
> tbench-loopback        	thread-72	 1.00 (  0.34)	 -0.67 (  0.48)
> tbench-loopback        	thread-96	 1.00 (  4.40)	 -3.27 (  4.84)
> tbench-loopback        	thread-192	 1.00 (  0.49)	 -1.07 (  1.18)
> TCP_RR          	thread-24	 1.00 (  2.40)	 -0.34 (  2.49)
> TCP_RR          	thread-48	 1.00 (  1.62)	 -0.48 (  1.35)
> TCP_RR          	thread-72	 1.00 (  1.26)	 +0.46 (  0.95)
> TCP_RR          	thread-96	 1.00 (  2.98)	 +0.13 (  2.64)
> TCP_RR          	thread-192	 1.00 ( 13.75)	 -0.20 ( 15.42)
> TCP_STREAM      	thread-24	 1.00 (  0.21)	 +0.68 (  1.02)
> TCP_STREAM      	thread-48	 1.00 (  0.20)	 -1.41 (  0.01)
> TCP_STREAM      	thread-72	 1.00 (  0.09)	 -1.23 (  0.19)
> TCP_STREAM      	thread-96	 1.00 (  0.01)	 +0.01 (  0.01)
> TCP_STREAM      	thread-192	 1.00 (  0.20)	 -0.02 (  0.25)
> UDP_RR          	thread-24	 1.00 (  2.20)	 +0.84 ( 17.45)
> UDP_RR          	thread-48	 1.00 (  1.34)	 -0.73 (  1.12)
> UDP_RR          	thread-72	 1.00 (  2.32)	 +0.49 (  2.11)
> UDP_RR          	thread-96	 1.00 (  2.36)	 +0.53 (  2.42)
> UDP_RR          	thread-192	 1.00 ( 16.34)	 -0.67 ( 14.06)
> UDP_STREAM      	thread-24	 1.00 (106.55)	 -0.70 (107.13)
> UDP_STREAM      	thread-48	 1.00 (105.11)	 +1.60 (103.48)
> UDP_STREAM      	thread-72	 1.00 (100.60)	 +1.98 (101.13)
> UDP_STREAM      	thread-96	 1.00 ( 99.91)	 +2.59 (101.04)
> UDP_STREAM      	thread-192	 1.00 (135.39)	 -2.51 (108.00)
> 
> As expected, no obvious performance gain or loss observed. As for the
> issue we encountered, this patchset provides better worst-case behavior
> that such OOM cases are reduced at some extent. While further fine-
> grained traffic control is what the workloads need to think about.
> 
> Comments are welcomed! Thanks!
> 
> Abel Wu (3):
>    sock: Code cleanup on __sk_mem_raise_allocated()
>    net-memcg: Record pressure level when under pressure
>    sock: Throttle pressure-aware sockets under pressure
> 
>   include/linux/memcontrol.h | 39 +++++++++++++++++++++++++----
>   include/net/sock.h         |  2 +-
>   include/net/tcp.h          |  2 +-
>   mm/vmpressure.c            |  9 ++++++-
>   net/core/sock.c            | 51 +++++++++++++++++++++++++++++---------
>   5 files changed, 83 insertions(+), 20 deletions(-)
>

Shakeel Butt Sept. 8, 2023, 3:42 p.m. UTC | #2

On Fri, Sep 8, 2023 at 12:55 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Friendly ping :)
>

Sorry for the delay, I will get to this over this weekend.

Abel Wu Sept. 10, 2023, 5:09 a.m. UTC | #3

On 9/8/23 11:42 PM, Shakeel Butt wrote:
> On Fri, Sep 8, 2023 at 12:55 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Friendly ping :)
>>
> 
> Sorry for the delay, I will get to this over this weekend.

Hi Shakeel, I am really appreciate your time! Thanks a lot!

Shakeel Butt Sept. 14, 2023, 9:20 p.m. UTC | #4

On Fri, Sep 01, 2023 at 02:21:25PM +0800, Abel Wu wrote:
> 
[...]
> As expected, no obvious performance gain or loss observed. As for the
> issue we encountered, this patchset provides better worst-case behavior
> that such OOM cases are reduced at some extent. While further fine-
> grained traffic control is what the workloads need to think about.
> 

I agree with the motivation but I don't agree with the solution (patch 2
and 3). This is adding one more heuristic in the code which you yourself
described as helped to some extent. In addition adding more dependency
on vmpressure subsystem which is in weird state. Vmpressure is a cgroup
v1 feature which somehow networking subsystem is relying on for cgroup
v2 deployments. In addition vmpressure acts differently for workloads
with different memory types (mapped, mlocked, kernel memory).

Anyways, have you explored the BPF based approach. You can induce socket
pressure at the points you care about and define memory pressure however
your use-case cares for. You can define memory pressure using PSI or
vmpressure or maybe with MEMCG_HIGH events. What do you think?

thanks,
Shakeel

Abel Wu Sept. 15, 2023, 8:47 a.m. UTC | #5

On 9/15/23 5:20 AM, Shakeel Butt wrote:
> On Fri, Sep 01, 2023 at 02:21:25PM +0800, Abel Wu wrote:
>>
> [...]
>> As expected, no obvious performance gain or loss observed. As for the
>> issue we encountered, this patchset provides better worst-case behavior
>> that such OOM cases are reduced at some extent. While further fine-
>> grained traffic control is what the workloads need to think about.
>>
> 
> I agree with the motivation but I don't agree with the solution (patch 2
> and 3). This is adding one more heuristic in the code which you yourself
> described as helped to some extent. In addition adding more dependency
> on vmpressure subsystem which is in weird state. Vmpressure is a cgroup
> v1 feature which somehow networking subsystem is relying on for cgroup
> v2 deployments. In addition vmpressure acts differently for workloads
> with different memory types (mapped, mlocked, kernel memory).

Indeed.

> 
> Anyways, have you explored the BPF based approach. You can induce socket
> pressure at the points you care about and define memory pressure however
> your use-case cares for. You can define memory pressure using PSI or
> vmpressure or maybe with MEMCG_HIGH events. What do you think?

Yeah, this sounds much better. I will re-implement this patchset based
on your suggestion. Thank you for helpful comments!

Best,
	Abel