From patchwork Thu Nov 12 22:15:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Gushchin X-Patchwork-Id: 11901875 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A876616C0 for ; Thu, 12 Nov 2020 22:16:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 32B0722201 for ; Thu, 12 Nov 2020 22:16:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com header.b="cerfqOGy" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 32B0722201 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=fb.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0C9616B005C; Thu, 12 Nov 2020 17:16:06 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 07B6D6B005D; Thu, 12 Nov 2020 17:16:06 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EAA2A6B0068; Thu, 12 Nov 2020 17:16:05 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0185.hostedemail.com [216.40.44.185]) by kanga.kvack.org (Postfix) with ESMTP id B387D6B005C for ; Thu, 12 Nov 2020 17:16:05 -0500 (EST) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5961F3623 for ; Thu, 12 Nov 2020 22:16:05 +0000 (UTC) X-FDA: 77477175090.04.jeans36_0406d3f27309 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin04.hostedemail.com (Postfix) with ESMTP id 3AEA18003400 for ; Thu, 12 Nov 2020 22:16:05 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,prvs=85855ab1a6=guro@fb.com,,RULES_HIT:30001:30005:30012:30036:30054:30056:30064:30075,0,RBL:67.231.145.42:@fb.com:.lbl8.mailshell.net-64.10.201.10 62.18.0.100;04ygnzspywe84dme6cq7joa91dffyyp61fadngzpzpr865r8ze5pxpqixgpcgqz.tcp3fhfrniyn394zbqfz3n53dythsinhanrnoupmostm9za3ziamr987n8g78y8.k-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:70,LUA_SUMMARY:none X-HE-Tag: jeans36_0406d3f27309 X-Filterd-Recvd-Size: 12553 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Thu, 12 Nov 2020 22:16:04 +0000 (UTC) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 0ACMBJiM013518 for ; Thu, 12 Nov 2020 14:16:03 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding : content-type; s=facebook; bh=fGpL7GC95liGhV6v9nkNzQUubkLvGqael5gOPZw7RA0=; b=cerfqOGy3icrvdCESWpDNBTdChI42KyyQbq5YnG5+wUXRUh7X8BZUoNv83df4J8z+/iC QQc9nn+bNh7eoxlsjZz3PwRjTGlY3kKnhDYaRlMJa6AqU5vkyHmEitp6CvvF39GF2RKv wOqUtbBOrJ0KboLwFeSs/on9T4J9zEXV4L0= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 34s7dstj6v-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 12 Nov 2020 14:16:03 -0800 Received: from intmgw002.06.prn3.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Thu, 12 Nov 2020 14:16:01 -0800 Received: by devvm3388.prn0.facebook.com (Postfix, from userid 111017) id B7808A7D1C8; Thu, 12 Nov 2020 14:16:00 -0800 (PST) From: Roman Gushchin To: CC: Alexei Starovoitov , Daniel Borkmann , , Andrii Nakryiko , Shakeel Butt , , , , Roman Gushchin Subject: [PATCH bpf-next v5 00/34] bpf: switch to memcg-based memory accounting Date: Thu, 12 Nov 2020 14:15:09 -0800 Message-ID: <20201112221543.3621014-1-guro@fb.com> X-Mailer: git-send-email 2.24.1 MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.312,18.0.737 definitions=2020-11-12_13:2020-11-12,2020-11-12 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 adultscore=0 malwarescore=0 priorityscore=1501 lowpriorityscore=0 mlxlogscore=999 bulkscore=0 phishscore=0 suspectscore=38 spamscore=0 clxscore=1015 mlxscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2011120126 X-FB-Internal: deliver X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf users (e.g. memlock() users). Any specific value is either too low and creates false failures or too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and precharge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. In order to overcome these problems let's switch to the memcg-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. Of course, it requires enabled cgroups and kernel memory accounting and properly configured cgroup tree, but it's a default configuration for a modern Linux system. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. In general, if a process performs a bpf operation (e.g. creates or updates a map), it's memory cgroup is charged. However map updates performed from an interrupt context are charged to the memory cgroup which contained the process, which created the map. Providing a 1:1 replacement for the rlimit-based memory accounting is a non-goal of this patchset. Users and memory cgroups are completely orthogonal, so it's not possible even in theory. Memcg-based memory accounting requires a properly configured cgroup tree to be actually useful. However, it's the way how the memory is managed on a modern Linux system. The patchset consists of the following parts: 1) 4 mm patches, which are already in the mm tree, but are required to avoid a regression (otherwise vmallocs cannot be mapped to userspace). 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples First 4 patches are not supposed to be merged via the bpf tree. I'm including them to make sure bpf tests will pass. v5: - rebased to the latest version of the remote charging API - implemented kmem accounting from an interrupt context, by Shakeel - rebased to latest changes in mm allowed to map vmallocs to userspace - fixed a build issue in kselftests, by Alexei - fixed a use-after-free bug in bpf_map_free_deferred() - added bpf line info coverage, by Shakeel - split bpf map charging preparations into a separate patch v4: - covered allocations made from an interrupt context, by Daniel - added some clarifications to the cover letter v3: - droped the userspace part for further discussions/refinements, by Andrii and Song v2: - fixed build issue, caused by the remaining rlimit-based accounting for sockhash maps Roman Gushchin (34): mm: memcontrol: use helpers to read page's memcg data mm: memcontrol/slab: use helpers to access slab page's memcg_data mm: introduce page memcg flags mm: convert page kmemcg type to a page memcg flag bpf: memcg-based memory accounting for bpf progs bpf: prepare for memcg-based memory accounting for bpf maps bpf: memcg-based memory accounting for bpf maps bpf: refine memcg-based memory accounting for arraymap maps bpf: refine memcg-based memory accounting for cpumap maps bpf: memcg-based memory accounting for cgroup storage maps bpf: refine memcg-based memory accounting for devmap maps bpf: refine memcg-based memory accounting for hashtab maps bpf: memcg-based memory accounting for lpm_trie maps bpf: memcg-based memory accounting for bpf ringbuffer bpf: memcg-based memory accounting for bpf local storage maps bpf: refine memcg-based memory accounting for sockmap and sockhash maps bpf: refine memcg-based memory accounting for xskmap maps bpf: eliminate rlimit-based memory accounting for arraymap maps bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps bpf: eliminate rlimit-based memory accounting for cpumap maps bpf: eliminate rlimit-based memory accounting for cgroup storage maps bpf: eliminate rlimit-based memory accounting for devmap maps bpf: eliminate rlimit-based memory accounting for hashtab maps bpf: eliminate rlimit-based memory accounting for lpm_trie maps bpf: eliminate rlimit-based memory accounting for queue_stack_maps maps bpf: eliminate rlimit-based memory accounting for reuseport_array maps bpf: eliminate rlimit-based memory accounting for bpf ringbuffer bpf: eliminate rlimit-based memory accounting for sockmap and sockhash maps bpf: eliminate rlimit-based memory accounting for stackmap maps bpf: eliminate rlimit-based memory accounting for xskmap maps bpf: eliminate rlimit-based memory accounting for bpf local storage maps bpf: eliminate rlimit-based memory accounting infra for bpf maps bpf: eliminate rlimit-based memory accounting for bpf progs bpf: samples: do not touch RLIMIT_MEMLOCK fs/buffer.c | 2 +- fs/iomap/buffered-io.c | 2 +- include/linux/bpf.h | 27 +-- include/linux/memcontrol.h | 215 +++++++++++++++++- include/linux/mm.h | 22 -- include/linux/mm_types.h | 5 +- include/linux/page-flags.h | 11 +- include/trace/events/writeback.h | 2 +- kernel/bpf/arraymap.c | 30 +-- kernel/bpf/bpf_local_storage.c | 18 +- kernel/bpf/bpf_struct_ops.c | 19 +- kernel/bpf/core.c | 22 +- kernel/bpf/cpumap.c | 20 +- kernel/bpf/devmap.c | 23 +- kernel/bpf/hashtab.c | 33 +-- kernel/bpf/helpers.c | 37 ++- kernel/bpf/local_storage.c | 38 +--- kernel/bpf/lpm_trie.c | 17 +- kernel/bpf/queue_stack_maps.c | 16 +- kernel/bpf/reuseport_array.c | 12 +- kernel/bpf/ringbuf.c | 33 +-- kernel/bpf/stackmap.c | 16 +- kernel/bpf/syscall.c | 177 ++++---------- kernel/fork.c | 7 +- mm/debug.c | 4 +- mm/huge_memory.c | 4 +- mm/memcontrol.c | 139 +++++------ mm/page_alloc.c | 8 +- mm/page_io.c | 6 +- mm/slab.h | 38 +--- mm/workingset.c | 2 +- net/core/bpf_sk_storage.c | 2 +- net/core/sock_map.c | 40 +--- net/xdp/xskmap.c | 15 +- samples/bpf/map_perf_test_user.c | 6 - samples/bpf/offwaketime_user.c | 6 - samples/bpf/sockex2_user.c | 2 - samples/bpf/sockex3_user.c | 2 - samples/bpf/spintest_user.c | 6 - samples/bpf/syscall_tp_user.c | 2 - samples/bpf/task_fd_query_user.c | 5 - samples/bpf/test_lru_dist.c | 3 - samples/bpf/test_map_in_map_user.c | 6 - samples/bpf/test_overhead_user.c | 2 - samples/bpf/trace_event_user.c | 2 - samples/bpf/tracex2_user.c | 6 - samples/bpf/tracex3_user.c | 6 - samples/bpf/tracex4_user.c | 6 - samples/bpf/tracex5_user.c | 3 - samples/bpf/tracex6_user.c | 3 - samples/bpf/xdp1_user.c | 6 - samples/bpf/xdp_adjust_tail_user.c | 6 - samples/bpf/xdp_monitor_user.c | 5 - samples/bpf/xdp_redirect_cpu_user.c | 6 - samples/bpf/xdp_redirect_map_user.c | 6 - samples/bpf/xdp_redirect_user.c | 6 - samples/bpf/xdp_router_ipv4_user.c | 6 - samples/bpf/xdp_rxq_info_user.c | 6 - samples/bpf/xdp_sample_pkts_user.c | 6 - samples/bpf/xdp_tx_iptunnel_user.c | 6 - samples/bpf/xdpsock_user.c | 7 - .../selftests/bpf/progs/bpf_iter_bpf_map.c | 2 +- .../selftests/bpf/progs/map_ptr_kern.c | 7 - 63 files changed, 460 insertions(+), 743 deletions(-)