From patchwork Fri Jul 29 15:23:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 12932530 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54F87C00144 for ; Fri, 29 Jul 2022 15:23:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 726B16B0071; Fri, 29 Jul 2022 11:23:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D9CA8E0001; Fri, 29 Jul 2022 11:23:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 59D9F6B0073; Fri, 29 Jul 2022 11:23:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4AC7D6B0071 for ; Fri, 29 Jul 2022 11:23:22 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 23406161164 for ; Fri, 29 Jul 2022 15:23:22 +0000 (UTC) X-FDA: 79740506244.28.3C25A8C Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) by imf30.hostedemail.com (Postfix) with ESMTP id C6330800D9 for ; Fri, 29 Jul 2022 15:23:21 +0000 (UTC) Received: by mail-pg1-f179.google.com with SMTP id f65so4251001pgc.12 for ; Fri, 29 Jul 2022 08:23:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=bq/Y/ngKZ+WX2+8wytQYbP21OmkIVmhSvPy4lSP3hZY=; b=olY0hAfoJpgfXtFDjC9LKVjlh7X0ScP5IIrPw+NncqM7V89hhO9EE8ZHkHvRe10QF/ 8aInHlu3F/f6E99DQyOg0tDiiHHs6OVl7j5HOK/QWowTxLdhkPaE2eNVhFpCI+JRpmd/ yzbx1D+8I/A5wBQh6M/RvTGpOH+6E5Kk/Nu4J84yKWhA4fX/04NGdmzOqrwFdoQyUMnL GZUoFWCN1cszlIhxMfnaziwSPvpCDeMqKPBAUpFV+eK+LiENRFfZ6cL2xjGoTZQmnzB6 nHI7zLQ1WA94QvE3g9nkniKHdmvL6nIZ51QFaxQ0KvVGMPeATH98L9Fj17cQnom+MPFw DPww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=bq/Y/ngKZ+WX2+8wytQYbP21OmkIVmhSvPy4lSP3hZY=; b=Cvja+5rddEz1+YhZKQNwCZATsi30EIT0jYsri2b5hMd26PT+IO1OkV6CPsBBATO9uu SpMbNDA7uyUMjL9X5pm22gmY7a1Jp/SPxcnoCH+TSHYv/TS2Yf1jbrdjtxf2REduhoSu zmK8jYpifR5pBSt9ADd10mQSqBFcBNOUsry6gpwn+O6fxaMOM3Pcw09BqEJb8jbn12F9 W03R+COeZmaD4zodjl1CoyTq1J2dzwh2tYl4mtD2qHbq7pzSKI2SAIR0p48Hg2o/9tMO /gDbzcLKA/qCvieskYt3beY9aVxWL7vFmY8jMyTcJ6gC4TrHdCqFkfsfou5EnG3tcM1B VHGA== X-Gm-Message-State: ACgBeo0NJOdO19RXgE99vxo/kZrftpIByux7SmhASnRmsRsNXlHTwC6m g0M0+rag1cCMNf5TZRrvTgA= X-Google-Smtp-Source: AA6agR4/Io2UalncnScFwRru9D8cID85apn3SkZcPYYPPO0xxfepqrQ+WR8upxYqQxmTbRRcvzbImQ== X-Received: by 2002:a63:3143:0:b0:41b:b5dc:e6b6 with SMTP id x64-20020a633143000000b0041bb5dce6b6mr145342pgx.422.1659108200726; Fri, 29 Jul 2022 08:23:20 -0700 (PDT) Received: from vultr.guest ([2001:19f0:6001:2912:5400:4ff:fe16:4344]) by smtp.gmail.com with ESMTPSA id b12-20020a1709027e0c00b0016d3a354cffsm3714219plm.89.2022.07.29.08.23.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Jul 2022 08:23:19 -0700 (PDT) From: Yafang Shao To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, songmuchun@bytedance.com, akpm@linux-foundation.org Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org, Yafang Shao Subject: [RFC PATCH bpf-next 00/15] bpf: Introduce selectable memcg for bpf map Date: Fri, 29 Jul 2022 15:23:01 +0000 Message-Id: <20220729152316.58205-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=olY0hAfo; spf=pass (imf30.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659108201; a=rsa-sha256; cv=none; b=h/hfrcfy9PHtJ0wzo/2W3SVl+3B4Ufz0ID3V548imZXpU/jbH1rfLppGz8gVx4MzFJoGyC K/GOMJR8NZQK0Dgn2on8+feUxyPObB9t2Ldl01l4A83WHtJKug/L2WV13cvcUCgZN3Qb8A FF56i4BuY4w1089TxCQLFHYtZx0OgBg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659108201; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=bq/Y/ngKZ+WX2+8wytQYbP21OmkIVmhSvPy4lSP3hZY=; b=HoQkYboqXB1BqE2qHAd+RJMwc3WI0Uacx2up/bTDymNiP9Ri54M1KeynevH7pgnjGVWVMg KThVtjhRWVN2cW4xhm8HF81qgcNAWknnvAlPmH4IpEjUwb9ghG8/XqpwnzUOIGnABSpAmD MZwOzU8P+OBPprNTBnnjIU4HWVUFzIA= X-Stat-Signature: 6gz9b7za13718mspu98rieqpbcbb1t8q X-Rspamd-Queue-Id: C6330800D9 X-Rspam-User: Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=olY0hAfo; spf=pass (imf30.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam02 X-HE-Tag: 1659108201-3821 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On our production environment, we may load, run and pin bpf programs and maps in containers. For example, some of our networking bpf programs and maps are loaded and pinned by a process running in a container on our k8s environment. In this container, there're also running some other user applications which watch the networking configurations from remote servers and update them on this local host, log the error events, monitor the traffic, and do some other stuffs. Sometimes we may need to update these user applications to a new release, and in this update process we will destroy the old container and then start a new genration. In order not to interrupt the bpf programs in the update process, we will pin the bpf programs and maps in bpffs. That is the background and use case on our production environment. After switching to memcg-based bpf memory accounting to limit the bpf memory, some unexpected issues jumped out at us. 1. The memory usage is not consistent between the first generation and new generations. 2. After the first generation is destroyed, the bpf memory can't be limited if the bpf maps are not preallocated, because they will be reparented. This patchset tries to resolve these issues by introducing an independent memcg to limit the bpf memory. In the bpf map creation, we can assign a specific memcg instead of using the current memcg. That makes it flexible in containized environment. For example, if we want to limit the pinned bpf maps, we can use below hierarchy, Shared resources Private resources bpf-memcg k8s-memcg / \ / bpf-bar-memcg bpf-foo-memcg srv-foo-memcg | / \ (charged) (not charged) (charged) | / \ | / \ bpf-foo-{progs, maps} srv-foo srv-foo loads and pins bpf-foo-{progs, maps}, but they are charged to an independent memcg (bpf-foo-memcg) instead of srv-foo's memcg (srv-foo-memcg). Pls. note that there may be no process in bpf-foo-memcg, that means it can be rmdir-ed by root user currently. Meanwhile we don't forcefully destroy a memcg if it doesn't have any residents. So this hierarchy is acceptible. In order to make the memcg of bpf maps seletectable, this patchset introduces some memory allocation wrappers to allocate map related memory. In these wrappers, it will get the memcg from the map and then charge the allocated pages or objs. Currenly it only supports for bpf map, and we can extend it to bpf prog as well. It only supports for cgroup2 now, but we can make an additional change in cgroup_get_from_fd() to support it for cgroup1. The observebility can also be supported in the next step, for example, showing the bpf map's memcg by 'bpftool map show' or even showing which maps are charged to a specific memcg by 'bpftool cgroup show'. Furthermore, we may also show an accurate memory size of a bpf map instead of an estimated memory size in 'bpftool map show' in the future. Yafang Shao (15): bpf: Remove unneeded memset in queue_stack_map creation bpf: Use bpf_map_area_free instread of kvfree bpf: Make __GFP_NOWARN consistent in bpf map creation bpf: Use bpf_map_area_alloc consistently on bpf map creation bpf: Introduce helpers for container of struct bpf_map bpf: Use bpf_map_container_alloc helpers in various bpf maps bpf: Define bpf_map_get_memcg for !CONFIG_MEMCG_KMEM bpf: Use scope-based charge for bpf_map_area_alloc bpf: Use bpf_map_kzalloc in arraymap bpf: Use bpf_map_pages_alloc in ringbuf bpf: Use bpf_map_kvcalloc in bpf_local_storage mm, memcg: Add new helper get_obj_cgroup_from_cgroup bpf: Add new parameter into bpf_map_container_alloc bpf: Add new map flag BPF_F_SELECTABLE_MEMCG bpf: Introduce selectable memcg for bpf map include/linux/bpf.h | 19 +++- include/linux/memcontrol.h | 11 ++ include/uapi/linux/bpf.h | 5 + kernel/bpf/arraymap.c | 46 ++++---- kernel/bpf/bloom_filter.c | 13 ++- kernel/bpf/bpf_local_storage.c | 17 +-- kernel/bpf/bpf_struct_ops.c | 17 +-- kernel/bpf/cpumap.c | 12 +- kernel/bpf/devmap.c | 26 +++-- kernel/bpf/hashtab.c | 17 +-- kernel/bpf/local_storage.c | 11 +- kernel/bpf/lpm_trie.c | 8 +- kernel/bpf/offload.c | 6 +- kernel/bpf/queue_stack_maps.c | 12 +- kernel/bpf/reuseport_array.c | 9 +- kernel/bpf/ringbuf.c | 57 +++++----- kernel/bpf/stackmap.c | 15 +-- kernel/bpf/syscall.c | 197 ++++++++++++++++++++++++++++----- mm/memcontrol.c | 41 +++++++ net/core/sock_map.c | 31 +++--- net/xdp/xskmap.c | 8 +- tools/include/uapi/linux/bpf.h | 5 + tools/lib/bpf/bpf.c | 1 + tools/lib/bpf/bpf.h | 3 +- tools/lib/bpf/libbpf.c | 2 + 25 files changed, 411 insertions(+), 178 deletions(-)