From patchwork Wed Nov 22 14:15:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 13464936 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04646C61D9B for ; Wed, 22 Nov 2023 14:16:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 673FE6B0613; Wed, 22 Nov 2023 09:16:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 624116B0615; Wed, 22 Nov 2023 09:16:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C46F6B0616; Wed, 22 Nov 2023 09:16:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 380B86B0613 for ; Wed, 22 Nov 2023 09:16:23 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 064C6140BF1 for ; Wed, 22 Nov 2023 14:16:23 +0000 (UTC) X-FDA: 81485790246.30.08BB1F4 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf03.hostedemail.com (Postfix) with ESMTP id 118A52001D for ; Wed, 22 Nov 2023 14:16:19 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IXB8MfZt; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700662580; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=gEFaC38G2s27p4sRXGudzDSfRCeLNbnrtzWPTvuPOrE=; b=TshnWeZdJC9SMgwtYyG6KwYkc3CUw5mDTNc/9UK6ZJETfn0SHHmyC7lpe6uMY7MjUWgGqL kSs6VQ6EDLXpWB6fVF46T140H0ndvBW3EaOE7GcGGemaKfmz9RIstNSn1Lo2kMMgzokfIW phAr8WlLAwmSCZLTKSjDt7MmGAawy1k= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700662580; a=rsa-sha256; cv=none; b=JIun6SmFPw7f3FnZp96ngRi3755i2KjNyfy0ybCerOcXiMP1X5oghEJ7uMa7pJtPFAZ7UK 7OaXmWMfB59g7Pgng1uXBRqatRY9k1SW1bN7h6AUXvT1sM3T+Rlb9xOh/4OBf6CsrcM2MZ uY9iXyl72ykV4EbntXrHIhlWm9wbnlQ= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IXB8MfZt; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-6cb66f23eddso3014765b3a.0 for ; Wed, 22 Nov 2023 06:16:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700662579; x=1701267379; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=gEFaC38G2s27p4sRXGudzDSfRCeLNbnrtzWPTvuPOrE=; b=IXB8MfZtjhxfbio2eC51GSEfvmGhRoIl0X8rsSyHOLN85qzv6N3xdFZ4WxFY0uBi2Z WiOba3rAGwux+Oae66XUbeWlm7yZfnt+fapBA22eICr0Fmblvpl+HyRhJcNxC25BTDY3 Z979FN8yGn+kM70XXpGoRkbxbS2QZ4NYORtNS9xtY3fDogW2ta4zLejSHGZk2Naoscqk 8JHWjjJ+/MvXRurK2t1hhr7axqBd9ZtznOZrFE+kilLGcvYnnp9UeNAoGSa8hhW+CGL1 XdKk2F24JNf/yIC3s+NaAG2SaJ+yrVCVf4c1kDdPBZ5nk1ZFTi4aib3dJS94QthPdCPV XIKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700662579; x=1701267379; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gEFaC38G2s27p4sRXGudzDSfRCeLNbnrtzWPTvuPOrE=; b=pwillkkJWrnn8bCWFs4R4rZuJyhPPJ0x/AzPQZGtfxnGU5ozcLAlFnsIPrci9TijTu 3fYroeFzrJdXZu9DR94unIBf+f7/6b6LzhT2GmujiGk1UEhI4dz1sOWa4SUxrbrR5QCT 14mu9KXK0i/pt2/5IEq/x36dDn9h4gvCvVqA54B7JxovEK0qrx+4UhZM0RVGdhY8Bs+X EYwDb1tCvTxBK+ykUnlm7mmQ+X0V5VxVq5YLTPN2MpOG9HQ944ExcjWLjCoEOAsJCKWl NxkSb+ZjiUB513q0Xtmass6/2jtL0fkv1hpVfWTmPx5Ei2HPI1MHQzzHnrmAoiRy4Xk4 WX9g== X-Gm-Message-State: AOJu0Yx1RQNmVY9LE+ugTVN016nmsMjA+eLHprIW/6DuTeCSH7ZSia27 i0FdKxWhie4Sfspe0wisnIg= X-Google-Smtp-Source: AGHT+IFhmaA9eJSUL6Bl04UTu5jGhNUgNaEDGcrXHRztOb4CKetZJ/zl0FEet0evFRzuIE+jA8PGEA== X-Received: by 2002:a05:6a20:7207:b0:187:58b0:337 with SMTP id y7-20020a056a20720700b0018758b00337mr1586648pzb.11.1700662578496; Wed, 22 Nov 2023 06:16:18 -0800 (PST) Received: from vultr.guest ([2001:19f0:ac01:a71:5400:4ff:fea8:5687]) by smtp.gmail.com with ESMTPSA id p18-20020a63fe12000000b0058988954686sm9356260pgh.90.2023.11.22.06.16.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Nov 2023 06:16:17 -0800 (PST) From: Yafang Shao To: akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, omosnace@redhat.com, mhocko@suse.com Cc: linux-mm@kvack.org, linux-security-module@vger.kernel.org, bpf@vger.kernel.org, ligang.bdlg@bytedance.com, Yafang Shao Subject: [RFC PATCH v2 0/6] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Date: Wed, 22 Nov 2023 14:15:53 +0000 Message-Id: <20231122141559.4228-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 X-Rspamd-Queue-Id: 118A52001D X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: oukoyp3y99wiiitk7rq99ozjakbaw81j X-HE-Tag: 1700662579-834020 X-HE-Meta: U2FsdGVkX18Uz8xizdTTikEXwzrQdeXc+sDP3jfrjfroo+4fqLGLAT+q7yxP87xUFingPJ3I36S5x7mw4X33XanR4uer2iJCjtW8yCOzB3oBe5A6ukYEJayksd1iOHiPOUR/8nR4WAKuN41LRndfTsVvJBerps1/I+l2QO2/kxOfTDngjdGLuA2umY70E6yLnwWgoXyfqsmSwPsgJAwj16tseQP//VG7r/XQXllJ2WUIFI+adDhpxY6nIDjJWCeR0yyZSO2CdDESghjn4Gd82rtsCakG1T+NbLZE1bxIAtyqOz3sTgMY44lka48AhsFkBoQeV43rJHBY9Tp2qfFtfkhLF7g1YquXKjWlghLPTe2ZKSRRb+QGL1m38Mn8Pe9fiYHQ6Nw+vLXXVISiJV0g9Oc0Srrvs+QtxBEg6Di2LdBQD7UWNnq9VoJQ90B+pA84w+3Iz7oUFni6+c9lFe1Y3k3V2rIrnqqcOgbcCe8UfhBeZ8ULKdQmkokapsBC6rny6g7bEJcHoAy4csp7NLgXtj3YzPnM2yNgqLQqRV+ARegAjojssr8JsRRzb6yXVe6JxBbZyRZnHPEQ65bgDsEmea1I3h6fzxDNnz2RZ5KYk9lagNmX3ziZArAhQkDTIs2n1MAlVKbYmjHQv8+jMJLy2yscBGtb092aEUwRav0IhJcL7/ZEb8ENshFkm2aYjslSj6avrZYOUOj6oIYY12V6PSMqlJjrKqLdLQHlmKg+uUvS9O4CoD/mK99pV+XUVDvIkLw/Zf+CqZnw8+8L5s1+NqX2jy6l2/mVfCAyx3fwDmGEujd+fDY0wDc0r0rgqNZqGrCnTmGnkDoq1kL/wC1453ff4YI3sKqKGAjgzw+DomPn4cTlQkyJp7p9ytitrl7frrMXNbn5zy0pgGkdwXjqmB726bDuU/pW42cYJrDOGgmFECoFlfG8koTpOKL7S2h1ME1mKzibHUY2mmPc5NE EmHl6O7A 69F/E+oc59b1Y0vlmDALei416Ni7jqHgs3Zu67NI5H79Jj1KV6gRVxQkfnP3cXEMIbVAGbtswNdeW5tBaOxB6L4YJBoSekaz7WVqYQMc++HVqcxo5DOJLv4PO3qT7NJ2vTh1ItQrY3AZ+gUh6jNX/raY0WL/H7RoW0ik/NvQev1lc5zL+GUvEiUS4WVgVv3PhnhIrZl8UX+Bm5J1ZxfTU8u1yRWWqglORNuXk/8VFb62IIwS5BpEteXhydxbYPtlzM8MJDlXC4M4iaR4/xUHxD+hmv8Qlipby047beMLrsqTrxXvBWEabDlQ3jv7Z0VeuJYMFW/fsfqF+BBh/8OW+UOnpxCVWRy/k+4x8qtaNLNgJPLSJhstV7ZlIAc1j+Ldu1faNxvlwcNZU/ZCCsj9me/GOJD+4cbEDBDhmCIuHDAB0fQrV6vTaSexb68SXyOL0t8hAlzJqmav6YGTm480eaFj+dhTD/qOk+TQrSFv0lHI/Fh1daF8+gpnhkurKyRSqyRvUWPtR5ge0HzyHOzXL9NZN3LArDSxEmtY/FrVf0elaOjw2ucrHzy7R8JmKOGmy+jYN36vxeIjXgOID2ekH81laDqZGW3o8TCImmd2lC2vcBnZiLn/++O4J2dExvmB4eoGOy/LgFMsPHl8S1fC6aKLjCFAINXEk//6/Ca9wbCpmrKFzhKnN9Ceait6wWGkE3QzNKuF3YyWxurk9wuvVppPFgw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Background ========== In our containerized environment, we've identified unexpected OOM events where the OOM-killer terminates tasks despite having ample free memory. This anomaly is traced back to tasks within a container using mbind(2) to bind memory to a specific NUMA node. When the allocated memory on this node is exhausted, the OOM-killer, prioritizing tasks based on oom_score, indiscriminately kills tasks. The Challenge ============ In a containerized environment, independent memory binding by a user can lead to unexpected system issues or disrupt tasks being run by other users on the same server. If a user genuinely requires memory binding, we will allocate dedicated servers to them by leveraging kubelet deployment. Currently, users possess the ability to autonomously bind their memory to specific nodes without explicit agreement or authorization from our end. It's imperative that we establish a method to prevent this behavior. Proposed Solutions ================= - Introduce Capability to Disable MPOL_BIND Currently, any task can perform MPOL_BIND without specific capabilities. Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this may have unintended consequences. Capabilities, being broad, might grant unnecessary privileges. We should explore alternatives to prevent unexpected side effects. - Use LSM BPF to Disable MPOL_BIND Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more flexibility and allows for fine-grained control without unintended consequences. A sample LSM BPF program is included, demonstrating practical implementation in a production environment. - seccomp seccomp is relatively heavyweight, making it less suitable for enabling in our production environment: - Both kubelet and containers need adaptation to support it. - Dynamically altering security policies for individual containers without interrupting their operations isn't straightforward. Future Considerations ===================== In addition, there's room for enhancement in the OOM-killer for cases involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to prioritize selecting a victim that has allocated memory on the same NUMA node. My exploration on the lore led me to a proposal[0] related to this matter, although consensus seems elusive at this point. Nevertheless, delving into this specific topic is beyond the scope of the current patchset. [0] https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/ Changes: - RFC v1 -> RFC v2: - Refine the commit log to avoid misleading - Use one common lsm hook instead and add comment for it - Add selinux implementation - Other improments in mempolicy - RFC v1: https://lwn.net/Articles/951188/ Yafang Shao (6): mm, doc: Add doc for MPOL_F_NUMA_BALANCING mm: mempolicy: Revise comment regarding mempolicy mode flags mm, security: Fix missed security_task_movememory() in mbind(2) mm, security: Add lsm hook for memory policy adjustment security: selinux: Implement set_mempolicy hook selftests/bpf: Add selftests for set_mempolicy with a lsm prog .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ include/linux/lsm_hook_defs.h | 3 + include/linux/security.h | 9 +++ include/uapi/linux/mempolicy.h | 2 +- mm/mempolicy.c | 17 +++- security/security.c | 13 +++ security/selinux/hooks.c | 8 ++ security/selinux/include/classmap.h | 2 +- tools/testing/selftests/bpf/Makefile | 2 +- .../selftests/bpf/prog_tests/set_mempolicy.c | 79 +++++++++++++++++++ .../selftests/bpf/progs/test_set_mempolicy.c | 29 +++++++ 11 files changed, 187 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c