From patchwork Sat Nov 20 04:50:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 12630059 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F1DBC433F5 for ; Sat, 20 Nov 2021 04:50:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D2E36B0072; Fri, 19 Nov 2021 23:50:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 15B796B0073; Fri, 19 Nov 2021 23:50:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEFF66B0074; Fri, 19 Nov 2021 23:50:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0068.hostedemail.com [216.40.44.68]) by kanga.kvack.org (Postfix) with ESMTP id DCA0A6B0072 for ; Fri, 19 Nov 2021 23:50:29 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 875F687701 for ; Sat, 20 Nov 2021 04:50:19 +0000 (UTC) X-FDA: 78828082158.16.2C8727E Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf21.hostedemail.com (Postfix) with ESMTP id C3810D0369DF for ; Sat, 20 Nov 2021 04:50:17 +0000 (UTC) Received: by mail-pl1-f201.google.com with SMTP id f16-20020a170902ce9000b001436ba39b2bso5704121plg.3 for ; Fri, 19 Nov 2021 20:50:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=iWNn/wNJRY2Y0h/rqhWfX8g07yS3O6443gwIflX0KOQ=; b=rF6A573f3uTOcP2DN6YOqIIDuHlaXpDc1X65m9DnLi+HiZPGfATiXrq+3lKL51BV2J zH/nOMLgcXjU3rOypQV1ZglbzMm4evZaxfS7Z0iYCdBGvUGZCygsGuWtTcHids+8SYw+ ApI4iPKAq6PWmZD2gJmVsVefolSEQHlmQ5NYdMG69ybH0uWBhmhfPmbyJnL20mZeuZNq pFy8vgTlaTsYDMbj1itS2UaqnkvHp1VIS0M8vmC7pBVHPPZ01qGIG5nhczAWpVAKfj9p MMggKZw0+33MZWnWfgVOgyObj1qHy3d/p+jT7uaw7PpddPuZ00gzYKBjwz+PufwD+tPy s8Hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=iWNn/wNJRY2Y0h/rqhWfX8g07yS3O6443gwIflX0KOQ=; b=Tb4e4TaCKHl9uSfvpXZCwCeYCyJTGv3imnreH7HitAp04MRYG+Kj7sd0Bx7Ntrso9r RX+K4dI9mUww0/6I0R5aLScyunNbBfHILhK9dosRLje5UDwNbwAe06p7OmSVlfSVZVSR uAJLsALQN2TQ+HVYCvxHP+41erLWU5ahjR5TVCJuAWkazQX9RRF/vmfZlIanJ/q98wPK DlQN6I0dvgNAdkgUBCCN0RzcMxmb7ohxbhI0kH7FK1dMT7EJy2QpnZmOwmM2hTuoT1jC uJBoIGmmXSyTGLTsZlL4PUiqewVxeabEmFflMSw/J9BYLClLbKZ8i8JCvHy+Is3YYTlj bqKQ== X-Gm-Message-State: AOAM533IyFWPxbbOFo6FsnctROr/e1rrBlTMDGfL7ivpKirKcpYTrC1k xUxfW4JgMwJqmcGqhEJxBYo6E4wKqQrNKCOicQ== X-Google-Smtp-Source: ABdhPJzLZCsmVsHYvgrxaIaGq8zvGjvN62zWcOui3rUsL12upLL9nA2Q48apGhptRBH2NZzYjm9TANV5i09OBpimOQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:fa91:560a:d7b4:93]) (user=almasrymina job=sendgmr) by 2002:a63:89c2:: with SMTP id v185mr1094980pgd.252.1637383818156; Fri, 19 Nov 2021 20:50:18 -0800 (PST) Date: Fri, 19 Nov 2021 20:50:07 -0800 In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com> Message-Id: <20211120045011.3074840-2-almasrymina@google.com> Mime-Version: 1.0 References: <20211120045011.3074840-1-almasrymina@google.com> X-Mailer: git-send-email 2.34.0.rc2.393.gf8c9666880-goog Subject: [PATCH v4 1/4] mm: support deterministic memory charging of filesystems From: Mina Almasry To: Alexander Viro , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Hugh Dickins Cc: Mina Almasry , Jonathan Corbet , Shuah Khan , Shakeel Butt , Greg Thelen , Dave Chinner , Matthew Wilcox , Roman Gushchin , "Theodore Ts'o" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: C3810D0369DF X-Stat-Signature: uydodzurjks3x8sn34o9t45drqspygfc Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=rF6A573f; spf=pass (imf21.hostedemail.com: domain of 3in6YYQsKCPsdopdvu1plqdjrrjoh.frpolqx0-ppnydfn.ruj@flex--almasrymina.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3in6YYQsKCPsdopdvu1plqdjrrjoh.frpolqx0-ppnydfn.ruj@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1637383817-771312 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Users can specify a memcg= mount option option at mount time and all data page charges will be charged to the memcg supplied. This is useful to deterministicly charge the memory of the file system or memory shared via tmpfs for example. Implementation notes: - Add memcg= option parsing to fs common code. - We attach the memcg to charge for this filesystem data pages to the struct super_block. The memcg can be changed via a remount operation, and all future memcg charges in this filesystem will be charged to the new memcg. - We create a new interface mem_cgroup_charge_mapping(), which will check if the super_block in the mapping has a memcg to charge. It charges that, and falls back to the mm passed if there is no super_block memcg. - On filesystem data memory allocation paths, we call the new interface mem_cgroup_charge_mapping(). Caveats: - Processes are only allowed to direct filesystem charges to a cgroup that they themselves can enter and allocate memory in. This so that we do not introduce an attack vector where processes can DoS any cgroup in the system that they are not normally allowed to enter and allocate memory in. - In mem_cgroup_charge_mapping() we pay the cost of checking whether the super_block has a memcg to charge, regardless of whether the mount point was mounted with memcg=. This can be alleviated by putting the memcg to charge in the struct address_space, but, this increases the size of that struct and makes it difficult to support remounting the memcg= option, although remounting is of dubious value. - mem_cgroup_charge_mapping() simply returns any error received from the following charge_memcg() or mem_cgroup_charge() calls. There is a follow up patch in this series which closely examines and handles the behavior when hitting the limit of the remote memcg. Signed-off-by: Mina Almasry --- Changes in v4: - Added cover letter and moved list of Cc's there. - Made memcg= option generic to all file systems. - Reverted to calling mem_cgroup_charge_mapping() for generic file system allocation paths, since this feature is not implemented for all filesystems. - Refactored some memcontrol interfaces slightly to reduce the number of "#ifdef CONFIG_MEMCG" needed in other files. Changes in v3: - Fixed build failures/warnings Reported-by: kernel test robot Changes in v2: - Fixed Roman's email. - Added a new wrapper around charge_memcg() instead of __mem_cgroup_charge() - Merged the permission check into this patch as Roman suggested. - Instead of checking for a s_memcg_to_charge off the superblock in the filemap code, I set_active_memcg() before calling into the fs generic code as Dave suggests. - I have kept the s_memcg_to_charge in the superblock to keep the struct address_space pointer small and preserve the remount use case.. --- fs/fs_context.c | 27 +++++++ fs/proc_namespace.c | 4 ++ fs/super.c | 9 +++ include/linux/fs.h | 5 ++ include/linux/fs_context.h | 2 + include/linux/memcontrol.h | 32 +++++++++ mm/filemap.c | 2 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 142 +++++++++++++++++++++++++++++++++++++ mm/shmem.c | 3 +- 10 files changed, 226 insertions(+), 3 deletions(-) -- 2.34.0.rc2.393.gf8c9666880-goog diff --git a/fs/fs_context.c b/fs/fs_context.c index b7e43a780a625..fe2449d5f1fbf 100644 --- a/fs/fs_context.c +++ b/fs/fs_context.c @@ -23,6 +23,7 @@ #include #include "mount.h" #include "internal.h" +#include enum legacy_fs_param { LEGACY_FS_UNSET_PARAMS, @@ -108,6 +109,28 @@ int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param) } EXPORT_SYMBOL(vfs_parse_fs_param_source); +static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param) +{ + struct mem_cgroup *memcg; + + if (strcmp(param->key, "memcg") != 0) + return -ENOPARAM; + + if (param->type != fs_value_is_string) + return invalf(fc, "Non-string source"); + + if (fc->memcg) + return invalf(fc, "Multiple memcgs specified"); + + memcg = mem_cgroup_get_from_path(param->string); + if (IS_ERR(memcg)) + return invalf(fc, "Bad value for memcg"); + + fc->memcg = memcg; + param->string = NULL; + return 0; +} + /** * vfs_parse_fs_param - Add a single parameter to a superblock config * @fc: The filesystem context to modify @@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) return ret; } + ret = parse_param_memcg(fc, param); + if (ret != -ENOPARAM) + return ret; + /* If the filesystem doesn't take any arguments, give it the * default handling of source. */ diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 392ef5162655b..32e1647dcef43 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "proc/internal.h" /* only for get_proc_task() in ->open() */ @@ -125,6 +126,9 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) if (err) goto out; show_mnt_opts(m, mnt); + + mem_cgroup_put_name_in_seq(m, sb); + if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); diff --git a/fs/super.c b/fs/super.c index 3bfc0f8fbd5bc..06c972f80c529 100644 --- a/fs/super.c +++ b/fs/super.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include /* for the emergency remount stuff */ @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s) up_write(&s->s_umount); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); put_user_ns(s->s_user_ns); kfree(s->s_subtype); @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s) WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); WARN_ON(!list_empty(&s->s_mounts)); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); fscrypt_sb_free(s); put_user_ns(s->s_user_ns); @@ -904,6 +907,9 @@ int reconfigure_super(struct fs_context *fc) } } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + if (fc->ops->reconfigure) { retval = fc->ops->reconfigure(fc); if (retval) { @@ -1528,6 +1534,9 @@ int vfs_get_tree(struct fs_context *fc) return error; } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + /* * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE * but s_maxbytes was an unsigned long long for many releases. Throw diff --git a/include/linux/fs.h b/include/linux/fs.h index 3afca821df32e..59407b3e7aee3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1567,6 +1567,11 @@ struct super_block { struct workqueue_struct *s_dio_done_wq; struct hlist_head s_pins; +#ifdef CONFIG_MEMCG + /* memcg to charge for pages allocated to this filesystem */ + struct mem_cgroup *s_memcg_to_charge; +#endif + /* * Owning user namespace and default context in which to * interpret filesystem uids, gids, quotas, device nodes, diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 6b54982fc5f37..8e2cc1e554fa1 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -25,6 +25,7 @@ struct super_block; struct user_namespace; struct vfsmount; struct path; +struct mem_cgroup; enum fs_context_purpose { FS_CONTEXT_FOR_MOUNT, /* New superblock for explicit mount */ @@ -110,6 +111,7 @@ struct fs_context { bool need_free:1; /* Need to call ops->free() */ bool global:1; /* Goes into &init_user_ns */ bool oldapi:1; /* Coming from mount(2) */ + struct mem_cgroup *memcg; /* memcg to charge */ }; struct fs_context_operations { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6b..0a9b0bba5f3c8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -27,6 +27,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct super_block; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -923,6 +924,15 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return !!(memcg->css.flags & CSS_ONLINE); } +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg); + +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping); + +struct mem_cgroup *mem_cgroup_get_from_path(const char *path); +void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb); + void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages); @@ -1223,6 +1233,28 @@ static inline int mem_cgroup_charge(struct folio *folio, return 0; } +static inline void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ +} + +static inline int mem_cgroup_charge_mapping(struct folio *folio, + struct mm_struct *mm, gfp_t gfp, + struct address_space *mapping) +{ + return 0; +} + +static inline struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + return NULL; +} + +static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq, + struct super_block *sb) +{ +} + static inline int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/filemap.c b/mm/filemap.c index 6844c9816a864..3825cf12bc345 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->index = index; if (!huge) { - error = mem_cgroup_charge(folio, NULL, gfp); + error = mem_cgroup_charge_mapping(folio, NULL, gfp, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) goto error; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1ab..8468a3ad446b9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1661,7 +1661,8 @@ static void collapse_file(struct mm_struct *mm, goto out; } - if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { + if (unlikely(mem_cgroup_charge_mapping(page_folio(new_page), mm, gfp, + mapping))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 781605e920153..c4ba7f364c214 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -62,6 +62,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2580,6 +2581,129 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +/* + * Non error return value must eventually be released with css_put(). + */ +struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + static const char procs_filename[] = "/cgroup.procs"; + struct file *file, *procs; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + char *procs_path = + kmalloc(strlen(path) + sizeof(procs_filename), GFP_KERNEL); + + if (procs_path == NULL) + return ERR_PTR(-ENOMEM); + strcpy(procs_path, path); + strcat(procs_path, procs_filename); + + procs = filp_open(procs_path, O_WRONLY, 0); + kfree(procs_path); + + /* + * Restrict the capability for tasks to mount with memcg charging to the + * cgroup they could not join. For example, disallow: + * + * mount -t tmpfs -o memcg=root-cgroup nodev + * + * if it is a non-root task. + */ + if (IS_ERR(procs)) + return (struct mem_cgroup *)procs; + fput(procs); + + file = filp_open(path, O_DIRECTORY | O_RDONLY, 0); + if (IS_ERR(file)) + return (struct mem_cgroup *)file; + + css = css_tryget_online_from_dir(file->f_path.dentry, + &memory_cgrp_subsys); + if (IS_ERR(css)) + memcg = (struct mem_cgroup *)css; + else + memcg = container_of(css, struct mem_cgroup, css); + + fput(file); + return memcg; +} + +void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb) +{ + struct mem_cgroup *memcg; + int ret = 0; + char *buf = __getname(); + int len = PATH_MAX; + + if (!buf) + return; + + buf[0] = '\0'; + + rcu_read_lock(); + memcg = rcu_dereference(sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + if (!memcg) + return; + + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); + if (ret >= len / 2) + strcpy(buf, "?"); + else { + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); + + if (p) + *p = '\0'; + else + strcpy(buf, "?"); + } + + css_put(&memcg->css); + if (buf[0] != '\0') + seq_printf(m, ",memcg=%s", buf); + + __putname(buf); +} + +/* + * Set or clear (if @memcg is NULL) charge association from file system to + * memcg. If @memcg != NULL, then a css reference must be held by the caller to + * ensure that the cgroup is not deleted during this operation, this reference + * is dropped after this operation. + */ +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ + memcg = xchg(&sb->s_memcg_to_charge, memcg); + if (memcg) + css_put(&memcg->css); +} + +/* + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller + * must drop reference with css_put(). NULL indicates that the inode does not + * have a memcg to charge, so the default process based policy should be used. + */ +static struct mem_cgroup * +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) +{ + struct mem_cgroup *memcg; + + if (!mapping) + return NULL; + + rcu_read_lock(); + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + return memcg; +} + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -6678,6 +6802,24 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, return ret; } +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping) +{ + struct mem_cgroup *mapping_memcg; + int ret = 0; + if (mem_cgroup_disabled()) + return 0; + + mapping_memcg = mem_cgroup_mapping_get_charge_target(mapping); + if (mapping_memcg) { + ret = charge_memcg(folio, mapping_memcg, gfp); + css_put(&mapping_memcg->css); + return ret; + } + + return mem_cgroup_charge(folio, mm, gfp); +} + int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) { struct mem_cgroup *memcg; diff --git a/mm/shmem.c b/mm/shmem.c index 23c91a8beb781..e469da13a1b8a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -709,7 +709,8 @@ static int shmem_add_to_page_cache(struct page *page, page->index = index; if (!PageSwapCache(page)) { - error = mem_cgroup_charge(page_folio(page), charge_mm, gfp); + error = mem_cgroup_charge_mapping(page_folio(page), charge_mm, + gfp, mapping); if (error) { if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); From patchwork Sat Nov 20 04:50:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 12630061 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98A78C433FE for ; Sat, 20 Nov 2021 04:51:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 79ED16B0073; Fri, 19 Nov 2021 23:50:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 726F66B0074; Fri, 19 Nov 2021 23:50:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A23A6B0075; Fri, 19 Nov 2021 23:50:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0246.hostedemail.com [216.40.44.246]) by kanga.kvack.org (Postfix) with ESMTP id 49F756B0073 for ; Fri, 19 Nov 2021 23:50:33 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0DFC618512164 for ; Sat, 20 Nov 2021 04:50:23 +0000 (UTC) X-FDA: 78828082326.22.67C8510 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf05.hostedemail.com (Postfix) with ESMTP id D01335092EDF for ; Sat, 20 Nov 2021 04:50:19 +0000 (UTC) Received: by mail-pl1-f201.google.com with SMTP id x18-20020a170902ec9200b00143c6409dbcso5715228plg.5 for ; Fri, 19 Nov 2021 20:50:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=79UaOzjGTX1bdBbuOK6OGZYuNHbD+Y6LGP3VMZPWWfo=; b=n+LvozaE5Q8BNCS3fqbqo3irax1e6z+BA7Db8d98bRsEGw60v1DGmfPfVE+O44XNGV rVwVHQDGp+B/FpyiQCK2k19VRTVEyntKdNaWzv+HFUGu+xeemoywhDpiO46vsBL0LzyO CLjU7WusWss/FqNR7Jr324wGgsFJOXNE8X8a8Lh8VTxzYuMgRpH4zYgvIkemxnyBE7ah NCcaiY38b6Fdlu7p1vAPSX4GJAx6ugjvXDdTq25+LSYxa3OkEDzLDT11bfizQZ0Et8vR 417XJRFAOrmVlqwhi/X0zItDeqXEYQ7OMZ7T68e+zwx+iTSa3ex4R3LL8lq/UYh4GWF3 r/5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=79UaOzjGTX1bdBbuOK6OGZYuNHbD+Y6LGP3VMZPWWfo=; b=vr/Ioj7ARwSoR76eZ1pFPEpcsYlJvwdlP4JjxiFo5tYVaVrv3Yoj8BZt7hBrmnxpdE xjkktPjcMCMDjNdCW6wvCb40ed3dkXfI5F+OkrK1QOUeXYSf6AWFxNwD8FYjFYuGL7MZ QsSYfZLVhQlhm9uinETwaJkH34BZ4/II64sr8TA9JY6/vKPrMwbsNP0aMH71hEvlsNnh OoUqtQoFPpjkuOmIuyD/BNoBHwyDLJuyWKNV9Q0p7YvvU2tLJj14o3vu2Bj/pp05LFXO Syij9ZZqYUmKnu7rnHRKvt7QEDAvbfsSjeZoB9vPvRmaCNpLW6h5RYO03Y6ccD0tK9nY +YMA== X-Gm-Message-State: AOAM533wMWFpa2t1tExz9voFnVmt6eAOqpVcJyBGpOYQJC2Wq9d2vUsy V2ledXFlk7rRt75MtRHoYYfa98yneGEpmYI2NA== X-Google-Smtp-Source: ABdhPJzXoftBTXQvnJs8EFAoaTwcP0Br6rSCWqmYFOoLP3v+C0fEMkZmUYYmYcCbMQC53bT7Z87J4j5MwXLcZk2+VQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:fa91:560a:d7b4:93]) (user=almasrymina job=sendgmr) by 2002:a17:902:7005:b0:142:4452:25de with SMTP id y5-20020a170902700500b00142445225demr84357056plk.3.1637383821734; Fri, 19 Nov 2021 20:50:21 -0800 (PST) Date: Fri, 19 Nov 2021 20:50:08 -0800 In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com> Message-Id: <20211120045011.3074840-3-almasrymina@google.com> Mime-Version: 1.0 References: <20211120045011.3074840-1-almasrymina@google.com> X-Mailer: git-send-email 2.34.0.rc2.393.gf8c9666880-goog Subject: [PATCH v4 2/4] mm/oom: handle remote ooms From: Mina Almasry To: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton Cc: Mina Almasry , Jonathan Corbet , Alexander Viro , Hugh Dickins , Shuah Khan , Shakeel Butt , Greg Thelen , Dave Chinner , Matthew Wilcox , Roman Gushchin , "Theodore Ts'o" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D01335092EDF X-Stat-Signature: icwgz73d9mrgdi9xq4fm5yikzqh576zz Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=n+LvozaE; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 3jX6YYQsKCP4grsgyx4sotgmuumrk.iusrot03-ssq1giq.uxm@flex--almasrymina.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3jX6YYQsKCP4grsgyx4sotgmuumrk.iusrot03-ssq1giq.uxm@flex--almasrymina.bounces.google.com X-HE-Tag: 1637383819-891514 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On remote ooms (OOMs due to remote charging), the oom-killer will attempt to find a task to kill in the memcg under oom. The oom-killer may be unable to find a process to kill if there are no killable processes in the remote memcg. In this case, the oom-killer (out_of_memory()) will return false, and depending on the gfp, that will generally get bubbled up to mem_cgroup_charge_mapping() as an ENOMEM. A few considerations on how to handle this edge case: 1. memcg= is an opt-in feature, so we have some flexibility with the behavior that we export to userspace using this feature to carry out remote charges that may result in remote ooms. The critical thing is to document this behavior so the userspace knows what to expect and handle the edge cases. 2. It is generally not desirable to kill the allocating process, because it's not a member of the remote memcg which is under oom, and so killing it will almost certainly not free any memory in the memcg under oom. 3. There are allocations that happen in pagefault paths, as well as those that happen in non-pagefault paths, and the error returned from mem_cgroup_charge_mapping() will be handled by the caller resulting in different behavior seen by the userspace in the pagefault and non-pagefault paths. For example, currently if mem_cgroup_charge_mapping() returns ENOMEM, the caller will generally get an ENOMEM on non-pagefault paths, and the caller will be stuck looping the pagefault forever in the pagefault path. 4. In general, it's desirable to give userspace the option to gracefully handle and recover from a failed remote charge rather than kill the process or put it into a situation that's hard to recover from. With these considerations, the thing that makes most sense here is to handle this edge case similarly to how we handle ENOSPC error, and to return ENOSPC from mem_cgroup_charge_mapping() when the remote charge fails. This has the desirable properties: 1. On pagefault allocations, the userspace will get a SIGBUS if the remote charge fails, and the userspace is able to catch this signal and handle it to recover gracefully as desired. 2. On non-pagefault paths, the userspace will get an ENOSPC error which it can also handle gracefully, if desired. 3. We would not leave the remote charging process in a looping pagetfault (a state somewhat hard to recover from) or kill it. Implementation notes: 1. To get the ENOSPC behavior we alegedly want, in mem_cgroup_charge_mapping() we detect whether charge_memcg() has failed, and we return ENOSPC here. 2. If the oom-killer is invoked and finds nothing to kill, it prints out the "Out of memory and no killable processes..." message, which can be spammy if the system is executing many remote charges and generally will cause worry as it will likely be seen as a scary looking kernel warning, even though this is somewhat of an expected edge case to run into and we handle it adequately. Therefore, in out_of_memory() we return early to not print this warning. This is not necessary for the functionality of the remote charges. Signed-off-by: Mina Almasry --- Changes in v4: - Greatly expanded on the commit message to include all my current thinking. - Converted the patch to handle remote ooms similarly to ENOSPC, rather than ENOMEM. Changes in v3: - Fixed build failures/warnings Reported-by: kernel test robot Changes in v2: - Moved the remote oom handling as Roman requested. - Used mem_cgroup_from_task(current) instead of grabbing the memcg from current->mm --- include/linux/memcontrol.h | 6 ++++++ mm/memcontrol.c | 31 ++++++++++++++++++++++++++++++- mm/oom_kill.c | 9 +++++++++ 3 files changed, 45 insertions(+), 1 deletion(-) -- 2.34.0.rc2.393.gf8c9666880-goog diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0a9b0bba5f3c8..451feebabf160 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -932,6 +932,7 @@ int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, struct mem_cgroup *mem_cgroup_get_from_path(const char *path); void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb); +bool is_remote_oom(struct mem_cgroup *memcg_under_oom); void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages); @@ -1255,6 +1256,11 @@ static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq, { } +static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom) +{ + return false; +} + static inline int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c4ba7f364c214..3e5bc2c32c9b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2668,6 +2668,35 @@ void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb) __putname(buf); } +/* + * Returns true if current's mm is a descendant of the memcg_under_oom (or + * equal to it). False otherwise. This is used by the oom-killer to detect + * ooms due to remote charging. + */ +bool is_remote_oom(struct mem_cgroup *memcg_under_oom) +{ + struct mem_cgroup *current_memcg; + bool is_remote_oom; + + if (!memcg_under_oom) + return false; + + rcu_read_lock(); + current_memcg = mem_cgroup_from_task(current); + if (current_memcg && !css_tryget_online(¤t_memcg->css)) + current_memcg = NULL; + rcu_read_unlock(); + + if (!current_memcg) + return false; + + is_remote_oom = + !mem_cgroup_is_descendant(current_memcg, memcg_under_oom); + css_put(¤t_memcg->css); + + return is_remote_oom; +} + /* * Set or clear (if @memcg is NULL) charge association from file system to * memcg. If @memcg != NULL, then a css reference must be held by the caller to @@ -6814,7 +6843,7 @@ int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, if (mapping_memcg) { ret = charge_memcg(folio, mapping_memcg, gfp); css_put(&mapping_memcg->css); - return ret; + return ret == -ENOMEM ? -ENOSPC : ret; } return mem_cgroup_charge(folio, mm, gfp); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0a7e16b16b8c3..8db500b337415 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1108,6 +1108,15 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { + if (is_remote_oom(oc->memcg)) { + /* + * For remote ooms with no killable processes, return + * false here without logging the warning below as we + * expect the caller to handle this as they please. + */ + return false; + } + dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* From patchwork Sat Nov 20 04:50:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 12630063 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08554C433F5 for ; Sat, 20 Nov 2021 04:52:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C5AB06B0074; Fri, 19 Nov 2021 23:50:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BE36C6B0075; Fri, 19 Nov 2021 23:50:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5C6F6B0078; Fri, 19 Nov 2021 23:50:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250]) by kanga.kvack.org (Postfix) with ESMTP id 968B66B0074 for ; Fri, 19 Nov 2021 23:50:36 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 6474D89B25 for ; Sat, 20 Nov 2021 04:50:26 +0000 (UTC) X-FDA: 78828082452.22.2E8935E Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by imf22.hostedemail.com (Postfix) with ESMTP id 943FC192D for ; Sat, 20 Nov 2021 04:50:25 +0000 (UTC) Received: by mail-pg1-f201.google.com with SMTP id t1-20020a6564c1000000b002e7f31cf59fso5030940pgv.14 for ; Fri, 19 Nov 2021 20:50:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Iq24JeebEl5J+lwUGxaMYnGaB9XaVCl6Xy4iwtIyjvA=; b=chHgb4puhIWCbXYE8Sfesk3NR0JUQxPlsgsGXteWK/zzn4p5veG4SJor/gZuL7JgX3 rTpfhIB3TB148ACfiYmshuzDlk8La2eFK38A6waYF1P8TDVR9KDtqC5wgu6d4MhZGzn9 AU/xppkCdNdEA6B7QqNnDCas+iwtGzetJMukBnFtJdRdTqKvTWX7FAcUwxCo6r8I32sf bBCWBPZBvHHjBAZX3j8SS2dTcb80lzU4+UCk6gX7wrhiM6xAKdh05qROmWp4Y+SiE12o V0NAZSvNAXayIFigemF54IXXzW9KOuEVLOEBm4hrqi2dGz99WYHyQdUR37wEDeW9fchV D+/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Iq24JeebEl5J+lwUGxaMYnGaB9XaVCl6Xy4iwtIyjvA=; b=GeLGFtrF5t8NQ2m7Zc4B8hMis2CE++d/2o1/A5cOy882SdUotOXOH9DTKWq9Nlamaz BCo5lSi6h7cqV5tcq3yg5OjhmA/7rwTjqhpAdU5J3S6CdRLv9Khc2hLvaxT1rnG3IGiG 7XdNRqUnNHCZfdDdAN1G9vWRGi2WVTEL4rhRdkV85dY0zkpbPOaU0eIpUZHrYU0ab5vS jRO6Amreps5fmia/ADg8Qxoras4qx1IGMU9qNC96JCkXSV/agGmpwHSTgUibqFwTyY0r p7jAKGmKVC/MJiyCsW2cUjIJx2vX6Nd7ZYc3jWc4S31bS6WPOXRKpq5UqMRBTS8QP4M9 6w4A== X-Gm-Message-State: AOAM530ms3C1QqrggLhZxoSgUqRiu2y1afSqQCm83yB15g3su5G8XTdv SFZums1DI0JtWKahs/3iHF5jc8lh+W1OnHIsXA== X-Google-Smtp-Source: ABdhPJw2LtgTcmkkzLu7IfsO7oStiwjz2hQIMktMWyDyw5d5CngJQrpTot1D+094Y6eOR3p65b1F4UncxKHZugK7JQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:fa91:560a:d7b4:93]) (user=almasrymina job=sendgmr) by 2002:a17:90b:4c8b:: with SMTP id my11mr6839565pjb.96.1637383825076; Fri, 19 Nov 2021 20:50:25 -0800 (PST) Date: Fri, 19 Nov 2021 20:50:09 -0800 In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com> Message-Id: <20211120045011.3074840-4-almasrymina@google.com> Mime-Version: 1.0 References: <20211120045011.3074840-1-almasrymina@google.com> X-Mailer: git-send-email 2.34.0.rc2.393.gf8c9666880-goog Subject: [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation From: Mina Almasry To: Jonathan Corbet Cc: Mina Almasry , Alexander Viro , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Hugh Dickins , Shuah Khan , Shakeel Butt , Greg Thelen , Dave Chinner , Matthew Wilcox , Roman Gushchin , "Theodore Ts'o" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org X-Stat-Signature: 7uf8xa43j5x9wk653nuo8o94qzfs3h1c X-Rspamd-Queue-Id: 943FC192D X-Rspamd-Server: rspam07 Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=chHgb4pu; spf=pass (imf22.hostedemail.com: domain of 3kX6YYQsKCAQepqewv2qmreksskpi.gsqpmry1-qqozego.svk@flex--almasrymina.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3kX6YYQsKCAQepqewv2qmreksskpi.gsqpmry1-qqozego.svk@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1637383825-996809 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Document the usage of the memcg= mount option, as well as permission restrictions of its use and caveats with remote charging. Signed-off-by: Mina Almasry --- Changes in v4: - Added more info about the permissions to mount with memcg=, and the importance of restricting write access to the mount point. - Changed documentation to describe the ENOSPC/SIGBUS behavior rather than the ENOMEM behavior implemented in earlier patches. - I did not find a good place to put this documentation after making the mount option generic. Please let me know if there is a good place to add this, and if not I can add a new file. Thanks! --- Documentation/filesystems/tmpfs.rst | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) -- 2.34.0.rc2.393.gf8c9666880-goog diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 0408c245785e3..dc1f46e16eaf4 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -137,6 +137,34 @@ mount options. It can be added later, when the tmpfs is already mounted on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. +If CONFIG_MEMCG is enabled, filesystems (including tmpfs) has a mount option to +specify the memory cgroup to be charged for page allocations. + +memcg=/sys/fs/cgroup/unified/test/: data page allocations are charged to +cgroup /sys/fs/cgroup/unified/test/. + +Only processes that have write access to +/sys/fs/cgroup/unified/test/cgroup.procs can mount a tmpfs with +memcg=/sys/fs/cgroup/unified/test. Thus, a process is able to charge memory to a +cgroup only if it itself is able to enter that cgroup and allocate memory +there. This is to prevent random processes from mounting filesystems in user +namespaces and intentionally DoSing random cgroups running on the system. + +Once a mount point is created with memcg=, any process that has write access to +this mount point is able to use this mount point and direct charges to the +cgroup provided. Thus, it is important to limit write access to the mount point +to the intended users if untrusted code is running on the machine. This is +generally required regardless of whether the mount is done with memcg= or not. + +When charging memory to the remote memcg (memcg specified with memcg=) and +hitting that memcg's limit, the oom-killer will be invoked (if enabled) and will +attempt to kill a process in the remote memcg. If no killable processes are +found, the remote charging process gets an ENOSPC error. If the remote charging +process is in the pagefault path, it gets a SIGBUS signal. It's recommended +that processes executing remote charges are able to handle a SIGBUS signal or +ENOSPC error that may arise during executing the remote charges. + + To specify the initial root directory you can use the following mount options: From patchwork Sat Nov 20 04:50:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 12630065 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 783CFC433EF for ; Sat, 20 Nov 2021 04:52:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 618A16B0075; Fri, 19 Nov 2021 23:50:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A0CB6B0078; Fri, 19 Nov 2021 23:50:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 41B476B007B; Fri, 19 Nov 2021 23:50:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 2FE2F6B0075 for ; Fri, 19 Nov 2021 23:50:40 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id EC02F87701 for ; Sat, 20 Nov 2021 04:50:29 +0000 (UTC) X-FDA: 78828082578.08.EA18030 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by imf17.hostedemail.com (Postfix) with ESMTP id 90DBDF0001D3 for ; Sat, 20 Nov 2021 04:50:29 +0000 (UTC) Received: by mail-pg1-f201.google.com with SMTP id u6-20020a63f646000000b002dbccd46e61so5040783pgj.18 for ; Fri, 19 Nov 2021 20:50:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=uvNIfQ2xPxMosCuaZIYF62oO8SwPg1ICTimCdNTm9Vg=; b=kkr7HoN4DfwCs24jlS1zFyDomRuE632ceq1FIOuLPKkP5rbmQpiVF2QlY4VWdBv3f1 gmqzsT2J6RO1wFTt8ELRkyFxE7vOyk00zgiJmcI++BCKPDYfxgEBhj05WQ7P+HSAClIG uf73/EDXzDcXIsjYc7JgHddA0schC1QFC2zw8I9Et5IeYM5W24UkLIAMwTAZzYEdWFht 2+fTz1sSwLrbVTLMUTAkfO39C3FXugos8dhvkeoTBA7tZIIqyRvWvhn/IjKZR6i5U+/j KHdrJPbo07RKeWEdjCMuyNp0MrsGP8F03eJKk+Km8ZFWfPBXNC9PKlY4qEgXEmdbL8PO gdcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=uvNIfQ2xPxMosCuaZIYF62oO8SwPg1ICTimCdNTm9Vg=; b=aOsqRT32gd/f0fHtQQT0WYHtsqbd+lCWFVmUdEmu80qyLz04iU3kpOLiG9HzmOaxeX RD6AyKQ8hVLZjR9e0IncTZ8ZD7/+wdUkSnwXX2cS/76yKUJwnbNBjXQn7G7JG5ziiP7O 1S/Sp1qu+o0LJ2g6tMm/teMUPmW+gSjvgjBDFCnvHsCTjwVjrZ/M/gTLJ86XtL7FKHIM TsSd6POZLvO1/8ed7P2fdCEM+DwYZgxyPuamR4xiw4LhhsU5sSKWmQn4CaWE4ThS2/Pi lJ5BCX5EG4oSq55ZixJs3xXv5jOc/vs6OqVWdwAa2vJc3/5AGod4CNVZ7MIerSh+F0J5 Fijg== X-Gm-Message-State: AOAM533UW6yBugFC5YtpqHnfkhRLS3HmFP4ZQEwQwH7twywPuH1F3NIy P0E1W7uY4ta2odjop+FI4209cGjYy/V3ZjQn8w== X-Google-Smtp-Source: ABdhPJxznMEO7QjA++HmqkjBv2dw1jatuQ8Xo0MBwxe+EG5+/zYigHZk1KISXWqJLlEm4pkIXaY6ZL25E8kDU4UWTA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:fa91:560a:d7b4:93]) (user=almasrymina job=sendgmr) by 2002:a17:90a:c78f:: with SMTP id gn15mr6699081pjb.54.1637383828674; Fri, 19 Nov 2021 20:50:28 -0800 (PST) Date: Fri, 19 Nov 2021 20:50:10 -0800 In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com> Message-Id: <20211120045011.3074840-5-almasrymina@google.com> Mime-Version: 1.0 References: <20211120045011.3074840-1-almasrymina@google.com> X-Mailer: git-send-email 2.34.0.rc2.393.gf8c9666880-goog Subject: [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests From: Mina Almasry To: Andrew Morton , Shuah Khan Cc: Mina Almasry , Jonathan Corbet , Alexander Viro , Johannes Weiner , Michal Hocko , Vladimir Davydov , Hugh Dickins , Shakeel Butt , Greg Thelen , Dave Chinner , Matthew Wilcox , Roman Gushchin , "Theodore Ts'o" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 90DBDF0001D3 X-Stat-Signature: trn318j64qoi6hrx8dkjuum4oxoj3t6b Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=kkr7HoN4; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of 3lH6YYQsKCAchsthzy5tpuhnvvnsl.jvtspu14-ttr2hjr.vyn@flex--almasrymina.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=3lH6YYQsKCAchsthzy5tpuhnvvnsl.jvtspu14-ttr2hjr.vyn@flex--almasrymina.bounces.google.com X-HE-Tag: 1637383829-129296 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: - Test mounting and remounting with memcg= succeeds. - Test that simple writes in this file system are charged to the correct memecg. - Test that on non-pagefault paths the calling process gets an ENOSPC. - Test that in pagefault paths the calling process gets a SIGBUS. Signed-off-by: Mina Almasry --- Changes in v4: - Convert tests to expect ENOSPC/SIGBUS rather than ENOMEM oom behavior. - Added remount test. --- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/mmap_write.c | 103 +++++++++++++++++++ tools/testing/selftests/vm/tmpfs-memcg.sh | 116 ++++++++++++++++++++++ 3 files changed, 220 insertions(+) create mode 100644 tools/testing/selftests/vm/mmap_write.c create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh -- 2.34.0.rc2.393.gf8c9666880-goog diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 2e7e86e852828..cb229974c5f15 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -19,6 +19,7 @@ madv_populate userfaultfd mlock-intersect-test mlock-random-test +mmap_write virtual_address_range gup_test va_128TBswitch diff --git a/tools/testing/selftests/vm/mmap_write.c b/tools/testing/selftests/vm/mmap_write.c new file mode 100644 index 0000000000000..88a8468f2128c --- /dev/null +++ b/tools/testing/selftests/vm/mmap_write.c @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program faults memory in tmpfs + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* Global definitions. */ + +/* Global variables. */ +static const char *self; +static char *shmaddr; +static int shmid; + +/* + * Show usage and exit. + */ +static void exit_usage(void) +{ + printf("Usage: %s -p -s \n", self); + exit(EXIT_FAILURE); +} + +int main(int argc, char **argv) +{ + int fd = 0; + int key = 0; + int *ptr = NULL; + int c = 0; + int size = 0; + char path[256] = ""; + int want_sleep = 0, private = 0; + int populate = 0; + int write = 0; + int reserve = 1; + + /* Parse command-line arguments. */ + setvbuf(stdout, NULL, _IONBF, 0); + self = argv[0]; + + while ((c = getopt(argc, argv, ":s:p:")) != -1) { + switch (c) { + case 's': + size = atoi(optarg); + break; + case 'p': + strncpy(path, optarg, sizeof(path)); + break; + default: + errno = EINVAL; + perror("Invalid arg"); + exit_usage(); + } + } + + printf("%s\n", path); + if (strncmp(path, "", sizeof(path)) != 0) { + printf("Writing to this path: %s\n", path); + } else { + errno = EINVAL; + perror("path not found"); + exit_usage(); + } + + if (size != 0) { + printf("Writing this size: %d\n", size); + } else { + errno = EINVAL; + perror("size not found"); + exit_usage(); + } + + fd = open(path, O_CREAT | O_RDWR, 0777); + if (fd == -1) + err(1, "Failed to open file."); + + if (ftruncate(fd, size)) + err(1, "failed to ftruncate %s", path); + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (ptr == MAP_FAILED) { + close(fd); + err(1, "Error mapping the file"); + } + + printf("Writing to memory.\n"); + memset(ptr, 1, size); + printf("Done writing to memory.\n"); + close(fd); + + return 0; +} diff --git a/tools/testing/selftests/vm/tmpfs-memcg.sh b/tools/testing/selftests/vm/tmpfs-memcg.sh new file mode 100755 index 0000000000000..50876992107fd --- /dev/null +++ b/tools/testing/selftests/vm/tmpfs-memcg.sh @@ -0,0 +1,116 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +CGROUP_PATH=/dev/cgroup/memory/tmpfs-memcg-test +REMOUNT_CGROUP_PATH=/dev/cgroup/memory/remount-memcg-test + +function cleanup() { + rm -rf /mnt/tmpfs/* + umount /mnt/tmpfs + rm -rf /mnt/tmpfs + + rmdir $CGROUP_PATH + rmdir $REMOUNT_CGROUP_PATH + + echo CLEANUP DONE +} + +function setup() { + mkdir -p $CGROUP_PATH + mkdir -p $REMOUNT_CGROUP_PATH + echo $((10 * 1024 * 1024)) > $CGROUP_PATH/memory.limit_in_bytes + echo 0 > $CGROUP_PATH/cpuset.cpus + echo 0 > $CGROUP_PATH/cpuset.mems + + mkdir -p /mnt/tmpfs + + echo SETUP DONE +} + +function expect_equal() { + local expected="$1" + local actual="$2" + local error="$3" + + if [[ "$actual" != "$expected" ]]; then + echo "expected ($expected) != actual ($actual): $3" >&2 + cleanup + exit 1 + fi +} + +function expect_ge() { + local expected="$1" + local actual="$2" + local error="$3" + + if [[ "$actual" -lt "$expected" ]]; then + echo "expected ($expected) < actual ($actual): $3" >&2 + cleanup + exit 1 + fi +} + +cleanup +setup + +mount -t tmpfs -o memcg=$REMOUNT_CGROUP_PATH tmpfs /mnt/tmpfs +check=$(cat /proc/mounts | grep -i remount-memcg-test) +if [ -z "$check" ]; then + echo "tmpfs memcg= was not mounted correctly:" + echo $check + echo "FAILED" + cleanup + exit 1 +fi + +mount -t tmpfs -o remount,memcg=$CGROUP_PATH tmpfs /mnt/tmpfs +check=$(cat /proc/mounts | grep -i tmpfs-memcg-test) +if [ -z "$check" ]; then + echo "tmpfs memcg= was not remounted correctly:" + echo $check + echo "FAILED" + cleanup + exit 1 +fi + +TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes) +expect_equal 0 "$TARGET_MEMCG_USAGE" "Before echo, memcg usage should be 0" + +# Echo to allocate a page in the tmpfs +echo +echo +echo hello > /mnt/tmpfs/test +TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes) +expect_ge 4096 "$TARGET_MEMCG_USAGE" "After echo, memcg usage should be greater than 4096" +echo "Echo test succeeded" + +echo +echo +tools/testing/selftests/vm/mmap_write -p /mnt/tmpfs/test -s $((1 * 1024 * 1024)) +TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes) +expect_ge $((1 * 1024 * 1024)) "$TARGET_MEMCG_USAGE" "After mmap_write, memcg usage should greater than 1MB" +echo "WRITE TEST SUCCEEDED" + +# SIGBUS the remote container on pagefault. +echo +echo +echo "SIGBUS the process doing the remote charge on hitting the limit of the remote cgroup." +echo "This will take a long time because the kernel goes through reclaim retries," +echo "but should eventually the write process should receive a SIGBUS" +set +e +tools/testing/selftests/vm/mmap_write -p /mnt/tmpfs/test -s $((11 * 1024 * 1024)) & +wait $! +expect_equal "$?" "135" "mmap_write should have exited with SIGBUS" +set -e + +# ENOSPC the remote container on non pagefault. +echo +echo +echo "OOMing the remote container using cat (non-pagefault)" +echo "This will take a long time because the kernel goes through reclaim retries," +echo "but should eventually the cat command should receive an ENOSPC" +cat /dev/random > /mnt/tmpfs/random || true + +cleanup +echo TEST PASSED