From patchwork Tue Dec 6 02:34:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13065334 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7894AC63703 for ; Tue, 6 Dec 2022 02:34:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CAD138E0002; Mon, 5 Dec 2022 21:34:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C5D688E0001; Mon, 5 Dec 2022 21:34:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B24DB8E0002; Mon, 5 Dec 2022 21:34:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A32268E0001 for ; Mon, 5 Dec 2022 21:34:21 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6DDD3140469 for ; Tue, 6 Dec 2022 02:34:21 +0000 (UTC) X-FDA: 80210312322.24.88F8E35 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf06.hostedemail.com (Postfix) with ESMTP id 1B84E180008 for ; Tue, 6 Dec 2022 02:34:20 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="N/YGhe+L"; spf=pass (imf06.hostedemail.com: domain of 3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670294061; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=jPLZgAUlS+7QpJFptknQoJ63ocBqUst2yPMDlo562Iw=; b=gYG2vN6e2UnKGgJ/MzivsLrXca/nedrgJ5dC8/+cOF3fgPPd9bGQw68qAPOVY0jYP7o+g4 ko9eCqBu6Jx4wK6RKQzShUR4yBW4ZlcO+ngPrBpxw+g+cyJv1k3g0ti9bJzojxnwVYFwNw H/ZQg0bc8OYj9T1QU5QQqWO/mBC6mMI= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="N/YGhe+L"; spf=pass (imf06.hostedemail.com: domain of 3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670294061; a=rsa-sha256; cv=none; b=z68SEcqc1QsYDBxLr2gxXG+y8WAt6RYMzJfxbT9klfAtQGmgVkbEaVhH0gwqJ31busBxFX AdssrHNhRgOYyV4wDSgCYP4YbkJSRK+uz6OcOpLcIV/l0ItEvgMNa5x273fCuSdHwc9qiv yYZY7BVmkI4dPeULIuV/b+xueUSYxP8= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-3b102317668so143208987b3.23 for ; Mon, 05 Dec 2022 18:34:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=jPLZgAUlS+7QpJFptknQoJ63ocBqUst2yPMDlo562Iw=; b=N/YGhe+Lsh8EoscWEjWpIO6y6VRuWPXo8FK17vhQmESVY+glSNG0ZX21C5Pp8mrRwj l9vNNbsh1J2qCtLf4F3DqqjrnOAmo2Y+mL8szliH2uTWx2+/rACvp3HhDYGaL5wBnysL +J+GFilVBQsrOx6lxzeN6D2JDlwcX0H/S4GmOKEGyiSQlitopObfNzsSch5hJM+m+U+4 y7wtS2/8hA4t1MmV7oRwVYxk9muuuAUD4T8h+3oPzZE5iM4vzzw7gSLNSUsbn/vv5rYI rFBBknDsROqjTWhP/IP8YE3e0rtlPxwwmJH1bsy449hU0VxXXbfwmr8LGCp/SjVsNrn6 2/lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=jPLZgAUlS+7QpJFptknQoJ63ocBqUst2yPMDlo562Iw=; b=aDoB5a6OuD/lBOX2IfeXw55EBLY60erlT3XWj9rXV+0jjCSJpBdJ7l0ozs4dgWlGDE 6c1JjgPegwXsnvKA1CbiJwjkGUMR7xCayu/LjhAA7/T2Lx1KU2mf0dixYV3my3kgWh0M opSJ4A4wCWcB9+TVrtHOy/6xURuq0SZ5IUpVDbD6QC/IkMOyIaw3Colg9LReEHlwQ+dE CFgnZ0DjuBAdLqQPK086PXITYeN3chwWnjjSQx63x+kmB80N/gONNuyTQ487SE9t7EdP rJ2T8J63sc7JrRlpYwaRzxtSK4XeoC+UV0ex/5/uVA1LcJLHut/K/VpR/oD3sVpVMz1o exPg== X-Gm-Message-State: ANoB5pn6A6MGNYeaaNfpncgF0v6JVPCzWhubaWd1kDRn0vSj75ZyfJe8 ppi66VP4671mfDfT93YpHn1WEJAQC6b9cAETWw== X-Google-Smtp-Source: AA0mqf4Ol7JyWd3qZd0da+qOuT14l1cx556K0Sn0229COs8F51OTWwaSRgckQp+t8ZjGADsrSi8ckgipO4M7Hj3mlA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:8537:326f:6d74:8d03]) (user=almasrymina job=sendgmr) by 2002:a25:c7cb:0:b0:6fc:834c:9c89 with SMTP id w194-20020a25c7cb000000b006fc834c9c89mr18227229ybe.43.1670294060089; Mon, 05 Dec 2022 18:34:20 -0800 (PST) Date: Mon, 5 Dec 2022 18:34:05 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221206023406.3182800-1-almasrymina@google.com> Subject: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems From: Mina Almasry To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, Mina Almasry , linux-mm@kvack.org, linux-kernel@vger.kernel.org X-Rspamd-Queue-Id: 1B84E180008 X-Stat-Signature: p1go3bqin4c1znjy14zdi3bs5smibfuj X-Rspam-User: X-Spamd-Result: default: False [-5.10 / 9.00]; BAYES_HAM(-6.00)[100.00%]; MID_CONTAINS_FROM(1.00)[]; MV_CASE(0.50)[]; DMARC_POLICY_ALLOW(-0.50)[google.com,reject]; FORGED_SENDER(0.30)[almasrymina@google.com,3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com]; R_DKIM_ALLOW(-0.20)[google.com:s=20210112]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCPT_COUNT_TWELVE(0.00)[14]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_NEQ_ENVFROM(0.00)[almasrymina@google.com,3LKqOYwsKCN09KL9RQXLHM9FNNFKD.BNLKHMTW-LLJU9BJ.NQF@flex--almasrymina.bounces.google.com]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[google.com:+]; TO_DN_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; ARC_NA(0.00)[] X-Rspamd-Server: rspam08 X-HE-Tag: 1670294060-308789 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") enabled demotion in memcg reclaim, which is the right thing to do, however, it introduced a regression in the behavior of try_to_free_mem_cgroup_pages(). The callers of try_to_free_mem_cgroup_pages() expect it to attempt to reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage of the cgroup should reduce by nr_pages. The callers expect try_to_free_mem_cgroup_pages() to also return the number of pages reclaimed, not demoted. However, what try_to_free_mem_cgroup_pages() actually does is it unconditionally counts demoted pages as reclaimed pages. So in practice when it is called it will often demote nr_pages and return the number of demoted pages to the caller. Demoted pages don't lower the memcg usage, and so try_to_free_mem_cgroup_pages() is not actually doing what the callers want it to do. Various things work suboptimally on memory tiered systems or don't work at all due to this: - memory.high enforcement likely doesn't work (it just demotes nr_pages instead of lowering the memcg usage by nr_pages). - try_charge_memcg() will keep retrying the charge while try_to_free_mem_cgroup_pages() is just demoting pages and not actually making any room for the charge. - memory.reclaim has a wonky interface. It advertises to the user it reclaims the provided amount but it will actually often demote that amount. There may be more effects to this issue. To fix these issues I propose shrink_folio_list() to only count pages demoted from inside of sc->nodemask to outside of sc->nodemask as 'reclaimed'. For callers such as reclaim_high() or try_charge_memcg() that set sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to actually reclaim nr_pages and return the number of pages reclaimed. No demoted pages would count towards the nr_pages requirement. For callers such as memory_reclaim() that set sc->nodemask, try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask with either reclaim or demotion. Tested this change using memory.reclaim interface. I set up a test case where I allocate 500m in a cgroup, and then do: echo "50m" > memory.reclaim Without this fix, my kernel demotes 70mb and reclaims 4 mb (memory.current is reduced by about 4mb). With this fix, my kernel demotes all memory possible and reclaims 60mb (memory.current is reduced by about 60mb). Signed-off-by: Mina Almasry --- Changes in v3: - Reverted back to v1 implementation: "try to demote but don't count demoted pages unless they are demoted to outside the nodemask" as Ying suggested. - Made sure demotions that fall back to non next_demotion_target() are not counted as Wei suggested. - Updated comment in shrink_folio_list() as Ying suggested. - Added before/after for the test case in commit message since Ying asked. - Fixed call sites that don't provide sc->nodemask but expect demotion from a specific node as Ying pointed out. Cc: weixugc@google.com Cc: ying.huang@intel.com This is developed on top of mm-unstable largely because I want the solution to be compatible with the recently added nodes= arg on mm-unstable. --- mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 78 insertions(+), 13 deletions(-) -- 2.39.0.rc0.267.gcb52ba06e7-goog diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b42ac9ad755..f324e80395c3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } +struct demotion_control { + struct migration_target_control *mtc; + nodemask_t *demote_from_nodemask; + unsigned long nr_demoted_outside_nodemask; +}; + static struct page *alloc_demote_page(struct page *page, unsigned long private) { struct page *target_page; nodemask_t *allowed_mask; - struct migration_target_control *mtc; - - mtc = (struct migration_target_control *)private; + struct demotion_control *dc = (struct demotion_control *)private; + struct migration_target_control *mtc = dc->mtc; allowed_mask = mtc->nmask; /* @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) mtc->nmask = NULL; mtc->gfp_mask |= __GFP_THISNODE; target_page = alloc_migration_target(page, (unsigned long)mtc); - if (target_page) + if (!target_page) { + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + target_page = alloc_migration_target(page, (unsigned long)mtc); + } + + if (!target_page) return target_page; - mtc->gfp_mask &= ~__GFP_THISNODE; - mtc->nmask = allowed_mask; + if (dc->demote_from_nodemask && + !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask)) + dc->nr_demoted_outside_nodemask++; - return alloc_migration_target(page, (unsigned long)mtc); + return target_page; +} + +void free_demote_page(struct page *page, unsigned long private) +{ + struct demotion_control *dc = (struct demotion_control *)private; + + if (dc->demote_from_nodemask && + !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) + dc->nr_demoted_outside_nodemask--; + + folio_put(page_folio(page)); } /* @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + nodemask_t *nodemask) { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, .nmask = &allowed_mask }; + struct demotion_control dc = { + .mtc = &mtc, + .demote_from_nodemask = nodemask, + .nr_demoted_outside_nodemask = 0, + }; + if (list_empty(demote_folios)) return 0; @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, &allowed_mask); /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_folios, alloc_demote_page, NULL, - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + migrate_pages(demote_folios, alloc_demote_page, free_demote_page, + (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); - return nr_succeeded; + return dc.nr_demoted_outside_nodemask; } static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) } /* - * shrink_folio_list() returns the number of reclaimed pages + * shrink_folio_list() returns the number of reclaimed pages. + * + * Demoted pages are counted as reclaimed iff: + * (a) sc->nodemask arg is provided. + * (b) page has been demoted from a node inside sc->nodemask to a node + * outside sc->nodemask. */ static unsigned int shrink_folio_list(struct list_head *folio_list, struct pglist_data *pgdat, struct scan_control *sc, @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(free_folios); LIST_HEAD(demote_folios); unsigned int nr_reclaimed = 0; + unsigned int nr_demoted_outside_nodemask = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + nr_demoted_outside_nodemask = + demote_folio_list(&demote_folios, pgdat, sc->nodemask); + + if (sc->nodemask) + nr_reclaimed += nr_demoted_outside_nodemask; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list) { + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, + .nodemask = &nodemask }; struct reclaim_stat stat; unsigned int nr_reclaimed; @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, LIST_HEAD(clean_folios); unsigned int noreclaim_flag; + /* + * Set the nodemask in sc to indicate to shrink_folio_list() that we're + * looking for reclaim from this node. + */ + node_set(zone->zone_pgdat->node_id, nodemask); + list_for_each_entry_safe(folio, next, folio_list, lru) { if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && !folio_test_dirty(folio) && !__folio_test_movable(folio) && @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; bool boosted; struct zone *zone; + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, + .nodemask = &nodemask, }; + /* + * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're + * looking for reclaim from this node. + */ + node_set(pgdat->node_id, nodemask); + set_task_reclaim_state(current, &sc.reclaim_state); psi_memstall_enter(&pflags); __fs_reclaim_acquire(_THIS_IP_); @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in const unsigned long nr_pages = 1 << order; struct task_struct *p = current; unsigned int noreclaim_flag; + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = current_gfp_context(gfp_mask), @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), + .nodemask = &nodemask, }; unsigned long pflags; + /* + * Set the nodemask in sc to indicate to shrink_node() that we're + * looking for reclaim from this node. + */ + node_set(pgdat->node_id, nodemask); + trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, sc.gfp_mask);