From patchwork Fri Aug 9 21:21:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kaiyang Zhao X-Patchwork-Id: 13759271 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC67AC3DA4A for ; Fri, 9 Aug 2024 21:21:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F90B6B0095; Fri, 9 Aug 2024 17:21:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 280556B0098; Fri, 9 Aug 2024 17:21:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F9C16B009A; Fri, 9 Aug 2024 17:21:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DE4F66B0095 for ; Fri, 9 Aug 2024 17:21:48 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8925AA15BF for ; Fri, 9 Aug 2024 21:21:48 +0000 (UTC) X-FDA: 82433979096.22.F50D167 Received: from mail-ot1-f42.google.com (mail-ot1-f42.google.com [209.85.210.42]) by imf23.hostedemail.com (Postfix) with ESMTP id 793FA140007 for ; Fri, 9 Aug 2024 21:21:46 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cs.cmu.edu header.s=google-2021 header.b=KSfWhx4N; dmarc=pass (policy=none) header.from=cs.cmu.edu; spf=pass (imf23.hostedemail.com: domain of kaiyang2@andrew.cmu.edu designates 209.85.210.42 as permitted sender) smtp.mailfrom=kaiyang2@andrew.cmu.edu ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723238497; a=rsa-sha256; cv=none; b=eM+HnmkrOyBaVOx0WX5157ZZQv90T26WK13bZFbSl6cZkmQ47yEfzE7pm85yyvXC+3rz3S MeJphnB7KoexgmQyc46uMk30NF38i/KChNPwJGKBX8q7ZeReqY9Ka9hgPvhHlY3PIDHOAQ uQ2mKXLPiGKicyG7sEfGRUnU9hekX4o= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cs.cmu.edu header.s=google-2021 header.b=KSfWhx4N; dmarc=pass (policy=none) header.from=cs.cmu.edu; spf=pass (imf23.hostedemail.com: domain of kaiyang2@andrew.cmu.edu designates 209.85.210.42 as permitted sender) smtp.mailfrom=kaiyang2@andrew.cmu.edu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723238497; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=kwu5e+cVko7XaEoH6RDzz8lUOmdaAuB7KWkiGchXZXM=; b=TIL2xNMMh22J3KR1HQIz7JhHAJhzwQFl6jqaVJlBKFm8qv1USgJm4y6UEpLBKD3syfudz7 MstA+1LlW8XcCNo7HD+jgVtmoK2oNNeVTsoQ39vaaL12GhjrLm5Q6V1RUBsIQqVQ6HzYva bOy02sqTWSW3oGZeVdHN7h5apLZnBTk= Received: by mail-ot1-f42.google.com with SMTP id 46e09a7af769-7093472356dso1414684a34.0 for ; Fri, 09 Aug 2024 14:21:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.cmu.edu; s=google-2021; t=1723238505; x=1723843305; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=kwu5e+cVko7XaEoH6RDzz8lUOmdaAuB7KWkiGchXZXM=; b=KSfWhx4Nr4K0bO7r952S/0PWKkzkgYpopj3ppGm3GFNqnxamXFJKrHU/sNXWnsK9cm 8zg1Jwd6gKI86BXPpMhTGHCE5sdtfd6W+zXyva//0tNk+kCIN/0kBQ3ftywXwfaB9Ds+ hEzI0MOgx7wrVGqnEhwjyBqrSHUsKtth2ctHZYni2GJru2LOAPExFiOXKVn3BXjOBwPK 5KAKpg8Nvfoq5bl00LA/iuvLEkbTjHGaNvv2XKv488+pO4CXmZMvVzNwcvAYy0AQje00 Khces7TNi8H8BrsGWvhtmyZ7W3b3UEwLGGh8D5RLcjZ4zv+HdXu1NniQ52VLEwNF9Qpb GBPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723238505; x=1723843305; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kwu5e+cVko7XaEoH6RDzz8lUOmdaAuB7KWkiGchXZXM=; b=YGSDE9mgAwoP2s1nDHtXKV5hGneNHqcJtGbS9ouhJZB6bCASofx7g1xQl8PdCMdibg tkUvahEuHKzxoMXghP5DRbNRksnzNu6vaXCOyZAfqZYcJWPeMC1Km6sZ9kBTy1narigE rIkRf6GSwVgM4wp6Ae6sjJWizzuWdvvQThhM3VqzIuSrJ0iPorzClA2nAki77IM7cPua /H4rFKwIQCdghv+c/8e4IzGHxE9HZ6uRK7GxXOOc5yWdZoyu9gFuM+6Zq1Aiqvz1Lwfu jvnRARAi1/yyy7B+/yL5l1+lQEZ5XaH0HZrHnuCnA4+8MpoTxdzR1fwNsJWcGBf1y7IT zPGw== X-Gm-Message-State: AOJu0Yw9ZZ9xw9toVfJQ4vXyWWi9MkazYnXBFPqA/Jl4SCYKUwkty8i/ hZm4msV/uEMefUllZ2PN2o18gVJDnrfMqTTLycdcId/xVSIydwyUsZyNvD03AG3Kwgy8WRu3vzB nLD2SidyOwSu/6+5TOnO82sVckUDXDGuIClQzPyX9xx/pijrtOBu8rkMG7W9CYB4hurzNTcOYp3 yvf90i6A0AgrzeaD9TXqTcO1rIMwaxu/+ZekQ= X-Google-Smtp-Source: AGHT+IHc2g1EGhaKsKnTRZgja+J8tYK4FL+EWVwjkTQIKoFMbCGW1S/y4Z/6VB8UHYDB3cgin1TMug== X-Received: by 2002:a05:6830:6481:b0:70a:988a:b5fd with SMTP id 46e09a7af769-70b7484efc0mr3685738a34.24.1723238505020; Fri, 09 Aug 2024 14:21:45 -0700 (PDT) Received: from localhost (pool-74-98-231-160.pitbpa.fios.verizon.net. [74.98.231.160]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6bd82c5ef1bsm1646986d6.26.2024.08.09.14.21.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 09 Aug 2024 14:21:44 -0700 (PDT) From: kaiyang2@cs.cmu.edu To: linux-mm@kvack.org, cgroups@vger.kernel.org Cc: roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, mhocko@kernel.org, nehagholkar@meta.com, abhishekd@meta.com, hannes@cmpxchg.org, Kaiyang Zhao Subject: [PATCH] mm,memcg: provide per-cgroup counters for NUMA balancing operations Date: Fri, 9 Aug 2024 21:21:15 +0000 Message-ID: <20240809212115.59291-1-kaiyang2@cs.cmu.edu> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: 793FA140007 X-Rspamd-Server: rspam01 X-Stat-Signature: 4dabawejs74h9sozs91c4rrecrw1e6kp X-HE-Tag: 1723238506-162532 X-HE-Meta: U2FsdGVkX1+XV6oM5lnGXfukId/JBTIiz+eiP5Ff5estjzzKFzUGuuTcH8kIwkUGdSODcmVnlXQOHbPMFg56HvWCubLofMcYYR3W5U4JhX7mqtMGXanwoB5uL5ssSPQrFOp+FjfklIxLFZd9UDOE44cHkd0tN8GI8V0dssUYGdc0QEosjTvCnXzkGngxVipuORYFnJlAEi1u1b1c1vcZa1q69HzxLlcsaskQQcM8jpg724thhfRptVSfSUrTZnxvlNlhS/6mCA4MZkmlidIhKfh6Tr/IddVMxRSK9gZhaya/lz0QjpH5aQo9shb/q9kVlFql40mVdIbQFPyssYzEv8DFXaiRlGKmnPdnMn66dlOZfpNH1BNXywvDBKJyIa3r0P+8cehh89YKR0WjOPFfZJx/EY9TQcMEIRjGRDlm2V25vTKbrAnpHSSdwyLzxk4uDtnLxjzvwyqVWclwge5Dt6gmwFooCRs2owO9deD/BMheUNsrDDpzSL0ZnnNETR6RiqRO7F/Qs8ZVN4ky4dranLpFG2Jpt60zct4ejYXWkwDHYSeOCCnTugoJOGXMKt3PThnGDBrgRwIzfkhaQ15m7vYdgkx/106QlwSLd4N6l8UqycpbegNP8cbc6ZduusFN8xuboOfg8qIMmvMNeaNllPuToBnXDksX2XptV4RYx6najiFgofmH5NCva/HBDM6TXMX/X1tdqhvFx1sOHMWvqnps4sX4z0NV3zpkhCRv8ljPccpOf2dEfdZDbim/884MdNhvtwD+fvWEu9Bj08Rc4n5CvKs8v95CrIwDIX/eCbbDhhkTCzoLOsI+vjC1oMI57WOVBPRQD55YD+Ao++GGVZb6hZ1kTetP/1YVwtLSZkVKdR19UAs6l3ihOqd+ggrDlOYpwbEv7bGhPsrqvU2WW+wkFLhNFkIy4cj1xqdA6ETDxI4Xa+ZzbaXUQCZ2jBk0afN3J4m4BQlj39I/I6H tPKw+glI 9zXGTBNXc+jeGPXSL9XgW3dBiYJlO70Xi9NPc23A6J6K68KCE6Q2NCO6sMafu8ibNaxIZEt5DPo4Syo5j5ZZYMJBKVXe2tSKMLu6kMs1K0HN9BijoPLqWv8H3+wQYhOmQyfPAiY/Fuw6/GlktAlAVCqjIbsm36Bo3x+iD+vzbqNyaG3f5Pdc11/k9B+5nZBUacRBOSU5jTOv+ddrV/rR0Ti0kWQss+5Qbvk1Wax5Hxa3HSpjuXtjQwDut7gHpkzLIIJGlkSk9d58zUAFC5atx2kCYa4sYYBxncrgeMCQ87KGh2bV2EiNW4aJ2zDPuK6Y/8fTLVL1DGb0aehUaRrt1IHlNG6TWLEdpV5Xfvay8oh+DgCPc3+yGzr7GCg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kaiyang Zhao The ability to observe the demotion and promotion decisions made by the kernel on a per-cgroup basis is important for monitoring and tuning containerized workloads on either NUMA machines or machines equipped with tiered memory. Different containers in the system may experience drastically different memory tiering actions that cannot be distinguished from the global counters alone. For example, a container running a workload that has a much hotter memory accesses will likely see more promotions and fewer demotions, potentially depriving a colocated container of top tier memory to such an extent that its performance degrades unacceptably. For another example, some containers may exhibit longer periods between data reuse, causing much more numa_hint_faults than numa_pages_migrated. In this case, tuning hot_threshold_ms may be appropriate, but the signal can easily be lost if only global counters are available. This patch set adds five counters to memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd and pgdemote_direct. count_memcg_events_mm() is added to count multiple event occurrences at once, and get_mem_cgroup_from_folio() is added because we need to get a reference to the memcg of a folio before it's migrated to track numa_pages_migrated. The accounting of PGDEMOTE_* is moved to shrink_inactive_list() before being changed to per-cgroup. Signed-off-by: Kaiyang Zhao --- include/linux/memcontrol.h | 24 +++++++++++++++++++++--- include/linux/vmstat.h | 1 + mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++++ mm/memory.c | 1 + mm/mempolicy.c | 4 +++- mm/migrate.c | 3 +++ mm/vmscan.c | 8 ++++---- 7 files changed, 65 insertions(+), 8 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44f7fb7dc0c8..90ecd2dbca06 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -768,6 +768,8 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); struct mem_cgroup *get_mem_cgroup_from_current(void); +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio); + struct lruvec *folio_lruvec_lock(struct folio *folio); struct lruvec *folio_lruvec_lock_irq(struct folio *folio); struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, @@ -1012,8 +1014,8 @@ static inline void count_memcg_folio_events(struct folio *folio, count_memcg_events(memcg, idx, nr); } -static inline void count_memcg_event_mm(struct mm_struct *mm, - enum vm_event_item idx) +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) { struct mem_cgroup *memcg; @@ -1023,10 +1025,16 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); if (likely(memcg)) - count_memcg_events(memcg, idx, 1); + count_memcg_events(memcg, idx, count); rcu_read_unlock(); } +static inline void count_memcg_event_mm(struct mm_struct *mm, + enum vm_event_item idx) +{ + count_memcg_events_mm(mm, idx, 1); +} + static inline void memcg_memory_event(struct mem_cgroup *memcg, enum memcg_memory_event event) { @@ -1246,6 +1254,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_current(void) return NULL; } +static inline struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + return NULL; +} + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) { @@ -1468,6 +1481,11 @@ static inline void count_memcg_folio_events(struct folio *folio, { } +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) +{ +} + static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 596c050ed492..ff0b49f76ca4 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -32,6 +32,7 @@ struct reclaim_stat { unsigned nr_ref_keep; unsigned nr_unmap_fail; unsigned nr_lazyfree_fail; + unsigned nr_demoted; }; /* Stat data for system wide items */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e1ffd2950393..e8e59d3729c0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -307,6 +307,9 @@ static const unsigned int memcg_node_stat_items[] = { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif + PGDEMOTE_KSWAPD, + PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, }; static const unsigned int memcg_stat_items[] = { @@ -437,6 +440,11 @@ static const unsigned int memcg_vm_event_stat[] = { THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif +#ifdef CONFIG_NUMA_BALANCING + NUMA_PAGE_MIGRATE, + NUMA_PTE_UPDATES, + NUMA_HINT_FAULTS, +#endif }; #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat) @@ -978,6 +986,23 @@ struct mem_cgroup *get_mem_cgroup_from_current(void) return memcg; } +/** + * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. + */ +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + struct mem_cgroup *memcg = folio_memcg(folio); + + if (mem_cgroup_disabled()) + return NULL; + + rcu_read_lock(); + if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) + memcg = root_mem_cgroup; + rcu_read_unlock(); + return memcg; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1383,6 +1408,10 @@ static const struct memory_stat memory_stats[] = { { "workingset_restore_anon", WORKINGSET_RESTORE_ANON }, { "workingset_restore_file", WORKINGSET_RESTORE_FILE }, { "workingset_nodereclaim", WORKINGSET_NODERECLAIM }, + + { "pgdemote_kswapd", PGDEMOTE_KSWAPD }, + { "pgdemote_direct", PGDEMOTE_DIRECT }, + { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, }; /* The actual unit of the state item, not the same as the output unit */ @@ -1416,6 +1445,9 @@ static int memcg_page_state_output_unit(int item) case WORKINGSET_RESTORE_ANON: case WORKINGSET_RESTORE_FILE: case WORKINGSET_NODERECLAIM: + case PGDEMOTE_KSWAPD: + case PGDEMOTE_DIRECT: + case PGDEMOTE_KHUGEPAGED: return 1; default: return memcg_page_state_unit(item); diff --git a/mm/memory.c b/mm/memory.c index d6af095d255b..7d69c0287b24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5373,6 +5373,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, vma_set_access_pid_bit(vma); count_vm_numa_event(NUMA_HINT_FAULTS); + count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1); if (page_nid == numa_node_id()) { count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); *flags |= TNF_FAULT_LOCAL; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b3b5f376471f..b646fab3e45e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -676,8 +676,10 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, tlb_gather_mmu(&tlb, vma->vm_mm); nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA); - if (nr_updated > 0) + if (nr_updated > 0) { count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); + count_memcg_events_mm(vma->vm_mm, NUMA_PTE_UPDATES, nr_updated); + } tlb_finish_mmu(&tlb); diff --git a/mm/migrate.c b/mm/migrate.c index 66a5f73ebfdf..7e1267042a56 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2614,6 +2614,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); + struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio); list_add(&folio->lru, &migratepages); nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2623,12 +2624,14 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, putback_movable_pages(&migratepages); if (nr_succeeded) { count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); } + mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 25e43bb3b574..fd66789a413b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1008,9 +1008,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - mod_node_page_state(pgdat, PGDEMOTE_KSWAPD + reclaimer_offset(), - nr_succeeded); - return nr_succeeded; } @@ -1518,7 +1515,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + stat->nr_demoted = demote_folio_list(&demote_folios, pgdat); + nr_reclaimed += stat->nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ @@ -1984,6 +1982,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc))