Message ID | 20240809212115.59291-1-kaiyang2@cs.cmu.edu (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm,memcg: provide per-cgroup counters for NUMA balancing operations | expand |
On Fri, 9 Aug 2024 21:21:15 +0000 kaiyang2@cs.cmu.edu wrote: > From: Kaiyang Zhao <kaiyang2@cs.cmu.edu> > > The ability to observe the demotion and promotion decisions made by the > kernel on a per-cgroup basis is important for monitoring and tuning > containerized workloads on either NUMA machines or machines > equipped with tiered memory. > > Different containers in the system may experience drastically different > memory tiering actions that cannot be distinguished from the global > counters alone. > > For example, a container running a workload that has a much hotter > memory accesses will likely see more promotions and fewer demotions, > potentially depriving a colocated container of top tier memory to such > an extent that its performance degrades unacceptably. > > For another example, some containers may exhibit longer periods between > data reuse, causing much more numa_hint_faults than numa_pages_migrated. > In this case, tuning hot_threshold_ms may be appropriate, but the signal > can easily be lost if only global counters are available. > > This patch set adds five counters to > memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, > numa_hint_faults, pgdemote_kswapd and pgdemote_direct. > > count_memcg_events_mm() is added to count multiple event occurrences at > once, and get_mem_cgroup_from_folio() is added because we need to get a > reference to the memcg of a folio before it's migrated to track > numa_pages_migrated. The accounting of PGDEMOTE_* is moved to > shrink_inactive_list() before being changed to per-cgroup. > Thanks. I lack the operational experience to be able to judge the usefulness of this - hopefully others can weigh in. Meanwhile, the patch is simple enough - I'll queue it up for testing.
On Fri, 9 Aug 2024, kaiyang2@cs.cmu.edu wrote: > From: Kaiyang Zhao <kaiyang2@cs.cmu.edu> > > The ability to observe the demotion and promotion decisions made by the > kernel on a per-cgroup basis is important for monitoring and tuning > containerized workloads on either NUMA machines or machines > equipped with tiered memory. > > Different containers in the system may experience drastically different > memory tiering actions that cannot be distinguished from the global > counters alone. > > For example, a container running a workload that has a much hotter > memory accesses will likely see more promotions and fewer demotions, > potentially depriving a colocated container of top tier memory to such > an extent that its performance degrades unacceptably. > > For another example, some containers may exhibit longer periods between > data reuse, causing much more numa_hint_faults than numa_pages_migrated. > In this case, tuning hot_threshold_ms may be appropriate, but the signal > can easily be lost if only global counters are available. > > This patch set adds five counters to > memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, > numa_hint_faults, pgdemote_kswapd and pgdemote_direct. > > count_memcg_events_mm() is added to count multiple event occurrences at > once, and get_mem_cgroup_from_folio() is added because we need to get a > reference to the memcg of a folio before it's migrated to track > numa_pages_migrated. The accounting of PGDEMOTE_* is moved to > shrink_inactive_list() before being changed to per-cgroup. > > Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Hi Kaiyang, have you considered per-memcg control over NUMA balancing operations as well? Wondering if that's the direction that you're heading in, because it would be very useful to be able to control NUMA balancing at memcg granularity on multi-tenant systems. I mentioned this at LSF/MM/BPF this year. If people believe this is out of scope for memcg, that would be good feedback as well.
On Sun, Aug 11, 2024 at 01:16:53PM -0700, David Rientjes wrote: > Hi Kaiyang, have you considered per-memcg control over NUMA balancing > operations as well? > > Wondering if that's the direction that you're heading in, because it would > be very useful to be able to control NUMA balancing at memcg granularity > on multi-tenant systems. > > I mentioned this at LSF/MM/BPF this year. If people believe this is out > of scope for memcg, that would be good feedback as well. Yes that's exactly where we are heading -- per-cgroup control of NUMA balancing operations in the context of memory tiering with CXL memory, by extending the concept of memory.low and memory.high. The use case is enabling a fair share of top tier memory across containers. I'm collaborating with Meta on this, and we already have an implementation and some experiments done. The patches will go out soon. If others have thoughts on this, please chime in. Best, Kaiyang Zhao
On Mon, 12 Aug 2024, Kaiyang Zhao wrote: > On Sun, Aug 11, 2024 at 01:16:53PM -0700, David Rientjes wrote: > > Hi Kaiyang, have you considered per-memcg control over NUMA balancing > > operations as well? > > > > Wondering if that's the direction that you're heading in, because it would > > be very useful to be able to control NUMA balancing at memcg granularity > > on multi-tenant systems. > > > > I mentioned this at LSF/MM/BPF this year. If people believe this is out > > of scope for memcg, that would be good feedback as well. > > Yes that's exactly where we are heading -- per-cgroup control of NUMA > balancing operations in the context of memory tiering with CXL memory, > by extending the concept of memory.low and memory.high. The use case is > enabling a fair share of top tier memory across containers. > Thanks Kaiyang, that will be very useful to test out, looking forward to seeing the patches! Does this include top-tier specific memory limits as well? And is your primary motivation the promotion path through NUMA Balancing or are you also looking at demotion to develop a comprehensive policy for memory placement using these limits? > I'm collaborating with Meta on this, and we already have an implementation > and some experiments done. The patches will go out soon. If others have > thoughts on this, please chime in. > I have lots of thoughts, but not sure if the primary motivation is around promotion only here :)
On Mon, Aug 12, 2024 at 10:09:55PM -0700, David Rientjes wrote: > Does this include top-tier specific memory limits as well? Yes, we plan to have top-tier specific memory protection (like memory.low) and memory limit (like memory.high). Exactly what the interface will look like will probably depends on the community feedback. Maybe it's just adding a memory_top_tier.low/high knob, maybe something else. > And is your primary motivation the promotion path through NUMA Balancing > or are you also looking at demotion to develop a comprehensive policy for > memory placement using these limits? Both promotion and demotion/reclaim path. If we want top-tier memory protection and limit, moderating both promotion and demotion is I think a must for converging on a stable distribution of memory usage on top-tier and slower tier. Kaiyang
On Mon, 12 Aug 2024 22:49:02 +0000 Kaiyang Zhao <kaiyang2@cs.cmu.edu> wrote: > On Sun, Aug 11, 2024 at 01:16:53PM -0700, David Rientjes wrote: > > Hi Kaiyang, have you considered per-memcg control over NUMA balancing > > operations as well? > > > > Wondering if that's the direction that you're heading in, because it would > > be very useful to be able to control NUMA balancing at memcg granularity > > on multi-tenant systems. > > > > I mentioned this at LSF/MM/BPF this year. If people believe this is out > > of scope for memcg, that would be good feedback as well. > > Yes that's exactly where we are heading -- per-cgroup control of NUMA > balancing operations in the context of memory tiering with CXL memory, > by extending the concept of memory.low and memory.high. The use case is > enabling a fair share of top tier memory across containers. > > I'm collaborating with Meta on this, and we already have an implementation > and some experiments done. The patches will go out soon. If others have > thoughts on this, please chime in. I think it's helpful to capture such broad-brush where-this-is-heading info in the changelogs. For reviewers and for posterity. In fact, any reviewer question should be treated as a bug report against the changelog! Questions mean that information was missing. I'll grab v3 now, but please feel free to send along any changelog additions which come to mind.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44f7fb7dc0c8..90ecd2dbca06 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -768,6 +768,8 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); struct mem_cgroup *get_mem_cgroup_from_current(void); +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio); + struct lruvec *folio_lruvec_lock(struct folio *folio); struct lruvec *folio_lruvec_lock_irq(struct folio *folio); struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, @@ -1012,8 +1014,8 @@ static inline void count_memcg_folio_events(struct folio *folio, count_memcg_events(memcg, idx, nr); } -static inline void count_memcg_event_mm(struct mm_struct *mm, - enum vm_event_item idx) +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) { struct mem_cgroup *memcg; @@ -1023,10 +1025,16 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); if (likely(memcg)) - count_memcg_events(memcg, idx, 1); + count_memcg_events(memcg, idx, count); rcu_read_unlock(); } +static inline void count_memcg_event_mm(struct mm_struct *mm, + enum vm_event_item idx) +{ + count_memcg_events_mm(mm, idx, 1); +} + static inline void memcg_memory_event(struct mem_cgroup *memcg, enum memcg_memory_event event) { @@ -1246,6 +1254,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_current(void) return NULL; } +static inline struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + return NULL; +} + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) { @@ -1468,6 +1481,11 @@ static inline void count_memcg_folio_events(struct folio *folio, { } +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) +{ +} + static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 596c050ed492..ff0b49f76ca4 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -32,6 +32,7 @@ struct reclaim_stat { unsigned nr_ref_keep; unsigned nr_unmap_fail; unsigned nr_lazyfree_fail; + unsigned nr_demoted; }; /* Stat data for system wide items */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e1ffd2950393..e8e59d3729c0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -307,6 +307,9 @@ static const unsigned int memcg_node_stat_items[] = { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif + PGDEMOTE_KSWAPD, + PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, }; static const unsigned int memcg_stat_items[] = { @@ -437,6 +440,11 @@ static const unsigned int memcg_vm_event_stat[] = { THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif +#ifdef CONFIG_NUMA_BALANCING + NUMA_PAGE_MIGRATE, + NUMA_PTE_UPDATES, + NUMA_HINT_FAULTS, +#endif }; #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat) @@ -978,6 +986,23 @@ struct mem_cgroup *get_mem_cgroup_from_current(void) return memcg; } +/** + * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. + */ +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + struct mem_cgroup *memcg = folio_memcg(folio); + + if (mem_cgroup_disabled()) + return NULL; + + rcu_read_lock(); + if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) + memcg = root_mem_cgroup; + rcu_read_unlock(); + return memcg; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1383,6 +1408,10 @@ static const struct memory_stat memory_stats[] = { { "workingset_restore_anon", WORKINGSET_RESTORE_ANON }, { "workingset_restore_file", WORKINGSET_RESTORE_FILE }, { "workingset_nodereclaim", WORKINGSET_NODERECLAIM }, + + { "pgdemote_kswapd", PGDEMOTE_KSWAPD }, + { "pgdemote_direct", PGDEMOTE_DIRECT }, + { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, }; /* The actual unit of the state item, not the same as the output unit */ @@ -1416,6 +1445,9 @@ static int memcg_page_state_output_unit(int item) case WORKINGSET_RESTORE_ANON: case WORKINGSET_RESTORE_FILE: case WORKINGSET_NODERECLAIM: + case PGDEMOTE_KSWAPD: + case PGDEMOTE_DIRECT: + case PGDEMOTE_KHUGEPAGED: return 1; default: return memcg_page_state_unit(item); diff --git a/mm/memory.c b/mm/memory.c index d6af095d255b..7d69c0287b24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5373,6 +5373,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, vma_set_access_pid_bit(vma); count_vm_numa_event(NUMA_HINT_FAULTS); + count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1); if (page_nid == numa_node_id()) { count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); *flags |= TNF_FAULT_LOCAL; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b3b5f376471f..b646fab3e45e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -676,8 +676,10 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, tlb_gather_mmu(&tlb, vma->vm_mm); nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA); - if (nr_updated > 0) + if (nr_updated > 0) { count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); + count_memcg_events_mm(vma->vm_mm, NUMA_PTE_UPDATES, nr_updated); + } tlb_finish_mmu(&tlb); diff --git a/mm/migrate.c b/mm/migrate.c index 66a5f73ebfdf..7e1267042a56 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2614,6 +2614,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); + struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio); list_add(&folio->lru, &migratepages); nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2623,12 +2624,14 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, putback_movable_pages(&migratepages); if (nr_succeeded) { count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); } + mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 25e43bb3b574..fd66789a413b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1008,9 +1008,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - mod_node_page_state(pgdat, PGDEMOTE_KSWAPD + reclaimer_offset(), - nr_succeeded); - return nr_succeeded; } @@ -1518,7 +1515,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + stat->nr_demoted = demote_folio_list(&demote_folios, pgdat); + nr_reclaimed += stat->nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ @@ -1984,6 +1982,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc))