Message ID | 20221206023406.3182800-1-almasrymina@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3,mm-unstable] mm: Fix memcg reclaim on memory tiered systems | expand |
Mina Almasry <almasrymina@google.com> writes: > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") enabled demotion in memcg reclaim, which is the right thing > to do, however, it introduced a regression in the behavior of > try_to_free_mem_cgroup_pages(). > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > of the cgroup should reduce by nr_pages. The callers expect > try_to_free_mem_cgroup_pages() to also return the number of pages > reclaimed, not demoted. > > However, what try_to_free_mem_cgroup_pages() actually does is it > unconditionally counts demoted pages as reclaimed pages. So in practnice > when it is called it will often demote nr_pages and return the number of > demoted pages to the caller. Demoted pages don't lower the memcg usage, > and so try_to_free_mem_cgroup_pages() is not actually doing what the > callers want it to do. > > Various things work suboptimally on memory tiered systems or don't work > at all due to this: > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > instead of lowering the memcg usage by nr_pages). > - try_charge_memcg() will keep retrying the charge while > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > making any room for the charge. > - memory.reclaim has a wonky interface. It advertises to the user it > reclaims the provided amount but it will actually often demote that > amount. > > There may be more effects to this issue. > > To fix these issues I propose shrink_folio_list() to only count pages > demoted from inside of sc->nodemask to outside of sc->nodemask as > 'reclaimed'. > > For callers such as reclaim_high() or try_charge_memcg() that set > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > actually reclaim nr_pages and return the number of pages reclaimed. No > demoted pages would count towards the nr_pages requirement. > > For callers such as memory_reclaim() that set sc->nodemask, > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > with either reclaim or demotion. > > Tested this change using memory.reclaim interface. I set up a test case where > I allocate 500m in a cgroup, and then do: > > echo "50m" > memory.reclaim > > Without this fix, my kernel demotes 70mb and reclaims 4 mb > (memory.current is reduced by about 4mb). > > With this fix, my kernel demotes all memory possible and reclaims 60mb > (memory.current is reduced by about 60mb). > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > Changes in v3: > - Reverted back to v1 implementation: "try to demote but don't count > demoted pages unless they are demoted to outside the nodemask" as Ying > suggested. > - Made sure demotions that fall back to non next_demotion_target() are > not counted as Wei suggested. > - Updated comment in shrink_folio_list() as Ying suggested. > - Added before/after for the test case in commit message since Ying > asked. > - Fixed call sites that don't provide sc->nodemask but expect demotion > from a specific node as Ying pointed out. > > Cc: weixugc@google.com > Cc: ying.huang@intel.com > > This is developed on top of mm-unstable largely because I want the > solution to be compatible with the recently added nodes= arg on > mm-unstable. > --- > mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 78 insertions(+), 13 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b42ac9ad755..f324e80395c3 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio, > mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); > } > > +struct demotion_control { > + struct migration_target_control *mtc; > + nodemask_t *demote_from_nodemask; > + unsigned long nr_demoted_outside_nodemask; > +}; > + > static struct page *alloc_demote_page(struct page *page, unsigned long private) > { > struct page *target_page; > nodemask_t *allowed_mask; > - struct migration_target_control *mtc; > - > - mtc = (struct migration_target_control *)private; > + struct demotion_control *dc = (struct demotion_control *)private; > + struct migration_target_control *mtc = dc->mtc; > > allowed_mask = mtc->nmask; > /* > @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > mtc->nmask = NULL; > mtc->gfp_mask |= __GFP_THISNODE; > target_page = alloc_migration_target(page, (unsigned long)mtc); > - if (target_page) > + if (!target_page) { > + mtc->gfp_mask &= ~__GFP_THISNODE; > + mtc->nmask = allowed_mask; > + target_page = alloc_migration_target(page, (unsigned long)mtc); > + } > + > + if (!target_page) > return target_page; > > - mtc->gfp_mask &= ~__GFP_THISNODE; > - mtc->nmask = allowed_mask; > + if (dc->demote_from_nodemask && > + !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask)) Use mtc->nid directly? > + dc->nr_demoted_outside_nodemask++; > > - return alloc_migration_target(page, (unsigned long)mtc); > + return target_page; > +} > + > +void free_demote_page(struct page *page, unsigned long private) > +{ > + struct demotion_control *dc = (struct demotion_control *)private; > + > + if (dc->demote_from_nodemask && > + !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) ditto > + dc->nr_demoted_outside_nodemask--; > + > + folio_put(page_folio(page)); > } > > /* > @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > * Folios which are not demoted are left on @demote_folios. > */ > static unsigned int demote_folio_list(struct list_head *demote_folios, > - struct pglist_data *pgdat) > + struct pglist_data *pgdat, > + nodemask_t *nodemask) > { > int target_nid = next_demotion_node(pgdat->node_id); > unsigned int nr_succeeded; > @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > .nmask = &allowed_mask > }; > > + struct demotion_control dc = { > + .mtc = &mtc, > + .demote_from_nodemask = nodemask, > + .nr_demoted_outside_nodemask = 0, > + }; > + > if (list_empty(demote_folios)) > return 0; > > @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > node_get_allowed_targets(pgdat, &allowed_mask); > > /* Demotion ignores all cpuset and mempolicy settings */ > - migrate_pages(demote_folios, alloc_demote_page, NULL, > - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > + migrate_pages(demote_folios, alloc_demote_page, free_demote_page, > + (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION, > &nr_succeeded); > > __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); > > - return nr_succeeded; > + return dc.nr_demoted_outside_nodemask; > } > > static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) > @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) > } > > /* > - * shrink_folio_list() returns the number of reclaimed pages > + * shrink_folio_list() returns the number of reclaimed pages. > + * > + * Demoted pages are counted as reclaimed iff: > + * (a) sc->nodemask arg is provided. > + * (b) page has been demoted from a node inside sc->nodemask to a node > + * outside sc->nodemask. > */ > static unsigned int shrink_folio_list(struct list_head *folio_list, > struct pglist_data *pgdat, struct scan_control *sc, > @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > unsigned int nr_reclaimed = 0; > + unsigned int nr_demoted_outside_nodemask = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > /* 'folio_list' is always empty here */ > > /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + nr_demoted_outside_nodemask = > + demote_folio_list(&demote_folios, pgdat, sc->nodemask); > + > + if (sc->nodemask) > + nr_reclaimed += nr_demoted_outside_nodemask; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list */ > @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > unsigned int reclaim_clean_pages_from_list(struct zone *zone, > struct list_head *folio_list) > { > + nodemask_t nodemask = NODE_MASK_NONE; Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space? Best Regards, Huang, Ying > struct scan_control sc = { > .gfp_mask = GFP_KERNEL, > .may_unmap = 1, > + .nodemask = &nodemask > }; > struct reclaim_stat stat; > unsigned int nr_reclaimed; > @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, > LIST_HEAD(clean_folios); > unsigned int noreclaim_flag; > > + /* > + * Set the nodemask in sc to indicate to shrink_folio_list() that we're > + * looking for reclaim from this node. > + */ > + node_set(zone->zone_pgdat->node_id, nodemask); > + > list_for_each_entry_safe(folio, next, folio_list, lru) { > if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && > !folio_test_dirty(folio) && !__folio_test_movable(folio) && > @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; > bool boosted; > struct zone *zone; > + nodemask_t nodemask = NODE_MASK_NONE; > struct scan_control sc = { > .gfp_mask = GFP_KERNEL, > .order = order, > .may_unmap = 1, > + .nodemask = &nodemask, > }; > > + /* > + * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're > + * looking for reclaim from this node. > + */ > + node_set(pgdat->node_id, nodemask); > + > set_task_reclaim_state(current, &sc.reclaim_state); > psi_memstall_enter(&pflags); > __fs_reclaim_acquire(_THIS_IP_); > @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > const unsigned long nr_pages = 1 << order; > struct task_struct *p = current; > unsigned int noreclaim_flag; > + nodemask_t nodemask = NODE_MASK_NONE; > struct scan_control sc = { > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > .gfp_mask = current_gfp_context(gfp_mask), > @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > .reclaim_idx = gfp_zone(gfp_mask), > + .nodemask = &nodemask, > }; > unsigned long pflags; > > + /* > + * Set the nodemask in sc to indicate to shrink_node() that we're > + * looking for reclaim from this node. > + */ > + node_set(pgdat->node_id, nodemask); > + > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > sc.gfp_mask); > > -- > 2.39.0.rc0.267.gcb52ba06e7-goog
On Mon, Dec 5, 2022 at 7:14 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > to do, however, it introduced a regression in the behavior of > > try_to_free_mem_cgroup_pages(). > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > of the cgroup should reduce by nr_pages. The callers expect > > try_to_free_mem_cgroup_pages() to also return the number of pages > > reclaimed, not demoted. > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > unconditionally counts demoted pages as reclaimed pages. So in practnice > > when it is called it will often demote nr_pages and return the number of > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > callers want it to do. > > > > Various things work suboptimally on memory tiered systems or don't work > > at all due to this: > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > instead of lowering the memcg usage by nr_pages). > > - try_charge_memcg() will keep retrying the charge while > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > making any room for the charge. > > - memory.reclaim has a wonky interface. It advertises to the user it > > reclaims the provided amount but it will actually often demote that > > amount. > > > > There may be more effects to this issue. > > > > To fix these issues I propose shrink_folio_list() to only count pages > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > 'reclaimed'. > > > > For callers such as reclaim_high() or try_charge_memcg() that set > > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to > > actually reclaim nr_pages and return the number of pages reclaimed. No > > demoted pages would count towards the nr_pages requirement. > > > > For callers such as memory_reclaim() that set sc->nodemask, > > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask > > with either reclaim or demotion. > > > > Tested this change using memory.reclaim interface. I set up a test case where > > I allocate 500m in a cgroup, and then do: > > > > echo "50m" > memory.reclaim > > > > Without this fix, my kernel demotes 70mb and reclaims 4 mb > > (memory.current is reduced by about 4mb). > > > > With this fix, my kernel demotes all memory possible and reclaims 60mb > > (memory.current is reduced by about 60mb). > > > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > > > --- > > > > Changes in v3: > > - Reverted back to v1 implementation: "try to demote but don't count > > demoted pages unless they are demoted to outside the nodemask" as Ying > > suggested. > > - Made sure demotions that fall back to non next_demotion_target() are > > not counted as Wei suggested. > > - Updated comment in shrink_folio_list() as Ying suggested. > > - Added before/after for the test case in commit message since Ying > > asked. > > - Fixed call sites that don't provide sc->nodemask but expect demotion > > from a specific node as Ying pointed out. > > > > Cc: weixugc@google.com > > Cc: ying.huang@intel.com > > > > This is developed on top of mm-unstable largely because I want the > > solution to be compatible with the recently added nodes= arg on > > mm-unstable. > > --- > > mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++-------- > > 1 file changed, 78 insertions(+), 13 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 2b42ac9ad755..f324e80395c3 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio, > > mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); > > } > > > > +struct demotion_control { > > + struct migration_target_control *mtc; > > + nodemask_t *demote_from_nodemask; > > + unsigned long nr_demoted_outside_nodemask; > > +}; > > + > > static struct page *alloc_demote_page(struct page *page, unsigned long private) > > { > > struct page *target_page; > > nodemask_t *allowed_mask; > > - struct migration_target_control *mtc; > > - > > - mtc = (struct migration_target_control *)private; > > + struct demotion_control *dc = (struct demotion_control *)private; > > + struct migration_target_control *mtc = dc->mtc; > > > > allowed_mask = mtc->nmask; > > /* > > @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > mtc->nmask = NULL; > > mtc->gfp_mask |= __GFP_THISNODE; > > target_page = alloc_migration_target(page, (unsigned long)mtc); > > - if (target_page) > > + if (!target_page) { > > + mtc->gfp_mask &= ~__GFP_THISNODE; > > + mtc->nmask = allowed_mask; > > + target_page = alloc_migration_target(page, (unsigned long)mtc); > > + } > > + > > + if (!target_page) > > return target_page; > > > > - mtc->gfp_mask &= ~__GFP_THISNODE; > > - mtc->nmask = allowed_mask; > > + if (dc->demote_from_nodemask && > > + !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask)) > > Use mtc->nid directly? > mtc->nid is the next_demotion_node(). Wei's earlier comment is that the page may be allocated anywhere on the get_allowed_targets(), not necessarily the next_demotion_node(), so I don't think I can use mtc->nid. I think I have to check on which node the page was allocated as I'm doing here. Let me know if I missed something. > > + dc->nr_demoted_outside_nodemask++; > > > > - return alloc_migration_target(page, (unsigned long)mtc); > > + return target_page; > > +} > > + > > +void free_demote_page(struct page *page, unsigned long private) > > +{ > > + struct demotion_control *dc = (struct demotion_control *)private; > > + > > + if (dc->demote_from_nodemask && > > + !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) > > ditto > > > + dc->nr_demoted_outside_nodemask--; > > + > > + folio_put(page_folio(page)); > > } > > > > /* > > @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > * Folios which are not demoted are left on @demote_folios. > > */ > > static unsigned int demote_folio_list(struct list_head *demote_folios, > > - struct pglist_data *pgdat) > > + struct pglist_data *pgdat, > > + nodemask_t *nodemask) > > { > > int target_nid = next_demotion_node(pgdat->node_id); > > unsigned int nr_succeeded; > > @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > > .nmask = &allowed_mask > > }; > > > > + struct demotion_control dc = { > > + .mtc = &mtc, > > + .demote_from_nodemask = nodemask, > > + .nr_demoted_outside_nodemask = 0, > > + }; > > + > > if (list_empty(demote_folios)) > > return 0; > > > > @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > > node_get_allowed_targets(pgdat, &allowed_mask); > > > > /* Demotion ignores all cpuset and mempolicy settings */ > > - migrate_pages(demote_folios, alloc_demote_page, NULL, > > - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > > + migrate_pages(demote_folios, alloc_demote_page, free_demote_page, > > + (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION, > > &nr_succeeded); > > > > __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); > > > > - return nr_succeeded; > > + return dc.nr_demoted_outside_nodemask; > > } > > > > static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) > > @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) > > } > > > > /* > > - * shrink_folio_list() returns the number of reclaimed pages > > + * shrink_folio_list() returns the number of reclaimed pages. > > + * > > + * Demoted pages are counted as reclaimed iff: > > + * (a) sc->nodemask arg is provided. > > + * (b) page has been demoted from a node inside sc->nodemask to a node > > + * outside sc->nodemask. > > */ > > static unsigned int shrink_folio_list(struct list_head *folio_list, > > struct pglist_data *pgdat, struct scan_control *sc, > > @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(free_folios); > > LIST_HEAD(demote_folios); > > unsigned int nr_reclaimed = 0; > > + unsigned int nr_demoted_outside_nodemask = 0; > > unsigned int pgactivate = 0; > > bool do_demote_pass; > > struct swap_iocb *plug = NULL; > > @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > /* 'folio_list' is always empty here */ > > > > /* Migrate folios selected for demotion */ > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > + nr_demoted_outside_nodemask = > > + demote_folio_list(&demote_folios, pgdat, sc->nodemask); > > + > > + if (sc->nodemask) > > + nr_reclaimed += nr_demoted_outside_nodemask; > > + > > /* Folios that could not be demoted are still in @demote_folios */ > > if (!list_empty(&demote_folios)) { > > /* Folios which weren't demoted go back on @folio_list */ > > @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > unsigned int reclaim_clean_pages_from_list(struct zone *zone, > > struct list_head *folio_list) > > { > > + nodemask_t nodemask = NODE_MASK_NONE; > > Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space? > I can certainly use NODEMASK_ALLOC/NODEMASK_FREE if you'd like. I think there are a few places that stack allocate nodemask_t already, including one place I recently added in memory_reclaim(), so it doesn't seem _necessary_ per say. If you're asking my opinion, AFAICT it's not an issue. I think you need > 32 numa nodes before nodemask_t becomes an array of size 2 longs on a 32-bit machine, and even then I think it's not a huge deal. Up to you; I have no issue with converting to NODEMASK_ALLOC/NODEMASK_FREE in v4. > Best Regards, > Huang, Ying > > > struct scan_control sc = { > > .gfp_mask = GFP_KERNEL, > > .may_unmap = 1, > > + .nodemask = &nodemask > > }; > > struct reclaim_stat stat; > > unsigned int nr_reclaimed; > > @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, > > LIST_HEAD(clean_folios); > > unsigned int noreclaim_flag; > > > > + /* > > + * Set the nodemask in sc to indicate to shrink_folio_list() that we're > > + * looking for reclaim from this node. > > + */ > > + node_set(zone->zone_pgdat->node_id, nodemask); > > + > > list_for_each_entry_safe(folio, next, folio_list, lru) { > > if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && > > !folio_test_dirty(folio) && !__folio_test_movable(folio) && > > @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > > unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; > > bool boosted; > > struct zone *zone; > > + nodemask_t nodemask = NODE_MASK_NONE; > > struct scan_control sc = { > > .gfp_mask = GFP_KERNEL, > > .order = order, > > .may_unmap = 1, > > + .nodemask = &nodemask, > > }; > > > > + /* > > + * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're > > + * looking for reclaim from this node. > > + */ > > + node_set(pgdat->node_id, nodemask); > > + > > set_task_reclaim_state(current, &sc.reclaim_state); > > psi_memstall_enter(&pflags); > > __fs_reclaim_acquire(_THIS_IP_); > > @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > const unsigned long nr_pages = 1 << order; > > struct task_struct *p = current; > > unsigned int noreclaim_flag; > > + nodemask_t nodemask = NODE_MASK_NONE; > > struct scan_control sc = { > > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > > .gfp_mask = current_gfp_context(gfp_mask), > > @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > .may_swap = 1, > > .reclaim_idx = gfp_zone(gfp_mask), > > + .nodemask = &nodemask, > > }; > > unsigned long pflags; > > > > + /* > > + * Set the nodemask in sc to indicate to shrink_node() that we're > > + * looking for reclaim from this node. > > + */ > > + node_set(pgdat->node_id, nodemask); > > + > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > sc.gfp_mask); > > > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog >
Mina Almasry <almasrymina@google.com> writes: > On Mon, Dec 5, 2022 at 7:14 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Mina Almasry <almasrymina@google.com> writes: >> >> > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> > reclaim"") enabled demotion in memcg reclaim, which is the right thing >> > to do, however, it introduced a regression in the behavior of >> > try_to_free_mem_cgroup_pages(). >> > >> > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to >> > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage >> > of the cgroup should reduce by nr_pages. The callers expect >> > try_to_free_mem_cgroup_pages() to also return the number of pages >> > reclaimed, not demoted. >> > >> > However, what try_to_free_mem_cgroup_pages() actually does is it >> > unconditionally counts demoted pages as reclaimed pages. So in practnice >> > when it is called it will often demote nr_pages and return the number of >> > demoted pages to the caller. Demoted pages don't lower the memcg usage, >> > and so try_to_free_mem_cgroup_pages() is not actually doing what the >> > callers want it to do. >> > >> > Various things work suboptimally on memory tiered systems or don't work >> > at all due to this: >> > >> > - memory.high enforcement likely doesn't work (it just demotes nr_pages >> > instead of lowering the memcg usage by nr_pages). >> > - try_charge_memcg() will keep retrying the charge while >> > try_to_free_mem_cgroup_pages() is just demoting pages and not actually >> > making any room for the charge. >> > - memory.reclaim has a wonky interface. It advertises to the user it >> > reclaims the provided amount but it will actually often demote that >> > amount. >> > >> > There may be more effects to this issue. >> > >> > To fix these issues I propose shrink_folio_list() to only count pages >> > demoted from inside of sc->nodemask to outside of sc->nodemask as >> > 'reclaimed'. >> > >> > For callers such as reclaim_high() or try_charge_memcg() that set >> > sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to >> > actually reclaim nr_pages and return the number of pages reclaimed. No >> > demoted pages would count towards the nr_pages requirement. >> > >> > For callers such as memory_reclaim() that set sc->nodemask, >> > try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask >> > with either reclaim or demotion. >> > >> > Tested this change using memory.reclaim interface. I set up a test case where >> > I allocate 500m in a cgroup, and then do: >> > >> > echo "50m" > memory.reclaim >> > >> > Without this fix, my kernel demotes 70mb and reclaims 4 mb >> > (memory.current is reduced by about 4mb). >> > >> > With this fix, my kernel demotes all memory possible and reclaims 60mb >> > (memory.current is reduced by about 60mb). >> > >> > Signed-off-by: Mina Almasry <almasrymina@google.com> >> > >> > --- >> > >> > Changes in v3: >> > - Reverted back to v1 implementation: "try to demote but don't count >> > demoted pages unless they are demoted to outside the nodemask" as Ying >> > suggested. >> > - Made sure demotions that fall back to non next_demotion_target() are >> > not counted as Wei suggested. >> > - Updated comment in shrink_folio_list() as Ying suggested. >> > - Added before/after for the test case in commit message since Ying >> > asked. >> > - Fixed call sites that don't provide sc->nodemask but expect demotion >> > from a specific node as Ying pointed out. >> > >> > Cc: weixugc@google.com >> > Cc: ying.huang@intel.com >> > >> > This is developed on top of mm-unstable largely because I want the >> > solution to be compatible with the recently added nodes= arg on >> > mm-unstable. >> > --- >> > mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++-------- >> > 1 file changed, 78 insertions(+), 13 deletions(-) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 2b42ac9ad755..f324e80395c3 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio, >> > mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); >> > } >> > >> > +struct demotion_control { >> > + struct migration_target_control *mtc; >> > + nodemask_t *demote_from_nodemask; >> > + unsigned long nr_demoted_outside_nodemask; >> > +}; >> > + >> > static struct page *alloc_demote_page(struct page *page, unsigned long private) >> > { >> > struct page *target_page; >> > nodemask_t *allowed_mask; >> > - struct migration_target_control *mtc; >> > - >> > - mtc = (struct migration_target_control *)private; >> > + struct demotion_control *dc = (struct demotion_control *)private; >> > + struct migration_target_control *mtc = dc->mtc; >> > >> > allowed_mask = mtc->nmask; >> > /* >> > @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) >> > mtc->nmask = NULL; >> > mtc->gfp_mask |= __GFP_THISNODE; >> > target_page = alloc_migration_target(page, (unsigned long)mtc); >> > - if (target_page) >> > + if (!target_page) { >> > + mtc->gfp_mask &= ~__GFP_THISNODE; >> > + mtc->nmask = allowed_mask; >> > + target_page = alloc_migration_target(page, (unsigned long)mtc); >> > + } >> > + >> > + if (!target_page) >> > return target_page; >> > >> > - mtc->gfp_mask &= ~__GFP_THISNODE; >> > - mtc->nmask = allowed_mask; >> > + if (dc->demote_from_nodemask && >> > + !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask)) >> >> Use mtc->nid directly? >> > > mtc->nid is the next_demotion_node(). Wei's earlier comment is that > the page may be allocated anywhere on the get_allowed_targets(), not > necessarily the next_demotion_node(), so I don't think I can use > mtc->nid. I think I have to check on which node the page was allocated > as I'm doing here. Let me know if I missed something. Yes. You are right. >> > + dc->nr_demoted_outside_nodemask++; >> > >> > - return alloc_migration_target(page, (unsigned long)mtc); >> > + return target_page; >> > +} >> > + >> > +void free_demote_page(struct page *page, unsigned long private) >> > +{ >> > + struct demotion_control *dc = (struct demotion_control *)private; >> > + >> > + if (dc->demote_from_nodemask && >> > + !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) >> >> ditto >> >> > + dc->nr_demoted_outside_nodemask--; >> > + >> > + folio_put(page_folio(page)); >> > } >> > >> > /* >> > @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) >> > * Folios which are not demoted are left on @demote_folios. >> > */ >> > static unsigned int demote_folio_list(struct list_head *demote_folios, >> > - struct pglist_data *pgdat) >> > + struct pglist_data *pgdat, >> > + nodemask_t *nodemask) >> > { >> > int target_nid = next_demotion_node(pgdat->node_id); >> > unsigned int nr_succeeded; >> > @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, >> > .nmask = &allowed_mask >> > }; >> > >> > + struct demotion_control dc = { >> > + .mtc = &mtc, >> > + .demote_from_nodemask = nodemask, >> > + .nr_demoted_outside_nodemask = 0, >> > + }; >> > + >> > if (list_empty(demote_folios)) >> > return 0; >> > >> > @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, >> > node_get_allowed_targets(pgdat, &allowed_mask); >> > >> > /* Demotion ignores all cpuset and mempolicy settings */ >> > - migrate_pages(demote_folios, alloc_demote_page, NULL, >> > - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, >> > + migrate_pages(demote_folios, alloc_demote_page, free_demote_page, >> > + (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION, >> > &nr_succeeded); >> > >> > __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); >> > >> > - return nr_succeeded; >> > + return dc.nr_demoted_outside_nodemask; >> > } >> > >> > static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) >> > @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) >> > } >> > >> > /* >> > - * shrink_folio_list() returns the number of reclaimed pages >> > + * shrink_folio_list() returns the number of reclaimed pages. >> > + * >> > + * Demoted pages are counted as reclaimed iff: >> > + * (a) sc->nodemask arg is provided. >> > + * (b) page has been demoted from a node inside sc->nodemask to a node >> > + * outside sc->nodemask. >> > */ >> > static unsigned int shrink_folio_list(struct list_head *folio_list, >> > struct pglist_data *pgdat, struct scan_control *sc, >> > @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > LIST_HEAD(free_folios); >> > LIST_HEAD(demote_folios); >> > unsigned int nr_reclaimed = 0; >> > + unsigned int nr_demoted_outside_nodemask = 0; >> > unsigned int pgactivate = 0; >> > bool do_demote_pass; >> > struct swap_iocb *plug = NULL; >> > @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > /* 'folio_list' is always empty here */ >> > >> > /* Migrate folios selected for demotion */ >> > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); >> > + nr_demoted_outside_nodemask = >> > + demote_folio_list(&demote_folios, pgdat, sc->nodemask); >> > + >> > + if (sc->nodemask) >> > + nr_reclaimed += nr_demoted_outside_nodemask; >> > + >> > /* Folios that could not be demoted are still in @demote_folios */ >> > if (!list_empty(&demote_folios)) { >> > /* Folios which weren't demoted go back on @folio_list */ >> > @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > unsigned int reclaim_clean_pages_from_list(struct zone *zone, >> > struct list_head *folio_list) >> > { >> > + nodemask_t nodemask = NODE_MASK_NONE; >> >> Is it necessary for us to use NODEMASK_ALLOC/NODEMASK_FREE to save stack space? >> > > I can certainly use NODEMASK_ALLOC/NODEMASK_FREE if you'd like. I > think there are a few places that stack allocate nodemask_t already, > including one place I recently added in memory_reclaim(), so it > doesn't seem _necessary_ per say. > > If you're asking my opinion, AFAICT it's not an issue. I think you > need > 32 numa nodes before nodemask_t becomes an array of size 2 > longs on a 32-bit machine, and even then I think it's not a huge deal. > Up to you; I have no issue with converting to > NODEMASK_ALLOC/NODEMASK_FREE in v4. Both are OK for me. Want to know David Rientjes's idea here. Best Regards, Huang, Ying >> >> > struct scan_control sc = { >> > .gfp_mask = GFP_KERNEL, >> > .may_unmap = 1, >> > + .nodemask = &nodemask >> > }; >> > struct reclaim_stat stat; >> > unsigned int nr_reclaimed; >> > @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, >> > LIST_HEAD(clean_folios); >> > unsigned int noreclaim_flag; >> > >> > + /* >> > + * Set the nodemask in sc to indicate to shrink_folio_list() that we're >> > + * looking for reclaim from this node. >> > + */ >> > + node_set(zone->zone_pgdat->node_id, nodemask); >> > + >> > list_for_each_entry_safe(folio, next, folio_list, lru) { >> > if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && >> > !folio_test_dirty(folio) && !__folio_test_movable(folio) && >> > @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) >> > unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; >> > bool boosted; >> > struct zone *zone; >> > + nodemask_t nodemask = NODE_MASK_NONE; >> > struct scan_control sc = { >> > .gfp_mask = GFP_KERNEL, >> > .order = order, >> > .may_unmap = 1, >> > + .nodemask = &nodemask, >> > }; >> > >> > + /* >> > + * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're >> > + * looking for reclaim from this node. >> > + */ >> > + node_set(pgdat->node_id, nodemask); >> > + >> > set_task_reclaim_state(current, &sc.reclaim_state); >> > psi_memstall_enter(&pflags); >> > __fs_reclaim_acquire(_THIS_IP_); >> > @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in >> > const unsigned long nr_pages = 1 << order; >> > struct task_struct *p = current; >> > unsigned int noreclaim_flag; >> > + nodemask_t nodemask = NODE_MASK_NONE; >> > struct scan_control sc = { >> > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), >> > .gfp_mask = current_gfp_context(gfp_mask), >> > @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in >> > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), >> > .may_swap = 1, >> > .reclaim_idx = gfp_zone(gfp_mask), >> > + .nodemask = &nodemask, >> > }; >> > unsigned long pflags; >> > >> > + /* >> > + * Set the nodemask in sc to indicate to shrink_node() that we're >> > + * looking for reclaim from this node. >> > + */ >> > + node_set(pgdat->node_id, nodemask); >> > + >> > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, >> > sc.gfp_mask); >> > >> > -- >> > 2.39.0.rc0.267.gcb52ba06e7-goog >>
On Mon 05-12-22 18:34:05, Mina Almasry wrote: > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim"") enabled demotion in memcg reclaim, which is the right thing > to do, however, it introduced a regression in the behavior of > try_to_free_mem_cgroup_pages(). > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > of the cgroup should reduce by nr_pages. The callers expect > try_to_free_mem_cgroup_pages() to also return the number of pages > reclaimed, not demoted. > > However, what try_to_free_mem_cgroup_pages() actually does is it > unconditionally counts demoted pages as reclaimed pages. So in practice > when it is called it will often demote nr_pages and return the number of > demoted pages to the caller. Demoted pages don't lower the memcg usage, > and so try_to_free_mem_cgroup_pages() is not actually doing what the > callers want it to do. > > Various things work suboptimally on memory tiered systems or don't work > at all due to this: > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > instead of lowering the memcg usage by nr_pages). > - try_charge_memcg() will keep retrying the charge while > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > making any room for the charge. This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > - memory.reclaim has a wonky interface. It advertises to the user it > reclaims the provided amount but it will actually often demote that > amount. > > There may be more effects to this issue. > > To fix these issues I propose shrink_folio_list() to only count pages > demoted from inside of sc->nodemask to outside of sc->nodemask as > 'reclaimed'. Could you expand on why the node mask matters? From the charge point of view it should be completely uninteresting as the charge remains. I suspect we really need to change to reclaim metrics for memcg reclaim. In the memory balancing reclaim we can indeed consider demotions as a reclaim because the memory is freed in the end but for the memcg reclaim we really should be counting discharges instead. No demotion/migration will free up charges.
Hi Mina, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20221206023406.3182800-1-almasrymina%40google.com patch subject: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems config: openrisc-randconfig-r012-20221206 compiler: or1k-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/d326a3b857c743d1e64b73a752505a65732b4e51 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512 git checkout d326a3b857c743d1e64b73a752505a65732b4e51 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=openrisc SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> mm/vmscan.c:1599:6: warning: no previous prototype for 'free_demote_page' [-Wmissing-prototypes] 1599 | void free_demote_page(struct page *page, unsigned long private) | ^~~~~~~~~~~~~~~~ vim +/free_demote_page +1599 mm/vmscan.c 1598 > 1599 void free_demote_page(struct page *page, unsigned long private) 1600 { 1601 struct demotion_control *dc = (struct demotion_control *)private; 1602 1603 if (dc->demote_from_nodemask && 1604 !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) 1605 dc->nr_demoted_outside_nodemask--; 1606 1607 folio_put(page_folio(page)); 1608 } 1609
On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > to do, however, it introduced a regression in the behavior of > > try_to_free_mem_cgroup_pages(). > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > of the cgroup should reduce by nr_pages. The callers expect > > try_to_free_mem_cgroup_pages() to also return the number of pages > > reclaimed, not demoted. > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > unconditionally counts demoted pages as reclaimed pages. So in practice > > when it is called it will often demote nr_pages and return the number of > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > callers want it to do. > > > > Various things work suboptimally on memory tiered systems or don't work > > at all due to this: > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > instead of lowering the memcg usage by nr_pages). > > - try_charge_memcg() will keep retrying the charge while > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > making any room for the charge. > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > Ah, I did indeed miss this. Thanks for the pointer. However I don't understand this bit from your email (sorry I'm probably missing something): "I suspect this is rather unlikely situation, though. The last tear (without any fallback) should have some memory to reclaim most of the time." Reading the code in try_charge_memcg(), I don't see the last retry for try_to_free_mem_cgroup_pages() do anything special. My concern here is that try_charge_memcg() calls try_to_free_mem_cgroup_pages() MAX_RECLAIM_RETRIES times. Each time that call may demote pages and report back that it was able to 'reclaim' memory, but the charge keeps failing because the memcg reclaim didn't actually make room for the charge. What happens in this case? My understanding is that the memcg oom-killer gets wrongly invoked. > > - memory.reclaim has a wonky interface. It advertises to the user it > > reclaims the provided amount but it will actually often demote that > > amount. > > > > There may be more effects to this issue. > > > > To fix these issues I propose shrink_folio_list() to only count pages > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > 'reclaimed'. > > Could you expand on why the node mask matters? From the charge point of > view it should be completely uninteresting as the charge remains. > > I suspect we really need to change to reclaim metrics for memcg reclaim. > In the memory balancing reclaim we can indeed consider demotions as a > reclaim because the memory is freed in the end but for the memcg reclaim > we really should be counting discharges instead. No demotion/migration will > free up charges. I think what you're describing is exactly what this patch aims to do. I'm proposing an interface change to shrink_folio_list() such that it only counts demoted pages as reclaimed iff sc->nodemask is provided by the caller and the demotion removed pages from inside sc->nodemask to outside sc->nodemask. In this case: 1. memory balancing reclaim would pass sc->nodemask=nid to shrink_folio_list() indicating that it should count pages demoted from sc->nodemask as reclaimed. 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list() indicating that it is looking for reclaim across all nodes and no demoted pages should count as reclaimed. Sorry if the commit message was not clear. I can try making it clearer in the next version but it's already very long. > -- > Michal Hocko > SUSE Labs
Hi Mina, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20221206023406.3182800-1-almasrymina%40google.com patch subject: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems config: arm-randconfig-r036-20221206 compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6e4cea55f0d1104408b26ac574566a0e4de48036) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install arm cross compiling tool for clang build # apt-get install binutils-arm-linux-gnueabi # https://github.com/intel-lab-lkp/linux/commit/d326a3b857c743d1e64b73a752505a65732b4e51 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Mina-Almasry/mm-Fix-memcg-reclaim-on-memory-tiered-systems/20221206-103512 git checkout d326a3b857c743d1e64b73a752505a65732b4e51 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> mm/vmscan.c:1599:6: warning: no previous prototype for function 'free_demote_page' [-Wmissing-prototypes] void free_demote_page(struct page *page, unsigned long private) ^ mm/vmscan.c:1599:1: note: declare 'static' if the function is not intended to be used outside of this translation unit void free_demote_page(struct page *page, unsigned long private) ^ static 1 warning generated. vim +/free_demote_page +1599 mm/vmscan.c 1598 > 1599 void free_demote_page(struct page *page, unsigned long private) 1600 { 1601 struct demotion_control *dc = (struct demotion_control *)private; 1602 1603 if (dc->demote_from_nodemask && 1604 !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) 1605 dc->nr_demoted_outside_nodemask--; 1606 1607 folio_put(page_folio(page)); 1608 } 1609
On Tue 06-12-22 08:06:51, Mina Almasry wrote: > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > to do, however, it introduced a regression in the behavior of > > > try_to_free_mem_cgroup_pages(). > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > of the cgroup should reduce by nr_pages. The callers expect > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > reclaimed, not demoted. > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > when it is called it will often demote nr_pages and return the number of > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > callers want it to do. > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > at all due to this: > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > instead of lowering the memcg usage by nr_pages). > > > - try_charge_memcg() will keep retrying the charge while > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > making any room for the charge. > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > understand this bit from your email (sorry I'm probably missing > something): > > "I suspect this is rather unlikely situation, though. The last tear > (without any fallback) should have some memory to reclaim most of > the time." > > Reading the code in try_charge_memcg(), I don't see the last retry for > try_to_free_mem_cgroup_pages() do anything special. My concern here is > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > report back that it was able to 'reclaim' memory, but the charge keeps > failing because the memcg reclaim didn't actually make room for the > charge. What happens in this case? My understanding is that the memcg > oom-killer gets wrongly invoked. The memcg reclaim shrinks from all zones in the allowed zonelist. In general from all nodes. So unless the lower tier is outside of this zonelist then there is a zone to reclaim from which cannot demote. Correct? > > > - memory.reclaim has a wonky interface. It advertises to the user it > > > reclaims the provided amount but it will actually often demote that > > > amount. > > > > > > There may be more effects to this issue. > > > > > > To fix these issues I propose shrink_folio_list() to only count pages > > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > > 'reclaimed'. > > > > Could you expand on why the node mask matters? From the charge point of > > view it should be completely uninteresting as the charge remains. > > > > I suspect we really need to change to reclaim metrics for memcg reclaim. > > In the memory balancing reclaim we can indeed consider demotions as a > > reclaim because the memory is freed in the end but for the memcg reclaim > > we really should be counting discharges instead. No demotion/migration will > > free up charges. > > I think what you're describing is exactly what this patch aims to do. > I'm proposing an interface change to shrink_folio_list() such that it > only counts demoted pages as reclaimed iff sc->nodemask is provided by > the caller and the demotion removed pages from inside sc->nodemask to > outside sc->nodemask. In this case: > > 1. memory balancing reclaim would pass sc->nodemask=nid to > shrink_folio_list() indicating that it should count pages demoted from > sc->nodemask as reclaimed. > > 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list() > indicating that it is looking for reclaim across all nodes and no > demoted pages should count as reclaimed. > > Sorry if the commit message was not clear. I can try making it clearer > in the next version but it's already very long. Either I am missing something or I simply do not understand why you are hooked into nodemask so much. Why cannot we have a simple rule that only global reclaim considers demotions as nr_reclaimed?
Michal Hocko <mhocko@suse.com> writes: > On Tue 06-12-22 08:06:51, Mina Almasry wrote: >> On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: >> > >> > On Mon 05-12-22 18:34:05, Mina Almasry wrote: >> > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> > > reclaim"") enabled demotion in memcg reclaim, which is the right thing >> > > to do, however, it introduced a regression in the behavior of >> > > try_to_free_mem_cgroup_pages(). >> > > >> > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to >> > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage >> > > of the cgroup should reduce by nr_pages. The callers expect >> > > try_to_free_mem_cgroup_pages() to also return the number of pages >> > > reclaimed, not demoted. >> > > >> > > However, what try_to_free_mem_cgroup_pages() actually does is it >> > > unconditionally counts demoted pages as reclaimed pages. So in practice >> > > when it is called it will often demote nr_pages and return the number of >> > > demoted pages to the caller. Demoted pages don't lower the memcg usage, >> > > and so try_to_free_mem_cgroup_pages() is not actually doing what the >> > > callers want it to do. >> > > >> > > Various things work suboptimally on memory tiered systems or don't work >> > > at all due to this: >> > > >> > > - memory.high enforcement likely doesn't work (it just demotes nr_pages >> > > instead of lowering the memcg usage by nr_pages). >> > > - try_charge_memcg() will keep retrying the charge while >> > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually >> > > making any room for the charge. >> > >> > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ >> > >> >> Ah, I did indeed miss this. Thanks for the pointer. However I don't >> understand this bit from your email (sorry I'm probably missing >> something): >> >> "I suspect this is rather unlikely situation, though. The last tear >> (without any fallback) should have some memory to reclaim most of >> the time." >> >> Reading the code in try_charge_memcg(), I don't see the last retry for >> try_to_free_mem_cgroup_pages() do anything special. My concern here is >> that try_charge_memcg() calls try_to_free_mem_cgroup_pages() >> MAX_RECLAIM_RETRIES times. Each time that call may demote pages and >> report back that it was able to 'reclaim' memory, but the charge keeps >> failing because the memcg reclaim didn't actually make room for the >> charge. What happens in this case? My understanding is that the memcg >> oom-killer gets wrongly invoked. > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > general from all nodes. So unless the lower tier is outside of this > zonelist then there is a zone to reclaim from which cannot demote. > Correct? > >> > > - memory.reclaim has a wonky interface. It advertises to the user it >> > > reclaims the provided amount but it will actually often demote that >> > > amount. >> > > >> > > There may be more effects to this issue. >> > > >> > > To fix these issues I propose shrink_folio_list() to only count pages >> > > demoted from inside of sc->nodemask to outside of sc->nodemask as >> > > 'reclaimed'. >> > >> > Could you expand on why the node mask matters? From the charge point of >> > view it should be completely uninteresting as the charge remains. >> > >> > I suspect we really need to change to reclaim metrics for memcg reclaim. >> > In the memory balancing reclaim we can indeed consider demotions as a >> > reclaim because the memory is freed in the end but for the memcg reclaim >> > we really should be counting discharges instead. No demotion/migration will >> > free up charges. >> >> I think what you're describing is exactly what this patch aims to do. >> I'm proposing an interface change to shrink_folio_list() such that it >> only counts demoted pages as reclaimed iff sc->nodemask is provided by >> the caller and the demotion removed pages from inside sc->nodemask to >> outside sc->nodemask. In this case: >> >> 1. memory balancing reclaim would pass sc->nodemask=nid to >> shrink_folio_list() indicating that it should count pages demoted from >> sc->nodemask as reclaimed. >> >> 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list() >> indicating that it is looking for reclaim across all nodes and no >> demoted pages should count as reclaimed. >> >> Sorry if the commit message was not clear. I can try making it clearer >> in the next version but it's already very long. > > Either I am missing something or I simply do not understand why you are > hooked into nodemask so much. Why cannot we have a simple rule that > only global reclaim considers demotions as nr_reclaimed? Yes. This sounds reasonable to me and this simplify the logic greatly! Best Regards, Huang, Ying
On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 06-12-22 08:06:51, Mina Almasry wrote: > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > > to do, however, it introduced a regression in the behavior of > > > > try_to_free_mem_cgroup_pages(). > > > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > > of the cgroup should reduce by nr_pages. The callers expect > > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > > reclaimed, not demoted. > > > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > > when it is called it will often demote nr_pages and return the number of > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > > callers want it to do. > > > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > > at all due to this: > > > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > > instead of lowering the memcg usage by nr_pages). > > > > - try_charge_memcg() will keep retrying the charge while > > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > > making any room for the charge. > > > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > > understand this bit from your email (sorry I'm probably missing > > something): > > > > "I suspect this is rather unlikely situation, though. The last tear > > (without any fallback) should have some memory to reclaim most of > > the time." > > > > Reading the code in try_charge_memcg(), I don't see the last retry for > > try_to_free_mem_cgroup_pages() do anything special. My concern here is > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > > report back that it was able to 'reclaim' memory, but the charge keeps > > failing because the memcg reclaim didn't actually make room for the > > charge. What happens in this case? My understanding is that the memcg > > oom-killer gets wrongly invoked. > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > general from all nodes. So unless the lower tier is outside of this > zonelist then there is a zone to reclaim from which cannot demote. > Correct? > Ah, thanks for pointing this out. I did indeed miss that the memcg reclaim tries to apply pressure equally to all the nodes. With some additional testing I'm able to see what you said: there could be no premature oom kill invocation because generally the memcg reclaim will find some pages to reclaim from lower tier nodes. I do find that the first call to try_to_free_mem_cgroup_pages() sometimes will mostly demote and not do much reclaim. I haven't been able to fully track the cause of that down but I suspect that the first call in my test will find most of the cgroup's memory on top tier nodes. However we do retry a bunch of times before we invoke oom, and in my testing subsequent calls will find plenty of memory in the lower tier nodes that it can reclaim. I'll update the commit message in the next version. > > > > - memory.reclaim has a wonky interface. It advertises to the user it > > > > reclaims the provided amount but it will actually often demote that > > > > amount. > > > > > > > > There may be more effects to this issue. > > > > > > > > To fix these issues I propose shrink_folio_list() to only count pages > > > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > > > 'reclaimed'. > > > > > > Could you expand on why the node mask matters? From the charge point of > > > view it should be completely uninteresting as the charge remains. > > > > > > I suspect we really need to change to reclaim metrics for memcg reclaim. > > > In the memory balancing reclaim we can indeed consider demotions as a > > > reclaim because the memory is freed in the end but for the memcg reclaim > > > we really should be counting discharges instead. No demotion/migration will > > > free up charges. > > > > I think what you're describing is exactly what this patch aims to do. > > I'm proposing an interface change to shrink_folio_list() such that it > > only counts demoted pages as reclaimed iff sc->nodemask is provided by > > the caller and the demotion removed pages from inside sc->nodemask to > > outside sc->nodemask. In this case: > > > > 1. memory balancing reclaim would pass sc->nodemask=nid to > > shrink_folio_list() indicating that it should count pages demoted from > > sc->nodemask as reclaimed. > > > > 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list() > > indicating that it is looking for reclaim across all nodes and no > > demoted pages should count as reclaimed. > > > > Sorry if the commit message was not clear. I can try making it clearer > > in the next version but it's already very long. > > Either I am missing something or I simply do not understand why you are > hooked into nodemask so much. Why cannot we have a simple rule that > only global reclaim considers demotions as nr_reclaimed? > Thanks. I think this approach would work for most callers. My issue here is properly supporting the recently added nodes= arg[1] to memory.reclaim. If the user specifies all nodes or provides no arg, I'd like to treat it as memcg reclaim which doesn't count demotions. If the user provides the top tier nodes, I would like to count demotions as this interface is the way to trigger proactive demotion from top tier nodes. I guess I can check which args the user is passing and decide whether or not to count demotions. But the user right now can specify any combination of nodes, some of them top tier, some lower tier, some in the middle. I can return -EINVAL for that, but that seems like a shame. I thought a generic way to address this was what I'm doing here, i.e. counting pages demoted from the nodemask as reclaimed. Is that not acceptable? Is -EINVAL preferred here? [1] https://lore.kernel.org/linux-mm/87tu2a1v3y.fsf@yhuang6-desk2.ccr.corp.intel.com/ > -- > Michal Hocko > SUSE Labs
On Tue 06-12-22 17:55:58, Mina Almasry wrote: > On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 06-12-22 08:06:51, Mina Almasry wrote: > > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > > > to do, however, it introduced a regression in the behavior of > > > > > try_to_free_mem_cgroup_pages(). > > > > > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > > > of the cgroup should reduce by nr_pages. The callers expect > > > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > > > reclaimed, not demoted. > > > > > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > > > when it is called it will often demote nr_pages and return the number of > > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > > > callers want it to do. > > > > > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > > > at all due to this: > > > > > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > > > instead of lowering the memcg usage by nr_pages). > > > > > - try_charge_memcg() will keep retrying the charge while > > > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > > > making any room for the charge. > > > > > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > > > > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > > > understand this bit from your email (sorry I'm probably missing > > > something): > > > > > > "I suspect this is rather unlikely situation, though. The last tear > > > (without any fallback) should have some memory to reclaim most of > > > the time." > > > > > > Reading the code in try_charge_memcg(), I don't see the last retry for > > > try_to_free_mem_cgroup_pages() do anything special. My concern here is > > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > > > report back that it was able to 'reclaim' memory, but the charge keeps > > > failing because the memcg reclaim didn't actually make room for the > > > charge. What happens in this case? My understanding is that the memcg > > > oom-killer gets wrongly invoked. > > > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > > general from all nodes. So unless the lower tier is outside of this > > zonelist then there is a zone to reclaim from which cannot demote. > > Correct? > > > > Ah, thanks for pointing this out. I did indeed miss that the memcg > reclaim tries to apply pressure equally to all the nodes. With some > additional testing I'm able to see what you said: there could be no > premature oom kill invocation because generally the memcg reclaim will > find some pages to reclaim from lower tier nodes. > > I do find that the first call to try_to_free_mem_cgroup_pages() > sometimes will mostly demote and not do much reclaim. I haven't been > able to fully track the cause of that down but I suspect that the > first call in my test will find most of the cgroup's memory on top > tier nodes. However we do retry a bunch of times before we invoke oom, > and in my testing subsequent calls will find plenty of memory in the > lower tier nodes that it can reclaim. I'll update the commit message > in the next version. In the past we used to break out early from the memcg reclaim if the reclaim target has been completed - see 1ba6fc9af35b ("mm: vmscan: do not share cgroup iteration between reclaimers"). But I do not see early break from the reclaim anywhere anymore so the precise nr_reclaimed tracking shouldn't make a lot of difference. There are cases where we really have hard time to find proper candidates and need to dip into hogher reclaim priorities though. Anyway a proper nr_reclaimed tracking should be rather straightforward but I do not expect to make a big difference in practice diff --git a/mm/vmscan.c b/mm/vmscan.c index 026199c047e0..1b7f2d8cb128 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(ret_folios); LIST_HEAD(free_folios); LIST_HEAD(demote_folios); - unsigned int nr_reclaimed = 0; + unsigned int nr_reclaimed = 0, nr_demoted = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, } /* 'folio_list' is always empty here */ - /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + /* + * Migrate folios selected for demotion. + * Do not consider demoted pages to be reclaimed for the memcg reclaim + * because no charges are really freed during the migration. Global + * reclaim aims at releasing memory from nodes/zones so consider + * demotion to reclaim memory. + */ + nr_demoted += demote_folio_list(&demote_folios, pgdat); + if (!cgroup_reclaim(sc)) + nr_reclaimed += nr_demoted; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list for retry: */ [...] > > Either I am missing something or I simply do not understand why you are > > hooked into nodemask so much. Why cannot we have a simple rule that > > only global reclaim considers demotions as nr_reclaimed? > > > > Thanks. I think this approach would work for most callers. My issue > here is properly supporting the recently added nodes= arg[1] to > memory.reclaim. If the user specifies all nodes or provides no arg, > I'd like to treat it as memcg reclaim which doesn't count demotions. > If the user provides the top tier nodes, I would like to count > demotions as this interface is the way to trigger proactive demotion > from top tier nodes. Sorry to repeat myself but nodemask doesn't really make any difference. The reclaim should only count pages which really free memory in the domain. Even if you demote to a node outside of the given nodemask then the charge stays with the existing memcg so it is rather dubious to count it as a reclaimed page. Or do I still miss what you are trying to achieve?
On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 06-12-22 17:55:58, Mina Almasry wrote: > > On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 06-12-22 08:06:51, Mina Almasry wrote: > > > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > > > > to do, however, it introduced a regression in the behavior of > > > > > > try_to_free_mem_cgroup_pages(). > > > > > > > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > > > > of the cgroup should reduce by nr_pages. The callers expect > > > > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > > > > reclaimed, not demoted. > > > > > > > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > > > > when it is called it will often demote nr_pages and return the number of > > > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > > > > callers want it to do. > > > > > > > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > > > > at all due to this: > > > > > > > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > > > > instead of lowering the memcg usage by nr_pages). > > > > > > - try_charge_memcg() will keep retrying the charge while > > > > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > > > > making any room for the charge. > > > > > > > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > > > > > > > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > > > > understand this bit from your email (sorry I'm probably missing > > > > something): > > > > > > > > "I suspect this is rather unlikely situation, though. The last tear > > > > (without any fallback) should have some memory to reclaim most of > > > > the time." > > > > > > > > Reading the code in try_charge_memcg(), I don't see the last retry for > > > > try_to_free_mem_cgroup_pages() do anything special. My concern here is > > > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > > > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > > > > report back that it was able to 'reclaim' memory, but the charge keeps > > > > failing because the memcg reclaim didn't actually make room for the > > > > charge. What happens in this case? My understanding is that the memcg > > > > oom-killer gets wrongly invoked. > > > > > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > > > general from all nodes. So unless the lower tier is outside of this > > > zonelist then there is a zone to reclaim from which cannot demote. > > > Correct? > > > > > > > Ah, thanks for pointing this out. I did indeed miss that the memcg > > reclaim tries to apply pressure equally to all the nodes. With some > > additional testing I'm able to see what you said: there could be no > > premature oom kill invocation because generally the memcg reclaim will > > find some pages to reclaim from lower tier nodes. > > > > I do find that the first call to try_to_free_mem_cgroup_pages() > > sometimes will mostly demote and not do much reclaim. I haven't been > > able to fully track the cause of that down but I suspect that the > > first call in my test will find most of the cgroup's memory on top > > tier nodes. However we do retry a bunch of times before we invoke oom, > > and in my testing subsequent calls will find plenty of memory in the > > lower tier nodes that it can reclaim. I'll update the commit message > > in the next version. > > In the past we used to break out early from the memcg reclaim if the > reclaim target has been completed - see 1ba6fc9af35b ("mm: vmscan: do > not share cgroup iteration between reclaimers"). But I do not see early > break from the reclaim anywhere anymore so the precise nr_reclaimed > tracking shouldn't make a lot of difference. There are cases where we > really have hard time to find proper candidates and need to dip into > hogher reclaim priorities though. > Yes, I agree. Thank you for your patient explanation. I now see why precise accounting of nr_reclaimed is not an issue (or at least as big an issue as I once thought). > Anyway a proper nr_reclaimed tracking should be rather straightforward > but I do not expect to make a big difference in practice > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 026199c047e0..1b7f2d8cb128 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(ret_folios); > LIST_HEAD(free_folios); > LIST_HEAD(demote_folios); > - unsigned int nr_reclaimed = 0; > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > struct swap_iocb *plug = NULL; > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > - /* Migrate folios selected for demotion */ > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > + /* > + * Migrate folios selected for demotion. > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > + * because no charges are really freed during the migration. Global > + * reclaim aims at releasing memory from nodes/zones so consider > + * demotion to reclaim memory. > + */ > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > + if (!cgroup_reclaim(sc)) > + nr_reclaimed += nr_demoted; > + > /* Folios that could not be demoted are still in @demote_folios */ > if (!list_empty(&demote_folios)) { > /* Folios which weren't demoted go back on @folio_list for retry: */ > > [...] Thank you again, but this patch breaks the memory.reclaim nodes arg for me. This is my test case. I run it on a machine with 2 memory tiers. Memory tier 1= nodes 0-2 Memory tier 2= node 3 mkdir -p /sys/fs/cgroup/unified/test cd /sys/fs/cgroup/unified/test echo $$ > cgroup.procs head -c 500m /dev/random > /tmp/testfile echo $$ > /sys/fs/cgroup/unified/cgroup.procs echo "1m nodes=0-2" > memory.reclaim In my opinion the expected behavior is for the kernel to demote 1mb of memory from nodes 0-2 to node 3. Actual behavior on the tip of mm-unstable is as expected. Actual behavior with your patch cherry-picked to mm-unstable is that the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and returns -EAGAIN to the user. This may be the correct behavior you're intending, but it completely breaks the use case I implemented the nodes= arg for and listed on the commit message of that change. > > > Either I am missing something or I simply do not understand why you are > > > hooked into nodemask so much. Why cannot we have a simple rule that > > > only global reclaim considers demotions as nr_reclaimed? > > > > > > > Thanks. I think this approach would work for most callers. My issue > > here is properly supporting the recently added nodes= arg[1] to > > memory.reclaim. If the user specifies all nodes or provides no arg, > > I'd like to treat it as memcg reclaim which doesn't count demotions. > > If the user provides the top tier nodes, I would like to count > > demotions as this interface is the way to trigger proactive demotion > > from top tier nodes. > > Sorry to repeat myself but nodemask doesn't really make any difference. > The reclaim should only count pages which really free memory in the > domain. Even if you demote to a node outside of the given nodemask then > the charge stays with the existing memcg so it is rather dubious to > count it as a reclaimed page. Or do I still miss what you are trying to > achieve? > I think you've convinced me that the issue is not as serious as I first thought, thanks. We can maybe abandon this fix? If we do want to fix this now or in the future, if possible let the fix be compatible with memory.reclaim and its nodes= arg. I'm happy to provide patches for whatever fix you find acceptable. > -- > Michal Hocko > SUSE Labs
On Wed 07-12-22 13:43:55, Mina Almasry wrote: > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > but I do not expect to make a big difference in practice > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 026199c047e0..1b7f2d8cb128 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > LIST_HEAD(ret_folios); > > LIST_HEAD(free_folios); > > LIST_HEAD(demote_folios); > > - unsigned int nr_reclaimed = 0; > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > unsigned int pgactivate = 0; > > bool do_demote_pass; > > struct swap_iocb *plug = NULL; > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > } > > /* 'folio_list' is always empty here */ > > > > - /* Migrate folios selected for demotion */ > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > + /* > > + * Migrate folios selected for demotion. > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > + * because no charges are really freed during the migration. Global > > + * reclaim aims at releasing memory from nodes/zones so consider > > + * demotion to reclaim memory. > > + */ > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > + if (!cgroup_reclaim(sc)) > > + nr_reclaimed += nr_demoted; > > + > > /* Folios that could not be demoted are still in @demote_folios */ > > if (!list_empty(&demote_folios)) { > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > [...] > > Thank you again, but this patch breaks the memory.reclaim nodes arg > for me. This is my test case. I run it on a machine with 2 memory > tiers. > > Memory tier 1= nodes 0-2 > Memory tier 2= node 3 > > mkdir -p /sys/fs/cgroup/unified/test > cd /sys/fs/cgroup/unified/test > echo $$ > cgroup.procs > head -c 500m /dev/random > /tmp/testfile > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > echo "1m nodes=0-2" > memory.reclaim > > In my opinion the expected behavior is for the kernel to demote 1mb of > memory from nodes 0-2 to node 3. > > Actual behavior on the tip of mm-unstable is as expected. > > Actual behavior with your patch cherry-picked to mm-unstable is that > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > returns -EAGAIN to the user. This may be the correct behavior you're > intending, but it completely breaks the use case I implemented the > nodes= arg for and listed on the commit message of that change. Yes, strictly speaking the behavior is correct albeit unexpected. You have told the kernel to _reclaim_ that much memory but demotion are simply aging handling rather than a reclaim if the demotion target has a lot of memory free. This would be the case without any nodemask as well btw. I am worried this will popping up again and again. I thought your nodes subset approach could deal with this but I have overlooked one important thing in your patch. The user provided nodemask controls where to reclaim from but it doesn't constrain demotion targets. Is this intentional? Would it actually make more sense to control demotion by addint demotion nodes into the nodemask?
On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > but I do not expect to make a big difference in practice > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 026199c047e0..1b7f2d8cb128 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > LIST_HEAD(ret_folios); > > > LIST_HEAD(free_folios); > > > LIST_HEAD(demote_folios); > > > - unsigned int nr_reclaimed = 0; > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > unsigned int pgactivate = 0; > > > bool do_demote_pass; > > > struct swap_iocb *plug = NULL; > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > } > > > /* 'folio_list' is always empty here */ > > > > > > - /* Migrate folios selected for demotion */ > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > + /* > > > + * Migrate folios selected for demotion. > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > + * because no charges are really freed during the migration. Global > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > + * demotion to reclaim memory. > > > + */ > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > + if (!cgroup_reclaim(sc)) > > > + nr_reclaimed += nr_demoted; > > > + > > > /* Folios that could not be demoted are still in @demote_folios */ > > > if (!list_empty(&demote_folios)) { > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > [...] > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > for me. This is my test case. I run it on a machine with 2 memory > > tiers. > > > > Memory tier 1= nodes 0-2 > > Memory tier 2= node 3 > > > > mkdir -p /sys/fs/cgroup/unified/test > > cd /sys/fs/cgroup/unified/test > > echo $$ > cgroup.procs > > head -c 500m /dev/random > /tmp/testfile > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > echo "1m nodes=0-2" > memory.reclaim > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > memory from nodes 0-2 to node 3. > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > returns -EAGAIN to the user. This may be the correct behavior you're > > intending, but it completely breaks the use case I implemented the > > nodes= arg for and listed on the commit message of that change. > > Yes, strictly speaking the behavior is correct albeit unexpected. You > have told the kernel to _reclaim_ that much memory but demotion are > simply aging handling rather than a reclaim if the demotion target has a > lot of memory free. Yes, by the strict definition of reclaim, you're completely correct. But in reality earlier I proposed a patch to the kernel that disables demotion in proactive reclaim. That would have been a correct change by the strict definition of reclaim, but Johannes informed me that meta already has a dependency on proactive reclaim triggering demotion and directed me to add a nodes= arg to memory.reclaim to trigger demotion instead, to satisfy both use cases. Seems both us and meta are using this interface to trigger both reclaim and demotion, despite the strict definition of the word? > This would be the case without any nodemask as well > btw. > The difference here is that without any nodemask, the kernel will also reclaim from bottom tier nodes, which will count towards the user's request so the behavior makes sense in my opinion. If nodemask=NULL or nodemask=ALL_NODES, then the caller is asking for reclaim across all nodes which should not count demoted pages IMO. The kernel can still look for demotion opportunities, but the demoted pages would not count towards the user's request. > I am worried this will popping up again and again. I thought your nodes > subset approach could deal with this but I have overlooked one important > thing in your patch. The user provided nodemask controls where to > reclaim from but it doesn't constrain demotion targets. Is this > intentional? Would it actually make more sense to control demotion by > addint demotion nodes into the nodemask? > IMO, yes it is intentional, and no I would not recommend adding demotion nodes (I assume you mean adding both demote_from_nodes and demote_to_nodes as arg). My opinion is based on 2 reasons: 1. We control proactive reclaim by looking for nodes/tiers approaching pressure and triggering reclaim/demotion from said nodes/tiers. So we know the node/tier we would like to reclaim from, but not necessarily have a requirement on where the memory should go. I think it should be up to the kernel. 2. Currently I think most tiered machines will have 2 memory tiers, but I think the code is designed to support N memory tiers. What happens if N=3 and the user asks you to demote from the top tier nodes to the bottom tier nodes (skipping the middle tier)? The user in this case is explicitly asking to break the aging pipeline. From my short while on the mailing list I see great interest in respecting the aging pipeline, so it seems to me the demotion target should be decided by the kernel based on the aging pipeline, and the user should not be able to override it? I don't know. Maybe there is a valid use case for that somewhere. > -- > Michal Hocko > SUSE Labs
On Thu 08-12-22 01:00:40, Mina Almasry wrote: > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > > but I do not expect to make a big difference in practice > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 026199c047e0..1b7f2d8cb128 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > LIST_HEAD(ret_folios); > > > > LIST_HEAD(free_folios); > > > > LIST_HEAD(demote_folios); > > > > - unsigned int nr_reclaimed = 0; > > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > > unsigned int pgactivate = 0; > > > > bool do_demote_pass; > > > > struct swap_iocb *plug = NULL; > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > } > > > > /* 'folio_list' is always empty here */ > > > > > > > > - /* Migrate folios selected for demotion */ > > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > > + /* > > > > + * Migrate folios selected for demotion. > > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > > + * because no charges are really freed during the migration. Global > > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > > + * demotion to reclaim memory. > > > > + */ > > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > > + if (!cgroup_reclaim(sc)) > > > > + nr_reclaimed += nr_demoted; > > > > + > > > > /* Folios that could not be demoted are still in @demote_folios */ > > > > if (!list_empty(&demote_folios)) { > > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > > > [...] > > > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > > for me. This is my test case. I run it on a machine with 2 memory > > > tiers. > > > > > > Memory tier 1= nodes 0-2 > > > Memory tier 2= node 3 > > > > > > mkdir -p /sys/fs/cgroup/unified/test > > > cd /sys/fs/cgroup/unified/test > > > echo $$ > cgroup.procs > > > head -c 500m /dev/random > /tmp/testfile > > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > > echo "1m nodes=0-2" > memory.reclaim > > > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > > memory from nodes 0-2 to node 3. > > > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > > returns -EAGAIN to the user. This may be the correct behavior you're > > > intending, but it completely breaks the use case I implemented the > > > nodes= arg for and listed on the commit message of that change. > > > > Yes, strictly speaking the behavior is correct albeit unexpected. You > > have told the kernel to _reclaim_ that much memory but demotion are > > simply aging handling rather than a reclaim if the demotion target has a > > lot of memory free. > > Yes, by the strict definition of reclaim, you're completely correct. > But in reality earlier I proposed a patch to the kernel that disables > demotion in proactive reclaim. That would have been a correct change > by the strict definition of reclaim, but Johannes informed me that > meta already has a dependency on proactive reclaim triggering demotion > and directed me to add a nodes= arg to memory.reclaim to trigger > demotion instead, to satisfy both use cases. Seems both us and meta > are using this interface to trigger both reclaim and demotion, despite > the strict definition of the word? Well, demotion is a part of aging and that is a part of the reclaim so I believe we want both and demotion mostly an implementation detail. If you want to have a very precise control then the nodemask should drive you there. [...] > > I am worried this will popping up again and again. I thought your nodes > > subset approach could deal with this but I have overlooked one important > > thing in your patch. The user provided nodemask controls where to > > reclaim from but it doesn't constrain demotion targets. Is this > > intentional? Would it actually make more sense to control demotion by > > addint demotion nodes into the nodemask? > > > > IMO, yes it is intentional, and no I would not recommend adding > demotion nodes (I assume you mean adding both demote_from_nodes and > demote_to_nodes as arg). What I really mean is to add demotion nodes to the nodemask along with the set of nodes you want to reclaim from. To me that sounds like a more natural interface allowing for all sorts of usecases: - free up demotion targets (only specify demotion nodes in the mask) - control where to demote (e.g. select specific demotion target(s)) - do not demote at all (skip demotion nodes from the node mask) > My opinion is based on 2 reasons: > > 1. We control proactive reclaim by looking for nodes/tiers approaching > pressure and triggering reclaim/demotion from said nodes/tiers. So we > know the node/tier we would like to reclaim from, but not necessarily > have a requirement on where the memory should go. I think it should be > up to the kernel. > 2. Currently I think most tiered machines will have 2 memory tiers, > but I think the code is designed to support N memory tiers. What > happens if N=3 and the user asks you to demote from the top tier nodes > to the bottom tier nodes (skipping the middle tier)? The user in this > case is explicitly asking to break the aging pipeline. From my short > while on the mailing list I see great interest in respecting the aging > pipeline, so it seems to me the demotion target should be decided by > the kernel based on the aging pipeline, and the user should not be > able to override it? I don't know. Maybe there is a valid use case for > that somewhere. I agree that the agining should be preserved as much as possible unless there is an explicit requirement to do otherwise which might be something application specific. It is really hard to assume all the usecases at this stage but we should keep in mind that the semantic of the interface will get cast into stone once it is released. As of now I do see a great confusion point in the nodemask semantic which pretends to allow some fine control while it is largerly hard to predict because it makes some assumptions about the reclaim while it has a very limited control of the demotion.
On Thu, Dec 8, 2022 at 3:54 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 08-12-22 01:00:40, Mina Almasry wrote: > > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > > > but I do not expect to make a big difference in practice > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 026199c047e0..1b7f2d8cb128 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > LIST_HEAD(ret_folios); > > > > > LIST_HEAD(free_folios); > > > > > LIST_HEAD(demote_folios); > > > > > - unsigned int nr_reclaimed = 0; > > > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > > > unsigned int pgactivate = 0; > > > > > bool do_demote_pass; > > > > > struct swap_iocb *plug = NULL; > > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > } > > > > > /* 'folio_list' is always empty here */ > > > > > > > > > > - /* Migrate folios selected for demotion */ > > > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > > > + /* > > > > > + * Migrate folios selected for demotion. > > > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > > > + * because no charges are really freed during the migration. Global > > > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > > > + * demotion to reclaim memory. > > > > > + */ > > > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > > > + if (!cgroup_reclaim(sc)) > > > > > + nr_reclaimed += nr_demoted; > > > > > + > > > > > /* Folios that could not be demoted are still in @demote_folios */ > > > > > if (!list_empty(&demote_folios)) { > > > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > > > > > [...] > > > > > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > > > for me. This is my test case. I run it on a machine with 2 memory > > > > tiers. > > > > > > > > Memory tier 1= nodes 0-2 > > > > Memory tier 2= node 3 > > > > > > > > mkdir -p /sys/fs/cgroup/unified/test > > > > cd /sys/fs/cgroup/unified/test > > > > echo $$ > cgroup.procs > > > > head -c 500m /dev/random > /tmp/testfile > > > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > > > echo "1m nodes=0-2" > memory.reclaim > > > > > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > > > memory from nodes 0-2 to node 3. > > > > > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > > > returns -EAGAIN to the user. This may be the correct behavior you're > > > > intending, but it completely breaks the use case I implemented the > > > > nodes= arg for and listed on the commit message of that change. > > > > > > Yes, strictly speaking the behavior is correct albeit unexpected. You > > > have told the kernel to _reclaim_ that much memory but demotion are > > > simply aging handling rather than a reclaim if the demotion target has a > > > lot of memory free. > > > > Yes, by the strict definition of reclaim, you're completely correct. > > But in reality earlier I proposed a patch to the kernel that disables > > demotion in proactive reclaim. That would have been a correct change > > by the strict definition of reclaim, but Johannes informed me that > > meta already has a dependency on proactive reclaim triggering demotion > > and directed me to add a nodes= arg to memory.reclaim to trigger > > demotion instead, to satisfy both use cases. Seems both us and meta > > are using this interface to trigger both reclaim and demotion, despite > > the strict definition of the word? > > Well, demotion is a part of aging and that is a part of the reclaim so I > believe we want both and demotion mostly an implementation detail. If > you want to have a very precise control then the nodemask should drive > you there. > > [...] > > > I am worried this will popping up again and again. I thought your nodes > > > subset approach could deal with this but I have overlooked one important > > > thing in your patch. The user provided nodemask controls where to > > > reclaim from but it doesn't constrain demotion targets. Is this > > > intentional? Would it actually make more sense to control demotion by > > > addint demotion nodes into the nodemask? > > > > > > > IMO, yes it is intentional, and no I would not recommend adding > > demotion nodes (I assume you mean adding both demote_from_nodes and > > demote_to_nodes as arg). > > What I really mean is to add demotion nodes to the nodemask along with > the set of nodes you want to reclaim from. To me that sounds like a > more natural interface allowing for all sorts of usecases: > - free up demotion targets (only specify demotion nodes in the mask) > - control where to demote (e.g. select specific demotion target(s)) > - do not demote at all (skip demotion nodes from the node mask) For clarification, do you mean to add another argument (e.g. demotion_nodes) in addition to the "nodes" argument? Also, which meaning do these arguments have? - Option 1: The "nodes" argument specifies the source nodes where pages should be reclaimed (but not demoted) from and the "demotion_nodes" argument specifies the source nodes where pages should be demoted (but not reclaimed) from. - Option 2: The "nodes" argument specifies the source nodes where pages should be reclaimed or demoted from and the "demotion_nodes" argument specifies the allowed demotion target nodes. Option 1 is more flexible in allowing various use cases. Option 2 can also support the use cases to reclaim only without demotion and demotion with fallback to reclaim, but cannot support demotion only without reclaim. > > My opinion is based on 2 reasons: > > > > 1. We control proactive reclaim by looking for nodes/tiers approaching > > pressure and triggering reclaim/demotion from said nodes/tiers. So we > > know the node/tier we would like to reclaim from, but not necessarily > > have a requirement on where the memory should go. I think it should be > > up to the kernel. > > 2. Currently I think most tiered machines will have 2 memory tiers, > > but I think the code is designed to support N memory tiers. What > > happens if N=3 and the user asks you to demote from the top tier nodes > > to the bottom tier nodes (skipping the middle tier)? The user in this > > case is explicitly asking to break the aging pipeline. From my short > > while on the mailing list I see great interest in respecting the aging > > pipeline, so it seems to me the demotion target should be decided by > > the kernel based on the aging pipeline, and the user should not be > > able to override it? I don't know. Maybe there is a valid use case for > > that somewhere. > > I agree that the agining should be preserved as much as possible unless > there is an explicit requirement to do otherwise which might be > something application specific. > > It is really hard to assume all the usecases at this stage but we should > keep in mind that the semantic of the interface will get cast into stone > once it is released. As of now I do see a great confusion point in the > nodemask semantic which pretends to allow some fine control while it is > largerly hard to predict because it makes some assumptions about the > reclaim while it has a very limited control of the demotion. > > -- > Michal Hocko > SUSE Labs
On Thu 08-12-22 16:59:36, Wei Xu wrote: [...] > > What I really mean is to add demotion nodes to the nodemask along with > > the set of nodes you want to reclaim from. To me that sounds like a > > more natural interface allowing for all sorts of usecases: > > - free up demotion targets (only specify demotion nodes in the mask) > > - control where to demote (e.g. select specific demotion target(s)) > > - do not demote at all (skip demotion nodes from the node mask) > > For clarification, do you mean to add another argument (e.g. > demotion_nodes) in addition to the "nodes" argument? No, nodes=mask argument should control the domain where the memory reclaim should happen. That includes both aging and the reclaim. If the mask doesn't contain any lower tier node then no demotion will happen. If only a subset of lower tiers are specified then only those could be used for the demotion process. Or put it otherwise, the nodemask is not only used to filter out zonelists during reclaim it also restricts migration targets. Is this more clear now?
On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > [...] > > > What I really mean is to add demotion nodes to the nodemask along with > > > the set of nodes you want to reclaim from. To me that sounds like a > > > more natural interface allowing for all sorts of usecases: > > > - free up demotion targets (only specify demotion nodes in the mask) > > > - control where to demote (e.g. select specific demotion target(s)) > > > - do not demote at all (skip demotion nodes from the node mask) > > > > For clarification, do you mean to add another argument (e.g. > > demotion_nodes) in addition to the "nodes" argument? > > No, nodes=mask argument should control the domain where the memory > reclaim should happen. That includes both aging and the reclaim. If the > mask doesn't contain any lower tier node then no demotion will happen. > If only a subset of lower tiers are specified then only those could be > used for the demotion process. Or put it otherwise, the nodemask is not > only used to filter out zonelists during reclaim it also restricts > migration targets. > > Is this more clear now? In that case, how can we request demotion only from toptier nodes (without counting any reclaimed bytes from other nodes), which is our memory tiering use case? Besides, when both toptier and demotion nodes are specified, the demoted pages should only be counted as aging and not be counted towards the requested bytes of try_to_free_mem_cgroup_pages(), which is what this patch tries to address. > -- > Michal Hocko > SUSE Labs
On Fri 09-12-22 08:41:47, Wei Xu wrote: > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > > [...] > > > > What I really mean is to add demotion nodes to the nodemask along with > > > > the set of nodes you want to reclaim from. To me that sounds like a > > > > more natural interface allowing for all sorts of usecases: > > > > - free up demotion targets (only specify demotion nodes in the mask) > > > > - control where to demote (e.g. select specific demotion target(s)) > > > > - do not demote at all (skip demotion nodes from the node mask) > > > > > > For clarification, do you mean to add another argument (e.g. > > > demotion_nodes) in addition to the "nodes" argument? > > > > No, nodes=mask argument should control the domain where the memory > > reclaim should happen. That includes both aging and the reclaim. If the > > mask doesn't contain any lower tier node then no demotion will happen. > > If only a subset of lower tiers are specified then only those could be > > used for the demotion process. Or put it otherwise, the nodemask is not > > only used to filter out zonelists during reclaim it also restricts > > migration targets. > > > > Is this more clear now? > > In that case, how can we request demotion only from toptier nodes > (without counting any reclaimed bytes from other nodes), which is our > memory tiering use case? I am not sure I follow. Could you be more specific please? > Besides, when both toptier and demotion nodes are specified, the > demoted pages should only be counted as aging and not be counted > towards the requested bytes of try_to_free_mem_cgroup_pages(), which > is what this patch tries to address. This should be addressed by http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no?
On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 09-12-22 08:41:47, Wei Xu wrote: > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > > > [...] > > > > > What I really mean is to add demotion nodes to the nodemask along with > > > > > the set of nodes you want to reclaim from. To me that sounds like a > > > > > more natural interface allowing for all sorts of usecases: > > > > > - free up demotion targets (only specify demotion nodes in the mask) > > > > > - control where to demote (e.g. select specific demotion target(s)) > > > > > - do not demote at all (skip demotion nodes from the node mask) > > > > > > > > For clarification, do you mean to add another argument (e.g. > > > > demotion_nodes) in addition to the "nodes" argument? > > > > > > No, nodes=mask argument should control the domain where the memory > > > reclaim should happen. That includes both aging and the reclaim. If the > > > mask doesn't contain any lower tier node then no demotion will happen. > > > If only a subset of lower tiers are specified then only those could be > > > used for the demotion process. Or put it otherwise, the nodemask is not > > > only used to filter out zonelists during reclaim it also restricts > > > migration targets. > > > > > > Is this more clear now? > > I think putting the demotion sources and demotion targets in the same nodemask is a bit confusing, and prone to error. IIUC the user puts both the demotion source and the demotion target in the nodemaks, and the kernel infers which is which depending on whether the node is a top-tier node, or a bottom tier node. I think in the future this will become ambiguous. What happens in the future when the user when the machine has N memory tiers and the user specifies a node in a middle tier in the nodemask? Does that mean the user wants demotion from or to this node? Middle memory tiers can act as both... I think if your goal is to constrain demotion targets then a much more clear and future proof way is to simply add a second arg to memory.reclaim "allowed_demotion_targets=".\ > > In that case, how can we request demotion only from toptier nodes > > (without counting any reclaimed bytes from other nodes), which is our > > memory tiering use case? > > I am not sure I follow. Could you be more specific please? > > > Besides, when both toptier and demotion nodes are specified, the > > demoted pages should only be counted as aging and not be counted > > towards the requested bytes of try_to_free_mem_cgroup_pages(), which > > is what this patch tries to address. > > This should be addressed by > http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no? I think I provided a test case in [1] showing very clearly that this breaks one of our use cases, i.e. the use case where the user is asking to demote X bytes from the top tier nodes to the lower tier nodes. I would not like to proceed with a fix that breaks one of our use cases. I believe I provided in this patch a fix that caters to all existing users, and we should take the fix in this patch over a fix that breaks use cases. [1] https://lore.kernel.org/all/CAHS8izMKK107wVFSJvg36nQ=WzXd8_cjYBtR0p47L+XLYUSsqA@mail.gmail.com/ > -- > Michal Hocko > SUSE Labs
On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 09-12-22 08:41:47, Wei Xu wrote: > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > > > [...] > > > > > What I really mean is to add demotion nodes to the nodemask along with > > > > > the set of nodes you want to reclaim from. To me that sounds like a > > > > > more natural interface allowing for all sorts of usecases: > > > > > - free up demotion targets (only specify demotion nodes in the mask) > > > > > - control where to demote (e.g. select specific demotion target(s)) > > > > > - do not demote at all (skip demotion nodes from the node mask) > > > > > > > > For clarification, do you mean to add another argument (e.g. > > > > demotion_nodes) in addition to the "nodes" argument? > > > > > > No, nodes=mask argument should control the domain where the memory > > > reclaim should happen. That includes both aging and the reclaim. If the > > > mask doesn't contain any lower tier node then no demotion will happen. > > > If only a subset of lower tiers are specified then only those could be > > > used for the demotion process. Or put it otherwise, the nodemask is not > > > only used to filter out zonelists during reclaim it also restricts > > > migration targets. > > > > > > Is this more clear now? > > > > In that case, how can we request demotion only from toptier nodes > > (without counting any reclaimed bytes from other nodes), which is our > > memory tiering use case? > > I am not sure I follow. Could you be more specific please? In our memory tiering use case, we would like to proactively free up memory on top-tier nodes by demoting cold pages to lower-tier nodes. This is to create enough free top-tier memory for new allocations and promotions. How many pages and how often to demote from top-tier nodes can depend on a number of factors (e.g. the amount of free top-tier memory, the amount of cold pages, the bandwidth pressure on lower-tier, the task tolerance of slower memory on performance) and are controlled by the userspace policies. Because the purpose of such proactive demotions is to free up top-tier memory, not to lower the amount of memory charged to the memcg, we'd like that memory.reclaim can demote the specified amount of bytes from the given top-tier nodes. If we have to also provide the lower-tier nodes to memory.reclaim to allow demotions, the kernel can reclaim from the lower-tier nodes in the same memory.reclaim request. We then won't be able to control the amount of bytes to be demoted from top-tier nodes. > > Besides, when both toptier and demotion nodes are specified, the > > demoted pages should only be counted as aging and not be counted > > towards the requested bytes of try_to_free_mem_cgroup_pages(), which > > is what this patch tries to address. > > This should be addressed by > http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no? > -- > Michal Hocko > SUSE Labs
On Fri 09-12-22 13:39:44, Mina Almasry wrote: > On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 09-12-22 08:41:47, Wei Xu wrote: > > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > > > > [...] > > > > > > What I really mean is to add demotion nodes to the nodemask along with > > > > > > the set of nodes you want to reclaim from. To me that sounds like a > > > > > > more natural interface allowing for all sorts of usecases: > > > > > > - free up demotion targets (only specify demotion nodes in the mask) > > > > > > - control where to demote (e.g. select specific demotion target(s)) > > > > > > - do not demote at all (skip demotion nodes from the node mask) > > > > > > > > > > For clarification, do you mean to add another argument (e.g. > > > > > demotion_nodes) in addition to the "nodes" argument? > > > > > > > > No, nodes=mask argument should control the domain where the memory > > > > reclaim should happen. That includes both aging and the reclaim. If the > > > > mask doesn't contain any lower tier node then no demotion will happen. > > > > If only a subset of lower tiers are specified then only those could be > > > > used for the demotion process. Or put it otherwise, the nodemask is not > > > > only used to filter out zonelists during reclaim it also restricts > > > > migration targets. > > > > > > > > Is this more clear now? > > > > > I think putting the demotion sources and demotion targets in the same > nodemask is a bit confusing, and prone to error. IIUC the user puts > both the demotion source and the demotion target in the nodemaks, and > the kernel infers which is which depending on whether the node is a > top-tier node, or a bottom tier node. The level of the tier is not really all that important. All you request from the userspace is to free up some charges based on aging (it is a reclaim interface). The specified nodemask doesn't and cannot really tell anything about how that is achieved because that is an internal implementation detail. The nodemask extension defines a domain where that aging and reclaim should happen. Treating demotion special and assuming a certain behavior is a free ticket to future headaches. > I think in the future this will > become ambiguous. What happens in the future when the user when the > machine has N memory tiers and the user specifies a node in a middle > tier in the nodemask? Does that mean the user wants demotion from or > to this node? Middle memory tiers can act as both... It's really very simple. The memory reclaim (both aging and actual eviction) is bound to the given nodemask. Full stop. See how that doesn't say anything about demotion? It simply is not important. Say you have a multi-layer system with nodes 0-3 with RAM, 4, 5 with tier 1 and nodes 6, 7 as a second tier with the following topology. RAM nodes can demote to any lower tiers. Node 4 can demote to 6 and node 5 to node 7. Now say you want to release some memory from node 2 and 3 to release some memory from there also globally yet node 5 should be avoided for whatever reason. If you reclaim with nodemask=2,3,4,6 will achieve exactly that without any ambiguity. Each node in the node mask will get aged and demotion targets will be filtered based on the nodemask as well so you cannot really end up aging (and demoting from) nodes 5, 7 even if they are part of the demotion path. This is simply impossible if the demotion target is implicit and it can live outside of the provided nodemask. Are you on the same page that this would be problematic? > I think if your goal is to constrain demotion targets then a much more > clear and future proof way is to simply add a second arg to > memory.reclaim "allowed_demotion_targets=".\ That would be an option as well but what is the exact semantic? Do you control the demotion path as well? > > > In that case, how can we request demotion only from toptier nodes > > > (without counting any reclaimed bytes from other nodes), which is our > > > memory tiering use case? > > > > I am not sure I follow. Could you be more specific please? > > > > > Besides, when both toptier and demotion nodes are specified, the > > > demoted pages should only be counted as aging and not be counted > > > towards the requested bytes of try_to_free_mem_cgroup_pages(), which > > > is what this patch tries to address. > > > > This should be addressed by > > http://lkml.kernel.org/r/Y5B1K5zAE0PkjFZx@dhcp22.suse.cz, no? > > I think I provided a test case in [1] showing very clearly that this > breaks one of our use cases, i.e. the use case where the user is > asking to demote X bytes from the top tier nodes to the lower tier > nodes. Are you referring to EAGAIN situation as a result of the nr_reclaimed patch? Please note that the patch was untested and likely incomplete. Also really slightly independent on the actual nodemask reclaim extension. Btw. based on these discussions I have to retract my ack and reopen the discussion (will follow up on the original email thread). The current semantic is really concerning.
On Sat 10-12-22 00:01:28, Wei Xu wrote: > On Fri, Dec 9, 2022 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 09-12-22 08:41:47, Wei Xu wrote: > > > On Fri, Dec 9, 2022 at 12:08 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 08-12-22 16:59:36, Wei Xu wrote: > > > > [...] > > > > > > What I really mean is to add demotion nodes to the nodemask along with > > > > > > the set of nodes you want to reclaim from. To me that sounds like a > > > > > > more natural interface allowing for all sorts of usecases: > > > > > > - free up demotion targets (only specify demotion nodes in the mask) > > > > > > - control where to demote (e.g. select specific demotion target(s)) > > > > > > - do not demote at all (skip demotion nodes from the node mask) > > > > > > > > > > For clarification, do you mean to add another argument (e.g. > > > > > demotion_nodes) in addition to the "nodes" argument? > > > > > > > > No, nodes=mask argument should control the domain where the memory > > > > reclaim should happen. That includes both aging and the reclaim. If the > > > > mask doesn't contain any lower tier node then no demotion will happen. > > > > If only a subset of lower tiers are specified then only those could be > > > > used for the demotion process. Or put it otherwise, the nodemask is not > > > > only used to filter out zonelists during reclaim it also restricts > > > > migration targets. > > > > > > > > Is this more clear now? > > > > > > In that case, how can we request demotion only from toptier nodes > > > (without counting any reclaimed bytes from other nodes), which is our > > > memory tiering use case? > > > > I am not sure I follow. Could you be more specific please? > > In our memory tiering use case, we would like to proactively free up > memory on top-tier nodes by demoting cold pages to lower-tier nodes. > This is to create enough free top-tier memory for new allocations and > promotions. How many pages and how often to demote from top-tier > nodes can depend on a number of factors (e.g. the amount of free > top-tier memory, the amount of cold pages, the bandwidth pressure on > lower-tier, the task tolerance of slower memory on performance) and > are controlled by the userspace policies. > > Because the purpose of such proactive demotions is to free up top-tier > memory, not to lower the amount of memory charged to the memcg, we'd > like that memory.reclaim can demote the specified amount of bytes from > the given top-tier nodes. If we have to also provide the lower-tier > nodes to memory.reclaim to allow demotions, the kernel can reclaim > from the lower-tier nodes in the same memory.reclaim request. We then > won't be able to control the amount of bytes to be demoted from > top-tier nodes. I am not sure this is something to be handled by the reclaim interface because now you are creating an ambiguity what the interface should do and start depend on it. Consider that we will change the reclaim algorithm in the future and the node you request to demote will simply reclaim rather than demote. This will break your usecase, right?
diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b42ac9ad755..f324e80395c3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1555,13 +1555,18 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } +struct demotion_control { + struct migration_target_control *mtc; + nodemask_t *demote_from_nodemask; + unsigned long nr_demoted_outside_nodemask; +}; + static struct page *alloc_demote_page(struct page *page, unsigned long private) { struct page *target_page; nodemask_t *allowed_mask; - struct migration_target_control *mtc; - - mtc = (struct migration_target_control *)private; + struct demotion_control *dc = (struct demotion_control *)private; + struct migration_target_control *mtc = dc->mtc; allowed_mask = mtc->nmask; /* @@ -1576,13 +1581,31 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) mtc->nmask = NULL; mtc->gfp_mask |= __GFP_THISNODE; target_page = alloc_migration_target(page, (unsigned long)mtc); - if (target_page) + if (!target_page) { + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + target_page = alloc_migration_target(page, (unsigned long)mtc); + } + + if (!target_page) return target_page; - mtc->gfp_mask &= ~__GFP_THISNODE; - mtc->nmask = allowed_mask; + if (dc->demote_from_nodemask && + !node_isset(page_to_nid(target_page), *dc->demote_from_nodemask)) + dc->nr_demoted_outside_nodemask++; - return alloc_migration_target(page, (unsigned long)mtc); + return target_page; +} + +void free_demote_page(struct page *page, unsigned long private) +{ + struct demotion_control *dc = (struct demotion_control *)private; + + if (dc->demote_from_nodemask && + !node_isset(page_to_nid(page), *dc->demote_from_nodemask)) + dc->nr_demoted_outside_nodemask--; + + folio_put(page_folio(page)); } /* @@ -1590,7 +1613,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + nodemask_t *nodemask) { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1608,6 +1632,12 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, .nmask = &allowed_mask }; + struct demotion_control dc = { + .mtc = &mtc, + .demote_from_nodemask = nodemask, + .nr_demoted_outside_nodemask = 0, + }; + if (list_empty(demote_folios)) return 0; @@ -1617,13 +1647,13 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, &allowed_mask); /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_folios, alloc_demote_page, NULL, - (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + migrate_pages(demote_folios, alloc_demote_page, free_demote_page, + (unsigned long)&dc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); - return nr_succeeded; + return dc.nr_demoted_outside_nodemask; } static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) @@ -1643,7 +1673,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) } /* - * shrink_folio_list() returns the number of reclaimed pages + * shrink_folio_list() returns the number of reclaimed pages. + * + * Demoted pages are counted as reclaimed iff: + * (a) sc->nodemask arg is provided. + * (b) page has been demoted from a node inside sc->nodemask to a node + * outside sc->nodemask. */ static unsigned int shrink_folio_list(struct list_head *folio_list, struct pglist_data *pgdat, struct scan_control *sc, @@ -1653,6 +1688,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(free_folios); LIST_HEAD(demote_folios); unsigned int nr_reclaimed = 0; + unsigned int nr_demoted_outside_nodemask = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2085,7 +2121,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + nr_demoted_outside_nodemask = + demote_folio_list(&demote_folios, pgdat, sc->nodemask); + + if (sc->nodemask) + nr_reclaimed += nr_demoted_outside_nodemask; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ @@ -2130,9 +2171,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list) { + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, + .nodemask = &nodemask }; struct reclaim_stat stat; unsigned int nr_reclaimed; @@ -2140,6 +2183,12 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, LIST_HEAD(clean_folios); unsigned int noreclaim_flag; + /* + * Set the nodemask in sc to indicate to shrink_folio_list() that we're + * looking for reclaim from this node. + */ + node_set(zone->zone_pgdat->node_id, nodemask); + list_for_each_entry_safe(folio, next, folio_list, lru) { if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && !folio_test_dirty(folio) && !__folio_test_movable(folio) && @@ -7031,12 +7080,20 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; bool boosted; struct zone *zone; + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, + .nodemask = &nodemask, }; + /* + * Set the nodemask in sc to indicate to kswapd_shrink_node() that we're + * looking for reclaim from this node. + */ + node_set(pgdat->node_id, nodemask); + set_task_reclaim_state(current, &sc.reclaim_state); psi_memstall_enter(&pflags); __fs_reclaim_acquire(_THIS_IP_); @@ -7642,6 +7699,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in const unsigned long nr_pages = 1 << order; struct task_struct *p = current; unsigned int noreclaim_flag; + nodemask_t nodemask = NODE_MASK_NONE; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = current_gfp_context(gfp_mask), @@ -7651,9 +7709,16 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), + .nodemask = &nodemask, }; unsigned long pflags; + /* + * Set the nodemask in sc to indicate to shrink_node() that we're + * looking for reclaim from this node. + */ + node_set(pgdat->node_id, nodemask); + trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, sc.gfp_mask);
commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") enabled demotion in memcg reclaim, which is the right thing to do, however, it introduced a regression in the behavior of try_to_free_mem_cgroup_pages(). The callers of try_to_free_mem_cgroup_pages() expect it to attempt to reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage of the cgroup should reduce by nr_pages. The callers expect try_to_free_mem_cgroup_pages() to also return the number of pages reclaimed, not demoted. However, what try_to_free_mem_cgroup_pages() actually does is it unconditionally counts demoted pages as reclaimed pages. So in practice when it is called it will often demote nr_pages and return the number of demoted pages to the caller. Demoted pages don't lower the memcg usage, and so try_to_free_mem_cgroup_pages() is not actually doing what the callers want it to do. Various things work suboptimally on memory tiered systems or don't work at all due to this: - memory.high enforcement likely doesn't work (it just demotes nr_pages instead of lowering the memcg usage by nr_pages). - try_charge_memcg() will keep retrying the charge while try_to_free_mem_cgroup_pages() is just demoting pages and not actually making any room for the charge. - memory.reclaim has a wonky interface. It advertises to the user it reclaims the provided amount but it will actually often demote that amount. There may be more effects to this issue. To fix these issues I propose shrink_folio_list() to only count pages demoted from inside of sc->nodemask to outside of sc->nodemask as 'reclaimed'. For callers such as reclaim_high() or try_charge_memcg() that set sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to actually reclaim nr_pages and return the number of pages reclaimed. No demoted pages would count towards the nr_pages requirement. For callers such as memory_reclaim() that set sc->nodemask, try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask with either reclaim or demotion. Tested this change using memory.reclaim interface. I set up a test case where I allocate 500m in a cgroup, and then do: echo "50m" > memory.reclaim Without this fix, my kernel demotes 70mb and reclaims 4 mb (memory.current is reduced by about 4mb). With this fix, my kernel demotes all memory possible and reclaims 60mb (memory.current is reduced by about 60mb). Signed-off-by: Mina Almasry <almasrymina@google.com> --- Changes in v3: - Reverted back to v1 implementation: "try to demote but don't count demoted pages unless they are demoted to outside the nodemask" as Ying suggested. - Made sure demotions that fall back to non next_demotion_target() are not counted as Wei suggested. - Updated comment in shrink_folio_list() as Ying suggested. - Added before/after for the test case in commit message since Ying asked. - Fixed call sites that don't provide sc->nodemask but expect demotion from a specific node as Ying pointed out. Cc: weixugc@google.com Cc: ying.huang@intel.com This is developed on top of mm-unstable largely because I want the solution to be compatible with the recently added nodes= arg on mm-unstable. --- mm/vmscan.c | 91 +++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 78 insertions(+), 13 deletions(-) -- 2.39.0.rc0.267.gcb52ba06e7-goog