Message ID | 20240830100438.3623486-2-usamaarif642@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: split underused THPs | expand |
On Fri, 30 Aug 2024, Usama Arif wrote: > From: Yu Zhao <yuzhao@google.com> > > If a tail page has only two references left, one inherited from the > isolation of its head and the other from lru_add_page_tail() which we > are about to drop, it means this tail page was concurrently zapped. > Then we can safely free it and save page reclaim or migration the > trouble of trying it. > > Signed-off-by: Yu Zhao <yuzhao@google.com> > Tested-by: Shuang Zhai <zhais@google.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Usama Arif <usamaarif642@gmail.com> I'm sorry, but I think this patch (just this 1/6) needs to be dropped: it is only an optimization, and unless a persuasive performance case can be made to extend it, it ought to go (perhaps revisited later). The problem I kept hitting was that all my work, requiring compaction and reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle(): because nr_isolated_anon had grown high - and remained high even when the load had all been killed. Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6 is the one to blame. I was intending to send this patch to "fix" it: --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag folio_clear_active(new_folio); folio_clear_unevictable(new_folio); list_del(&new_folio->lru); + node_stat_sub_folio(folio, NR_ISOLATED_ANON + + folio_is_file_lru(folio)); if (!folio_batch_add(&free_folios, new_folio)) { mem_cgroup_uncharge_folios(&free_folios); free_unref_folios(&free_folios); And that ran nicely, until I terminated the run and did grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat at the end: stat_refresh kindly left a pr_warn in dmesg to say nr_isolated_anon -334013737 My patch is not good enough. IIUC, some split_huge_pagers (reclaim?) know how many pages they isolated and decremented the stats by, and increment by that same number at the end; whereas other split_huge_pagers (migration?) decrement one by one as they go through the list afterwards. I've run out of time (I'm about to take a break): I gave up researching who needs what, and was already feeling this optimization does too much second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs rather admits to that). And I don't think it's as simple as moving the node_stat_sub_folio() into 2/6 where the zero pte is substituted: that would probably handle the vast majority of cases, but aren't there others which pass the folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages, or racily truncated now that the folio has been unlocked, for example? Hugh
On 05/09/2024 09:46, Hugh Dickins wrote: > On Fri, 30 Aug 2024, Usama Arif wrote: > >> From: Yu Zhao <yuzhao@google.com> >> >> If a tail page has only two references left, one inherited from the >> isolation of its head and the other from lru_add_page_tail() which we >> are about to drop, it means this tail page was concurrently zapped. >> Then we can safely free it and save page reclaim or migration the >> trouble of trying it. >> >> Signed-off-by: Yu Zhao <yuzhao@google.com> >> Tested-by: Shuang Zhai <zhais@google.com> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> > > I'm sorry, but I think this patch (just this 1/6) needs to be dropped: > it is only an optimization, and unless a persuasive performance case > can be made to extend it, it ought to go (perhaps revisited later). > I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it. Its an optimization and underused shrinker doesn't depend on it. Its possible that folio->new_folio below might fix it? But if it doesn't, I can retry later on to make this work and resend it only if it alone shows a significant performance improvement. Thanks a lot for debugging this! and sorry it caused an issue. > The problem I kept hitting was that all my work, requiring compaction and > reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle(): > because nr_isolated_anon had grown high - and remained high even when the > load had all been killed. > > Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6 > is the one to blame. I was intending to send this patch to "fix" it: > > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag > folio_clear_active(new_folio); > folio_clear_unevictable(new_folio); > list_del(&new_folio->lru); > + node_stat_sub_folio(folio, NR_ISOLATED_ANON + > + folio_is_file_lru(folio)); Maybe this should have been below? (Notice the folio->new_folio) + node_stat_sub_folio(new_folio, NR_ISOLATED_ANON + + folio_is_file_lru(new_folio)); > if (!folio_batch_add(&free_folios, new_folio)) { > mem_cgroup_uncharge_folios(&free_folios); > free_unref_folios(&free_folios); > > And that ran nicely, until I terminated the run and did > grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat > at the end: stat_refresh kindly left a pr_warn in dmesg to say > nr_isolated_anon -334013737 > > My patch is not good enough. IIUC, some split_huge_pagers (reclaim?) > know how many pages they isolated and decremented the stats by, and > increment by that same number at the end; whereas other split_huge_pagers > (migration?) decrement one by one as they go through the list afterwards. > > I've run out of time (I'm about to take a break): I gave up researching > who needs what, and was already feeling this optimization does too much > second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs > rather admits to that). > > And I don't think it's as simple as moving the node_stat_sub_folio() > into 2/6 where the zero pte is substituted: that would probably handle > the vast majority of cases, but aren't there others which pass the > folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages, > or racily truncated now that the folio has been unlocked, for example? > > Hugh
On Thu, 5 Sep 2024, Usama Arif wrote: > On 05/09/2024 09:46, Hugh Dickins wrote: > > On Fri, 30 Aug 2024, Usama Arif wrote: > > > >> From: Yu Zhao <yuzhao@google.com> > >> > >> If a tail page has only two references left, one inherited from the > >> isolation of its head and the other from lru_add_page_tail() which we > >> are about to drop, it means this tail page was concurrently zapped. > >> Then we can safely free it and save page reclaim or migration the > >> trouble of trying it. > >> > >> Signed-off-by: Yu Zhao <yuzhao@google.com> > >> Tested-by: Shuang Zhai <zhais@google.com> > >> Acked-by: Johannes Weiner <hannes@cmpxchg.org> > >> Signed-off-by: Usama Arif <usamaarif642@gmail.com> > > > > I'm sorry, but I think this patch (just this 1/6) needs to be dropped: > > it is only an optimization, and unless a persuasive performance case > > can be made to extend it, it ought to go (perhaps revisited later). > > > > I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it. > > Its an optimization and underused shrinker doesn't depend on it. > Its possible that folio->new_folio below might fix it? But if it doesn't, > I can retry later on to make this work and resend it only if it alone shows > a significant performance improvement. > > Thanks a lot for debugging this! and sorry it caused an issue. > > > > The problem I kept hitting was that all my work, requiring compaction and > > reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle(): > > because nr_isolated_anon had grown high - and remained high even when the > > load had all been killed. > > > > Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6 > > is the one to blame. I was intending to send this patch to "fix" it: > > > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag > > folio_clear_active(new_folio); > > folio_clear_unevictable(new_folio); > > list_del(&new_folio->lru); > > + node_stat_sub_folio(folio, NR_ISOLATED_ANON + > > + folio_is_file_lru(folio)); > > Maybe this should have been below? (Notice the folio->new_folio) > > + node_stat_sub_folio(new_folio, NR_ISOLATED_ANON + > + folio_is_file_lru(new_folio)); Yes, how very stupid of me (I'm well aware of the earlier bugfixes here, and ought not to have done a blind copy and paste of those lines): thanks. However... it makes no difference. I gave yours a run, expecting a much smaller negative number, but actually it works out much the same: because, of course, by this point in the code "folio" is left pointing to a small folio, and is almost equivalent to new_folio for the node_stat calculations. (I say "almost" because I guess there's a chance that the page at folio got reused as part of a larger folio meanwhile, which would throw the stats off (if they weren't off already).) So, even with your fix to my fix, this code doesn't work. Whereas a revert of this 1/6 does work: nr_isolated_anon and nr_isolated_file come to 0 when the system is at rest, as expected (and as silence from vmstat_refresh confirms - /proc/vmstat itself presents negative stats as 0, in order to hide transient oddities). Hugh > > > if (!folio_batch_add(&free_folios, new_folio)) { > > mem_cgroup_uncharge_folios(&free_folios); > > free_unref_folios(&free_folios); > > > > And that ran nicely, until I terminated the run and did > > grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat > > at the end: stat_refresh kindly left a pr_warn in dmesg to say > > nr_isolated_anon -334013737 > > > > My patch is not good enough. IIUC, some split_huge_pagers (reclaim?) > > know how many pages they isolated and decremented the stats by, and > > increment by that same number at the end; whereas other split_huge_pagers > > (migration?) decrement one by one as they go through the list afterwards. > > > > I've run out of time (I'm about to take a break): I gave up researching > > who needs what, and was already feeling this optimization does too much > > second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs > > rather admits to that). > > > > And I don't think it's as simple as moving the node_stat_sub_folio() > > into 2/6 where the zero pte is substituted: that would probably handle > > the vast majority of cases, but aren't there others which pass the > > folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages, > > or racily truncated now that the folio has been unlocked, for example? > > > > Hugh
On 05/09/2024 19:05, Hugh Dickins wrote: > On Thu, 5 Sep 2024, Usama Arif wrote: >> On 05/09/2024 09:46, Hugh Dickins wrote: >>> On Fri, 30 Aug 2024, Usama Arif wrote: >>> >>>> From: Yu Zhao <yuzhao@google.com> >>>> >>>> If a tail page has only two references left, one inherited from the >>>> isolation of its head and the other from lru_add_page_tail() which we >>>> are about to drop, it means this tail page was concurrently zapped. >>>> Then we can safely free it and save page reclaim or migration the >>>> trouble of trying it. >>>> >>>> Signed-off-by: Yu Zhao <yuzhao@google.com> >>>> Tested-by: Shuang Zhai <zhais@google.com> >>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com> >>> >>> I'm sorry, but I think this patch (just this 1/6) needs to be dropped: >>> it is only an optimization, and unless a persuasive performance case >>> can be made to extend it, it ought to go (perhaps revisited later). >>> >> >> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it. >> >> Its an optimization and underused shrinker doesn't depend on it. >> Its possible that folio->new_folio below might fix it? But if it doesn't, >> I can retry later on to make this work and resend it only if it alone shows >> a significant performance improvement. >> >> Thanks a lot for debugging this! and sorry it caused an issue. >> >> >>> The problem I kept hitting was that all my work, requiring compaction and >>> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle(): >>> because nr_isolated_anon had grown high - and remained high even when the >>> load had all been killed. >>> >>> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6 >>> is the one to blame. I was intending to send this patch to "fix" it: >>> >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag >>> folio_clear_active(new_folio); >>> folio_clear_unevictable(new_folio); >>> list_del(&new_folio->lru); >>> + node_stat_sub_folio(folio, NR_ISOLATED_ANON + >>> + folio_is_file_lru(folio)); >> >> Maybe this should have been below? (Notice the folio->new_folio) >> >> + node_stat_sub_folio(new_folio, NR_ISOLATED_ANON + >> + folio_is_file_lru(new_folio)); > > Yes, how very stupid of me (I'm well aware of the earlier bugfixes here, > and ought not to have done a blind copy and paste of those lines): thanks. > > However... it makes no difference. I gave yours a run, expecting a > much smaller negative number, but actually it works out much the same: > because, of course, by this point in the code "folio" is left pointing > to a small folio, and is almost equivalent to new_folio for the > node_stat calculations. > > (I say "almost" because I guess there's a chance that the page at > folio got reused as part of a larger folio meanwhile, which would > throw the stats off (if they weren't off already).) > > So, even with your fix to my fix, this code doesn't work. > Whereas a revert of this 1/6 does work: nr_isolated_anon and > nr_isolated_file come to 0 when the system is at rest, as expected > (and as silence from vmstat_refresh confirms - /proc/vmstat itself > presents negative stats as 0, in order to hide transient oddities). > > Hugh Thanks for trying. I was hoping you had mTHPs enabled and then the folio -> new_folio change would have fixed it. Happy for patch 1 only to be dropped. I can try to figure it out later and send if I can actually show any performance numbers for the fixed version on real world cases. Thanks, Usama > >> >>> if (!folio_batch_add(&free_folios, new_folio)) { >>> mem_cgroup_uncharge_folios(&free_folios); >>> free_unref_folios(&free_folios); >>> >>> And that ran nicely, until I terminated the run and did >>> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat >>> at the end: stat_refresh kindly left a pr_warn in dmesg to say >>> nr_isolated_anon -334013737 >>> >>> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?) >>> know how many pages they isolated and decremented the stats by, and >>> increment by that same number at the end; whereas other split_huge_pagers >>> (migration?) decrement one by one as they go through the list afterwards. >>> >>> I've run out of time (I'm about to take a break): I gave up researching >>> who needs what, and was already feeling this optimization does too much >>> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs >>> rather admits to that). >>> >>> And I don't think it's as simple as moving the node_stat_sub_folio() >>> into 2/6 where the zero pte is substituted: that would probably handle >>> the vast majority of cases, but aren't there others which pass the >>> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages, >>> or racily truncated now that the folio has been unlocked, for example? >>> >>> Hugh
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 15418ffdd377..0c48806ccb9a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3170,7 +3170,9 @@ static void __split_huge_page(struct page *page, struct list_head *list, unsigned int new_nr = 1 << new_order; int order = folio_order(folio); unsigned int nr = 1 << order; + struct folio_batch free_folios; + folio_batch_init(&free_folios); /* complete memcg works before add pages to LRU */ split_page_memcg(head, order, new_order); @@ -3254,6 +3256,27 @@ static void __split_huge_page(struct page *page, struct list_head *list, if (subpage == page) continue; folio_unlock(new_folio); + /* + * If a folio has only two references left, one inherited + * from the isolation of its head and the other from + * lru_add_page_tail() which we are about to drop, it means this + * folio was concurrently zapped. Then we can safely free it + * and save page reclaim or migration the trouble of trying it. + */ + if (list && folio_ref_freeze(new_folio, 2)) { + VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio); + VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio); + + folio_clear_active(new_folio); + folio_clear_unevictable(new_folio); + list_del(&new_folio->lru); + if (!folio_batch_add(&free_folios, new_folio)) { + mem_cgroup_uncharge_folios(&free_folios); + free_unref_folios(&free_folios); + } + continue; + } /* * Subpages may be freed if there wasn't any mapping @@ -3264,6 +3287,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, */ free_page_and_swap_cache(subpage); } + + if (free_folios.nr) { + mem_cgroup_uncharge_folios(&free_folios); + free_unref_folios(&free_folios); + } } /* Racy check whether the huge page can be split */