Message ID | 20240225123215.86503-1-ioworker0@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free | expand |
On 1/1/70 08:00, wrote: > diff --git a/mm/madvise.c b/mm/madvise.c > index cfa5e7288261..bcbf56595a2e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > */ > if (folio_test_large(folio)) { > int err; > + unsigned long next_addr, align; > > - if (folio_estimated_sharers(folio) != 1) > - break; > - if (!folio_trylock(folio)) > - break; > + if (folio_estimated_sharers(folio) != 1 || > + !folio_trylock(folio)) > + goto skip_large_folio; > + > + align = folio_nr_pages(folio) * PAGE_SIZE; > + next_addr = ALIGN_DOWN(addr + align, align); There is a possible corner case: If there is a cow folio associated with this folio and the cow folio has smaller size than this folio for whatever reason, this change can't handle it correctly. Regards Yin, Fengwei
Hi Lance, On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > > This patch improves madvise_free_pte_range() to correctly > handle large folio that is smaller than PMD-size > (for example, 16KiB to 1024KiB[1]). It’s probably part of > the preparation to support anonymous multi-size THP. > > Additionally, when the consecutive PTEs are mapped to > consecutive pages of the same large folio (mTHP), if the > folio is locked before madvise(MADV_FREE) or cannot be > split, then all subsequent PTEs within the same PMD will > be skipped. However, they should have been MADV_FREEed. > > Moreover, this patch also optimizes lazyfreeing with > PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We > aim to avoid unnecessary folio splitting if the large > folio is entirely within the given range. > We did something similar on MADV_PAGEOUT[1] [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by > PTE-mapped folios of the same size results in the following > runtimes for madvise(MADV_FREE) in seconds (shorter is better): > > Folio Size | Old | New | Change > ---------------------------------------------- > 4KiB | 0.590251 | 0.590264 | 0% > 16KiB | 2.990447 | 0.182167 | -94% > 32KiB | 2.547831 | 0.101622 | -96% > 64KiB | 2.457796 | 0.049726 | -98% > 128KiB | 2.281034 | 0.030109 | -99% > 256KiB | 2.230387 | 0.015838 | -99% > 512KiB | 2.189106 | 0.009149 | -99% > 1024KiB | 2.183949 | 0.006620 | -99% > 2048KiB | 0.002799 | 0.002795 | 0% > > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/ > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > --- > mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 58 insertions(+), 11 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index cfa5e7288261..bcbf56595a2e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > */ > if (folio_test_large(folio)) { > int err; > + unsigned long next_addr, align; > > - if (folio_estimated_sharers(folio) != 1) > - break; > - if (!folio_trylock(folio)) > - break; > + if (folio_estimated_sharers(folio) != 1 || > + !folio_trylock(folio)) > + goto skip_large_folio; > + > + align = folio_nr_pages(folio) * PAGE_SIZE; > + next_addr = ALIGN_DOWN(addr + align, align); > + > + /* > + * If we mark only the subpages as lazyfree, > + * split the large folio. > + */ > + if (next_addr > end || next_addr - addr != align) > + goto split_large_folio; > + > + /* > + * Avoid unnecessary folio splitting if the large > + * folio is entirely within the given range. > + */ > + folio_test_clear_dirty(folio); > + folio_unlock(folio); > + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { > + ptent = ptep_get(pte); > + if (pte_young(ptent) || pte_dirty(ptent)) { > + ptent = ptep_get_and_clear_full( > + mm, addr, pte, tlb->fullmm); > + ptent = pte_mkold(ptent); > + ptent = pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, addr); > + } The code works under the assumption the large folio is entirely mapped in all PTEs in the range. This is not always true. This won't work in some cases as some PTEs might be mapping to the large folios. some others might have been unmapped or mapped to different folios. so in MADV_PAGEOUT, we have a function to check the folio is really entirely mapped: +static inline bool pte_range_cont_mapped(unsigned long start_pfn, + pte_t *start_pte, unsigned long start_addr, int nr) +{ + int i; + pte_t pte_val; + + for (i = 0; i < nr; i++) { + pte_val = ptep_get(start_pte + i); + + if (pte_none(pte_val)) + return false; + + if (pte_pfn(pte_val) != (start_pfn + i)) + return false; + } + + return true; +} > + } > + folio_mark_lazyfree(folio); > + goto next_folio; > + > +split_large_folio: > folio_get(folio); > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > err = split_folio(folio); > folio_unlock(folio); > folio_put(folio); > - if (err) > - break; > - start_pte = pte = > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > + > + /* > + * If the large folio is locked before madvise(MADV_FREE) > + * or cannot be split, we just skip it. > + */ > + if (err) { > +skip_large_folio: > + if (next_addr >= end) > + break; > + pte += (next_addr - addr) / PAGE_SIZE; > + addr = next_addr; > + } > + > + if (!start_pte) { > + start_pte = pte = pte_offset_map_lock( > + mm, pmd, addr, &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + } > + > +next_folio: > pte--; > addr -= PAGE_SIZE; > continue; > -- > 2.33.1 > > Thanks Barry
Hey Fengwei, Thanks for taking time to review! > On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote: [...] > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > */ > > if (folio_test_large(folio)) { > > int err; > > + unsigned long next_addr, align; > > > > - if (folio_estimated_sharers(folio) != 1) > > - break; > > - if (!folio_trylock(folio)) > > - break; > > + if (folio_estimated_sharers(folio) != 1 || > > + !folio_trylock(folio)) > > + goto skip_large_folio; > > + > > + align = folio_nr_pages(folio) * PAGE_SIZE; > > + next_addr = ALIGN_DOWN(addr + align, align); > There is a possible corner case: > If there is a cow folio associated with this folio and the cow folio > has smaller size than this folio for whatever reason, this change can't > handle it correctly. Thanks for pointing that out; it's very helpful to me! I made some changes. Could you please check if this corner case is now resolved? As a diff against this patch. diff --git a/mm/madvise.c b/mm/madvise.c index bcbf56595a2e..c7aacc9f9536 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, next_addr = ALIGN_DOWN(addr + align, align); /* - * If we mark only the subpages as lazyfree, - * split the large folio. + * If we mark only the subpages as lazyfree, or + * if there is a cow folio associated with this folio, + * then split the large folio. */ - if (next_addr > end || next_addr - addr != align) + if (next_addr > end || next_addr - addr != align || + folio_total_mapcount(folio) != folio_nr_pages(folio)) goto split_large_folio; /* --- Thanks again for your time! Best, Lance
Hey Barry, Thanks for taking time to review! On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: [...] > On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: [...] > We did something similar on MADV_PAGEOUT[1] > [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ Thanks for providing the link above. [...] > > + * Avoid unnecessary folio splitting if the large > > + * folio is entirely within the given range. > > + */ > > + folio_test_clear_dirty(folio); > > + folio_unlock(folio); > > + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { > > + ptent = ptep_get(pte); > > + if (pte_young(ptent) || pte_dirty(ptent)) { > > + ptent = ptep_get_and_clear_full( > > + mm, addr, pte, tlb->fullmm); > > + ptent = pte_mkold(ptent); > > + ptent = pte_mkclean(ptent); > > + set_pte_at(mm, addr, pte, ptent); > > + tlb_remove_tlb_entry(tlb, pte, addr); > > + } > > The code works under the assumption the large folio is entirely mapped > in all PTEs in the range. This is not always true. > > This won't work in some cases as some PTEs might be mapping to the > large folios. some others might have been unmapped or mapped > to different folios. > > so in MADV_PAGEOUT, we have a function to check the folio is > really entirely mapped: > > +static inline bool pte_range_cont_mapped(unsigned long start_pfn, > + pte_t *start_pte, unsigned long start_addr, int nr) > +{ > + int i; > + pte_t pte_val; > + > + for (i = 0; i < nr; i++) { > + pte_val = ptep_get(start_pte + i); > + > + if (pte_none(pte_val)) > + return false; > + > + if (pte_pfn(pte_val) != (start_pfn + i)) > + return false; > + } > + > + return true; > +} Thanks for providing the information; it's very helpful to me! I made some changes. Would you mind taking another look, please? As a diff against this patch. diff --git a/mm/madvise.c b/mm/madvise.c index bcbf56595a2e..255d2f329be4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -616,6 +616,18 @@ static long madvise_pageout(struct vm_area_struct *vma, return 0; } +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) +{ + pte_t pte_val; + unsigned long pfn = pte_pfn(pte); + for (int i = 0; i < nr; i++) { + pte_val = ptep_get(pte + i); + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) + return false; + } + return true; +} + static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -676,20 +688,25 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, */ if (folio_test_large(folio)) { int err; - unsigned long next_addr, align; + unsigned long nr, next_addr, align; if (folio_estimated_sharers(folio) != 1 || !folio_trylock(folio)) goto skip_large_folio; - align = folio_nr_pages(folio) * PAGE_SIZE; + nr = folio_nr_pages(folio); + align = nr * PAGE_SIZE; next_addr = ALIGN_DOWN(addr + align, align); /* - * If we mark only the subpages as lazyfree, - * split the large folio. + * If we mark only the subpages as lazyfree, or + * if there is a cow folio associated with this folio, + * or if this folio is not really entirely mapped, + * then split the large folio. */ - if (next_addr > end || next_addr - addr != align) + if (next_addr > end || next_addr - addr != align || + folio_total_mapcount(folio) != nr || + pte_range_cont_mapped(pte, nr)) goto split_large_folio; /* --- Thanks again for your time! Best, Lance
On 26.02.24 09:37, Lance Yang wrote: > Hey Barry, > > Thanks for taking time to review! > > On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: > [...] >> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > [...] >> We did something similar on MADV_PAGEOUT[1] >> [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ > > Thanks for providing the link above. > > [...] >>> + * Avoid unnecessary folio splitting if the large >>> + * folio is entirely within the given range. >>> + */ >>> + folio_test_clear_dirty(folio); >>> + folio_unlock(folio); >>> + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { >>> + ptent = ptep_get(pte); >>> + if (pte_young(ptent) || pte_dirty(ptent)) { >>> + ptent = ptep_get_and_clear_full( >>> + mm, addr, pte, tlb->fullmm); >>> + ptent = pte_mkold(ptent); >>> + ptent = pte_mkclean(ptent); >>> + set_pte_at(mm, addr, pte, ptent); >>> + tlb_remove_tlb_entry(tlb, pte, addr); >>> + } >> >> The code works under the assumption the large folio is entirely mapped >> in all PTEs in the range. This is not always true. >> >> This won't work in some cases as some PTEs might be mapping to the >> large folios. some others might have been unmapped or mapped >> to different folios. >> >> so in MADV_PAGEOUT, we have a function to check the folio is >> really entirely mapped: >> >> +static inline bool pte_range_cont_mapped(unsigned long start_pfn, >> + pte_t *start_pte, unsigned long start_addr, int nr) >> +{ >> + int i; >> + pte_t pte_val; >> + >> + for (i = 0; i < nr; i++) { >> + pte_val = ptep_get(start_pte + i); >> + >> + if (pte_none(pte_val)) >> + return false; >> + >> + if (pte_pfn(pte_val) != (start_pfn + i)) >> + return false; >> + } >> + >> + return true; >> +} > > Thanks for providing the information; it's very helpful to me! > I made some changes. Would you mind taking another look, please? > > As a diff against this patch. > > diff --git a/mm/madvise.c b/mm/madvise.c > index bcbf56595a2e..255d2f329be4 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -616,6 +616,18 @@ static long madvise_pageout(struct vm_area_struct *vma, > return 0; > } > > +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) > +{ > + pte_t pte_val; > + unsigned long pfn = pte_pfn(pte); > + for (int i = 0; i < nr; i++) { > + pte_val = ptep_get(pte + i); > + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) > + return false; > + } > + return true; > +} I dislike the "cont mapped" terminology. Maybe folio_pte_batch() does what you want?
Hey David, Thanks for your suggestion! On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: > [...] > > On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: > > [...] > >> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > > [...] [...] > > +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) > > +{ > > + pte_t pte_val; > > + unsigned long pfn = pte_pfn(pte); > > + for (int i = 0; i < nr; i++) { > > + pte_val = ptep_get(pte + i); > > + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) > > + return false; > > + } > > + return true; > > +} > > I dislike the "cont mapped" terminology. > > Maybe folio_pte_batch() does what you want? folio_pte_batch() is a good choice. Appreciate it! Best, Lance > > -- > Cheers, > > David / dhildenb >
On 26/02/2024 08:35, Lance Yang wrote: > Hey Fengwei, > > Thanks for taking time to review! > >> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote: > [...] >>> --- a/mm/madvise.c >>> +++ b/mm/madvise.c >>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, >>> */ >>> if (folio_test_large(folio)) { >>> int err; >>> + unsigned long next_addr, align; >>> >>> - if (folio_estimated_sharers(folio) != 1) >>> - break; >>> - if (!folio_trylock(folio)) >>> - break; >>> + if (folio_estimated_sharers(folio) != 1 || >>> + !folio_trylock(folio)) >>> + goto skip_large_folio; >>> + >>> + align = folio_nr_pages(folio) * PAGE_SIZE; >>> + next_addr = ALIGN_DOWN(addr + align, align); >> There is a possible corner case: >> If there is a cow folio associated with this folio and the cow folio >> has smaller size than this folio for whatever reason, this change can't >> handle it correctly. > > Thanks for pointing that out; it's very helpful to me! > I made some changes. Could you please check if this corner case is now resolved? > > As a diff against this patch. > > diff --git a/mm/madvise.c b/mm/madvise.c > index bcbf56595a2e..c7aacc9f9536 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > next_addr = ALIGN_DOWN(addr + align, align); > > /* > - * If we mark only the subpages as lazyfree, > - * split the large folio. > + * If we mark only the subpages as lazyfree, or > + * if there is a cow folio associated with this folio, > + * then split the large folio. > */ > - if (next_addr > end || next_addr - addr != align) > + if (next_addr > end || next_addr - addr != align || > + folio_total_mapcount(folio) != folio_nr_pages(folio)) I still don't think this is correct. I think you were previously assuming that if you see a page from a large folio then the whole large folio should be contiguously mapped? This new check doesn't validate that assumption reliably; you need to iterate through every pte to generate a batch, like David does in folio_pte_batch() for this to be safe. An example of when this check is insufficient; let's say you have a 4 page anon folio mapped contiguously in a process (total_mapcount=4). The process is forked (total_mapcount=8). Then each process munmaps the second 2 pages (total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped. Then call madvise. It's probably even easier to trigger for file-backed memory (I think this code path is used for both file and anon?) Thanks, Ryan > goto split_large_folio; > > /* > --- > > Thanks again for your time! > > Best, > Lance
Hi Lance, Thanks for working on this! On 25/02/2024 12:32, Lance Yang wrote: > This patch improves madvise_free_pte_range() to correctly > handle large folio that is smaller than PMD-size When you say "correctly handle" are you implying there is a bug with the current implementation or are you just saying you can optimize this to improve performance? I'm not convinced there is a bug, but I agree there is certainly room for performance improvement. Thanks, Ryan > (for example, 16KiB to 1024KiB[1]). It’s probably part of > the preparation to support anonymous multi-size THP. > > Additionally, when the consecutive PTEs are mapped to > consecutive pages of the same large folio (mTHP), if the > folio is locked before madvise(MADV_FREE) or cannot be > split, then all subsequent PTEs within the same PMD will > be skipped. However, they should have been MADV_FREEed. > > Moreover, this patch also optimizes lazyfreeing with > PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We > aim to avoid unnecessary folio splitting if the large > folio is entirely within the given range. > > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by > PTE-mapped folios of the same size results in the following > runtimes for madvise(MADV_FREE) in seconds (shorter is better): > > Folio Size | Old | New | Change > ---------------------------------------------- > 4KiB | 0.590251 | 0.590264 | 0% > 16KiB | 2.990447 | 0.182167 | -94% > 32KiB | 2.547831 | 0.101622 | -96% > 64KiB | 2.457796 | 0.049726 | -98% > 128KiB | 2.281034 | 0.030109 | -99% > 256KiB | 2.230387 | 0.015838 | -99% > 512KiB | 2.189106 | 0.009149 | -99% > 1024KiB | 2.183949 | 0.006620 | -99% > 2048KiB | 0.002799 | 0.002795 | 0% > > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/ > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > --- > mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 58 insertions(+), 11 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index cfa5e7288261..bcbf56595a2e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > */ > if (folio_test_large(folio)) { > int err; > + unsigned long next_addr, align; > > - if (folio_estimated_sharers(folio) != 1) > - break; > - if (!folio_trylock(folio)) > - break; > + if (folio_estimated_sharers(folio) != 1 || > + !folio_trylock(folio)) > + goto skip_large_folio; > + > + align = folio_nr_pages(folio) * PAGE_SIZE; > + next_addr = ALIGN_DOWN(addr + align, align); > + > + /* > + * If we mark only the subpages as lazyfree, > + * split the large folio. > + */ > + if (next_addr > end || next_addr - addr != align) > + goto split_large_folio; > + > + /* > + * Avoid unnecessary folio splitting if the large > + * folio is entirely within the given range. > + */ > + folio_test_clear_dirty(folio); > + folio_unlock(folio); > + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { > + ptent = ptep_get(pte); > + if (pte_young(ptent) || pte_dirty(ptent)) { > + ptent = ptep_get_and_clear_full( > + mm, addr, pte, tlb->fullmm); > + ptent = pte_mkold(ptent); > + ptent = pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, addr); > + } > + } > + folio_mark_lazyfree(folio); > + goto next_folio; > + > +split_large_folio: > folio_get(folio); > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > err = split_folio(folio); > folio_unlock(folio); > folio_put(folio); > - if (err) > - break; > - start_pte = pte = > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > + > + /* > + * If the large folio is locked before madvise(MADV_FREE) > + * or cannot be split, we just skip it. > + */ > + if (err) { > +skip_large_folio: > + if (next_addr >= end) > + break; > + pte += (next_addr - addr) / PAGE_SIZE; > + addr = next_addr; > + } > + > + if (!start_pte) { > + start_pte = pte = pte_offset_map_lock( > + mm, pmd, addr, &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + } > + > +next_folio: > pte--; > addr -= PAGE_SIZE; > continue;
On 26.02.24 13:57, Ryan Roberts wrote: > On 26/02/2024 08:35, Lance Yang wrote: >> Hey Fengwei, >> >> Thanks for taking time to review! >> >>> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote: >> [...] >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, >>>> */ >>>> if (folio_test_large(folio)) { >>>> int err; >>>> + unsigned long next_addr, align; >>>> >>>> - if (folio_estimated_sharers(folio) != 1) >>>> - break; >>>> - if (!folio_trylock(folio)) >>>> - break; >>>> + if (folio_estimated_sharers(folio) != 1 || >>>> + !folio_trylock(folio)) >>>> + goto skip_large_folio; >>>> + >>>> + align = folio_nr_pages(folio) * PAGE_SIZE; >>>> + next_addr = ALIGN_DOWN(addr + align, align); >>> There is a possible corner case: >>> If there is a cow folio associated with this folio and the cow folio >>> has smaller size than this folio for whatever reason, this change can't >>> handle it correctly. >> >> Thanks for pointing that out; it's very helpful to me! >> I made some changes. Could you please check if this corner case is now resolved? >> >> As a diff against this patch. >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index bcbf56595a2e..c7aacc9f9536 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, >> next_addr = ALIGN_DOWN(addr + align, align); >> >> /* >> - * If we mark only the subpages as lazyfree, >> - * split the large folio. >> + * If we mark only the subpages as lazyfree, or >> + * if there is a cow folio associated with this folio, >> + * then split the large folio. >> */ >> - if (next_addr > end || next_addr - addr != align) >> + if (next_addr > end || next_addr - addr != align || >> + folio_total_mapcount(folio) != folio_nr_pages(folio)) > > I still don't think this is correct. I think you were previously assuming that > if you see a page from a large folio then the whole large folio should be > contiguously mapped? This new check doesn't validate that assumption reliably; > you need to iterate through every pte to generate a batch, like David does in > folio_pte_batch() for this to be safe. > > An example of when this check is insufficient; let's say you have a 4 page anon > folio mapped contiguously in a process (total_mapcount=4). The process is forked > (total_mapcount=8). Then each process munmaps the second 2 pages > (total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped. > Then call madvise. It's probably even easier to trigger for file-backed memory > (I think this code path is used for both file and anon?) What would work here is using folio_pte_batch() to get how many PTEs are mapped *here*, then comparing the the batch size to folio_nr_pages(). If both match, we are mapping all subpages.
On 26/02/2024 08:55, Lance Yang wrote: > Hey David, > > Thanks for your suggestion! > > On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: >> > [...] >>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: >>> [...] >>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: >>> [...] > [...] >>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) >>> +{ >>> + pte_t pte_val; >>> + unsigned long pfn = pte_pfn(pte); >>> + for (int i = 0; i < nr; i++) { >>> + pte_val = ptep_get(pte + i); >>> + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) >>> + return false; >>> + } >>> + return true; >>> +} >> >> I dislike the "cont mapped" terminology. >> >> Maybe folio_pte_batch() does what you want? > > folio_pte_batch() is a good choice. Appreciate it! Agreed, folio_pte_batch() is likely to be widely useful for this change and others, so suggest exporting it from memory.c and reusing as is if possible. > > Best, > Lance > >> >> -- >> Cheers, >> >> David / dhildenb >>
On Mon, Feb 26, 2024 at 9:04 PM David Hildenbrand <david@redhat.com> wrote: > > On 26.02.24 13:57, Ryan Roberts wrote: > > On 26/02/2024 08:35, Lance Yang wrote: > >> Hey Fengwei, > >> > >> Thanks for taking time to review! > >> > >>> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote: > >> [...] > >>>> --- a/mm/madvise.c > >>>> +++ b/mm/madvise.c > >>>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > >>>> */ > >>>> if (folio_test_large(folio)) { > >>>> int err; > >>>> + unsigned long next_addr, align; > >>>> > >>>> - if (folio_estimated_sharers(folio) != 1) > >>>> - break; > >>>> - if (!folio_trylock(folio)) > >>>> - break; > >>>> + if (folio_estimated_sharers(folio) != 1 || > >>>> + !folio_trylock(folio)) > >>>> + goto skip_large_folio; > >>>> + > >>>> + align = folio_nr_pages(folio) * PAGE_SIZE; > >>>> + next_addr = ALIGN_DOWN(addr + align, align); > >>> There is a possible corner case: > >>> If there is a cow folio associated with this folio and the cow folio > >>> has smaller size than this folio for whatever reason, this change can't > >>> handle it correctly. > >> > >> Thanks for pointing that out; it's very helpful to me! > >> I made some changes. Could you please check if this corner case is now resolved? > >> > >> As a diff against this patch. > >> > >> diff --git a/mm/madvise.c b/mm/madvise.c > >> index bcbf56595a2e..c7aacc9f9536 100644 > >> --- a/mm/madvise.c > >> +++ b/mm/madvise.c > >> @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > >> next_addr = ALIGN_DOWN(addr + align, align); > >> > >> /* > >> - * If we mark only the subpages as lazyfree, > >> - * split the large folio. > >> + * If we mark only the subpages as lazyfree, or > >> + * if there is a cow folio associated with this folio, > >> + * then split the large folio. > >> */ > >> - if (next_addr > end || next_addr - addr != align) > >> + if (next_addr > end || next_addr - addr != align || > >> + folio_total_mapcount(folio) != folio_nr_pages(folio)) > > > > I still don't think this is correct. I think you were previously assuming that > > if you see a page from a large folio then the whole large folio should be > > contiguously mapped? This new check doesn't validate that assumption reliably; > > you need to iterate through every pte to generate a batch, like David does in > > folio_pte_batch() for this to be safe. > > > > An example of when this check is insufficient; let's say you have a 4 page anon > > folio mapped contiguously in a process (total_mapcount=4). The process is forked > > (total_mapcount=8). Then each process munmaps the second 2 pages > > (total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped. > > Then call madvise. It's probably even easier to trigger for file-backed memory > > (I think this code path is used for both file and anon?) > > What would work here is using folio_pte_batch() to get how many PTEs are > mapped *here*, then comparing the the batch size to folio_nr_pages(). If > both match, we are mapping all subpages. Thanks! I'll use folio_pte_batch() here in v2. Best, Lance > > -- > Cheers, > > David / dhildenb >
On Mon, Feb 26, 2024 at 9:04 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 26/02/2024 08:55, Lance Yang wrote: > > Hey David, > > > > Thanks for your suggestion! > > > > On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: > >> > > [...] > >>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: > >>> [...] > >>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > >>> [...] > > [...] > >>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) > >>> +{ > >>> + pte_t pte_val; > >>> + unsigned long pfn = pte_pfn(pte); > >>> + for (int i = 0; i < nr; i++) { > >>> + pte_val = ptep_get(pte + i); > >>> + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) > >>> + return false; > >>> + } > >>> + return true; > >>> +} > >> > >> I dislike the "cont mapped" terminology. > >> > >> Maybe folio_pte_batch() does what you want? > > > > folio_pte_batch() is a good choice. Appreciate it! > > Agreed, folio_pte_batch() is likely to be widely useful for this change and > others, so suggest exporting it from memory.c and reusing as is if possible. Thanks for your suggestion. I'll use folio_pte_batch() in v2. Best, Lance > > > > > Best, > > Lance > > > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > >> >
Hey Ryan, Thanks for taking time to review! On Mon, Feb 26, 2024 at 9:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Lance, > > Thanks for working on this! > > > On 25/02/2024 12:32, Lance Yang wrote: > > This patch improves madvise_free_pte_range() to correctly > > handle large folio that is smaller than PMD-size > > When you say "correctly handle" are you implying there is a bug with the current > implementation or are you just saying you can optimize this to improve > performance? I'm not convinced there is a bug, but I agree there is certainly > room for performance improvement. I agree with your point, and will update the changelog in v2. Thanks again for your time! Best, Lance > > Thanks, > Ryan > > > (for example, 16KiB to 1024KiB[1]). It’s probably part of > > the preparation to support anonymous multi-size THP. > > > > Additionally, when the consecutive PTEs are mapped to > > consecutive pages of the same large folio (mTHP), if the > > folio is locked before madvise(MADV_FREE) or cannot be > > split, then all subsequent PTEs within the same PMD will > > be skipped. However, they should have been MADV_FREEed. > > > > Moreover, this patch also optimizes lazyfreeing with > > PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We > > aim to avoid unnecessary folio splitting if the large > > folio is entirely within the given range. > > > > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by > > PTE-mapped folios of the same size results in the following > > runtimes for madvise(MADV_FREE) in seconds (shorter is better): > > > > Folio Size | Old | New | Change > > ---------------------------------------------- > > 4KiB | 0.590251 | 0.590264 | 0% > > 16KiB | 2.990447 | 0.182167 | -94% > > 32KiB | 2.547831 | 0.101622 | -96% > > 64KiB | 2.457796 | 0.049726 | -98% > > 128KiB | 2.281034 | 0.030109 | -99% > > 256KiB | 2.230387 | 0.015838 | -99% > > 512KiB | 2.189106 | 0.009149 | -99% > > 1024KiB | 2.183949 | 0.006620 | -99% > > 2048KiB | 0.002799 | 0.002795 | 0% > > > > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com > > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/ > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > --- > > mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------- > > 1 file changed, 58 insertions(+), 11 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index cfa5e7288261..bcbf56595a2e 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > */ > > if (folio_test_large(folio)) { > > int err; > > + unsigned long next_addr, align; > > > > - if (folio_estimated_sharers(folio) != 1) > > - break; > > - if (!folio_trylock(folio)) > > - break; > > + if (folio_estimated_sharers(folio) != 1 || > > + !folio_trylock(folio)) > > + goto skip_large_folio; > > + > > + align = folio_nr_pages(folio) * PAGE_SIZE; > > + next_addr = ALIGN_DOWN(addr + align, align); > > + > > + /* > > + * If we mark only the subpages as lazyfree, > > + * split the large folio. > > + */ > > + if (next_addr > end || next_addr - addr != align) > > + goto split_large_folio; > > + > > + /* > > + * Avoid unnecessary folio splitting if the large > > + * folio is entirely within the given range. > > + */ > > + folio_test_clear_dirty(folio); > > + folio_unlock(folio); > > + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { > > + ptent = ptep_get(pte); > > + if (pte_young(ptent) || pte_dirty(ptent)) { > > + ptent = ptep_get_and_clear_full( > > + mm, addr, pte, tlb->fullmm); > > + ptent = pte_mkold(ptent); > > + ptent = pte_mkclean(ptent); > > + set_pte_at(mm, addr, pte, ptent); > > + tlb_remove_tlb_entry(tlb, pte, addr); > > + } > > + } > > + folio_mark_lazyfree(folio); > > + goto next_folio; > > + > > +split_large_folio: > > folio_get(folio); > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(start_pte, ptl); > > @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > err = split_folio(folio); > > folio_unlock(folio); > > folio_put(folio); > > - if (err) > > - break; > > - start_pte = pte = > > - pte_offset_map_lock(mm, pmd, addr, &ptl); > > - if (!start_pte) > > - break; > > - arch_enter_lazy_mmu_mode(); > > + > > + /* > > + * If the large folio is locked before madvise(MADV_FREE) > > + * or cannot be split, we just skip it. > > + */ > > + if (err) { > > +skip_large_folio: > > + if (next_addr >= end) > > + break; > > + pte += (next_addr - addr) / PAGE_SIZE; > > + addr = next_addr; > > + } > > + > > + if (!start_pte) { > > + start_pte = pte = pte_offset_map_lock( > > + mm, pmd, addr, &ptl); > > + if (!start_pte) > > + break; > > + arch_enter_lazy_mmu_mode(); > > + } > > + > > +next_folio: > > pte--; > > addr -= PAGE_SIZE; > > continue; >
On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 26/02/2024 08:55, Lance Yang wrote: > > Hey David, > > > > Thanks for your suggestion! > > > > On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: > >> > > [...] > >>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: > >>> [...] > >>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > >>> [...] > > [...] > >>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) > >>> +{ > >>> + pte_t pte_val; > >>> + unsigned long pfn = pte_pfn(pte); > >>> + for (int i = 0; i < nr; i++) { > >>> + pte_val = ptep_get(pte + i); > >>> + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) > >>> + return false; > >>> + } > >>> + return true; > >>> +} > >> > >> I dislike the "cont mapped" terminology. > >> > >> Maybe folio_pte_batch() does what you want? > > > > folio_pte_batch() is a good choice. Appreciate it! > > Agreed, folio_pte_batch() is likely to be widely useful for this change and > others, so suggest exporting it from memory.c and reusing as is if possible. I actually missed folio_pte_batch() in cont-pte series and re-invented a function to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting folio_pte_batch() will also benefit that case. The problem space is same. [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ > > > > > Best, > > Lance > > > >> > >> -- > >> Cheers, > >> > >> David / dhildenb Thanks Barry
> Thanks for your suggestion. I'll use folio_pte_batch() in v2. Hi Lance, Obviously, we both need this. While making large folio swap-in v2, I am exporting folio_pte_batch() as below, From: Barry Song <v-songbaohua@oppo.com> Date: Tue, 27 Feb 2024 14:05:43 +1300 Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch to check if a range of PTEs are completely mapped to a large folio with contiguous physcial offset. Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/internal.h | 13 +++++++++++++ mm/memory.c | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 36c11ea41f47..7e11aea3eda9 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio) return (void *)(mapping & ~PAGE_MAPPING_FLAGS); } +/* Flags for folio_pte_batch(). */ +typedef int __bitwise fpb_t; + +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) + +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) + +extern int folio_pte_batch(struct folio *folio, unsigned long addr, + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, + bool *any_writable); + void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, int nr_throttled); static inline void acct_reclaim_writeback(struct folio *folio) diff --git a/mm/memory.c b/mm/memory.c index 6378f6bc22c5..dd9bd67f037a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * If "any_writable" is set, it will indicate if any other PTE besides the * first (given) PTE is writable. */ -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, +int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, bool *any_writable) {
On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote: > > > Thanks for your suggestion. I'll use folio_pte_batch() in v2. > > Hi Lance, > Obviously, we both need this. While making large folio swap-in > v2, I am exporting folio_pte_batch() as below, Thanks, Barry. Could you separate the export of folio_pte_batch() from the large folio swap-in v2? Prioritizing the push for this specific change would aid in the development of the v2 based on it. Best, Lance > > From: Barry Song <v-songbaohua@oppo.com> > Date: Tue, 27 Feb 2024 14:05:43 +1300 > Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it > > MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch > to check if a range of PTEs are completely mapped to a large folio with > contiguous physcial offset. > > Cc: Lance Yang <ioworker0@gmail.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: David Hildenbrand <david@redhat.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/internal.h | 13 +++++++++++++ > mm/memory.c | 2 +- > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/internal.h b/mm/internal.h > index 36c11ea41f47..7e11aea3eda9 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio) > return (void *)(mapping & ~PAGE_MAPPING_FLAGS); > } > > +/* Flags for folio_pte_batch(). */ > +typedef int __bitwise fpb_t; > + > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > + > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) > + > +extern int folio_pte_batch(struct folio *folio, unsigned long addr, > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > + bool *any_writable); > + > void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, > int nr_throttled); > static inline void acct_reclaim_writeback(struct folio *folio) > diff --git a/mm/memory.c b/mm/memory.c > index 6378f6bc22c5..dd9bd67f037a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) > * If "any_writable" is set, it will indicate if any other PTE besides the > * first (given) PTE is writable. > */ > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > +int folio_pte_batch(struct folio *folio, unsigned long addr, > pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > bool *any_writable) > { > -- > 2.34.1 > > > Best, > > Lance > > Thanks > Barry >
On 2/27/24 04:49, Barry Song wrote: > On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 26/02/2024 08:55, Lance Yang wrote: >>> Hey David, >>> >>> Thanks for your suggestion! >>> >>> On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>> [...] >>>>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: >>>>> [...] >>>>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: >>>>> [...] >>> [...] >>>>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) >>>>> +{ >>>>> + pte_t pte_val; >>>>> + unsigned long pfn = pte_pfn(pte); >>>>> + for (int i = 0; i < nr; i++) { >>>>> + pte_val = ptep_get(pte + i); >>>>> + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) >>>>> + return false; >>>>> + } >>>>> + return true; >>>>> +} >>>> >>>> I dislike the "cont mapped" terminology. >>>> >>>> Maybe folio_pte_batch() does what you want? >>> >>> folio_pte_batch() is a good choice. Appreciate it! >> >> Agreed, folio_pte_batch() is likely to be widely useful for this change and >> others, so suggest exporting it from memory.c and reusing as is if possible. > > I actually missed folio_pte_batch() in cont-pte series and re-invented > a function > to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting > folio_pte_batch() will also benefit that case. The problem space is same. > > [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ I am wondering whether we can delay large folio split till page reclaim phase for madvise cases. Like if we hit folio which is partially mapped to the range, don't split it but just unmap the mapping part from the range. Let page reclaim decide whether split the large folio or not (If it's not mapped to any other range,it will be freed as whole large folio. If part of it still mapped to other range,page reclaim can decide whether to split it or ignore it for current reclaim cycle). Splitting does work here. But it just drops all the benefits of large folio. Regards Yin, Fengwei > >> >>> >>> Best, >>> Lance >>> >>>> >>>> -- >>>> Cheers, >>>> >>>> David / dhildenb > > Thanks > Barry
On Tue, Feb 27, 2024 at 2:48 PM Lance Yang <ioworker0@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > Thanks for your suggestion. I'll use folio_pte_batch() in v2. > > > > Hi Lance, > > Obviously, we both need this. While making large folio swap-in > > v2, I am exporting folio_pte_batch() as below, > > Thanks, Barry. > > Could you separate the export of folio_pte_batch() from the large folio > swap-in v2? Prioritizing the push for this specific change would aid in > the development of the v2 based on it. I agree we should make this one pulled in by Andrew early to avoid potential dependencies and conflicts in two jobs. > > Best, > Lance > > > > > From: Barry Song <v-songbaohua@oppo.com> > > Date: Tue, 27 Feb 2024 14:05:43 +1300 > > Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it > > > > MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch > > to check if a range of PTEs are completely mapped to a large folio with > > contiguous physcial offset. > > > > Cc: Lance Yang <ioworker0@gmail.com> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: David Hildenbrand <david@redhat.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/internal.h | 13 +++++++++++++ > > mm/memory.c | 2 +- > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/mm/internal.h b/mm/internal.h > > index 36c11ea41f47..7e11aea3eda9 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio) > > return (void *)(mapping & ~PAGE_MAPPING_FLAGS); > > } > > > > +/* Flags for folio_pte_batch(). */ > > +typedef int __bitwise fpb_t; > > + > > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > > +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > > + > > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > > +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) > > + > > +extern int folio_pte_batch(struct folio *folio, unsigned long addr, > > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > > + bool *any_writable); > > + > > void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, > > int nr_throttled); > > static inline void acct_reclaim_writeback(struct folio *folio) > > diff --git a/mm/memory.c b/mm/memory.c > > index 6378f6bc22c5..dd9bd67f037a 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) > > * If "any_writable" is set, it will indicate if any other PTE besides the > > * first (given) PTE is writable. > > */ > > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > > +int folio_pte_batch(struct folio *folio, unsigned long addr, > > pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > > bool *any_writable) > > { > > -- > > 2.34.1 > > > > > Best, > > > Lance > > Thanks Barry > >
Thanks, Barry! On Tue, Feb 27, 2024 at 10:12 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 2:48 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > Thanks for your suggestion. I'll use folio_pte_batch() in v2. > > > > > > Hi Lance, > > > Obviously, we both need this. While making large folio swap-in > > > v2, I am exporting folio_pte_batch() as below, > > > > Thanks, Barry. > > > > Could you separate the export of folio_pte_batch() from the large folio > > swap-in v2? Prioritizing the push for this specific change would aid in > > the development of the v2 based on it. > > I agree we should make this one pulled in by Andrew early to avoid potential > dependencies and conflicts in two jobs. > > > > > Best, > > Lance > > > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > Date: Tue, 27 Feb 2024 14:05:43 +1300 > > > Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it > > > > > > MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch > > > to check if a range of PTEs are completely mapped to a large folio with > > > contiguous physcial offset. > > > > > > Cc: Lance Yang <ioworker0@gmail.com> > > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > > Cc: David Hildenbrand <david@redhat.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > mm/internal.h | 13 +++++++++++++ > > > mm/memory.c | 2 +- > > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/internal.h b/mm/internal.h > > > index 36c11ea41f47..7e11aea3eda9 100644 > > > --- a/mm/internal.h > > > +++ b/mm/internal.h > > > @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio) > > > return (void *)(mapping & ~PAGE_MAPPING_FLAGS); > > > } > > > > > > +/* Flags for folio_pte_batch(). */ > > > +typedef int __bitwise fpb_t; > > > + > > > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ > > > +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) > > > + > > > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ > > > +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) > > > + > > > +extern int folio_pte_batch(struct folio *folio, unsigned long addr, > > > + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > > > + bool *any_writable); > > > + > > > void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, > > > int nr_throttled); > > > static inline void acct_reclaim_writeback(struct folio *folio) > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 6378f6bc22c5..dd9bd67f037a 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) > > > * If "any_writable" is set, it will indicate if any other PTE besides the > > > * first (given) PTE is writable. > > > */ > > > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > > > +int folio_pte_batch(struct folio *folio, unsigned long addr, > > > pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > > > bool *any_writable) > > > { > > > -- > > > 2.34.1 > > > > > > > Best, > > > > Lance > > > > Thanks > Barry > > >
On Tue, Feb 27, 2024 at 2:51 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2/27/24 04:49, Barry Song wrote: > > On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 26/02/2024 08:55, Lance Yang wrote: > >>> Hey David, > >>> > >>> Thanks for your suggestion! > >>> > >>> On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>> [...] > >>>>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote: > >>>>> [...] > >>>>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote: > >>>>> [...] > >>> [...] > >>>>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr) > >>>>> +{ > >>>>> + pte_t pte_val; > >>>>> + unsigned long pfn = pte_pfn(pte); > >>>>> + for (int i = 0; i < nr; i++) { > >>>>> + pte_val = ptep_get(pte + i); > >>>>> + if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i)) > >>>>> + return false; > >>>>> + } > >>>>> + return true; > >>>>> +} > >>>> > >>>> I dislike the "cont mapped" terminology. > >>>> > >>>> Maybe folio_pte_batch() does what you want? > >>> > >>> folio_pte_batch() is a good choice. Appreciate it! > >> > >> Agreed, folio_pte_batch() is likely to be widely useful for this change and > >> others, so suggest exporting it from memory.c and reusing as is if possible. > > > > I actually missed folio_pte_batch() in cont-pte series and re-invented > > a function > > to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting > > folio_pte_batch() will also benefit that case. The problem space is same. > > > > [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/ > I am wondering whether we can delay large folio split till page reclaim phase > for madvise cases. > > Like if we hit folio which is partially mapped to the range, don't split it but > just unmap the mapping part from the range. Let page reclaim decide whether > split the large folio or not (If it's not mapped to any other range,it will be > freed as whole large folio. If part of it still mapped to other range,page reclaim > can decide whether to split it or ignore it for current reclaim cycle). Yes, we can. but we still have to play the ptes check game to avoid adding folios multiple times to reclaim the list. I don't see too much difference between splitting in madvise and splitting in vmscan. as our real purpose is avoiding splitting entirely mapped large folios. for partial mapped large folios, if we split in madvise, then we don't need to play the game of skipping folios while iterating PTEs. if we don't split in madvise, we have to make sure the large folio is only added in reclaimed list one time by checking if PTEs belong to the previous added folio. > > Splitting does work here. But it just drops all the benefits of large folio. > > > Regards > Yin, Fengwei > > > > >> > >>> > >>> Best, > >>> Lance > >>> > >>>> > >>>> -- > >>>> Cheers, > >>>> > >>>> David / dhildenb > > Thanks Barry
On 2/27/24 10:17, Barry Song wrote: >> Like if we hit folio which is partially mapped to the range, don't split it but >> just unmap the mapping part from the range. Let page reclaim decide whether >> split the large folio or not (If it's not mapped to any other range,it will be >> freed as whole large folio. If part of it still mapped to other range,page reclaim >> can decide whether to split it or ignore it for current reclaim cycle). > Yes, we can. but we still have to play the ptes check game to avoid adding > folios multiple times to reclaim the list. > > I don't see too much difference between splitting in madvise and splitting > in vmscan. as our real purpose is avoiding splitting entirely mapped > large folios. for partial mapped large folios, if we split in madvise, then > we don't need to play the game of skipping folios while iterating PTEs. > if we don't split in madvise, we have to make sure the large folio is only > added in reclaimed list one time by checking if PTEs belong to the > previous added folio. If the partial mapped large folio is unmapped from the range, the related PTE become none. How could the folio be added to reclaimed list multiple times? Regards Yin, Fengwei
On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2/27/24 10:17, Barry Song wrote: > >> Like if we hit folio which is partially mapped to the range, don't split it but > >> just unmap the mapping part from the range. Let page reclaim decide whether > >> split the large folio or not (If it's not mapped to any other range,it will be > >> freed as whole large folio. If part of it still mapped to other range,page reclaim > >> can decide whether to split it or ignore it for current reclaim cycle). > > Yes, we can. but we still have to play the ptes check game to avoid adding > > folios multiple times to reclaim the list. > > > > I don't see too much difference between splitting in madvise and splitting > > in vmscan. as our real purpose is avoiding splitting entirely mapped > > large folios. for partial mapped large folios, if we split in madvise, then > > we don't need to play the game of skipping folios while iterating PTEs. > > if we don't split in madvise, we have to make sure the large folio is only > > added in reclaimed list one time by checking if PTEs belong to the > > previous added folio. > > If the partial mapped large folio is unmapped from the range, the related PTE > become none. How could the folio be added to reclaimed list multiple times? in case we have 16 PTEs in a large folio. PTE0 present PTE1 present PTE2 present PTE3 none PTE4 present PTE5 none PTE6 present .... the current code is scanning PTE one by one. while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... there are all kinds of possibilities for unmapping. so what we can do is recording we have added the folio while scanning PTE0, then skipping this folios for all other PTEs. otherwise, we can split it while scanning PTE0, then we will meet different folios afterwards. > > > Regards > Yin, Fengwei Thanks Barry
On Tue, Feb 27, 2024 at 7:40 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > > > > > On 2/27/24 10:17, Barry Song wrote: > > >> Like if we hit folio which is partially mapped to the range, don't split it but > > >> just unmap the mapping part from the range. Let page reclaim decide whether > > >> split the large folio or not (If it's not mapped to any other range,it will be > > >> freed as whole large folio. If part of it still mapped to other range,page reclaim > > >> can decide whether to split it or ignore it for current reclaim cycle). > > > Yes, we can. but we still have to play the ptes check game to avoid adding > > > folios multiple times to reclaim the list. > > > > > > I don't see too much difference between splitting in madvise and splitting > > > in vmscan. as our real purpose is avoiding splitting entirely mapped > > > large folios. for partial mapped large folios, if we split in madvise, then > > > we don't need to play the game of skipping folios while iterating PTEs. > > > if we don't split in madvise, we have to make sure the large folio is only > > > added in reclaimed list one time by checking if PTEs belong to the > > > previous added folio. > > > > If the partial mapped large folio is unmapped from the range, the related PTE > > become none. How could the folio be added to reclaimed list multiple times? > > in case we have 16 PTEs in a large folio. > PTE0 present > PTE1 present > PTE2 present > PTE3 none > PTE4 present > PTE5 none > PTE6 present > .... > the current code is scanning PTE one by one. > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > > there are all kinds of possibilities for unmapping. > not to mention we have all kinds of possibilities like PTE0 present for large folio1 PTE1 present for large folio1 PTE2 present for another folio2 PTE3 present for another folio3 PTE4 present for large folio1 ... > so what we can do is recording we have added the folio while scanning PTE0, > then skipping this folios for all other PTEs. > > otherwise, we can split it while scanning PTE0, then we will meet > different folios > afterwards. > > > > > > > Regards > > Yin, Fengwei > > Thanks > Barry
On 2/27/24 14:40, Barry Song wrote: > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 2/27/24 10:17, Barry Song wrote: >>>> Like if we hit folio which is partially mapped to the range, don't split it but >>>> just unmap the mapping part from the range. Let page reclaim decide whether >>>> split the large folio or not (If it's not mapped to any other range,it will be >>>> freed as whole large folio. If part of it still mapped to other range,page reclaim >>>> can decide whether to split it or ignore it for current reclaim cycle). >>> Yes, we can. but we still have to play the ptes check game to avoid adding >>> folios multiple times to reclaim the list. >>> >>> I don't see too much difference between splitting in madvise and splitting >>> in vmscan. as our real purpose is avoiding splitting entirely mapped >>> large folios. for partial mapped large folios, if we split in madvise, then >>> we don't need to play the game of skipping folios while iterating PTEs. >>> if we don't split in madvise, we have to make sure the large folio is only >>> added in reclaimed list one time by checking if PTEs belong to the >>> previous added folio. >> >> If the partial mapped large folio is unmapped from the range, the related PTE >> become none. How could the folio be added to reclaimed list multiple times? > > in case we have 16 PTEs in a large folio. > PTE0 present > PTE1 present > PTE2 present > PTE3 none > PTE4 present > PTE5 none > PTE6 present > .... > the current code is scanning PTE one by one. > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... No. Before detect the folio is fully mapped to the range, we can't add folio to reclaim list because the partial mapped folio shouldn't be added. We can only scan PTE15 and know it's fully mapped. So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know this is a partial mapped large folio. We will unmap it. Then all 16 PTEs become none. If the large folio is fully mapped, the folio will be added to reclaim list after scan PTE15 and know it's fully mapped. Regards Yin, Fengwei > > there are all kinds of possibilities for unmapping. > > so what we can do is recording we have added the folio while scanning PTE0, > then skipping this folios for all other PTEs. > > otherwise, we can split it while scanning PTE0, then we will meet > different folios > afterwards. > >> >> >> Regards >> Yin, Fengwei > > Thanks > Barry
On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2/27/24 14:40, Barry Song wrote: > > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> > >> On 2/27/24 10:17, Barry Song wrote: > >>>> Like if we hit folio which is partially mapped to the range, don't split it but > >>>> just unmap the mapping part from the range. Let page reclaim decide whether > >>>> split the large folio or not (If it's not mapped to any other range,it will be > >>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > >>>> can decide whether to split it or ignore it for current reclaim cycle). > >>> Yes, we can. but we still have to play the ptes check game to avoid adding > >>> folios multiple times to reclaim the list. > >>> > >>> I don't see too much difference between splitting in madvise and splitting > >>> in vmscan. as our real purpose is avoiding splitting entirely mapped > >>> large folios. for partial mapped large folios, if we split in madvise, then > >>> we don't need to play the game of skipping folios while iterating PTEs. > >>> if we don't split in madvise, we have to make sure the large folio is only > >>> added in reclaimed list one time by checking if PTEs belong to the > >>> previous added folio. > >> > >> If the partial mapped large folio is unmapped from the range, the related PTE > >> become none. How could the folio be added to reclaimed list multiple times? > > > > in case we have 16 PTEs in a large folio. > > PTE0 present > > PTE1 present > > PTE2 present > > PTE3 none > > PTE4 present > > PTE5 none > > PTE6 present > > .... > > the current code is scanning PTE one by one. > > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > No. Before detect the folio is fully mapped to the range, we can't add folio > to reclaim list because the partial mapped folio shouldn't be added. We can > only scan PTE15 and know it's fully mapped. you never know PTE15 is the last one mapping to the large folio, PTE15 can be mapping to a completely different folio with PTE0. > > So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > become none. I don't understand why all 16PTEs become none as we set PTEs to none. we set PTEs to swap entries till try_to_unmap_one called by vmscan. > > If the large folio is fully mapped, the folio will be added to reclaim list > after scan PTE15 and know it's fully mapped. our approach is calling pte_batch_pte while meeting the first pte, if pte_batch_pte = 16, then we add this folio to reclaim_list and skip the left 15 PTEs. > > Regards > Yin, Fengwei > > > > > there are all kinds of possibilities for unmapping. > > > > so what we can do is recording we have added the folio while scanning PTE0, > > then skipping this folios for all other PTEs. > > > > otherwise, we can split it while scanning PTE0, then we will meet > > different folios > > afterwards. > > > >> > >> > >> Regards > >> Yin, Fengwei > > Thanks Barry
On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > > > > > On 2/27/24 14:40, Barry Song wrote: > > > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > >> > > >> > > >> > > >> On 2/27/24 10:17, Barry Song wrote: > > >>>> Like if we hit folio which is partially mapped to the range, don't split it but > > >>>> just unmap the mapping part from the range. Let page reclaim decide whether > > >>>> split the large folio or not (If it's not mapped to any other range,it will be > > >>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > > >>>> can decide whether to split it or ignore it for current reclaim cycle). > > >>> Yes, we can. but we still have to play the ptes check game to avoid adding > > >>> folios multiple times to reclaim the list. > > >>> > > >>> I don't see too much difference between splitting in madvise and splitting > > >>> in vmscan. as our real purpose is avoiding splitting entirely mapped > > >>> large folios. for partial mapped large folios, if we split in madvise, then > > >>> we don't need to play the game of skipping folios while iterating PTEs. > > >>> if we don't split in madvise, we have to make sure the large folio is only > > >>> added in reclaimed list one time by checking if PTEs belong to the > > >>> previous added folio. > > >> > > >> If the partial mapped large folio is unmapped from the range, the related PTE > > >> become none. How could the folio be added to reclaimed list multiple times? > > > > > > in case we have 16 PTEs in a large folio. > > > PTE0 present > > > PTE1 present > > > PTE2 present > > > PTE3 none > > > PTE4 present > > > PTE5 none > > > PTE6 present > > > .... > > > the current code is scanning PTE one by one. > > > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > > No. Before detect the folio is fully mapped to the range, we can't add folio > > to reclaim list because the partial mapped folio shouldn't be added. We can > > only scan PTE15 and know it's fully mapped. > > you never know PTE15 is the last one mapping to the large folio, PTE15 can > be mapping to a completely different folio with PTE0. > > > > > So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > > this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > > become none. > > I don't understand why all 16PTEs become none as we set PTEs to none. > we set PTEs to swap entries till try_to_unmap_one called by vmscan. > > > > > If the large folio is fully mapped, the folio will be added to reclaim list > > after scan PTE15 and know it's fully mapped. > > our approach is calling pte_batch_pte while meeting the first pte, if > pte_batch_pte = 16, > then we add this folio to reclaim_list and skip the left 15 PTEs. Let's compare two different implementation, for partial mapped large folio with 8 PTEs as below, PTE0 present for large folio1 PTE1 present for large folio1 PTE2 present for another folio2 PTE3 present for another folio3 PTE4 present for large folio1 PTE5 present for large folio1 PTE6 present for another folio4 PTE7 present for another folio5 If we don't split in madvise(depend on vmscan to split after adding folio1), we will have to make sure folio1, folio2, folio3, folio4, folio5 are added to reclaim_list by doing a complex game while scanning these 8 PTEs. if we split in madvise, they become: PTE0 present for large folioA - splitted from folio 1 PTE1 present for large folioB - splitted from folio 1 PTE2 present for another folio2 PTE3 present for another folio3 PTE4 present for large folioC - splitted from folio 1 PTE5 present for large folioD - splitted from folio 1 PTE6 present for another folio4 PTE7 present for another folio5 we simply add the above 8 folios into reclaim_list one by one. I would vote for splitting for partial mapped large folio in madvise. Thanks Barry
On 2/27/24 15:21, Barry Song wrote: > On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>> >>> >>> >>> On 2/27/24 14:40, Barry Song wrote: >>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> >>>>> >>>>> On 2/27/24 10:17, Barry Song wrote: >>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but >>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether >>>>>>> split the large folio or not (If it's not mapped to any other range,it will be >>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim >>>>>>> can decide whether to split it or ignore it for current reclaim cycle). >>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding >>>>>> folios multiple times to reclaim the list. >>>>>> >>>>>> I don't see too much difference between splitting in madvise and splitting >>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped >>>>>> large folios. for partial mapped large folios, if we split in madvise, then >>>>>> we don't need to play the game of skipping folios while iterating PTEs. >>>>>> if we don't split in madvise, we have to make sure the large folio is only >>>>>> added in reclaimed list one time by checking if PTEs belong to the >>>>>> previous added folio. >>>>> >>>>> If the partial mapped large folio is unmapped from the range, the related PTE >>>>> become none. How could the folio be added to reclaimed list multiple times? >>>> >>>> in case we have 16 PTEs in a large folio. >>>> PTE0 present >>>> PTE1 present >>>> PTE2 present >>>> PTE3 none >>>> PTE4 present >>>> PTE5 none >>>> PTE6 present >>>> .... >>>> the current code is scanning PTE one by one. >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... >>> No. Before detect the folio is fully mapped to the range, we can't add folio >>> to reclaim list because the partial mapped folio shouldn't be added. We can >>> only scan PTE15 and know it's fully mapped. >> >> you never know PTE15 is the last one mapping to the large folio, PTE15 can >> be mapping to a completely different folio with PTE0. >> >>> >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know >>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs >>> become none. >> >> I don't understand why all 16PTEs become none as we set PTEs to none. >> we set PTEs to swap entries till try_to_unmap_one called by vmscan. >> >>> >>> If the large folio is fully mapped, the folio will be added to reclaim list >>> after scan PTE15 and know it's fully mapped. >> >> our approach is calling pte_batch_pte while meeting the first pte, if >> pte_batch_pte = 16, >> then we add this folio to reclaim_list and skip the left 15 PTEs. > > Let's compare two different implementation, for partial mapped large folio > with 8 PTEs as below, > > PTE0 present for large folio1 > PTE1 present for large folio1 > PTE2 present for another folio2 > PTE3 present for another folio3 > PTE4 present for large folio1 > PTE5 present for large folio1 > PTE6 present for another folio4 > PTE7 present for another folio5 > > If we don't split in madvise(depend on vmscan to split after adding > folio1), we will have Let me clarify something here: I prefer that we don't split large folio here. Instead, we unmap the large folio from this VMA range (I think you missed the unmap operation I mentioned). The intention is trying best to avoid splitting the large folio. If the folio is only partially mapped to this VMA range, it's likely it will be reclaimed as whole large folio. Which brings benefit for lru and zone lock contention comparing to splitting large folio. The thing I am not sure is unmapping from specific VMA range is not available and whether it's worthy to add it. > to make sure folio1, folio2, folio3, folio4, folio5 are added to > reclaim_list by doing a complex > game while scanning these 8 PTEs. > > if we split in madvise, they become: > > PTE0 present for large folioA - splitted from folio 1 > PTE1 present for large folioB - splitted from folio 1 > PTE2 present for another folio2 > PTE3 present for another folio3 > PTE4 present for large folioC - splitted from folio 1 > PTE5 present for large folioD - splitted from folio 1 > PTE6 present for another folio4 > PTE7 present for another folio5 > > we simply add the above 8 folios into reclaim_list one by one. > > I would vote for splitting for partial mapped large folio in madvise. > > Thanks > Barry
On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2/27/24 15:21, Barry Song wrote: > > On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> > >>> > >>> On 2/27/24 14:40, Barry Song wrote: > >>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/27/24 10:17, Barry Song wrote: > >>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but > >>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether > >>>>>>> split the large folio or not (If it's not mapped to any other range,it will be > >>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > >>>>>>> can decide whether to split it or ignore it for current reclaim cycle). > >>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding > >>>>>> folios multiple times to reclaim the list. > >>>>>> > >>>>>> I don't see too much difference between splitting in madvise and splitting > >>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped > >>>>>> large folios. for partial mapped large folios, if we split in madvise, then > >>>>>> we don't need to play the game of skipping folios while iterating PTEs. > >>>>>> if we don't split in madvise, we have to make sure the large folio is only > >>>>>> added in reclaimed list one time by checking if PTEs belong to the > >>>>>> previous added folio. > >>>>> > >>>>> If the partial mapped large folio is unmapped from the range, the related PTE > >>>>> become none. How could the folio be added to reclaimed list multiple times? > >>>> > >>>> in case we have 16 PTEs in a large folio. > >>>> PTE0 present > >>>> PTE1 present > >>>> PTE2 present > >>>> PTE3 none > >>>> PTE4 present > >>>> PTE5 none > >>>> PTE6 present > >>>> .... > >>>> the current code is scanning PTE one by one. > >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > >>> No. Before detect the folio is fully mapped to the range, we can't add folio > >>> to reclaim list because the partial mapped folio shouldn't be added. We can > >>> only scan PTE15 and know it's fully mapped. > >> > >> you never know PTE15 is the last one mapping to the large folio, PTE15 can > >> be mapping to a completely different folio with PTE0. > >> > >>> > >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > >>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > >>> become none. > >> > >> I don't understand why all 16PTEs become none as we set PTEs to none. > >> we set PTEs to swap entries till try_to_unmap_one called by vmscan. > >> > >>> > >>> If the large folio is fully mapped, the folio will be added to reclaim list > >>> after scan PTE15 and know it's fully mapped. > >> > >> our approach is calling pte_batch_pte while meeting the first pte, if > >> pte_batch_pte = 16, > >> then we add this folio to reclaim_list and skip the left 15 PTEs. > > > > Let's compare two different implementation, for partial mapped large folio > > with 8 PTEs as below, > > > > PTE0 present for large folio1 > > PTE1 present for large folio1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folio1 > > PTE5 present for large folio1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > If we don't split in madvise(depend on vmscan to split after adding > > folio1), we will have > Let me clarify something here: > > I prefer that we don't split large folio here. Instead, we unmap the > large folio from this VMA range (I think you missed the unmap operation > I mentioned). I don't understand why we unmap as this is a MADV_PAGEOUT not an unmap. unmapping totally changes the semantics. Would you like to show pseudo code? for MADV_PAGEOUT on swap-out, the last step is writing swap entries to replace PTEs which are present. I don't understand how an unmap can be involved in this process. > > The intention is trying best to avoid splitting the large folio. If > the folio is only partially mapped to this VMA range, it's likely it > will be reclaimed as whole large folio. Which brings benefit for lru > and zone lock contention comparing to splitting large folio. which also brings negative side effects such as redundant I/O. For example, if you have only one subpage left in a large folio, pageout will still write nr_pages subpages into swap, then immediately free them in swap. > > The thing I am not sure is unmapping from specific VMA range is not > available and whether it's worthy to add it. I think we might have the possibility to have some complex code to add folio1, folio2, folio3, folio4 and folio5 in the above example into reclaim_list while avoiding splitting folio1. but i really don't understand how unmap will work. > > > to make sure folio1, folio2, folio3, folio4, folio5 are added to > > reclaim_list by doing a complex > > game while scanning these 8 PTEs. > > > > if we split in madvise, they become: > > > > PTE0 present for large folioA - splitted from folio 1 > > PTE1 present for large folioB - splitted from folio 1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folioC - splitted from folio 1 > > PTE5 present for large folioD - splitted from folio 1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > we simply add the above 8 folios into reclaim_list one by one. > > > > I would vote for splitting for partial mapped large folio in madvise. > > Thanks Barry
On 2/27/24 15:54, Barry Song wrote: > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 2/27/24 15:21, Barry Song wrote: >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> >>>>> >>>>> On 2/27/24 14:40, Barry Song wrote: >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2/27/24 10:17, Barry Song wrote: >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle). >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding >>>>>>>> folios multiple times to reclaim the list. >>>>>>>> >>>>>>>> I don't see too much difference between splitting in madvise and splitting >>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs. >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the >>>>>>>> previous added folio. >>>>>>> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE >>>>>>> become none. How could the folio be added to reclaimed list multiple times? >>>>>> >>>>>> in case we have 16 PTEs in a large folio. >>>>>> PTE0 present >>>>>> PTE1 present >>>>>> PTE2 present >>>>>> PTE3 none >>>>>> PTE4 present >>>>>> PTE5 none >>>>>> PTE6 present >>>>>> .... >>>>>> the current code is scanning PTE one by one. >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can >>>>> only scan PTE15 and know it's fully mapped. >>>> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can >>>> be mapping to a completely different folio with PTE0. >>>> >>>>> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs >>>>> become none. >>>> >>>> I don't understand why all 16PTEs become none as we set PTEs to none. >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan. >>>> >>>>> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list >>>>> after scan PTE15 and know it's fully mapped. >>>> >>>> our approach is calling pte_batch_pte while meeting the first pte, if >>>> pte_batch_pte = 16, >>>> then we add this folio to reclaim_list and skip the left 15 PTEs. >>> >>> Let's compare two different implementation, for partial mapped large folio >>> with 8 PTEs as below, >>> >>> PTE0 present for large folio1 >>> PTE1 present for large folio1 >>> PTE2 present for another folio2 >>> PTE3 present for another folio3 >>> PTE4 present for large folio1 >>> PTE5 present for large folio1 >>> PTE6 present for another folio4 >>> PTE7 present for another folio5 >>> >>> If we don't split in madvise(depend on vmscan to split after adding >>> folio1), we will have >> Let me clarify something here: >> >> I prefer that we don't split large folio here. Instead, we unmap the >> large folio from this VMA range (I think you missed the unmap operation >> I mentioned). > > I don't understand why we unmap as this is a MADV_PAGEOUT not > an unmap. unmapping totally changes the semantics. Would you like > to show pseudo code? Oh. Yes. MADV_PAGEOUT is not suitable. What about MADV_FREE? > > for MADV_PAGEOUT on swap-out, the last step is writing swap entries > to replace PTEs which are present. I don't understand how an unmap > can be involved in this process. > >> >> The intention is trying best to avoid splitting the large folio. If >> the folio is only partially mapped to this VMA range, it's likely it >> will be reclaimed as whole large folio. Which brings benefit for lru >> and zone lock contention comparing to splitting large folio. > > which also brings negative side effects such as redundant I/O. > For example, if you have only one subpage left in a large folio, > pageout will still write nr_pages subpages into swap, then immediately > free them in swap. > >> >> The thing I am not sure is unmapping from specific VMA range is not >> available and whether it's worthy to add it. > > I think we might have the possibility to have some complex code to > add folio1, folio2, folio3, folio4 and folio5 in the above example into > reclaim_list while avoiding splitting folio1. but i really don't understand > how unmap will work. > >> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to >>> reclaim_list by doing a complex >>> game while scanning these 8 PTEs. >>> >>> if we split in madvise, they become: >>> >>> PTE0 present for large folioA - splitted from folio 1 >>> PTE1 present for large folioB - splitted from folio 1 >>> PTE2 present for another folio2 >>> PTE3 present for another folio3 >>> PTE4 present for large folioC - splitted from folio 1 >>> PTE5 present for large folioD - splitted from folio 1 >>> PTE6 present for another folio4 >>> PTE7 present for another folio5 >>> >>> we simply add the above 8 folios into reclaim_list one by one. >>> >>> I would vote for splitting for partial mapped large folio in madvise. >>> > > Thanks > Barry
On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2/27/24 15:54, Barry Song wrote: > > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> > >> On 2/27/24 15:21, Barry Song wrote: > >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote: > >>>> > >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/27/24 14:40, Barry Song wrote: > >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 2/27/24 10:17, Barry Song wrote: > >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but > >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether > >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be > >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim > >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle). > >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding > >>>>>>>> folios multiple times to reclaim the list. > >>>>>>>> > >>>>>>>> I don't see too much difference between splitting in madvise and splitting > >>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped > >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then > >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs. > >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only > >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the > >>>>>>>> previous added folio. > >>>>>>> > >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE > >>>>>>> become none. How could the folio be added to reclaimed list multiple times? > >>>>>> > >>>>>> in case we have 16 PTEs in a large folio. > >>>>>> PTE0 present > >>>>>> PTE1 present > >>>>>> PTE2 present > >>>>>> PTE3 none > >>>>>> PTE4 present > >>>>>> PTE5 none > >>>>>> PTE6 present > >>>>>> .... > >>>>>> the current code is scanning PTE one by one. > >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6... > >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio > >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can > >>>>> only scan PTE15 and know it's fully mapped. > >>>> > >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can > >>>> be mapping to a completely different folio with PTE0. > >>>> > >>>>> > >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know > >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs > >>>>> become none. > >>>> > >>>> I don't understand why all 16PTEs become none as we set PTEs to none. > >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan. > >>>> > >>>>> > >>>>> If the large folio is fully mapped, the folio will be added to reclaim list > >>>>> after scan PTE15 and know it's fully mapped. > >>>> > >>>> our approach is calling pte_batch_pte while meeting the first pte, if > >>>> pte_batch_pte = 16, > >>>> then we add this folio to reclaim_list and skip the left 15 PTEs. > >>> > >>> Let's compare two different implementation, for partial mapped large folio > >>> with 8 PTEs as below, > >>> > >>> PTE0 present for large folio1 > >>> PTE1 present for large folio1 > >>> PTE2 present for another folio2 > >>> PTE3 present for another folio3 > >>> PTE4 present for large folio1 > >>> PTE5 present for large folio1 > >>> PTE6 present for another folio4 > >>> PTE7 present for another folio5 > >>> > >>> If we don't split in madvise(depend on vmscan to split after adding > >>> folio1), we will have > >> Let me clarify something here: > >> > >> I prefer that we don't split large folio here. Instead, we unmap the > >> large folio from this VMA range (I think you missed the unmap operation > >> I mentioned). > > > > I don't understand why we unmap as this is a MADV_PAGEOUT not > > an unmap. unmapping totally changes the semantics. Would you like > > to show pseudo code? > Oh. Yes. MADV_PAGEOUT is not suitable. > > What about MADV_FREE? we can't unmap either. as MADV_FREE applies to anon vma. while a folio is marked lazyfree, we move anon folio to file LRU. if somebody writes the folio afterwards, we take the folio back; if nobody writes it before vmscan gets it in the file LRU, we can reclaim it by setting PTEs to none. we can't immediately unmap a large folio at the time MADV_FREE is called. immediate unmap is the behavior of MADV_DONTNEED but not MADV_FREE. > > > > > for MADV_PAGEOUT on swap-out, the last step is writing swap entries > > to replace PTEs which are present. I don't understand how an unmap > > can be involved in this process. > > > >> > >> The intention is trying best to avoid splitting the large folio. If > >> the folio is only partially mapped to this VMA range, it's likely it > >> will be reclaimed as whole large folio. Which brings benefit for lru > >> and zone lock contention comparing to splitting large folio. > > > > which also brings negative side effects such as redundant I/O. > > For example, if you have only one subpage left in a large folio, > > pageout will still write nr_pages subpages into swap, then immediately > > free them in swap. > > > >> > >> The thing I am not sure is unmapping from specific VMA range is not > >> available and whether it's worthy to add it. > > > > I think we might have the possibility to have some complex code to > > add folio1, folio2, folio3, folio4 and folio5 in the above example into > > reclaim_list while avoiding splitting folio1. but i really don't understand > > how unmap will work. > > > >> > >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to > >>> reclaim_list by doing a complex > >>> game while scanning these 8 PTEs. > >>> > >>> if we split in madvise, they become: > >>> > >>> PTE0 present for large folioA - splitted from folio 1 > >>> PTE1 present for large folioB - splitted from folio 1 > >>> PTE2 present for another folio2 > >>> PTE3 present for another folio3 > >>> PTE4 present for large folioC - splitted from folio 1 > >>> PTE5 present for large folioD - splitted from folio 1 > >>> PTE6 present for another folio4 > >>> PTE7 present for another folio5 > >>> > >>> we simply add the above 8 folios into reclaim_list one by one. > >>> > >>> I would vote for splitting for partial mapped large folio in madvise. > >>> > > Thanks Barry
diff --git a/mm/madvise.c b/mm/madvise.c index cfa5e7288261..bcbf56595a2e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, */ if (folio_test_large(folio)) { int err; + unsigned long next_addr, align; - if (folio_estimated_sharers(folio) != 1) - break; - if (!folio_trylock(folio)) - break; + if (folio_estimated_sharers(folio) != 1 || + !folio_trylock(folio)) + goto skip_large_folio; + + align = folio_nr_pages(folio) * PAGE_SIZE; + next_addr = ALIGN_DOWN(addr + align, align); + + /* + * If we mark only the subpages as lazyfree, + * split the large folio. + */ + if (next_addr > end || next_addr - addr != align) + goto split_large_folio; + + /* + * Avoid unnecessary folio splitting if the large + * folio is entirely within the given range. + */ + folio_test_clear_dirty(folio); + folio_unlock(folio); + for (; addr != next_addr; pte++, addr += PAGE_SIZE) { + ptent = ptep_get(pte); + if (pte_young(ptent) || pte_dirty(ptent)) { + ptent = ptep_get_and_clear_full( + mm, addr, pte, tlb->fullmm); + ptent = pte_mkold(ptent); + ptent = pte_mkclean(ptent); + set_pte_at(mm, addr, pte, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + } + } + folio_mark_lazyfree(folio); + goto next_folio; + +split_large_folio: folio_get(folio); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, err = split_folio(folio); folio_unlock(folio); folio_put(folio); - if (err) - break; - start_pte = pte = - pte_offset_map_lock(mm, pmd, addr, &ptl); - if (!start_pte) - break; - arch_enter_lazy_mmu_mode(); + + /* + * If the large folio is locked before madvise(MADV_FREE) + * or cannot be split, we just skip it. + */ + if (err) { +skip_large_folio: + if (next_addr >= end) + break; + pte += (next_addr - addr) / PAGE_SIZE; + addr = next_addr; + } + + if (!start_pte) { + start_pte = pte = pte_offset_map_lock( + mm, pmd, addr, &ptl); + if (!start_pte) + break; + arch_enter_lazy_mmu_mode(); + } + +next_folio: pte--; addr -= PAGE_SIZE; continue;
This patch improves madvise_free_pte_range() to correctly handle large folio that is smaller than PMD-size (for example, 16KiB to 1024KiB[1]). It’s probably part of the preparation to support anonymous multi-size THP. Additionally, when the consecutive PTEs are mapped to consecutive pages of the same large folio (mTHP), if the folio is locked before madvise(MADV_FREE) or cannot be split, then all subsequent PTEs within the same PMD will be skipped. However, they should have been MADV_FREEed. Moreover, this patch also optimizes lazyfreeing with PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary folio splitting if the large folio is entirely within the given range. On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size | Old | New | Change ---------------------------------------------- 4KiB | 0.590251 | 0.590264 | 0% 16KiB | 2.990447 | 0.182167 | -94% 32KiB | 2.547831 | 0.101622 | -96% 64KiB | 2.457796 | 0.049726 | -98% 128KiB | 2.281034 | 0.030109 | -99% 256KiB | 2.230387 | 0.015838 | -99% 512KiB | 2.189106 | 0.009149 | -99% 1024KiB | 2.183949 | 0.006620 | -99% 2048KiB | 0.002799 | 0.002795 | 0% [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/ Signed-off-by: Lance Yang <ioworker0@gmail.com> --- mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 58 insertions(+), 11 deletions(-)