Message ID | 20240321220802.679544-13-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/gup: Unify hugetlb, part 2 | expand |
On Thu, Mar 21, 2024 at 06:08:02PM -0400, peterx@redhat.com wrote: > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > a tight loop of slow gup after the path switched. That shouldn't be a > problem because slow-gup should not be a hot path for GUP in general: when > page is commonly present, fast-gup will already succeed, while when the > page is indeed missing and require a follow up page fault, the slow gup > degrade will probably buried in the fault paths anyway. It also explains > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > a performance analysis but a side benefit. If the performance will be a > concern, we can consider handle CONT_PTE in follow_page(). I think this is probably fine for the moment, at least for this series, as CONT_PTE is still very new. But it will need to be optimized. "slow" GUP is the only GUP that is used by FOLL_LONGTERM and it still needs to be optimized because you can't assume a FOLL_LONGTERM user will be hitting the really slow fault path. There are enough important cases where it is just reading already populted page tables, and these days, often with large folios. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
Jason, On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote: > On Thu, Mar 21, 2024 at 06:08:02PM -0400, peterx@redhat.com wrote: > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > > a tight loop of slow gup after the path switched. That shouldn't be a > > problem because slow-gup should not be a hot path for GUP in general: when > > page is commonly present, fast-gup will already succeed, while when the > > page is indeed missing and require a follow up page fault, the slow gup > > degrade will probably buried in the fault paths anyway. It also explains > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > > a performance analysis but a side benefit. If the performance will be a > > concern, we can consider handle CONT_PTE in follow_page(). > > I think this is probably fine for the moment, at least for this > series, as CONT_PTE is still very new. > > But it will need to be optimized. "slow" GUP is the only GUP that is > used by FOLL_LONGTERM and it still needs to be optimized because you > can't assume a FOLL_LONGTERM user will be hitting the really slow > fault path. There are enough important cases where it is just reading > already populted page tables, and these days, often with large folios. Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup, especially for hugetlb, but maybe I missed something? I do see that devmap skips fast-gup for LONGTERM, we also have that writeback issue but none of those that I can find applies to hugetlb. This might be a problem indeed if we have hugetlb cont_pte pages that will constantly fallback to slow gup. OTOH, I also agree with you that such batching would be nice to have for slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings can at least benefit from it due to above. But then that'll be a more generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte large folios, before or after this series. > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Thanks!
On Fri, Mar 22, 2024 at 11:55:11AM -0400, Peter Xu wrote: > Jason, > > On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote: > > On Thu, Mar 21, 2024 at 06:08:02PM -0400, peterx@redhat.com wrote: > > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > > > a tight loop of slow gup after the path switched. That shouldn't be a > > > problem because slow-gup should not be a hot path for GUP in general: when > > > page is commonly present, fast-gup will already succeed, while when the > > > page is indeed missing and require a follow up page fault, the slow gup > > > degrade will probably buried in the fault paths anyway. It also explains > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > > > a performance analysis but a side benefit. If the performance will be a > > > concern, we can consider handle CONT_PTE in follow_page(). > > > > I think this is probably fine for the moment, at least for this > > series, as CONT_PTE is still very new. > > > > But it will need to be optimized. "slow" GUP is the only GUP that is > > used by FOLL_LONGTERM and it still needs to be optimized because you > > can't assume a FOLL_LONGTERM user will be hitting the really slow > > fault path. There are enough important cases where it is just reading > > already populted page tables, and these days, often with large folios. > > Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup, > especially for hugetlb, but maybe I missed something? Ah, no this is my bad memory, there was a time where that was true, but it is not the case now. Oh, it is a really bad memory because it seems I removed parts of it :) > I do see that devmap skips fast-gup for LONGTERM, we also have that > writeback issue but none of those that I can find applies to > hugetlb. This might be a problem indeed if we have hugetlb cont_pte > pages that will constantly fallback to slow gup. Right, DAX would be the main use case I can think of. Today the intersection of DAX and contig PTE is non-existant so lets not worry. > OTOH, I also agree with you that such batching would be nice to have for > slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings > can at least benefit from it due to above. But then that'll be a more > generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte > large folios, before or after this series. Right, improving contig pte is going to be a process and eventually it will make sense to optimize this regardless of hugetlbfs Jason
On Thu, 21 Mar 2024 18:08:02 -0400 peterx@redhat.com wrote: > From: Peter Xu <peterx@redhat.com> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and > over all architectures. Switch to the generic code path. > > Time to retire hugetlb_follow_page_mask(), following the previous > retirement of follow_hugetlb_page() in 4849807114b8. > > There may be a slight difference of how the loops run when processing slow > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each > loop of __get_user_pages() will resolve one pgtable entry with the patch > applied, rather than relying on the size of hugetlb hstate, the latter may > cover multiple entries in one loop. > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > a tight loop of slow gup after the path switched. That shouldn't be a > problem because slow-gup should not be a hot path for GUP in general: when > page is commonly present, fast-gup will already succeed, while when the > page is indeed missing and require a follow up page fault, the slow gup > degrade will probably buried in the fault paths anyway. It also explains > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > a performance analysis but a side benefit. If the performance will be a > concern, we can consider handle CONT_PTE in follow_page(). > > Before that is justified to be necessary, keep everything clean and simple. > mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function] 33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, | ^~~~~~~~~~~~~ --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix +++ a/mm/gup.c @@ -30,10 +30,12 @@ struct follow_page_context { unsigned int page_mask; }; +#ifdef CONFIG_HAVE_FAST_GUP static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned int flags, struct follow_page_context *ctx); +#endif static inline void sanity_check_pinned_pages(struct page **pages, unsigned long npages)
On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote: > On Thu, 21 Mar 2024 18:08:02 -0400 peterx@redhat.com wrote: > > > From: Peter Xu <peterx@redhat.com> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and > > over all architectures. Switch to the generic code path. > > > > Time to retire hugetlb_follow_page_mask(), following the previous > > retirement of follow_hugetlb_page() in 4849807114b8. > > > > There may be a slight difference of how the loops run when processing slow > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each > > loop of __get_user_pages() will resolve one pgtable entry with the patch > > applied, rather than relying on the size of hugetlb hstate, the latter may > > cover multiple entries in one loop. > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > > a tight loop of slow gup after the path switched. That shouldn't be a > > problem because slow-gup should not be a hot path for GUP in general: when > > page is commonly present, fast-gup will already succeed, while when the > > page is indeed missing and require a follow up page fault, the slow gup > > degrade will probably buried in the fault paths anyway. It also explains > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > > a performance analysis but a side benefit. If the performance will be a > > concern, we can consider handle CONT_PTE in follow_page(). > > > > Before that is justified to be necessary, keep everything clean and simple. > > > > mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function] > 33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > | ^~~~~~~~~~~~~ > > --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix > +++ a/mm/gup.c > @@ -30,10 +30,12 @@ struct follow_page_context { > unsigned int page_mask; > }; > > +#ifdef CONFIG_HAVE_FAST_GUP > static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > unsigned long addr, unsigned int pdshift, > unsigned int flags, > struct follow_page_context *ctx); > +#endif > > static inline void sanity_check_pinned_pages(struct page **pages, > unsigned long npages) > _ > > > This looks inelegant. > > That's two build issues so far. Please be more expansive in the > Kconfig variations when testing. Especially when mucking with pgtable > macros. Andrew, Apologies for that, and also for a slightly late response. Yeah it's time I'll need my set of things to do serious build tests, and I'll at least start to cover a few error prone configs/archs in with that. I was trying to rely on the build bot in many of previous such cases, as that was quite useful to me to cover many build issues without investing my own test setups, but I think for some reason it retired and stopped working for a while. Maybe I shouldn't have relied on it at all. For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper? As follow_hugepd() is used in slow gup not fast. So maybe we can put that under CONFIG_MMU below that code (and I think we can drop "static" too as I don't think it's anything useful). My version of fixup attached at the end of email, and I verified it on m68k build. I do plan to post a small fixup series to fix these issues (so far it may contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups where I'll mark the subject with "fixup!" properly). Either you can pick up below or you can wait for my small patchset, should be there either today or tomorrow. Thanks, ===8<=== diff --git a/mm/gup.c b/mm/gup.c index 4cd349390477..a2ed8203495a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -30,11 +30,6 @@ struct follow_page_context { unsigned int page_mask; }; -static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, - unsigned long addr, unsigned int pdshift, - unsigned int flags, - struct follow_page_context *ctx); - static inline void sanity_check_pinned_pages(struct page **pages, unsigned long npages) { @@ -505,6 +500,12 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags) } #ifdef CONFIG_MMU + +struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, + unsigned long addr, unsigned int pdshift, + unsigned int flags, + struct follow_page_context *ctx); + static struct page *no_page_table(struct vm_area_struct *vma, unsigned int flags, unsigned long address) { ===8<===
On Fri, Mar 22, 2024 at 08:45:59PM -0400, Peter Xu wrote: > On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote: > > On Thu, 21 Mar 2024 18:08:02 -0400 peterx@redhat.com wrote: > > > > > From: Peter Xu <peterx@redhat.com> > > > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and > > > over all architectures. Switch to the generic code path. > > > > > > Time to retire hugetlb_follow_page_mask(), following the previous > > > retirement of follow_hugetlb_page() in 4849807114b8. > > > > > > There may be a slight difference of how the loops run when processing slow > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each > > > loop of __get_user_pages() will resolve one pgtable entry with the patch > > > applied, rather than relying on the size of hugetlb hstate, the latter may > > > cover multiple entries in one loop. > > > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over > > > a tight loop of slow gup after the path switched. That shouldn't be a > > > problem because slow-gup should not be a hot path for GUP in general: when > > > page is commonly present, fast-gup will already succeed, while when the > > > page is indeed missing and require a follow up page fault, the slow gup > > > degrade will probably buried in the fault paths anyway. It also explains > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup: > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of > > > a performance analysis but a side benefit. If the performance will be a > > > concern, we can consider handle CONT_PTE in follow_page(). > > > > > > Before that is justified to be necessary, keep everything clean and simple. > > > > > > > mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined [-Wunused-function] > > 33 | static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > > | ^~~~~~~~~~~~~ > > > > --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix > > +++ a/mm/gup.c > > @@ -30,10 +30,12 @@ struct follow_page_context { > > unsigned int page_mask; > > }; > > > > +#ifdef CONFIG_HAVE_FAST_GUP > > static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > > unsigned long addr, unsigned int pdshift, > > unsigned int flags, > > struct follow_page_context *ctx); > > +#endif > > > > static inline void sanity_check_pinned_pages(struct page **pages, > > unsigned long npages) > > _ > > > > > > This looks inelegant. > > > > That's two build issues so far. Please be more expansive in the > > Kconfig variations when testing. Especially when mucking with pgtable > > macros. > > Andrew, > > Apologies for that, and also for a slightly late response. Yeah it's time > I'll need my set of things to do serious build tests, and I'll at least > start to cover a few error prone configs/archs in with that. > > I was trying to rely on the build bot in many of previous such cases, as > that was quite useful to me to cover many build issues without investing my > own test setups, but I think for some reason it retired and stopped working > for a while. Maybe I shouldn't have relied on it at all. > > For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper? > As follow_hugepd() is used in slow gup not fast. So maybe we can put that > under CONFIG_MMU below that code (and I think we can drop "static" too as I > don't think it's anything useful). My version of fixup attached at the end the static is useful; below patch did pass on m68k but won't on x86.. ignore that please. > of email, and I verified it on m68k build. > > I do plan to post a small fixup series to fix these issues (so far it may > contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups > where I'll mark the subject with "fixup!" properly). Either you can pick > up below or you can wait for my small patchset, should be there either > today or tomorrow. I changed plan here too; I found more users of HPAGE_PMD_NR assuming it's defined even if !CONFIG_MMU. That's weird, as CONFIG_MMU doesn't even define PMD_SHIFT... To fix this I decided to use the old trick on using BUILD_BUG() like it used to work before; frankly I don't know how that didn't throw warnings, but i'll make sure it passes all known builds (ps: I still haven't got my build harness ready, so that will still be limited but should solve known issues). In short: please wait for my fixup series. Thanks. > > Thanks, > > ===8<=== > diff --git a/mm/gup.c b/mm/gup.c > index 4cd349390477..a2ed8203495a 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -30,11 +30,6 @@ struct follow_page_context { > unsigned int page_mask; > }; > > -static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > - unsigned long addr, unsigned int pdshift, > - unsigned int flags, > - struct follow_page_context *ctx); > - > static inline void sanity_check_pinned_pages(struct page **pages, > unsigned long npages) > { > @@ -505,6 +500,12 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags) > } > > #ifdef CONFIG_MMU > + > +struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd, > + unsigned long addr, unsigned int pdshift, > + unsigned int flags, > + struct follow_page_context *ctx); > + > static struct page *no_page_table(struct vm_area_struct *vma, > unsigned int flags, unsigned long address) > { > ===8<=== > > -- > Peter Xu
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 52d9efcf1edf..85e1c9931ae5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end( { } -static inline struct page *hugetlb_follow_page_mask( - struct vm_area_struct *vma, unsigned long address, unsigned int flags, - unsigned int *page_mask) -{ - BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/ -} - static inline int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *dst_vma, diff --git a/mm/gup.c b/mm/gup.c index 43a2e0a203cd..2eb5911ba849 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -997,18 +997,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, { pgd_t *pgd, pgdval; struct mm_struct *mm = vma->vm_mm; + struct page *page; - ctx->page_mask = 0; - - /* - * Call hugetlb_follow_page_mask for hugetlb vmas as it will use - * special hugetlb page table walking code. This eliminates the - * need to check for hugetlb entries in the general walking code. - */ - if (is_vm_hugetlb_page(vma)) - return hugetlb_follow_page_mask(vma, address, flags, - &ctx->page_mask); + vma_pgtable_walk_begin(vma); + ctx->page_mask = 0; pgd = pgd_offset(mm, address); pgdval = *pgd; @@ -1020,6 +1013,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, else page = follow_p4d_mask(vma, address, pgd, flags, ctx); + vma_pgtable_walk_end(vma); + return page; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index abec04575c89..2e320757501b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6883,77 +6883,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, } #endif /* CONFIG_USERFAULTFD */ -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, - unsigned long address, unsigned int flags, - unsigned int *page_mask) -{ - struct hstate *h = hstate_vma(vma); - struct mm_struct *mm = vma->vm_mm; - unsigned long haddr = address & huge_page_mask(h); - struct page *page = NULL; - spinlock_t *ptl; - pte_t *pte, entry; - int ret; - - hugetlb_vma_lock_read(vma); - pte = hugetlb_walk(vma, haddr, huge_page_size(h)); - if (!pte) - goto out_unlock; - - ptl = huge_pte_lock(h, mm, pte); - entry = huge_ptep_get(pte); - if (pte_present(entry)) { - page = pte_page(entry); - - if (!huge_pte_write(entry)) { - if (flags & FOLL_WRITE) { - page = NULL; - goto out; - } - - if (gup_must_unshare(vma, flags, page)) { - /* Tell the caller to do unsharing */ - page = ERR_PTR(-EMLINK); - goto out; - } - } - - page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT)); - - /* - * Note that page may be a sub-page, and with vmemmap - * optimizations the page struct may be read only. - * try_grab_page() will increase the ref count on the - * head page, so this will be OK. - * - * try_grab_page() should always be able to get the page here, - * because we hold the ptl lock and have verified pte_present(). - */ - ret = try_grab_page(page, flags); - - if (WARN_ON_ONCE(ret)) { - page = ERR_PTR(ret); - goto out; - } - - *page_mask = (1U << huge_page_order(h)) - 1; - } -out: - spin_unlock(ptl); -out_unlock: - hugetlb_vma_unlock_read(vma); - - /* - * Fixup retval for dump requests: if pagecache doesn't exist, - * don't try to allocate a new page but just skip it. - */ - if (!page && (flags & FOLL_DUMP) && - !hugetlbfs_pagecache_present(h, vma, address)) - page = ERR_PTR(-EFAULT); - - return page; -} - long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot, unsigned long cp_flags)