Message ID | 20220504214437.2850685-2-zokeefe@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: userspace hugepage collapse | expand |
On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote: > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > + unsigned long address, > + pmd_t **pmd) > +{ > + pmd_t pmde; > + > + *pmd = mm_find_pmd_raw(mm, address); > + if (!*pmd) > + return SCAN_PMD_NULL; > + > + pmde = pmd_read_atomic(*pmd); It seems to be correct on using the atomic fetcher here. Though irrelevant to this patchset.. does it also mean that we miss that on mm_find_pmd()? I meant a separate fix like this one: ---8<--- diff --git a/mm/rmap.c b/mm/rmap.c index 69416072b1a6..61309718640f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) * without holding anon_vma lock for write. So when looking for a * genuine pmde (in which to find pte), test present and !THP together. */ - pmde = *pmd; + pmde = pmd_read_atomic(pmd); barrier(); if (!pmd_present(pmde) || pmd_trans_huge(pmde)) pmd = NULL; ---8<--- As otherwise it seems it's also prone to PAE race conditions when reading pmd out, but I could be missing something. > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > + barrier(); > +#endif > + if (!pmd_present(pmde)) > + return SCAN_PMD_NULL; > + if (pmd_trans_huge(pmde)) > + return SCAN_PMD_MAPPED; Would it be safer to check pmd_bad()? I think not all mm pmd paths check that, frankly I don't really know what's the major cause of a bad pmd (either software bugs or corrupted mem), but just to check with you, because potentially a bad pmd can be read as SCAN_SUCCEED and go through. > + return SCAN_SUCCEED; > +} The rest looks good to me, thanks.
Thanks again for the review, Peter. On Wed, May 18, 2022 at 11:41 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote: > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > + unsigned long address, > > + pmd_t **pmd) > > +{ > > + pmd_t pmde; > > + > > + *pmd = mm_find_pmd_raw(mm, address); > > + if (!*pmd) > > + return SCAN_PMD_NULL; > > + > > + pmde = pmd_read_atomic(*pmd); > > It seems to be correct on using the atomic fetcher here. Though irrelevant > to this patchset.. does it also mean that we miss that on mm_find_pmd()? I > meant a separate fix like this one: > > ---8<--- > diff --git a/mm/rmap.c b/mm/rmap.c > index 69416072b1a6..61309718640f 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > * without holding anon_vma lock for write. So when looking for a > * genuine pmde (in which to find pte), test present and !THP together. > */ > - pmde = *pmd; > + pmde = pmd_read_atomic(pmd); > barrier(); > if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > pmd = NULL; > ---8<--- > > As otherwise it seems it's also prone to PAE race conditions when reading > pmd out, but I could be missing something. > This is a good question. I took some time to look into this, but it's very complicated and unfortunately I couldn't reach a conclusion in the time I allotted to myself. My working (unverified) assumption is that mm_find_pmd() is called in places where it doesn't care if the pmd isn't read atomically. If so, does that also mean MADV_COLLAPSE is safe? I'm not sure. These i386 PAE + THP racing issues were most recently discussed when considering if READ_ONCE() should be used instead of pmd_read_atomic() [1]. [1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@google.com/ > > + > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > + barrier(); > > +#endif > > + if (!pmd_present(pmde)) > > + return SCAN_PMD_NULL; > > + if (pmd_trans_huge(pmde)) > > + return SCAN_PMD_MAPPED; > > Would it be safer to check pmd_bad()? I think not all mm pmd paths check > that, frankly I don't really know what's the major cause of a bad pmd > (either software bugs or corrupted mem), but just to check with you, > because potentially a bad pmd can be read as SCAN_SUCCEED and go through. > Likewise, I'm not sure what the cause of "bad pmds" is. Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do you mean to additionally check pmd_bad() after the pmd_trans_huge() check? If it's the former, I'd say we can't claim !pmd_bad() == memory already backed by thps / our job here is done. If it's the latter, I don't see it hurting much (but I can't argue intelligently about why it's needed) and can include the check in v6. Thanks again, Zach > > + return SCAN_SUCCEED; > > +} > > The rest looks good to me, thanks. > > -- > Peter Xu >
On Thu, May 19, 2022 at 02:06:25PM -0700, Zach O'Keefe wrote: > Thanks again for the review, Peter. > > On Wed, May 18, 2022 at 11:41 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote: > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > > + unsigned long address, > > > + pmd_t **pmd) > > > +{ > > > + pmd_t pmde; > > > + > > > + *pmd = mm_find_pmd_raw(mm, address); > > > + if (!*pmd) > > > + return SCAN_PMD_NULL; > > > + > > > + pmde = pmd_read_atomic(*pmd); > > > > It seems to be correct on using the atomic fetcher here. Though irrelevant > > to this patchset.. does it also mean that we miss that on mm_find_pmd()? I > > meant a separate fix like this one: > > > > ---8<--- > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 69416072b1a6..61309718640f 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > * without holding anon_vma lock for write. So when looking for a > > * genuine pmde (in which to find pte), test present and !THP together. > > */ > > - pmde = *pmd; > > + pmde = pmd_read_atomic(pmd); > > barrier(); > > if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > > pmd = NULL; > > ---8<--- > > > > As otherwise it seems it's also prone to PAE race conditions when reading > > pmd out, but I could be missing something. > > > > This is a good question. I took some time to look into this, but it's > very complicated and unfortunately I couldn't reach a conclusion in > the time I allotted to myself. My working (unverified) assumption is > that mm_find_pmd() is called in places where it doesn't care if the > pmd isn't read atomically. Frankly I am still not sure on that. Say the immediate check in mm_find_pmd() is pmd_trans_huge() afterwards, and AFAICT that is: static inline int pmd_trans_huge(pmd_t pmd) { return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; } Where we have: #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ I think it means we're checking both the lower (bit 7) and upper (bit 58) of the 32bits values of the whole 64bit pmd on PAE, and I can't explain how that'll not require atomicity.. pmd_read_atomic() plays with a magic that we'll either: - Get atomic results for none or pgtable pmds (which is stable), or - Can get not-atomic result for thps (unstable) However here we're checking exactly for thps.. I think we don't need a stable PFN but we'll need stable _PAGE_PSE|_PAGE_DEVMAP bits. It should not be like that when pmd_read_atomic() was introduced because that is 2012 (in 2012, 26c191788f18129) and AFAIU DEVMAP came in 2016+. IOW I had a feeling that pmd_read_atomic() used to be bug-free but after devmap it's harder to be justified at least, it seems to me. So far I don't see a clean way to fix it but use atomic64_read() because pmd_read_atomic() isn't atomic for thp checkings using pmd_trans_huge() afaict, however the last piece of puzzle is atomic64_read() is failing somehow on Xen and that's where we come up with the pmd_read_atomic() trick to read lower then upper 32bits: commit e4eed03fd06578571c01d4f1478c874bb432c815 Author: Andrea Arcangeli <aarcange@redhat.com> Date: Wed Jun 20 12:52:57 2012 -0700 thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. I believe Andrea has a solid reason to do so at that time, but I'm still not sure why Xen won't work with atomic64_read() since it's not mentioned in the commit.. Please go ahead with your patchset without being blocked by this, because AFAICT pmd_read_atomic() is already better than pmde=*pmdp anyway, and it seems a more general problem even if it existed or I could also have missed something. > If so, does that also mean MADV_COLLAPSE is > safe? I'm not sure. These i386 PAE + THP racing issues were most > recently discussed when considering if READ_ONCE() should be used > instead of pmd_read_atomic() [1]. > > [1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@google.com/ > > > > + > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > > + barrier(); > > > +#endif > > > + if (!pmd_present(pmde)) > > > + return SCAN_PMD_NULL; > > > + if (pmd_trans_huge(pmde)) > > > + return SCAN_PMD_MAPPED; > > > > Would it be safer to check pmd_bad()? I think not all mm pmd paths check > > that, frankly I don't really know what's the major cause of a bad pmd > > (either software bugs or corrupted mem), but just to check with you, > > because potentially a bad pmd can be read as SCAN_SUCCEED and go through. > > > > Likewise, I'm not sure what the cause of "bad pmds" is. > > Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a > pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do > you mean to additionally check pmd_bad() after the pmd_trans_huge() > check? > > If it's the former, I'd say we can't claim !pmd_bad() == memory > already backed by thps / our job here is done. > > If it's the latter, I don't see it hurting much (but I can't argue > intelligently about why it's needed) and can include the check in v6. The latter. I don't think it's strongly necessary, but looks good to have if you also agree. Thanks,
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index d651f3437367..55392bf30a03 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -11,6 +11,7 @@ EM( SCAN_FAIL, "failed") \ EM( SCAN_SUCCEED, "succeeded") \ EM( SCAN_PMD_NULL, "pmd_null") \ + EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ diff --git a/mm/internal.h b/mm/internal.h index 0667abd57634..51ae9f71a2a3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -172,6 +172,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason /* * in mm/rmap.c: */ +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address); extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); /* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index eb444fd45568..2c2ed6b4d96c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -28,6 +28,7 @@ enum scan_result { SCAN_FAIL, SCAN_SUCCEED, SCAN_PMD_NULL, + SCAN_PMD_MAPPED, SCAN_EXCEED_NONE_PTE, SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, @@ -977,6 +978,29 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return 0; } +static int find_pmd_or_thp_or_none(struct mm_struct *mm, + unsigned long address, + pmd_t **pmd) +{ + pmd_t pmde; + + *pmd = mm_find_pmd_raw(mm, address); + if (!*pmd) + return SCAN_PMD_NULL; + + pmde = pmd_read_atomic(*pmd); + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ + barrier(); +#endif + if (!pmd_present(pmde)) + return SCAN_PMD_NULL; + if (pmd_trans_huge(pmde)) + return SCAN_PMD_MAPPED; + return SCAN_SUCCEED; +} + /* * Bring missing pages in from swap, to complete THP collapse. * Only done if khugepaged_scan_pmd believes it is worthwhile. @@ -1228,11 +1252,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, VM_BUG_ON(address & ~HPAGE_PMD_MASK); - pmd = mm_find_pmd(mm, address); - if (!pmd) { - result = SCAN_PMD_NULL; + result = find_pmd_or_thp_or_none(mm, address, &pmd); + if (result != SCAN_SUCCEED) goto out; - } memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); pte = pte_offset_map_lock(mm, pmd, address, &ptl); diff --git a/mm/rmap.c b/mm/rmap.c index 94d6b24a1ac2..6980b4011bf8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -759,13 +759,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) return vma_address(page, vma); } -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd = NULL; - pmd_t pmde; pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) @@ -780,6 +779,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) goto out; pmd = pmd_offset(pud, address); +out: + return pmd; +} + +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) +{ + pmd_t pmde; + pmd_t *pmd; + + pmd = mm_find_pmd_raw(mm, address); + if (!pmd) + goto out; /* * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() * without holding anon_vma lock for write. So when looking for a
When scanning an anon pmd to see if it's eligible for collapse, return SCAN_PMD_MAPPED if the pmd already maps a THP. Note that SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the file-collapse path, since the latter might identify pte-mapped compound pages. This is required by MADV_COLLAPSE which necessarily needs to know what hugepage-aligned/sized regions are already pmd-mapped. Signed-off-by: Zach O'Keefe <zokeefe@google.com> --- include/trace/events/huge_memory.h | 1 + mm/internal.h | 1 + mm/khugepaged.c | 30 ++++++++++++++++++++++++++---- mm/rmap.c | 15 +++++++++++++-- 4 files changed, 41 insertions(+), 6 deletions(-)