Message ID | 20230801124844.278698-2-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | smaps / mm/gup: fix gup_can_follow_protnone fallout | expand |
On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote: > @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, > gup_flags |= FOLL_UNLOCKABLE; > } > > + /* > + * For now, always trigger NUMA hinting faults. Some GUP users like > + * KVM really require it to benefit from autonuma. > + */ > + gup_flags |= FOLL_HONOR_NUMA_FAULT; Since at it, do we want to not set it for FOLL_REMOTE, which still sounds like a good thing to have? Other than that, looks good here. Side note: when I was looking at the flags again just to check the interactions over numa balancing, I found FOLL_NOFAULT and I highly suspect that's not needed, instead it just wants to use follow_page[_mask]() with some proper gup flags passed over.. but that's off topic. Thanks,
On 01.08.23 17:48, Peter Xu wrote: > On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote: >> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, >> gup_flags |= FOLL_UNLOCKABLE; >> } >> >> + /* >> + * For now, always trigger NUMA hinting faults. Some GUP users like >> + * KVM really require it to benefit from autonuma. >> + */ >> + gup_flags |= FOLL_HONOR_NUMA_FAULT; > > Since at it, do we want to not set it for FOLL_REMOTE, which still sounds > like a good thing to have? I thought about that, but decided against making that patch here more complicated to eventually rip it again all out in #4. I fully agree that FOLL_REMOTE does not make too much sense, but let's rather keep it simple for this patch. Thanks! > > Other than that, looks good here. > > Side note: when I was looking at the flags again just to check the > interactions over numa balancing, I found FOLL_NOFAULT and I highly suspect > that's not needed, instead it just wants to use follow_page[_mask]() with > some proper gup flags passed over.. but that's off topic. Be prepared for my proposal of removing foll_flags from follow_page() ;) (accompanied by a proper documentation) Especially as we have FOLL_PIN users of FOLL_NOFAULT, follow_page() is a bad fit.
On Tue, Aug 01, 2023 at 06:15:48PM +0200, David Hildenbrand wrote: > On 01.08.23 17:48, Peter Xu wrote: > > On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote: > > > @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, > > > gup_flags |= FOLL_UNLOCKABLE; > > > } > > > + /* > > > + * For now, always trigger NUMA hinting faults. Some GUP users like > > > + * KVM really require it to benefit from autonuma. > > > + */ > > > + gup_flags |= FOLL_HONOR_NUMA_FAULT; > > > > Since at it, do we want to not set it for FOLL_REMOTE, which still sounds > > like a good thing to have? > > I thought about that, but decided against making that patch here more > complicated to eventually rip it again all out in #4. I thought that was the whole point of having patch 4 separate, because we should assume patch 4 may not exist in (at least) some trees, so I ignored patch 4 when commenting here, and we should not assume it's destined to be removed here. > > I fully agree that FOLL_REMOTE does not make too much sense, but let's > rather keep it simple for this patch. It's fine - I suppose this patch fixes whatever we're aware of that's broken with FOLL_NUMA's removal, so it can even be anything on top when needed. I assume I'm happy to ack this with/without that change, then: Acked-by: Peter Xu <peterx@redhat.com> PS: I still hope that the other oneliner can be squashed here directly; it literally changes exact the same line above so reading this patch alone can be affected. You said there you didn't want the commit message to be too long here, but this is definitely not long at all! I bet you have similar understanding to me on defining "long commit message". :) I'd never worry that. Your call. Thanks,
On 01.08.23 19:04, Peter Xu wrote: > On Tue, Aug 01, 2023 at 06:15:48PM +0200, David Hildenbrand wrote: >> On 01.08.23 17:48, Peter Xu wrote: >>> On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote: >>>> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, >>>> gup_flags |= FOLL_UNLOCKABLE; >>>> } >>>> + /* >>>> + * For now, always trigger NUMA hinting faults. Some GUP users like >>>> + * KVM really require it to benefit from autonuma. >>>> + */ >>>> + gup_flags |= FOLL_HONOR_NUMA_FAULT; >>> >>> Since at it, do we want to not set it for FOLL_REMOTE, which still sounds >>> like a good thing to have? >> >> I thought about that, but decided against making that patch here more >> complicated to eventually rip it again all out in #4. > > I thought that was the whole point of having patch 4 separate, because we > should assume patch 4 may not exist in (at least) some trees, so I ignored > patch 4 when commenting here, and we should not assume it's destined to be > removed here. For me, the goal of this patch is to bring it *as close as possible* to the previous state as before, so we can backport it to stable without too many surprises (effectively, only a handful of FOLL_FORCE/ptrace user will get a different behavior). I could add a separate patch that does the FOLL_REMOTE thing, but then, maybe we should do that if patch #4 runs into real trouble :) But no strong opinion if this is what everybody wants in this patch. > >> >> I fully agree that FOLL_REMOTE does not make too much sense, but let's >> rather keep it simple for this patch. > > It's fine - I suppose this patch fixes whatever we're aware of that's > broken with FOLL_NUMA's removal, so it can even be anything on top when > needed. I assume I'm happy to ack this with/without that change, then: > > Acked-by: Peter Xu <peterx@redhat.com> Thanks! > > PS: I still hope that the other oneliner can be squashed here directly; it > literally changes exact the same line above so reading this patch alone can > be affected. You said there you didn't want the commit message to be too > long here, but this is definitely not long at all! I bet you have similar > understanding to me on defining "long commit message". :) I'd never worry > that. Your call. No strong opinion, it just felt cleaner to not start adding what I have in that separate patch commit message in here.
On Tue, Aug 01, 2023 at 02:48:37PM +0200, David Hildenbrand wrote: > Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by > gup_can_follow_protnone()") missed that follow_page() and > follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really > don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting > or due to inaccessible (PROT_NONE) VMAs. > > As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page > faults from gup/gup_fast"): "Other follow_page callers like KSM should not > use FOLL_NUMA, or they would fail to get the pages if they use follow_page > instead of get_user_pages." > > liubo reported [1] that smaps_rollup results are imprecise, because they > miss accounting of pages that are mapped PROT_NONE. Further, it's easy > to reproduce that KSM no longer works on inaccessible VMAs on x86-64, > because pte_protnone()/pmd_protnone() also indictaes "true" in > inaccessible VMAs, and follow_page() refuses to return such pages right > now. > > As KVM really depends on these NUMA hinting faults, removing the > pte_protnone()/pmd_protnone() handling in GUP code completely is not really > an option. > > To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT > to restore the original behavior for now and add better comments. > > Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in > is_valid_gup_args(), to add that flag for all external GUP users. > > Note that there are three GUP-internal __get_user_pages() users that don't > end up calling is_valid_gup_args() and consequently won't get > FOLL_HONOR_NUMA_FAULT set. > > 1) get_dump_page(): we really don't want to handle NUMA hinting > faults. It specifies FOLL_FORCE and wouldn't have honored NUMA > hinting faults already. > 2) populate_vma_page_range(): we really don't want to handle NUMA hinting > faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have > honored NUMA hinting faults already. > 3) faultin_vma_page_range(): we similarly don't want to handle NUMA > hinting faults. > > To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in > inaccessible VMAs properly, we have to perform VMA accessibility checks in > gup_can_follow_protnone(). > > As GUP-fast should reject such pages either way in > pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and > arm64 that both implement pte_protnone() -- let's just always fallback > to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). > > As Linus notes [2], honoring NUMA faults might only make sense for > selected GUP users. > > So we should really see if we can instead let relevant GUP callers specify > it manually, and not trigger NUMA hinting faults from GUP as default. > Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag > and adding appropriate documenation. > > [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com > [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com > > Reported-by: liubo <liubo254@huawei.com> > Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com > Reported-by: Peter Xu <peterx@redhat.com> > Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") > Cc: <stable@vger.kernel.org> > Signed-off-by: David Hildenbrand <david@redhat.com> I agree that FOLL_REMOTE probably needs separate treatment but also agree that it's outside the context of this patch, particularly as a -stable candidate so Acked-by: Mel Gorman <mgorman@techsingularity.net> I've a minor nit below that would be nice to get fixed up, but not mandatory. > --- > include/linux/mm.h | 21 +++++++++++++++------ > include/linux/mm_types.h | 9 +++++++++ > mm/gup.c | 29 +++++++++++++++++++++++------ > mm/huge_memory.c | 2 +- > 4 files changed, 48 insertions(+), 13 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 2493ffa10f4b..f463d3004ddc 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, > gup_flags |= FOLL_UNLOCKABLE; > } > > + /* > + * For now, always trigger NUMA hinting faults. Some GUP users like > + * KVM really require it to benefit from autonuma. > + */ > + gup_flags |= FOLL_HONOR_NUMA_FAULT; > + > /* FOLL_GET and FOLL_PIN are mutually exclusive. */ > if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == > (FOLL_PIN | FOLL_GET))) Expand on *why* KVM requires it even though I suspect this changes later in the series. Maybe "Some GUP users like KVM require the hint to be as the calling context of GUP is functionally similar to a memory reference from task context"? Also minor nit -- s/autonuma/NUMA Balancing/ or numab. autonuma refers to a specific implementation of automatic balancing that operated similar to khugepaged but not merged. If you grep for it, you'll find that autonuma is only referenced in powerpc-specific code. It's not important these days but very early on, it was very confusing if AutoNUMA was mentioned when NUMAB was intended.
>> Reported-by: liubo <liubo254@huawei.com> >> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com >> Reported-by: Peter Xu <peterx@redhat.com> >> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ >> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") >> Cc: <stable@vger.kernel.org> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > I agree that FOLL_REMOTE probably needs separate treatment but also agree > that it's outside the context of this patch, particularly as a -stable > candidate so > > Acked-by: Mel Gorman <mgorman@techsingularity.net> > > I've a minor nit below that would be nice to get fixed up, but not > mandatory. Thanks Mel for taking a look, so I don't mess up once more :) > >> --- >> include/linux/mm.h | 21 +++++++++++++++------ >> include/linux/mm_types.h | 9 +++++++++ >> mm/gup.c | 29 +++++++++++++++++++++++------ >> mm/huge_memory.c | 2 +- >> 4 files changed, 48 insertions(+), 13 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 2493ffa10f4b..f463d3004ddc 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, >> gup_flags |= FOLL_UNLOCKABLE; >> } >> >> + /* >> + * For now, always trigger NUMA hinting faults. Some GUP users like >> + * KVM really require it to benefit from autonuma. >> + */ >> + gup_flags |= FOLL_HONOR_NUMA_FAULT; >> + >> /* FOLL_GET and FOLL_PIN are mutually exclusive. */ >> if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == >> (FOLL_PIN | FOLL_GET))) > > Expand on *why* KVM requires it even though I suspect this changes later > in the series. Maybe "Some GUP users like KVM require the hint to be as > the calling context of GUP is functionally similar to a memory reference > from task context"? It's raised later in this series but it doesn't hurt to discuss it here in a bit more detail. Sounds good to me. > > Also minor nit -- s/autonuma/NUMA Balancing/ or numab. autonuma refers to > a specific implementation of automatic balancing that operated similar to > khugepaged but not merged. If you grep for it, you'll find that autonuma > is only referenced in powerpc-specific code. It's not important these > days but very early on, it was very confusing if AutoNUMA was mentioned > when NUMAB was intended. Ah, yes, thanks. That's the one of the only place where that terminology accidentally slipped in. I'll wait for more feedback and resend!
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2fbc6c631764..165830a95641 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3455,15 +3455,24 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) * Indicates whether GUP can follow a PROT_NONE mapped page, or whether * a (NUMA hinting) fault is required. */ -static inline bool gup_can_follow_protnone(unsigned int flags) +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma, + unsigned int flags) { /* - * FOLL_FORCE has to be able to make progress even if the VMA is - * inaccessible. Further, FOLL_FORCE access usually does not represent - * application behaviour and we should avoid triggering NUMA hinting - * faults. + * If callers don't want to honor NUMA hinting faults, no need to + * determine if we would actually have to trigger a NUMA hinting fault. */ - return flags & FOLL_FORCE; + if (!(flags & FOLL_HONOR_NUMA_FAULT)) + return true; + + /* + * NUMA hinting faults don't apply in inaccessible (PROT_NONE) VMAs. + * + * Requiring a fault here even for inaccessible VMAs would mean that + * FOLL_FORCE cannot make any progress, because handle_mm_fault() + * refuses to process NUMA hinting faults in inaccessible VMAs. + */ + return !vma_is_accessible(vma); } typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index da538ff68953..18c8c3d793b0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1296,6 +1296,15 @@ enum { FOLL_PCI_P2PDMA = 1 << 10, /* allow interrupts from generic signals */ FOLL_INTERRUPTIBLE = 1 << 11, + /* + * Always honor (trigger) NUMA hinting faults. + * + * FOLL_WRITE implicitly honors NUMA hinting faults because a + * PROT_NONE-mapped page is not writable (exceptions with FOLL_FORCE + * apply). get_user_pages_fast_only() always implicitly honors NUMA + * hinting faults. + */ + FOLL_HONOR_NUMA_FAULT = 1 << 12, /* See also internal only FOLL flags in mm/internal.h */ }; diff --git a/mm/gup.c b/mm/gup.c index 2493ffa10f4b..f463d3004ddc 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, pte = ptep_get(ptep); if (!pte_present(pte)) goto no_page; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags)) goto no_page; page = vm_normal_page(vma, address, pte); @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, if (likely(!pmd_trans_huge(pmdval))) return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); - if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags)) return no_page_table(vma, flags); ptl = pmd_lock(mm, pmd); @@ -844,6 +844,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) return NULL; + /* + * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect + * to fail on PROT_NONE-mapped pages. + */ page = follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -2240,6 +2244,12 @@ static bool is_valid_gup_args(struct page **pages, int *locked, gup_flags |= FOLL_UNLOCKABLE; } + /* + * For now, always trigger NUMA hinting faults. Some GUP users like + * KVM really require it to benefit from autonuma. + */ + gup_flags |= FOLL_HONOR_NUMA_FAULT; + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -2564,7 +2574,14 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, struct page *page; struct folio *folio; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + /* + * Always fallback to ordinary GUP on PROT_NONE-mapped pages: + * pte_access_permitted() better should reject these pages + * either way: otherwise, GUP-fast might succeed in + * cases where ordinary GUP would fail due to VMA access + * permissions. + */ + if (pte_protnone(pte)) goto pte_unmap; if (!pte_access_permitted(pte, flags & FOLL_WRITE)) @@ -2983,8 +3000,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || pmd_devmap(pmd))) { - if (pmd_protnone(pmd) && - !gup_can_follow_protnone(flags)) + /* See gup_pte_range() */ + if (pmd_protnone(pmd)) return 0; if (!gup_huge_pmd(pmd, pmdp, addr, next, flags, @@ -3164,7 +3181,7 @@ static int internal_get_user_pages_fast(unsigned long start, if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | FOLL_FAST_ONLY | FOLL_NOFAULT | - FOLL_PCI_P2PDMA))) + FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT))) return -EINVAL; if (gup_flags & FOLL_PIN) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2e2e8a24cc71..2cd3e5502180 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, return ERR_PTR(-EFAULT); /* Full NUMA hinting faults to serialise migration in fault paths */ - if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags)) return NULL; if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Reported-by: liubo <liubo254@huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx@redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Cc: <stable@vger.kernel.org> Signed-off-by: David Hildenbrand <david@redhat.com> --- include/linux/mm.h | 21 +++++++++++++++------ include/linux/mm_types.h | 9 +++++++++ mm/gup.c | 29 +++++++++++++++++++++++------ mm/huge_memory.c | 2 +- 4 files changed, 48 insertions(+), 13 deletions(-)