Message ID | 20230727212845.135673-1-david@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | smaps / mm/gup: fix gup_can_follow_protnone fallout | expand |
On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote: > > This is my proposal on how to handle the fallout of 474098edac26 > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I > accidentially missed that follow_page() and smaps implicitly kept the > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to > not trigger faults on PROT_NONE-mapped PTEs. Ugh. I hate how it uses FOLL_FORCE that is inherently scary. Why do we have that "gup_can_follow_protnone()" logic AT ALL? Couldn't we just get rid of that disgusting thing, and just say that GUP (and follow_page()) always just ignores NUMA hinting, and always just follows protnone? We literally used to have this: if (!(gup_flags & FOLL_FORCE)) gup_flags |= FOLL_NUMA; ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should be the rare crazy case. The original reason for not setting FOLL_NUMA all the time is documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast") from way back in 2012: * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault * would be called on PROT_NONE ranges. We must never invoke * handle_mm_fault on PROT_NONE ranges or the NUMA hinting * page faults would unprotect the PROT_NONE ranges if * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd * bitflag. So to avoid that, don't set FOLL_NUMA if * FOLL_FORCE is set. but I don't think the original reason for this is *true* any more. Because then two years later in 2014, in commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Mel made the code able to distinguish between PROT_NONE and NUMA pages, and he changed the comment above too. But I get the very very strong feeling that instead of changing the comment, he should have actually removed the comment and the code. So I get the strong feeling that all these FOLL_NUMA games should just go away. You removed the FOLL_NUMA bit, but you replaced it with using FOLL_FORCE. So rather than make this all even more opaque and make it even harder to figure out why we have that code in the first place, I think it should all just be removed. The original reason for FOLL_NUMA simply does not exist any more. We know exactly when a page is marked for NUMA faulting, and we should simply *ignore* it for GUP and follow_page(). I think we should treat a NUMA-faulting page as just being present (and not NUMA-fault it). Am I missing something? Linus
On 28.07.23 18:18, Linus Torvalds wrote: > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote: >> >> This is my proposal on how to handle the fallout of 474098edac26 >> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I >> accidentially missed that follow_page() and smaps implicitly kept the >> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to >> not trigger faults on PROT_NONE-mapped PTEs. > > Ugh. I was hoping for that reaction, with the assumption that we would get something cleaner :) > > I hate how it uses FOLL_FORCE that is inherently scary. I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it is FOLL_FORCE in disguise (currently and before 474098edac26, if FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around). > > Why do we have that "gup_can_follow_protnone()" logic AT ALL? That's what I was hoping for. > > Couldn't we just get rid of that disgusting thing, and just say that > GUP (and follow_page()) always just ignores NUMA hinting, and always > just follows protnone? > > We literally used to have this: > > if (!(gup_flags & FOLL_FORCE)) > gup_flags |= FOLL_NUMA; > > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should > be the rare crazy case. Yes, but my point would be that we now spell that "rare crazy case" out for follow_page(). If you're talking about patch #1, I agree, therefore patch #3 to avoid all that nasty FOLL_FORCE handling in GUP callers. But yeah, if we can avoid all that, great. > > The original reason for not setting FOLL_NUMA all the time is > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting > page faults from gup/gup_fast") from way back in 2012: > > * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault > * would be called on PROT_NONE ranges. We must never invoke > * handle_mm_fault on PROT_NONE ranges or the NUMA hinting > * page faults would unprotect the PROT_NONE ranges if > * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd > * bitflag. So to avoid that, don't set FOLL_NUMA if > * FOLL_FORCE is set. In handle_mm_fault(), we never call do_numa_page() if !vma_is_accessible(). Same for do_huge_pmd_numa_page(). So, if we would ever end up triggering a page fault on mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we would simply do nothing. At least that's the hope, I'll take a closer look just to make sure we're good on all call paths. > > but I don't think the original reason for this is *true* any more. > > Because then two years later in 2014, in commit c46a7c817e66 ("x86: > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") > Mel made the code able to distinguish between PROT_NONE and NUMA > pages, and he changed the comment above too. CCing Mel. I remember that pte_protnone() can only distinguished between NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible(). Indeed, include/linux/pgtable.h: /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but * the only case the kernel cares is for NUMA balancing and is only ever set * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked * _PAGE_PROTNONE so by default, implement the helper as "always no". It * is the responsibility of the caller to distinguish between PROT_NONE * protections and NUMA hinting fault protections. */ > > But I get the very very strong feeling that instead of changing the > comment, he should have actually removed the comment and the code. > > So I get the strong feeling that all these FOLL_NUMA games should just > go away. You removed the FOLL_NUMA bit, but you replaced it with using > FOLL_FORCE. > > So rather than make this all even more opaque and make it even harder > to figure out why we have that code in the first place, I think it > should all just be removed. At least to me, spelling FOLL_FORCE in follow_page() now out is much less opaque then getting that implied by lack of FOLL_NUMA. > > The original reason for FOLL_NUMA simply does not exist any more. We > know exactly when a page is marked for NUMA faulting, and we should > simply *ignore* it for GUP and follow_page(). > > I think we should treat a NUMA-faulting page as just being present > (and not NUMA-fault it). > > Am I missing something? There was the case for "FOLL_PIN represents application behavior and should trigger NUMA faults", but I guess that can be ignored. But it would be much better to just remove all that if we can. Let me look into some details. Thanks Linus!
[...] > > There was the case for "FOLL_PIN represents application behavior and > should trigger NUMA faults", but I guess that can be ignored. > > But it would be much better to just remove all that if we can. > > Let me look into some details. I just stumbled over the comment from Mel in follow_trans_huge_pmd(): /* Full NUMA hinting faults to serialise migration in fault paths */ It dates back to commit 2b4847e73004c10ae6666c2e27b5c5430aed8698 Author: Mel Gorman <mgorman@suse.de> Date: Wed Dec 18 17:08:32 2013 -0800 mm: numa: serialise parallel get_user_page against THP migration Base pages are unmapped and flushed from cache and TLB during normal page migration and replaced with a migration entry that causes any parallel NUMA hinting fault or gup to block until migration completes. THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages and get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between THP migration and PMD numa clearing") made worse by introducing a pmd_clear_flush(). This patch forces get_user_page (fast and normal) on a pmd_numa page to go through the slow get_user_page path where it will serialise against THP migration and properly account for the NUMA hinting fault. On the migration side the page table lock is taken for each PTE update. We nowadays do have migration entries at PMD level -- and setting FOLL_FORCE could similarly trigger such a race. So I suspect we're good.
Hi, Linus, On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote: > The original reason for FOLL_NUMA simply does not exist any more. We > know exactly when a page is marked for NUMA faulting, and we should > simply *ignore* it for GUP and follow_page(). > > I think we should treat a NUMA-faulting page as just being present > (and not NUMA-fault it). But then does it means that any gup-only user will have numa balancing completely disabled? Since as long as the page will only be accessed by GUP, the numa balancing will never trigger anyway.. I think KVM is manipulating guest pages just like that. Not sure whether it means it'll void the whole numa effort there. If we allow that GUP from happening (taking protnone as present) I assume it'll also stop any further numa balancing on this very page to trigger too, because even if some page fault handler triggered on this protnone page later that is not GUP anymore, when it wants to migrate the page to the other numa node it'll see someone is holding a reference on it already, and then we should give up the balancing. So to me FOLL_NUMA (or any identifier like it.. just to describe the caller's need; some may really just want to fetch the pfn/page) still makes sense. But maybe I totally misunderstood above.. Thanks,
On 28.07.23 19:30, David Hildenbrand wrote: > On 28.07.23 18:18, Linus Torvalds wrote: >> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote: >>> >>> This is my proposal on how to handle the fallout of 474098edac26 >>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I >>> accidentially missed that follow_page() and smaps implicitly kept the >>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to >>> not trigger faults on PROT_NONE-mapped PTEs. >> >> Ugh. > > I was hoping for that reaction, with the assumption that we would get > something cleaner :) > >> >> I hate how it uses FOLL_FORCE that is inherently scary. > > I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it > is FOLL_FORCE in disguise (currently and before 474098edac26, if > FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around). > >> >> Why do we have that "gup_can_follow_protnone()" logic AT ALL? > > That's what I was hoping for. > >> >> Couldn't we just get rid of that disgusting thing, and just say that >> GUP (and follow_page()) always just ignores NUMA hinting, and always >> just follows protnone? >> >> We literally used to have this: >> >> if (!(gup_flags & FOLL_FORCE)) >> gup_flags |= FOLL_NUMA; >> >> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should >> be the rare crazy case. > > Yes, but my point would be that we now spell that "rare crazy case" > out for follow_page(). > > If you're talking about patch #1, I agree, therefore patch #3 to > avoid all that nasty FOLL_FORCE handling in GUP callers. > > But yeah, if we can avoid all that, great. > >> >> The original reason for not setting FOLL_NUMA all the time is >> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting >> page faults from gup/gup_fast") from way back in 2012: >> >> * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault >> * would be called on PROT_NONE ranges. We must never invoke >> * handle_mm_fault on PROT_NONE ranges or the NUMA hinting >> * page faults would unprotect the PROT_NONE ranges if >> * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd >> * bitflag. So to avoid that, don't set FOLL_NUMA if >> * FOLL_FORCE is set. > > > In handle_mm_fault(), we never call do_numa_page() if > !vma_is_accessible(). Same for do_huge_pmd_numa_page(). > > So, if we would ever end up triggering a page fault on > mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we > would simply do nothing. > > At least that's the hope, I'll take a closer look just to make > sure we're good on all call paths. > >> >> but I don't think the original reason for this is *true* any more. >> >> Because then two years later in 2014, in commit c46a7c817e66 ("x86: >> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") >> Mel made the code able to distinguish between PROT_NONE and NUMA >> pages, and he changed the comment above too. > > CCing Mel. > > I remember that pte_protnone() can only distinguished between > NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible(). > > Indeed, include/linux/pgtable.h: > > /* > * Technically a PTE can be PROTNONE even when not doing NUMA balancing but > * the only case the kernel cares is for NUMA balancing and is only ever set > * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked > * _PAGE_PROTNONE so by default, implement the helper as "always no". It > * is the responsibility of the caller to distinguish between PROT_NONE > * protections and NUMA hinting fault protections. > */ > >> >> But I get the very very strong feeling that instead of changing the >> comment, he should have actually removed the comment and the code. >> >> So I get the strong feeling that all these FOLL_NUMA games should just >> go away. You removed the FOLL_NUMA bit, but you replaced it with using >> FOLL_FORCE. >> >> So rather than make this all even more opaque and make it even harder >> to figure out why we have that code in the first place, I think it >> should all just be removed. > > At least to me, spelling FOLL_FORCE in follow_page() now out is much > less opaque then getting that implied by lack of FOLL_NUMA. > >> >> The original reason for FOLL_NUMA simply does not exist any more. We >> know exactly when a page is marked for NUMA faulting, and we should >> simply *ignore* it for GUP and follow_page(). >> >> I think we should treat a NUMA-faulting page as just being present >> (and not NUMA-fault it). >> >> Am I missing something? > > There was the case for "FOLL_PIN represents application behavior and > should trigger NUMA faults", but I guess that can be ignored. Re-reading commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"), it actually does spell out an important case that we should handle: "KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault." That is still the case today (and important). Not triggering NUMA hinting faults would degrade KVM. Hmm. So three alternatives I see: 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we don't like that. 2) Revert the commit and reintroduce unconditional FOLL_NUMA without FOLL_FORCE. 3) Have a FOLL_NUMA that callers like KVM can pass. Thoughts?
On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote: > Hmm. So three alternatives I see: > > 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone > checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we > don't like that. > > 2) Revert the commit and reintroduce unconditional FOLL_NUMA without > FOLL_FORCE. > > 3) Have a FOLL_NUMA that callers like KVM can pass. I'm afraid 3) means changing numa balancing to opt-in, probably no-go for any non-kvm gup users as that could start to break there, even if making smaps/follow_page happy again. I keep worrying 1) on FOLL_FORCE abuse. So I keep my vote on 2). Thanks,
On 28.07.23 21:39, Peter Xu wrote: > Hi, Linus, > > On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote: >> The original reason for FOLL_NUMA simply does not exist any more. We >> know exactly when a page is marked for NUMA faulting, and we should >> simply *ignore* it for GUP and follow_page(). >> >> I think we should treat a NUMA-faulting page as just being present >> (and not NUMA-fault it). > > But then does it means that any gup-only user will have numa balancing > completely disabled? Since as long as the page will only be accessed by > GUP, the numa balancing will never trigger anyway.. I think KVM is > manipulating guest pages just like that. Not sure whether it means it'll > void the whole numa effort there. > > If we allow that GUP from happening (taking protnone as present) I assume > it'll also stop any further numa balancing on this very page to trigger > too, because even if some page fault handler triggered on this protnone > page later that is not GUP anymore, when it wants to migrate the page to > the other numa node it'll see someone is holding a reference on it already, > and then we should give up the balancing. > > So to me FOLL_NUMA (or any identifier like it.. just to describe the > caller's need; some may really just want to fetch the pfn/page) still makes > sense. But maybe I totally misunderstood above.. Yes, I agree, took me a bit longer to realize (being a KVM developer :) ... I'm really ready for the weekend). So if this series is not acceptable then better revert that commit -- or let callers like KVM specify FOLL_NUMA.
On 28.07.23 21:50, Peter Xu wrote: > On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote: >> Hmm. So three alternatives I see: >> >> 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone >> checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we >> don't like that. >> >> 2) Revert the commit and reintroduce unconditional FOLL_NUMA without >> FOLL_FORCE. >> >> 3) Have a FOLL_NUMA that callers like KVM can pass. > > I'm afraid 3) means changing numa balancing to opt-in, probably no-go for > any non-kvm gup users as that could start to break there, even if making > smaps/follow_page happy again. > > I keep worrying 1) on FOLL_FORCE abuse. > > So I keep my vote on 2). Linus had a point that we can nowadays always set FOLL_NUMA, even with FOLL_FORCE. ... so maybe we want to let GUP always set FOLL_NUMA (even with FOLL_FORCE) and follow_page() never set FOLL_NUMA. That would at least decouple FOLL_NUMA from FOLL_FORCE.
On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote: > > But then does it means that any gup-only user will have numa balancing > completely disabled? Why would we ever care about a GUP-only user? Who knows where the actual access is coming from? It might be some device that is on a different node entirely. And even if the access is local from the CPU, it (a) might have happened after we moved somewhere else (b) who cares about the extra possible NUMA overhead when we just wasted *thousands* of cycles on GUP? So NUMA balancing really doesn't seem to make sense for GUP anyway as far as I can see. Now, the other side of the same thing is that (a) NUMA faulting should be fairly rare and (b) once you do GUP, who cares anyway, so you can also argue that "once you do GUP you might as well NUMA-fault, because performance simply isn't an issue". But I really think the real argument is "once you do GUP, numa faulting is just crazy". I think what happened is - the GUP code couldn't tell NUMA and actual PROTNONE apart - so the GUP code would punch through PROTNONE even when it shouldn't - so people added FOLL_NUMA to say "I don't want you to punch through, I want the NUMA fault" - but then FOLL_FORCE ends up meaning that you actually *do* want to punch through - regardless of NUMA or not - and now the two got tied together, and we end up with nonsensical garbage like if (!(gup_flags & FOLL_FORCE)) gup_flags |= FOLL_NUMA; to say "oh, actually, to avoid punching through when we shouldn't, we should NUMA fault". so we ended up with that case where even if YOU DIDN'T CARE AT ALL, you got FOLL_NUMA just so that you wouldn't punch through. And now we're in the situation that we've confused FOLL_FORCE and FOLL_NUMA, even though they have absolutely *nothing* to do with each other, except for a random implementation detail about punching through incorrectly that isn't even relevant any more. I really think FOLL_NUMA should just go away. And that FOLL_FORCE replacement for it is just wrong. If you *don't* do something without FOLL_FORCE, you damn well shouldn't do it just because FOLL_FORCE is set. The *only* semantic meaning FOLL_FORCE should have is that it overrides the vma protections for debuggers (in a very limited manner). It should *not* affect any NUMA faulting logic in any way, shape, or form. Linus
On 28.07.23 22:23, Linus Torvalds wrote: > On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote: >> >> But then does it means that any gup-only user will have numa balancing >> completely disabled? > > Why would we ever care about a GUP-only user? > > Who knows where the actual access is coming from? It might be some > device that is on a different node entirely. > > And even if the access is local from the CPU, it > > (a) might have happened after we moved somewhere else > > (b) who cares about the extra possible NUMA overhead when we just > wasted *thousands* of cycles on GUP? > > So NUMA balancing really doesn't seem to make sense for GUP anyway as > far as I can see. I do agree regarding many GUP users. But at least for KVM (and probably some others, but most probably KVM is the most important GUP user) it does make sense and we have to find a way to keep that working. At least, removing it creates significantly more harm than having it, guaranteed :) So would you rather favor a FOLL_NUMA that has to be passed from the outside by selected callers or a FOLL_NUMA that is set on the GUP path unconditionally (but left clear for follow_page())?
On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote: > > So would you rather favor a FOLL_NUMA that has to be passed from the > outside by selected callers or a FOLL_NUMA that is set on the GUP path > unconditionally (but left clear for follow_page())? I'd rather see the FOLL_NUMA that has to be set by odd cases, and that is never set by any sane user. And it should not be called FOLL_NUMA. It should be called something else. Because *not* having it doesn't disable following pages across NUMA boundaries, and the name is actively misleading. It sounds like what KVM actually wants is a "Do NOT follow NUMA pages, I'll force a page fault". And the fact that KVM wants a fault for NUMA pages shouldn't mean that others - who clearly cannot care - get that insane behavior by default. The name should reflect that, instead of being the misleading mess of FOLL_FORCE and bad naming that it is now. So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to make it clear that it's the *opposite* of FOLL_FORCE, and that it honors the NUMA faulting that nobody should care about. Then the KVM code can have a big comment about *why* it sets that bit. Hmm? Can we please aim for something that is understandable and documented? No odd implicit rules. No "force NUMA fault even when it makes no sense". No tie-in with FOLL_FORCE. Linus
On 28.07.23 22:50, Linus Torvalds wrote: > On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote: >> >> So would you rather favor a FOLL_NUMA that has to be passed from the >> outside by selected callers or a FOLL_NUMA that is set on the GUP path >> unconditionally (but left clear for follow_page())? > > I'd rather see the FOLL_NUMA that has to be set by odd cases, and that > is never set by any sane user. Thanks! > > And it should not be called FOLL_NUMA. It should be called something > else. Because *not* having it doesn't disable following pages across > NUMA boundaries, and the name is actively misleading. > > It sounds like what KVM actually wants is a "Do NOT follow NUMA pages, > I'll force a page fault". > > And the fact that KVM wants a fault for NUMA pages shouldn't mean that > others - who clearly cannot care - get that insane behavior by > default. For KVM it represents actual CPU access. To map these pages into the VM MMU we have to look them up from the process -- in the context of the faulting CPU. So it makes a lot of sense for KVM. (which is also where autonuma gets heavily used) > > The name should reflect that, instead of being the misleading mess of > FOLL_FORCE and bad naming that it is now. > > So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to > make it clear that it's the *opposite* of FOLL_FORCE, and that it > honors the NUMA faulting that nobody should care about. Naming sounds much better to me. > > Then the KVM code can have a big comment about *why* it sets that bit. Yes. > > Hmm? Can we please aim for something that is understandable and > documented? No odd implicit rules. No "force NUMA fault even when it > makes no sense". No tie-in with FOLL_FORCE. I mean, I messed all that FOLL_NUMA handling up because I was very confused. So I'm all for better documentation. Can we get a simple revert in first (without that FOLL_FORCE special casing and ideally with a better name) to handle stable backports, and I'll follow-up with more documentation and letting GUP callers pass in that flag instead? That would help a lot. Then we also have more time to let that "move it to GUP callers" mature a bit in -next, to see if we find any surprises?
On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote: > Can we get a simple revert in first (without that FOLL_FORCE special casing > and ideally with a better name) to handle stable backports, and I'll > follow-up with more documentation and letting GUP callers pass in that flag > instead? > > That would help a lot. Then we also have more time to let that "move it to > GUP callers" mature a bit in -next, to see if we find any surprises? As I raised my concern over the other thread, I still worry numa users can be affected by this change. After all, numa isn't so uncommon to me, at least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly suspect that's also true to major distros. Meanwhile all kernel modules use gup.. I'd say we can go ahead and try if we want, but I really don't know why that helps in any form to move it to the callers.. with the risk of breaking someone. Logically it should also be always better to migrate earlier than later, not only because the page will be local earlier, but also per I discussed also in the other thread (that the gup can hold a ref to the page, and it could potentially stop numa balancing to succeed later). Thanks,
On 28.07.23 23:20, Peter Xu wrote: > On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote: >> Can we get a simple revert in first (without that FOLL_FORCE special casing >> and ideally with a better name) to handle stable backports, and I'll >> follow-up with more documentation and letting GUP callers pass in that flag >> instead? >> >> That would help a lot. Then we also have more time to let that "move it to >> GUP callers" mature a bit in -next, to see if we find any surprises? > > As I raised my concern over the other thread, I still worry numa users can > be affected by this change. After all, numa isn't so uncommon to me, at > least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly > suspect that's also true to major distros. Meanwhile all kernel modules > use gup.. > > I'd say we can go ahead and try if we want, but I really don't know why > that helps in any form to move it to the callers.. with the risk of > breaking someone. > Indeed, that's why I suggest to be a bit careful, especially with stable. > Logically it should also be always better to migrate earlier than later, > not only because the page will be local earlier, but also per I discussed > also in the other thread (that the gup can hold a ref to the page, and it > could potentially stop numa balancing to succeed later). I get your point, but I also see the following cases (QEMU/KVM as example): * User space triggers O_DIRECT. It will be short-lived. But is it really an access from that CPU (NUMA node) to that page? At least for KVM, you much rather want to let KVM trigger the NUMA fault on actual memory access from a guest VCPU, not from a QEMU iothread when pinning the page? * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread. Where should we migrate that page to? Would it actually be counter- productive to migrate it to the NUMA node of the setup thread? The longterm pin will turn the page unmovable, yes, but where to migrate it to?
On 7/28/23 14:20, Peter Xu wrote: > On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote: >> Can we get a simple revert in first (without that FOLL_FORCE special casing >> and ideally with a better name) to handle stable backports, and I'll >> follow-up with more documentation and letting GUP callers pass in that flag >> instead? >> >> That would help a lot. Then we also have more time to let that "move it to >> GUP callers" mature a bit in -next, to see if we find any surprises? > > As I raised my concern over the other thread, I still worry numa users can > be affected by this change. After all, numa isn't so uncommon to me, at > least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly > suspect that's also true to major distros. Meanwhile all kernel modules > use gup.. > > I'd say we can go ahead and try if we want, but I really don't know why > that helps in any form to move it to the callers.. with the risk of > breaking someone. It's worth the trouble, in order to clear up this historical mess. It's helping *future* callers of the API, and future maintenance efforts. Yes there is some risk, but it seems very manageable. The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening, by the way, and now that I've read it I don't want to go back. :) thanks,
Hello, John, On Fri, Jul 28, 2023 at 02:32:12PM -0700, John Hubbard wrote: > On 7/28/23 14:20, Peter Xu wrote: > > On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote: > > > Can we get a simple revert in first (without that FOLL_FORCE special casing > > > and ideally with a better name) to handle stable backports, and I'll > > > follow-up with more documentation and letting GUP callers pass in that flag > > > instead? > > > > > > That would help a lot. Then we also have more time to let that "move it to > > > GUP callers" mature a bit in -next, to see if we find any surprises? > > > > As I raised my concern over the other thread, I still worry numa users can > > be affected by this change. After all, numa isn't so uncommon to me, at > > least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly > > suspect that's also true to major distros. Meanwhile all kernel modules > > use gup.. > > > > I'd say we can go ahead and try if we want, but I really don't know why > > that helps in any form to move it to the callers.. with the risk of > > breaking someone. > > It's worth the trouble, in order to clear up this historical mess. It's > helping *future* callers of the API, and future maintenance efforts. Yes > there is some risk, but it seems very manageable. > > The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening, > by the way, and now that I've read it I don't want to go back. :) Yeah I fully agree we should hopefully remove the NUMA / FORCE tangling.. even if we want to revert back to the FOLL_NUMA flag we may want to not revive that specific part. I had a feeling that we're all on the same page there. It's more about the further step to make FOLL_NUMA opt-in for GUP. Thanks,
On 7/28/23 14:49, Peter Xu wrote: >> The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening, >> by the way, and now that I've read it I don't want to go back. :) > > Yeah I fully agree we should hopefully remove the NUMA / FORCE > tangling.. even if we want to revert back to the FOLL_NUMA flag we may want > to not revive that specific part. I had a feeling that we're all on the > same page there. > Yes, I think so. :) > It's more about the further step to make FOLL_NUMA opt-in for GUP. Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So given that our API allows passing in FOLL_ flags, I don't understand the objection to letting different callers pass in, or not pass in, that flag. It's the perfect way to clean up the whole thing. As Linus suggested slightly earlier here, there can be a comment at the call site, explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right? thanks,
On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote: > * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread. > Where should we migrate that page to? Would it actually be counter- > productive to migrate it to the NUMA node of the setup thread? The > longterm pin will turn the page unmovable, yes, but where to migrate > it to? For VFIO & KVM you actively don't get any kind of numa balancing or awareness. In this case qemu should probably strive to put the memory on the numa node of the majorty of CPUs early on because it doesn't get another shot at it. In other cases it depends quite alot. Eg DPDK might want its VFIO buffers to NUMA'd to the node that is close to the device, not the CPU. Or vice versa. There is alot of micro sensitivity here at high data rates. I think people today manually tune this by deliberately allocating the memory to specific numas and then GUP should just leave it alone. FWIW, I'm reading this thread and I have no idea what the special semantic is KVM needs from GUP, so I'm all for better documentation on the GUP flag :) Jason
On 29.07.23 00:35, David Hildenbrand wrote: >> The original reason for not setting FOLL_NUMA all the time is >> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting >> page faults from gup/gup_fast") from way back in 2012: >> >> * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault >> * would be called on PROT_NONE ranges. We must never invoke >> * handle_mm_fault on PROT_NONE ranges or the NUMA hinting >> * page faults would unprotect the PROT_NONE ranges if >> * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd >> * bitflag. So to avoid that, don't set FOLL_NUMA if >> * FOLL_FORCE is set. >> >> but I don't think the original reason for this is *true* any more. >> >> Because then two years later in 2014, in commit c46a7c817e66 ("x86: >> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") >> Mel made the code able to distinguish between PROT_NONE and NUMA >> pages, and he changed the comment above too. > > Sleeping over it and looking into some nasty details, I realized the following things: (1) The pte_protnone() comment in include/linux/pgtable.h is either wrong or misleading. Especially the "For PROT_NONE VMAs, the PTEs are not marked _PAGE_PROTNONE" is *wrong* nowadays on x86. Doing an mprotect(PROT_NONE) will also result in pte_protnone() succeeding, because the pages *are* marked _PAGE_PROTNONE. The comment should be something like this /* * In an inaccessible (PROT_NONE) VMA, pte_protnone() *may* indicate * "yes". It is perfectly valid to indicate "no" in that case, * which is why our default implementation defaults to "always no". * * In an accessible VMA, however, pte_protnone() *reliably* * indicates PROT_NONE page protection due to NUMA hinting. NUMA * hinting faults only apply in accessible VMAs. * * So, to reliably distinguish between PROT_NONE due to an * inaccessible VMA and NUMA hinting, looking at the VMA * accessibility is sufficient. */ I'll send that as a separate patch. (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole story. commit 21d9ee3eda77 ("mm: remove remaining references to NUMA hinting bits and helpers") from 2015 made the distinction again impossible. Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making progress in GUP with an inaccessible (PROT_NONE) VMA. (a) GUP sees the pte_protnone() and triggers a NUMA hinting fault, although NUMA hinting does not apply. (b) handle_mm_fault() refuses to do anything with pte_protnone() in an inaccessible VMA. And even if it would do something, the new PTE would end up as pte_protnone() again. So, GUP will keep retrying. I have a reproducer that triggers that using ptrace read in an inaccessible VMA. It's easy to make that work in GUP, simply by looking at the VMA accessibility. See my patch proposal, that cleanly decouples FOLL_FORCE from FOLL_HONOR_NUMA_HINT. (3) follow_page() does not check VMA permissions and, therefore, my "implied FOLL_FORCE" assumption is not actually completely wrong. And especially callers that dont't pass FOLL_WRITE really expect follow_page() to work even if the VMA is inaccessible. But the interaction with NUMA hinting is just nasty, absolutely agreed. As raised in another comment, I'm looking into removing the "foll_flags" parameter from follow_page() completely and cleanly documenting the semantics of follow_page(). IMHO, the less follow_page(), the better. Let's see what we can do to improve that. So this would be the patch I would suggest as the first fix we can also backport to stable. Gave it a quick test, also with my ptrace read reproducer (trigger FOLL_FORCE on inaccessible VMA; make sure it works and that the pages don't suddenly end up readable in the page table). Seems to work. I'll follow up with cleanups and moving FOLL_HONOR_NUMA_HINT setting to the relevant callers (especially KVM). Further, I'll add a selftest to make sure that ptrace on inaccessible VMAs keeps working as expected. From 36c1aeb9aa3e859762f671776601a71179247d17 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Fri, 28 Jul 2023 21:57:04 +0200 Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT As it turns out, unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never set FOLL_NUMA because they really don't want NUMA hinting faults. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." While we didn't get BUG reports on the changed follow_page() semantics yet (and it's just a matter of time), liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE due to NUMA hinting. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE. To make that combination work in inaccessible VMAs, we have to perform proper VMA accessibility checks in gup_can_follow_protnone(). Move gup_can_follow_protnone() to internal.h which feels more appropriate and is required as long as FOLL_HONOR_NUMA_FAULT is an internal flag. As Linus notes [2], this handling doesn't make sense for many GUP users. So we should really see if we instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Reported-by: liubo <liubo254@huawei.com> Reported-by: Peter Xu <peterx@redhat.com> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Cc: <stable@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- include/linux/mm.h | 15 --------------- mm/gup.c | 18 ++++++++++++++---- mm/huge_memory.c | 2 +- mm/internal.h | 31 +++++++++++++++++++++++++++++++ 4 files changed, 46 insertions(+), 20 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..f8d7fa3c01c1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3400,21 +3400,6 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) return 0; } -/* - * Indicates whether GUP can follow a PROT_NONE mapped page, or whether - * a (NUMA hinting) fault is required. - */ -static inline bool gup_can_follow_protnone(unsigned int flags) -{ - /* - * FOLL_FORCE has to be able to make progress even if the VMA is - * inaccessible. Further, FOLL_FORCE access usually does not represent - * application behaviour and we should avoid triggering NUMA hinting - * faults. - */ - return flags & FOLL_FORCE; -} - typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); diff --git a/mm/gup.c b/mm/gup.c index 76d222ccc3ff..54b8d77f3a3d 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, pte = ptep_get(ptep); if (!pte_present(pte)) goto no_page; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags)) goto no_page; page = vm_normal_page(vma, address, pte); @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, if (likely(!pmd_trans_huge(pmdval))) return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); - if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags)) return no_page_table(vma, flags); ptl = pmd_lock(mm, pmd); @@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) return NULL; + /* + * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect + * to fail on PROT_NONE-mapped pages. + */ page = follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -1200,6 +1204,12 @@ static long __get_user_pages(struct mm_struct *mm, VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); + /* + * For now, always trigger NUMA hinting faults. Some GUP users like + * KVM really require it to benefit from autonuma. + */ + gup_flags |= FOLL_HONOR_NUMA_FAULT; + do { struct page *page; unsigned int foll_flags = gup_flags; @@ -2551,7 +2561,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, struct page *page; struct folio *folio; - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) + if (pte_protnone(pte) && !gup_can_follow_protnone(NULL, flags)) goto pte_unmap; if (!pte_access_permitted(pte, flags & FOLL_WRITE)) @@ -2971,7 +2981,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || pmd_devmap(pmd))) { if (pmd_protnone(pmd) && - !gup_can_follow_protnone(flags)) + !gup_can_follow_protnone(NULL, flags)) return 0; if (!gup_huge_pmd(pmd, pmdp, addr, next, flags, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb3678360b97..ef6bdc4a6fec 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, return ERR_PTR(-EFAULT); /* Full NUMA hinting faults to serialise migration in fault paths */ - if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags)) + if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags)) return NULL; if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page)) diff --git a/mm/internal.h b/mm/internal.h index a7d9e980429a..7db17259c51a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -937,6 +937,8 @@ enum { FOLL_FAST_ONLY = 1 << 20, /* allow unlocking the mmap lock */ FOLL_UNLOCKABLE = 1 << 21, + /* Honor (trigger) NUMA hinting faults on PROT_NONE-mapped pages. */ + FOLL_HONOR_NUMA_FAULT = 1 << 22, }; /* @@ -1004,6 +1006,35 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma, return !PageAnonExclusive(page); } +/* + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether + * a (NUMA hinting) fault is required. + */ +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma, + unsigned int flags) +{ + /* + * If callers don't want to honor NUMA hinting faults, no need to + * determine if we would actually have to trigger a NUMA hinting fault. + */ + if (!(flags & FOLL_HONOR_NUMA_FAULT)) + return true; + + /* We really need the VMA ... */ + if (!vma) + return false; + + /* + * ... because NUMA hinting faults only apply in accessible VMAs. In + * inaccessible (PROT_NONE) VMAs, NUMA hinting faults don't apply. + * + * Requiring a fault here even for inaccessible VMAs would mean that + * FOLL_FORCE cannot make any progress, because handle_mm_fault() + * refuses to process NUMA hinting faults in inaccessible VMAs. + */ + return !vma_is_accessible(vma); +} + extern bool mirrored_kernelcore; static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
On Fri, Jul 28, 2023 at 07:14:26PM -0300, Jason Gunthorpe wrote: > On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote: > > * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread. > > Where should we migrate that page to? Would it actually be counter- > > productive to migrate it to the NUMA node of the setup thread? The > > longterm pin will turn the page unmovable, yes, but where to migrate > > it to? > > For VFIO & KVM you actively don't get any kind of numa balancing or > awareness. In this case qemu should probably strive to put the memory > on the numa node of the majorty of CPUs early on because it doesn't > get another shot at it. > > In other cases it depends quite alot. Eg DPDK might want its VFIO > buffers to NUMA'd to the node that is close to the device, not the > CPU. Or vice versa. There is alot of micro sensitivity here at high > data rates. I think people today manually tune this by deliberately > allocating the memory to specific numas and then GUP should just leave > it alone. Right. For the other O_DIRECT example - it seems to be a more generic issue to "whether we should rely on the follow up accessor to decide the target node of numa balancing". To me at least for KVM's use case I'd still expect the major paths to trigger that is still when guest accessing a page from vcpu threads, that's still the GUP paths. Thanks,
On Fri, Jul 28, 2023 at 03:00:04PM -0700, John Hubbard wrote: > On 7/28/23 14:49, Peter Xu wrote: > > > The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening, > > > by the way, and now that I've read it I don't want to go back. :) > > > > Yeah I fully agree we should hopefully remove the NUMA / FORCE > > tangling.. even if we want to revert back to the FOLL_NUMA flag we may want > > to not revive that specific part. I had a feeling that we're all on the > > same page there. > > > > Yes, I think so. :) > > > It's more about the further step to make FOLL_NUMA opt-in for GUP. > > Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So > given that our API allows passing in FOLL_ flags, I don't understand the > objection to letting different callers pass in, or not pass in, that > flag. > > It's the perfect way to clean up the whole thing. As Linus suggested > slightly earlier here, there can be a comment at the call site, > explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right? I'm good even if we want to experiment anything, as long as (at least) kvm is all covered then I'm not against it, not yet strongly. But again, IMHO we're not only talking about "cleaning up" of any flag - if that falls into "cleanup" first, which I'm not 100% sure yet - we're takling about changing GUP abi. What I wanted to point out to be careful is we're literally changing the GUP abi for all kernel modules on numa implications. Thanks,
On Sat, Jul 29, 2023 at 11:35:22AM +0200, David Hildenbrand wrote: > On 29.07.23 00:35, David Hildenbrand wrote: > > > The original reason for not setting FOLL_NUMA all the time is > > > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting > > > page faults from gup/gup_fast") from way back in 2012: > > > > > > * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault > > > * would be called on PROT_NONE ranges. We must never invoke > > > * handle_mm_fault on PROT_NONE ranges or the NUMA hinting > > > * page faults would unprotect the PROT_NONE ranges if > > > * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd > > > * bitflag. So to avoid that, don't set FOLL_NUMA if > > > * FOLL_FORCE is set. > > > > > > but I don't think the original reason for this is *true* any more. > > > > > > Because then two years later in 2014, in commit c46a7c817e66 ("x86: > > > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") > > > Mel made the code able to distinguish between PROT_NONE and NUMA > > > pages, and he changed the comment above too. > > > > > > Sleeping over it and looking into some nasty details, I realized the following things: > > > (1) The pte_protnone() comment in include/linux/pgtable.h is > either wrong or misleading. > > Especially the "For PROT_NONE VMAs, the PTEs are not marked > _PAGE_PROTNONE" is *wrong* nowadays on x86. > > Doing an mprotect(PROT_NONE) will also result in pte_protnone() > succeeding, because the pages *are* marked _PAGE_PROTNONE. > > The comment should be something like this > > /* > * In an inaccessible (PROT_NONE) VMA, pte_protnone() *may* indicate > * "yes". It is perfectly valid to indicate "no" in that case, > * which is why our default implementation defaults to "always no". > * > * In an accessible VMA, however, pte_protnone() *reliably* > * indicates PROT_NONE page protection due to NUMA hinting. NUMA > * hinting faults only apply in accessible VMAs. > * > * So, to reliably distinguish between PROT_NONE due to an > * inaccessible VMA and NUMA hinting, looking at the VMA > * accessibility is sufficient. > */ > > I'll send that as a separate patch. > > > (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole > story. > > commit 21d9ee3eda77 ("mm: remove remaining references to NUMA > hinting bits and helpers") from 2015 made the distinction again > impossible. > > Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making > progress in GUP with an inaccessible (PROT_NONE) VMA. If we also teach follow_page_mask() on vma_is_accessible(), we should still be good, am I right? Basically fast-gup will stop working on protnone, and it always fallbacks to slow-gup. Then it seems we're good decoupling FORCE with NUMA hint. I assume that that's what you did below in the patch too, which looks right to me. > > (a) GUP sees the pte_protnone() and triggers a NUMA hinting fault, > although NUMA hinting does not apply. > > (b) handle_mm_fault() refuses to do anything with pte_protnone() in > an inaccessible VMA. And even if it would do something, the new > PTE would end up as pte_protnone() again. > So, GUP will keep retrying. I have a reproducer that triggers that > using ptrace read in an inaccessible VMA. > > It's easy to make that work in GUP, simply by looking at the VMA > accessibility. > > See my patch proposal, that cleanly decouples FOLL_FORCE from > FOLL_HONOR_NUMA_HINT. > > > (3) follow_page() does not check VMA permissions and, therefore, my > "implied FOLL_FORCE" assumption is not actually completely wrong. > > And especially callers that dont't pass FOLL_WRITE really expect > follow_page() to work even if the VMA is inaccessible. > > But the interaction with NUMA hinting is just nasty, absolutely > agreed. > > As raised in another comment, I'm looking into removing the > "foll_flags" parameter from follow_page() completely and cleanly > documenting the semantics of follow_page(). > > IMHO, the less follow_page(), the better. Let's see what we can do > to improve that. > > > So this would be the patch I would suggest as the first fix we can also > backport to stable. > > Gave it a quick test, also with my ptrace read reproducer (trigger > FOLL_FORCE on inaccessible VMA; make sure it works and that the pages don't > suddenly end up readable in the page table). Seems to work. > > I'll follow up with cleanups and moving FOLL_HONOR_NUMA_HINT setting to the > relevant callers (especially KVM). Further, I'll add a selftest to make > sure that ptrace on inaccessible VMAs keeps working as expected. > > > > From 36c1aeb9aa3e859762f671776601a71179247d17 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Fri, 28 Jul 2023 21:57:04 +0200 > Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT > > As it turns out, unfortunately commit 474098edac26 ("mm/gup: replace > FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and > follow_trans_huge_pmd() never set FOLL_NUMA because they really don't want > NUMA hinting faults. > > As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page > faults from gup/gup_fast"): "Other follow_page callers like KSM should not > use FOLL_NUMA, or they would fail to get the pages if they use follow_page > instead of get_user_pages." > > While we didn't get BUG reports on the changed follow_page() semantics yet > (and it's just a matter of time), liubo reported [1] that smaps_rollup > results are imprecise, because they miss accounting of pages that are > mapped PROT_NONE due to NUMA hinting. > > As KVM really depends on these NUMA hinting faults, removing the > pte_protnone()/pmd_protnone() handling in GUP code completely is not really > an option. > > To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT > to restore the original behavior and add better comments. > > Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE. To make that > combination work in inaccessible VMAs, we have to perform proper VMA > accessibility checks in gup_can_follow_protnone(). > > Move gup_can_follow_protnone() to internal.h which feels more > appropriate and is required as long as FOLL_HONOR_NUMA_FAULT is an > internal flag. > > As Linus notes [2], this handling doesn't make sense for many GUP users. > So we should really see if we instead let relevant GUP callers specify it > manually, and not trigger NUMA hinting faults from GUP as default. > > [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com > [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com > > Reported-by: liubo <liubo254@huawei.com> > Reported-by: Peter Xu <peterx@redhat.com> > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") > Cc: <stable@vger.kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Peter Xu <peterx@redhat.com> > Cc: liubo <liubo254@huawei.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > include/linux/mm.h | 15 --------------- > mm/gup.c | 18 ++++++++++++++---- > mm/huge_memory.c | 2 +- > mm/internal.h | 31 +++++++++++++++++++++++++++++++ > 4 files changed, 46 insertions(+), 20 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2dd73e4f3d8e..f8d7fa3c01c1 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3400,21 +3400,6 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) > return 0; > } > -/* > - * Indicates whether GUP can follow a PROT_NONE mapped page, or whether > - * a (NUMA hinting) fault is required. > - */ > -static inline bool gup_can_follow_protnone(unsigned int flags) > -{ > - /* > - * FOLL_FORCE has to be able to make progress even if the VMA is > - * inaccessible. Further, FOLL_FORCE access usually does not represent > - * application behaviour and we should avoid triggering NUMA hinting > - * faults. > - */ > - return flags & FOLL_FORCE; > -} > - > typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); > extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, > unsigned long size, pte_fn_t fn, void *data); > diff --git a/mm/gup.c b/mm/gup.c > index 76d222ccc3ff..54b8d77f3a3d 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > pte = ptep_get(ptep); > if (!pte_present(pte)) > goto no_page; > - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) > + if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags)) > goto no_page; > page = vm_normal_page(vma, address, pte); > @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > if (likely(!pmd_trans_huge(pmdval))) > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); > - if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags)) > + if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags)) > return no_page_table(vma, flags); > ptl = pmd_lock(mm, pmd); > @@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, > if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) > return NULL; > + /* > + * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect > + * to fail on PROT_NONE-mapped pages. > + */ > page = follow_page_mask(vma, address, foll_flags, &ctx); > if (ctx.pgmap) > put_dev_pagemap(ctx.pgmap); > @@ -1200,6 +1204,12 @@ static long __get_user_pages(struct mm_struct *mm, > VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); > + /* > + * For now, always trigger NUMA hinting faults. Some GUP users like > + * KVM really require it to benefit from autonuma. > + */ > + gup_flags |= FOLL_HONOR_NUMA_FAULT; > + > do { > struct page *page; > unsigned int foll_flags = gup_flags; > @@ -2551,7 +2561,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > struct page *page; > struct folio *folio; > - if (pte_protnone(pte) && !gup_can_follow_protnone(flags)) > + if (pte_protnone(pte) && !gup_can_follow_protnone(NULL, flags)) > goto pte_unmap; > if (!pte_access_permitted(pte, flags & FOLL_WRITE)) > @@ -2971,7 +2981,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo > if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || > pmd_devmap(pmd))) { > if (pmd_protnone(pmd) && > - !gup_can_follow_protnone(flags)) > + !gup_can_follow_protnone(NULL, flags)) > return 0; > if (!gup_huge_pmd(pmd, pmdp, addr, next, flags, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index eb3678360b97..ef6bdc4a6fec 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > return ERR_PTR(-EFAULT); > /* Full NUMA hinting faults to serialise migration in fault paths */ > - if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags)) > + if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags)) > return NULL; > if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page)) > diff --git a/mm/internal.h b/mm/internal.h > index a7d9e980429a..7db17259c51a 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -937,6 +937,8 @@ enum { > FOLL_FAST_ONLY = 1 << 20, > /* allow unlocking the mmap lock */ > FOLL_UNLOCKABLE = 1 << 21, > + /* Honor (trigger) NUMA hinting faults on PROT_NONE-mapped pages. */ > + FOLL_HONOR_NUMA_FAULT = 1 << 22, > }; > /* > @@ -1004,6 +1006,35 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma, > return !PageAnonExclusive(page); > } > +/* > + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether > + * a (NUMA hinting) fault is required. > + */ > +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma, > + unsigned int flags) > +{ > + /* > + * If callers don't want to honor NUMA hinting faults, no need to > + * determine if we would actually have to trigger a NUMA hinting fault. > + */ > + if (!(flags & FOLL_HONOR_NUMA_FAULT)) > + return true; > + > + /* We really need the VMA ... */ > + if (!vma) > + return false; I'm not sure whether the compiler will be smart enough to inline this for fast-gup on pmd/pte. One way to guarantee this is we simply always bail out for fast-gup on protnone (ignoring calling gup_can_follow_protnone() with a comment), as discussed above. Then WARN_ON_ONCE(!vma) for all the rest callers, assuming that's a required knowledge to know what the protnone means. Thanks, > + > + /* > + * ... because NUMA hinting faults only apply in accessible VMAs. In > + * inaccessible (PROT_NONE) VMAs, NUMA hinting faults don't apply. > + * > + * Requiring a fault here even for inaccessible VMAs would mean that > + * FOLL_FORCE cannot make any progress, because handle_mm_fault() > + * refuses to process NUMA hinting faults in inaccessible VMAs. > + */ > + return !vma_is_accessible(vma); > +} > + > extern bool mirrored_kernelcore; > static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) > -- > 2.41.0 > > > > -- > Cheers, > > David / dhildenb >
>> (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole >> story. >> >> commit 21d9ee3eda77 ("mm: remove remaining references to NUMA >> hinting bits and helpers") from 2015 made the distinction again >> impossible. >> >> Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making >> progress in GUP with an inaccessible (PROT_NONE) VMA. > > If we also teach follow_page_mask() on vma_is_accessible(), we should still > be good, am I right? > > Basically fast-gup will stop working on protnone, and it always fallbacks > to slow-gup. Then it seems we're good decoupling FORCE with NUMA hint. > > I assume that that's what you did below in the patch too, which looks right > to me. I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in is_valid_gup_args(), such that it will always be set for any GUP users, including GUP-fast. [...] >> +/* >> + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether >> + * a (NUMA hinting) fault is required. >> + */ >> +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma, >> + unsigned int flags) >> +{ >> + /* >> + * If callers don't want to honor NUMA hinting faults, no need to >> + * determine if we would actually have to trigger a NUMA hinting fault. >> + */ >> + if (!(flags & FOLL_HONOR_NUMA_FAULT)) >> + return true; >> + >> + /* We really need the VMA ... */ >> + if (!vma) >> + return false; > > I'm not sure whether the compiler will be smart enough to inline this for > fast-gup on pmd/pte. Why shouldn't it? It's placed in a head file and the vma == NULL is not obfuscated. :) Anyhow, I'll take a look at the compiler output. Thanks!
On Mon, 31 Jul 2023 at 09:20, David Hildenbrand <david@redhat.com> wrote: > > I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in > is_valid_gup_args(), such that it will always be set for any GUP users, > including GUP-fast. But do we actually want that? It is actively crazy to honor NUMA faulting at least for get_user_pages_remote(). So right now, GUP-fast requires us to honor NUMA faults, because GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't take any locks). So GUP-fast can only look at the page table data, and as such *has* to fail if the page table is inaccessible. But GUP in general? Why would it want to honor numa faulting? Particularly by default, and _particularly_ for things like FOLL_REMOTE. In fact, I feel like this is what the real rule should be: we simply define that get_user_pages_fast() is about looking up the page in the page tables. So if you want something that acts like a page table lookup, you use that "fast" thing. It's literally how it is designed. The whole - and pretty much only - point of it is that it can be used with no locking at all, because it basically acts like the hardware lookup does. So then if KVM wants to look up a page in the page table, that is what kvm should use, and it automatically gets the "honor numa faults" behavior, not because it sets a magic flag, but simply because that is how GUP-fast *works*. But if you use the "normal" get/pin_user_pages() function, which looks up the vma, at that point you are following things at a "software level", and it wouldn't do NUMA faulting, it would just get the page. (Ok, we have the whole "FAST_ONLY vs fall back" case, so "fast" can look up the vma too, but read the above argument as "fast *can* be done without vma, so fast must honor page table bits as per hardware"). Linus
On Mon, Jul 31, 2023 at 11:23:59AM -0700, Linus Torvalds wrote: > So GUP-fast can only look at the page table data, and as such *has* to > fail if the page table is inaccessible. > > But GUP in general? Why would it want to honor numa faulting? > Particularly by default, and _particularly_ for things like > FOLL_REMOTE. True. > > In fact, I feel like this is what the real rule should be: we simply > define that get_user_pages_fast() is about looking up the page in the > page tables. > > So if you want something that acts like a page table lookup, you use > that "fast" thing. It's literally how it is designed. The whole - and > pretty much only - point of it is that it can be used with no locking > at all, because it basically acts like the hardware lookup does. Unfortunately I think at least kvm (besides the rest..) relies not only on numa balancing but also fast-gup.. :-( Please refer to hva_to_pfn() where it even supports fast-gup-only HVA translation when atomic==true set. Thanks,
On 31.07.23 20:23, Linus Torvalds wrote: > On Mon, 31 Jul 2023 at 09:20, David Hildenbrand <david@redhat.com> wrote: >> Hi Linus, >> I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in >> is_valid_gup_args(), such that it will always be set for any GUP users, >> including GUP-fast. > > But do we actually want that? It is actively crazy to honor NUMA > faulting at least for get_user_pages_remote(). This would only be for the stable backport that would go in first and where I want to be a bit careful. Next step would be to let the callers (KVM) specify FOLL_HONOR_NUMA_FAULT, as suggested by you. > > So right now, GUP-fast requires us to honor NUMA faults, because > GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't > take any locks). With FOLL_HONOR_NUMA_FAULT moved to the GUP caller that would no longer be the case. Anybody who (1) doesn't specify FOLL_HONOR_NUMA_FAULT, which is the majority (2) doesn't specify FOLL_WRITE Would get GUP-fast just grabbing these pte_protnone() pages. > > So GUP-fast can only look at the page table data, and as such *has* to > fail if the page table is inaccessible. gup_fast_only, yes, which is what KVM uses if a writable PFN is desired. > > But GUP in general? Why would it want to honor numa faulting? > Particularly by default, and _particularly_ for things like > FOLL_REMOTE. KVM currently does [virt/kvm/kvm_main.c]: (1) hva_to_pfn_fast(): call get_user_page_fast_only(FOLL_WRITE) if a writable PFN is desired (2) hva_to_pfn_slow(): call get_user_pages_unlocked() So in the "!writable" case, we would always call get_user_pages_unlocked() and never honor NUMA faults. Converting that to some other pattern might be possible (although KVM plays quite some tricks here!), but assuming we would always first do a get_user_page_fast_only(), then when not intending to write (!FOLL_WRITE) (1) get_user_page_fast_only() would honor NUMA faults and fail (2) get_user_pages() would not honor NUMA faults and succeed Hmmm ... so we would have to use get_user_pages_fast()? It might be possible, but I am not sure if we want get_user_pages_fast() to always honor NUMA faults, because ... > > In fact, I feel like this is what the real rule should be: we simply > define that get_user_pages_fast() is about looking up the page in the > page tables. > > So if you want something that acts like a page table lookup, you use > that "fast" thing. It's literally how it is designed. The whole - and > pretty much only - point of it is that it can be used with no locking > at all, because it basically acts like the hardware lookup does. > ... I see what you mean (HW would similarly refuse to use such a page), but I do wonder if that makes the API clearer and if this is what we actually want. We do have callers of pin_user_pages_fast() and friends that maybe *really* shouldn't care about NUMA hinting. iov_iter_extract_user_pages() is one example -- used for O_DIRECT nowadays. Their logic is "if it's directly in the page table, create, hand it over. If not, please go the slow path.". In many cases user space just touched these pages so they are very likely in the page table. Converting them to pin_user_pages() would mean they will just run slower in the common case. Converting them to a manual pin_user_pages_fast_only() + pin_user_pages() doesn't seem very compelling. ... so we would need a new API? :/ > So then if KVM wants to look up a page in the page table, that is what > kvm should use, and it automatically gets the "honor numa faults" > behavior, not because it sets a magic flag, but simply because that is > how GUP-fast *works*. > > But if you use the "normal" get/pin_user_pages() function, which looks > up the vma, at that point you are following things at a "software > level", and it wouldn't do NUMA faulting, it would just get the page. My main problem with that is that pin_user_pages_fast() and friends are used all over the place for a "likely already in the page table case, so just make everything faster as default". Always honoring NUMA faults here does not sound like the improvement we wanted to have :) ... we actually *don't* want to honor NUMA faults here. We just have to find a way to make the KVM special case happy. Thanks!
On Mon, 31 Jul 2023 at 12:00, David Hildenbrand <david@redhat.com> wrote: > > So in the "!writable" case, we would always call > get_user_pages_unlocked() and never honor NUMA faults. Ok, so kvm doesn't just use the fast version. Oh well. It was an idea.. Linus
On 31.07.23 21:07, Linus Torvalds wrote: > On Mon, 31 Jul 2023 at 12:00, David Hildenbrand <david@redhat.com> wrote: >> >> So in the "!writable" case, we would always call >> get_user_pages_unlocked() and never honor NUMA faults. > > Ok, so kvm doesn't just use the fast version. Oh well. It was an idea.. Certainly an interesting one, thanks for thinking about possible alternatives! Unfortunately, KVM is an advanced GUP user to managed secondary MMUs. I'll get the FOLL_HONOR_NUMA_FAULT patches ready tomorrow and we can discuss if that looks better. (btw, the whole reasoning about "HW would refuse to use these pages" was exactly the whole reason I went into the FOLL_FORCE direction ... but it's really better to make FOLL_FORCE deal with VMA protection only)
On Mon, Jul 31, 2023 at 09:00:06PM +0200, David Hildenbrand wrote: > Their logic is "if it's directly in the page table, create, hand it over. If > not, please go the slow path.". In many cases user space just touched these > pages so they are very likely in the page table. I think it has become pretty confusing, overall. In my mind 'pin_user_pages_fast()' should be functionally the same as 'pin_user_pages_unlocked()'. Places call fast if they have no idea about what is under memory, otherwise they call unlocked if you are pretty sure something is there that needs the mmap lock to resolve. If we need different behaviors a GUP flag makes the most sense. > Always honoring NUMA faults here does not sound like the improvement we > wanted to have :) ... we actually *don't* want to honor NUMA faults here. Yeah, I think that is right. We should not really use the CPU running PUP as any input to a NUMA algorithm.. If we want NUMA'ness then the PUP user should specify the affinity that makes sense. Jason
On Fri, Jul 28, 2023 at 07:30:33PM +0200, David Hildenbrand wrote: > On 28.07.23 18:18, Linus Torvalds wrote: > > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote: > > > > > > This is my proposal on how to handle the fallout of 474098edac26 > > > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I > > > accidentially missed that follow_page() and smaps implicitly kept the > > > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to > > > not trigger faults on PROT_NONE-mapped PTEs. > > > > Ugh. > > I was hoping for that reaction, with the assumption that we would get > something cleaner :) > > > > > I hate how it uses FOLL_FORCE that is inherently scary. > > I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it > is FOLL_FORCE in disguise (currently and before 474098edac26, if > FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around). > FOLL_NUMA being conflated with FOLL_FORCE is almost certainly a historical accident. > > > > Why do we have that "gup_can_follow_protnone()" logic AT ALL? > > That's what I was hoping for. > > > > > Couldn't we just get rid of that disgusting thing, and just say that > > GUP (and follow_page()) always just ignores NUMA hinting, and always > > just follows protnone? > > > > We literally used to have this: > > > > if (!(gup_flags & FOLL_FORCE)) > > gup_flags |= FOLL_NUMA; > > > > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should > > be the rare crazy case. > > Yes, but my point would be that we now spell that "rare crazy case" > out for follow_page(). > > If you're talking about patch #1, I agree, therefore patch #3 to > avoid all that nasty FOLL_FORCE handling in GUP callers. > > But yeah, if we can avoid all that, great. > > > > > The original reason for not setting FOLL_NUMA all the time is > > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting > > page faults from gup/gup_fast") from way back in 2012: > > > > * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault > > * would be called on PROT_NONE ranges. We must never invoke > > * handle_mm_fault on PROT_NONE ranges or the NUMA hinting > > * page faults would unprotect the PROT_NONE ranges if > > * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd > > * bitflag. So to avoid that, don't set FOLL_NUMA if > > * FOLL_FORCE is set. > > > In handle_mm_fault(), we never call do_numa_page() if > !vma_is_accessible(). Same for do_huge_pmd_numa_page(). > > So, if we would ever end up triggering a page fault on > mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we > would simply do nothing. > > At least that's the hope, I'll take a closer look just to make > sure we're good on all call paths. > > > > > but I don't think the original reason for this is *true* any more. > > > > Because then two years later in 2014, in commit c46a7c817e66 ("x86: > > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") > > Mel made the code able to distinguish between PROT_NONE and NUMA > > pages, and he changed the comment above too. > > CCing Mel. > > I remember that pte_protnone() can only distinguished between > NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible(). > Ok, as usual, I'm far behind and this thread massive but I'll respond to this part before trying to digest the history of this and the current implementation. To the best of my recollection, FOLL_NUMA used to be a correctness issue but that should no longer true. Initially, it was to prevent mixing up "PROT_NONE" that was for NUMA hinting and "PROT_NONE" due to VMA protections. Now the bits are different so this case should be avoidable. Later it was still a different correctness issue because PMD migration had a hacky implementation without migration entries and a GUP could find a page that was being collapsed and had to be serialised. That should also now be avoidable. At some point, FOLL_FORCE and FOLL_NUMA got conflated but they really should not be related even if they are by accident. FOLL_FORCE (e.g. ptrace) may have to process the fault and make the page resident and accessible regardless of any other consequences. FOLL_NUMA ideally should be much more specific. If the calling context only cares about the struct page (e.g. smaps) then it's ok to get a reference to the page. If necessary, it could clear the protection and lose the hinting fault although it's less than ideal. Just needing the struct page for informational purposes though should not be treated as a NUMA hinting fault because it has nothing to do with the tasks memory reference behaviour. A variant of FOLL_NUMA (FOLL_NUMA_HINT?) may still be required to indicate the calling context is accessing the page for reasons that are equivalent to a real memory access from a CPU related to the task mapping the page. I didn't check but KVM may be an example of this when dealing with some MMU faults as the page is being looked up on behalf of the task and presumably from the same CPU the task was running run. Something like reading smaps only needs the struct page but it should not be treated as a NUMA hinting fault as the access has nothing to do with the task mapping the page. > > The original reason for FOLL_NUMA simply does not exist any more. We > > know exactly when a page is marked for NUMA faulting, and we should > > simply *ignore* it for GUP and follow_page(). > > I'd be wary of completely ignoring it if there is any known calling context that is equivalent to a memory access and the hinting fault should be processed -- KVM may be an example.