[v1,0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Message ID	20230727212845.135673-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, liubo <liubo254@huawei.com>, Peter Xu <peterx@redhat.com>, Matthew Wilcox <willy@infradead.org>, Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@ziepe.ca>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout Date: Thu, 27 Jul 2023 23:28:41 +0200 Message-ID: <20230727212845.135673-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	smaps / mm/gup: fix gup_can_follow_protnone fallout \| expand [v1,0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout [v1,1/4] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps [v1,2/4] mm/gup: Make follow_page() succeed again on PROT_NONE PTEs/PMDs [v1,3/4] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() [v1,4/4] mm/gup: document FOLL_FORCE behavior

David Hildenbrand July 27, 2023, 9:28 p.m. UTC

This is my proposal on how to handle the fallout of 474098edac26
("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
accidentially missed that follow_page() and smaps implicitly kept the
FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
not trigger faults on PROT_NONE-mapped PTEs.

(maybe it's just me who considers that confusing)

Patch #1 is the original fix proposal, which patch #3 cleans up. Patch #2
is another fix for the issue on the follow_page() level pointed out by
Peter. Patch #4 documents the FOLL_FORCE situation.

Peter prefers a revert of that commit [1], I disagree and am still happy to
see FOLL_NUMA gone that implicitly relied on FOLL_FORCE.

An alternative might be to use an internal FOLL_PROTNONE or
FOLL_NO_PROTNONE flag in patch #3, not so sure about that.

Did a quick sanity test, will do more testing tomorrow.

[1] https://lkml.kernel.org/r/ZMK+jSDgOmJKySTr@x1n

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>

David Hildenbrand (3):
  mm/gup: Make follow_page() succeed again on PROT_NONE PTEs/PMDs
  smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  mm/gup: document FOLL_FORCE behavior

liubo (1):
  smaps: Fix the abnormal memory statistics obtained through
    /proc/pid/smaps

 fs/proc/task_mmu.c       |  3 +--
 include/linux/mm_types.h | 25 ++++++++++++++++++++++++-
 mm/gup.c                 | 10 +++++++++-
 3 files changed, 34 insertions(+), 4 deletions(-)

Linus Torvalds July 28, 2023, 4:18 p.m. UTC | #1

On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>
> This is my proposal on how to handle the fallout of 474098edac26
> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> accidentially missed that follow_page() and smaps implicitly kept the
> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> not trigger faults on PROT_NONE-mapped PTEs.

Ugh.

I hate how it uses FOLL_FORCE that is inherently scary.

Why do we have that "gup_can_follow_protnone()" logic AT ALL?

Couldn't we just get rid of that disgusting thing, and just say that
GUP (and follow_page()) always just ignores NUMA hinting, and always
just follows protnone?

We literally used to have this:

        if (!(gup_flags & FOLL_FORCE))
                gup_flags |= FOLL_NUMA;

ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
be the rare crazy case.

The original reason for not setting FOLL_NUMA all the time is
documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast") from way back in 2012:

         * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
         * would be called on PROT_NONE ranges. We must never invoke
         * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
         * page faults would unprotect the PROT_NONE ranges if
         * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
         * bitflag. So to avoid that, don't set FOLL_NUMA if
         * FOLL_FORCE is set.

but I don't think the original reason for this is *true* any more.

Because then two years later in 2014, in commit c46a7c817e66 ("x86:
define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
Mel made the code able to distinguish between PROT_NONE and NUMA
pages, and he changed the comment above too.

But I get the very very strong feeling that instead of changing the
comment, he should have actually removed the comment and the code.

So I get the strong feeling that all these FOLL_NUMA games should just
go away. You removed the FOLL_NUMA bit, but you replaced it with using
FOLL_FORCE.

So rather than make this all even more opaque and make it even harder
to figure out why we have that code in the first place, I think it
should all just be removed.

The original reason for FOLL_NUMA simply does not exist any more. We
know exactly when a page is marked for NUMA faulting, and we should
simply *ignore* it for GUP and follow_page().

I think we should treat a NUMA-faulting page as just being present
(and not NUMA-fault it).

Am I missing something?

                  Linus

David Hildenbrand July 28, 2023, 5:30 p.m. UTC | #2

On 28.07.23 18:18, Linus Torvalds wrote:
> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>>
>> This is my proposal on how to handle the fallout of 474098edac26
>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>> accidentially missed that follow_page() and smaps implicitly kept the
>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>> not trigger faults on PROT_NONE-mapped PTEs.
> 
> Ugh.

I was hoping for that reaction, with the assumption that we would get
something cleaner :)

> 
> I hate how it uses FOLL_FORCE that is inherently scary.

I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
is FOLL_FORCE in disguise (currently and before 474098edac26, if
FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).

> 
> Why do we have that "gup_can_follow_protnone()" logic AT ALL?

That's what I was hoping for.

> 
> Couldn't we just get rid of that disgusting thing, and just say that
> GUP (and follow_page()) always just ignores NUMA hinting, and always
> just follows protnone?
> 
> We literally used to have this:
> 
>          if (!(gup_flags & FOLL_FORCE))
>                  gup_flags |= FOLL_NUMA;
> 
> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> be the rare crazy case.

Yes, but my point would be that we now spell that "rare crazy case"
out for follow_page().

If you're talking about patch #1, I agree, therefore patch #3 to
avoid all that nasty FOLL_FORCE handling in GUP callers.

But yeah, if we can avoid all that, great.

> 
> The original reason for not setting FOLL_NUMA all the time is
> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> page faults from gup/gup_fast") from way back in 2012:
> 
>           * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>           * would be called on PROT_NONE ranges. We must never invoke
>           * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>           * page faults would unprotect the PROT_NONE ranges if
>           * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>           * bitflag. So to avoid that, don't set FOLL_NUMA if
>           * FOLL_FORCE is set.

In handle_mm_fault(), we never call do_numa_page() if
!vma_is_accessible(). Same for do_huge_pmd_numa_page().

So, if we would ever end up triggering a page fault on
mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
would simply do nothing.

At least that's the hope, I'll take a closer look just to make
sure we're good on all call paths.

> 
> but I don't think the original reason for this is *true* any more.
> 
> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> Mel made the code able to distinguish between PROT_NONE and NUMA
> pages, and he changed the comment above too.

CCing Mel.

I remember that pte_protnone() can only distinguished between
NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().

Indeed, include/linux/pgtable.h:

/*
  * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
  * the only case the kernel cares is for NUMA balancing and is only ever set
  * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
  * _PAGE_PROTNONE so by default, implement the helper as "always no". It
  * is the responsibility of the caller to distinguish between PROT_NONE
  * protections and NUMA hinting fault protections.
  */

> 
> But I get the very very strong feeling that instead of changing the
> comment, he should have actually removed the comment and the code.
> 
> So I get the strong feeling that all these FOLL_NUMA games should just
> go away. You removed the FOLL_NUMA bit, but you replaced it with using
> FOLL_FORCE.
> 
> So rather than make this all even more opaque and make it even harder
> to figure out why we have that code in the first place, I think it
> should all just be removed.

At least to me, spelling FOLL_FORCE in follow_page() now out is much
less opaque then getting that implied by lack of FOLL_NUMA.

> 
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
> 
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).
> 
> Am I missing something?

There was the case for "FOLL_PIN represents application behavior and
should trigger NUMA faults", but I guess that can be ignored.

But it would be much better to just remove all that if we can.

Let me look into some details.

Thanks Linus!

David Hildenbrand July 28, 2023, 5:54 p.m. UTC | #3

[...]

> 
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.
> 
> But it would be much better to just remove all that if we can.
> 
> Let me look into some details.

I just stumbled over the comment from Mel in follow_trans_huge_pmd():

/* Full NUMA hinting faults to serialise migration in fault paths */

It dates back to

commit 2b4847e73004c10ae6666c2e27b5c5430aed8698
Author: Mel Gorman <mgorman@suse.de>
Date:   Wed Dec 18 17:08:32 2013 -0800

     mm: numa: serialise parallel get_user_page against THP migration
     
     Base pages are unmapped and flushed from cache and TLB during normal
     page migration and replaced with a migration entry that causes any
     parallel NUMA hinting fault or gup to block until migration completes.
     
     THP does not unmap pages due to a lack of support for migration entries
     at a PMD level.  This allows races with get_user_pages and
     get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
     THP migration and PMD numa clearing") made worse by introducing a
     pmd_clear_flush().
     
     This patch forces get_user_page (fast and normal) on a pmd_numa page to
     go through the slow get_user_page path where it will serialise against
     THP migration and properly account for the NUMA hinting fault.  On the
     migration side the page table lock is taken for each PTE update.


We nowadays do have migration entries at PMD level -- and setting FOLL_FORCE
could similarly trigger such a race.

So I suspect we're good.

Peter Xu July 28, 2023, 7:39 p.m. UTC | #4

Hi, Linus,

On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote:
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
> 
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).

But then does it means that any gup-only user will have numa balancing
completely disabled?  Since as long as the page will only be accessed by
GUP, the numa balancing will never trigger anyway..  I think KVM is
manipulating guest pages just like that.  Not sure whether it means it'll
void the whole numa effort there.

If we allow that GUP from happening (taking protnone as present) I assume
it'll also stop any further numa balancing on this very page to trigger
too, because even if some page fault handler triggered on this protnone
page later that is not GUP anymore, when it wants to migrate the page to
the other numa node it'll see someone is holding a reference on it already,
and then we should give up the balancing.

So to me FOLL_NUMA (or any identifier like it.. just to describe the
caller's need; some may really just want to fetch the pfn/page) still makes
sense.  But maybe I totally misunderstood above..

Thanks,

David Hildenbrand July 28, 2023, 7:40 p.m. UTC | #5

On 28.07.23 19:30, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
>> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> This is my proposal on how to handle the fallout of 474098edac26
>>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>>> accidentially missed that follow_page() and smaps implicitly kept the
>>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>>> not trigger faults on PROT_NONE-mapped PTEs.
>>
>> Ugh.
> 
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
> 
>>
>> I hate how it uses FOLL_FORCE that is inherently scary.
> 
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
> 
>>
>> Why do we have that "gup_can_follow_protnone()" logic AT ALL?
> 
> That's what I was hoping for.
> 
>>
>> Couldn't we just get rid of that disgusting thing, and just say that
>> GUP (and follow_page()) always just ignores NUMA hinting, and always
>> just follows protnone?
>>
>> We literally used to have this:
>>
>>           if (!(gup_flags & FOLL_FORCE))
>>                   gup_flags |= FOLL_NUMA;
>>
>> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
>> be the rare crazy case.
> 
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
> 
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
> 
> But yeah, if we can avoid all that, great.
> 
>>
>> The original reason for not setting FOLL_NUMA all the time is
>> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
>> page faults from gup/gup_fast") from way back in 2012:
>>
>>            * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>>            * would be called on PROT_NONE ranges. We must never invoke
>>            * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>>            * page faults would unprotect the PROT_NONE ranges if
>>            * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>>            * bitflag. So to avoid that, don't set FOLL_NUMA if
>>            * FOLL_FORCE is set.
> 
> 
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
> 
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
> 
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
> 
>>
>> but I don't think the original reason for this is *true* any more.
>>
>> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
>> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
>> Mel made the code able to distinguish between PROT_NONE and NUMA
>> pages, and he changed the comment above too.
> 
> CCing Mel.
> 
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
> 
> Indeed, include/linux/pgtable.h:
> 
> /*
>    * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
>    * the only case the kernel cares is for NUMA balancing and is only ever set
>    * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
>    * _PAGE_PROTNONE so by default, implement the helper as "always no". It
>    * is the responsibility of the caller to distinguish between PROT_NONE
>    * protections and NUMA hinting fault protections.
>    */
> 
>>
>> But I get the very very strong feeling that instead of changing the
>> comment, he should have actually removed the comment and the code.
>>
>> So I get the strong feeling that all these FOLL_NUMA games should just
>> go away. You removed the FOLL_NUMA bit, but you replaced it with using
>> FOLL_FORCE.
>>
>> So rather than make this all even more opaque and make it even harder
>> to figure out why we have that code in the first place, I think it
>> should all just be removed.
> 
> At least to me, spelling FOLL_FORCE in follow_page() now out is much
> less opaque then getting that implied by lack of FOLL_NUMA.
> 
>>
>> The original reason for FOLL_NUMA simply does not exist any more. We
>> know exactly when a page is marked for NUMA faulting, and we should
>> simply *ignore* it for GUP and follow_page().
>>
>> I think we should treat a NUMA-faulting page as just being present
>> (and not NUMA-fault it).
>>
>> Am I missing something?
> 
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.


Re-reading commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page 
faults from gup/gup_fast"), it actually does spell out an important case 
that we should handle:

"KVM secondary MMU page faults will trigger the NUMA hinting page
  faults through gup_fast -> get_user_pages -> follow_page ->
  handle_mm_fault."

That is still the case today (and important). Not triggering NUMA 
hinting faults would degrade KVM.

Hmm. So three alternatives I see:

1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
    checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
    don't like that.

2) Revert the commit and reintroduce unconditional FOLL_NUMA without
    FOLL_FORCE.

3) Have a FOLL_NUMA that callers like KVM can pass.

Thoughts?

Peter Xu July 28, 2023, 7:50 p.m. UTC | #6

On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote:
> Hmm. So three alternatives I see:
> 
> 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
>    checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
>    don't like that.
> 
> 2) Revert the commit and reintroduce unconditional FOLL_NUMA without
>    FOLL_FORCE.
> 
> 3) Have a FOLL_NUMA that callers like KVM can pass.

I'm afraid 3) means changing numa balancing to opt-in, probably no-go for
any non-kvm gup users as that could start to break there, even if making
smaps/follow_page happy again.

I keep worrying 1) on FOLL_FORCE abuse.

So I keep my vote on 2).

Thanks,

David Hildenbrand July 28, 2023, 7:52 p.m. UTC | #7

On 28.07.23 21:39, Peter Xu wrote:
> Hi, Linus,
> 
> On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote:
>> The original reason for FOLL_NUMA simply does not exist any more. We
>> know exactly when a page is marked for NUMA faulting, and we should
>> simply *ignore* it for GUP and follow_page().
>>
>> I think we should treat a NUMA-faulting page as just being present
>> (and not NUMA-fault it).
> 
> But then does it means that any gup-only user will have numa balancing
> completely disabled?  Since as long as the page will only be accessed by
> GUP, the numa balancing will never trigger anyway..  I think KVM is
> manipulating guest pages just like that.  Not sure whether it means it'll
> void the whole numa effort there.
> 
> If we allow that GUP from happening (taking protnone as present) I assume
> it'll also stop any further numa balancing on this very page to trigger
> too, because even if some page fault handler triggered on this protnone
> page later that is not GUP anymore, when it wants to migrate the page to
> the other numa node it'll see someone is holding a reference on it already,
> and then we should give up the balancing.
> 
> So to me FOLL_NUMA (or any identifier like it.. just to describe the
> caller's need; some may really just want to fetch the pfn/page) still makes
> sense.  But maybe I totally misunderstood above..

Yes, I agree, took me a bit longer to realize (being a KVM developer :) 
... I'm really ready for the weekend).

So if this series is not acceptable then better revert that commit -- or 
let callers like KVM specify FOLL_NUMA.

David Hildenbrand July 28, 2023, 8 p.m. UTC | #8

On 28.07.23 21:50, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 09:40:54PM +0200, David Hildenbrand wrote:
>> Hmm. So three alternatives I see:
>>
>> 1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
>>     checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
>>     don't like that.
>>
>> 2) Revert the commit and reintroduce unconditional FOLL_NUMA without
>>     FOLL_FORCE.
>>
>> 3) Have a FOLL_NUMA that callers like KVM can pass.
> 
> I'm afraid 3) means changing numa balancing to opt-in, probably no-go for
> any non-kvm gup users as that could start to break there, even if making
> smaps/follow_page happy again.
> 
> I keep worrying 1) on FOLL_FORCE abuse.
> 
> So I keep my vote on 2).

Linus had a point that we can nowadays always set FOLL_NUMA, even with 
FOLL_FORCE.

... so maybe we want to let GUP always set FOLL_NUMA (even with 
FOLL_FORCE) and follow_page() never set FOLL_NUMA.

That would at least decouple FOLL_NUMA from FOLL_FORCE.

Linus Torvalds July 28, 2023, 8:23 p.m. UTC | #9

On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote:
>
> But then does it means that any gup-only user will have numa balancing
> completely disabled?

Why would we ever care about a GUP-only user?

Who knows where the actual access is coming from? It might be some
device that is on a different node entirely.

And even if the access is local from the CPU, it

 (a) might have happened after we moved somewhere else

 (b) who cares about the extra possible NUMA overhead when we just
wasted *thousands* of cycles on GUP?

So NUMA balancing really doesn't seem to make sense for GUP anyway as
far as I can see.

Now, the other side of the same thing is that (a) NUMA faulting should
be fairly rare and (b) once you do GUP, who cares anyway, so you can
also argue that "once you do GUP you might as well NUMA-fault, because
performance simply isn't an issue".

But I really think the real argument is "once you do GUP, numa
faulting is just crazy".

I think what happened is

 - the GUP code couldn't tell NUMA and actual PROTNONE apart

 - so the GUP code would punch through PROTNONE even when it shouldn't

 - so people added FOLL_NUMA to say "I don't want you to punch
through, I want the NUMA fault"

 - but then FOLL_FORCE ends up meaning that you actually *do* want to
punch through - regardless of NUMA or not - and now the two got tied
together, and we end up with nonsensical garbage like

        if (!(gup_flags & FOLL_FORCE))
                gup_flags |= FOLL_NUMA;

   to say "oh, actually, to avoid punching through when we shouldn't,
we should NUMA fault".

so we ended up with that case where even if YOU DIDN'T CARE AT ALL,
you got FOLL_NUMA just so that you wouldn't punch through.

And now we're in the situation that we've confused FOLL_FORCE and
FOLL_NUMA, even though they have absolutely *nothing* to do with each
other, except for a random implementation detail about punching
through incorrectly that isn't even relevant any more.

I really think FOLL_NUMA should just go away. And that FOLL_FORCE
replacement for it is just wrong.  If you *don't* do something without
FOLL_FORCE, you damn well shouldn't do it just because FOLL_FORCE is
set.

The *only* semantic meaning FOLL_FORCE should have is that it
overrides the vma protections for debuggers (in a very limited
manner). It should *not* affect any NUMA faulting logic in any way,
shape, or form.

               Linus

David Hildenbrand July 28, 2023, 8:33 p.m. UTC | #10

On 28.07.23 22:23, Linus Torvalds wrote:
> On Fri, 28 Jul 2023 at 12:39, Peter Xu <peterx@redhat.com> wrote:
>>
>> But then does it means that any gup-only user will have numa balancing
>> completely disabled?
> 
> Why would we ever care about a GUP-only user?
> 
> Who knows where the actual access is coming from? It might be some
> device that is on a different node entirely.
> 
> And even if the access is local from the CPU, it
> 
>   (a) might have happened after we moved somewhere else
> 
>   (b) who cares about the extra possible NUMA overhead when we just
> wasted *thousands* of cycles on GUP?
> 
> So NUMA balancing really doesn't seem to make sense for GUP anyway as
> far as I can see.

I do agree regarding many GUP users.

But at least for KVM (and probably some others, but most probably KVM is 
the most important GUP user) it does make sense and we have to find a 
way to keep that working.

At least, removing it creates significantly more harm than having it, 
guaranteed :)

So would you rather favor a FOLL_NUMA that has to be passed from the 
outside by selected callers or a FOLL_NUMA that is set on the GUP path 
unconditionally (but left clear for follow_page())?

Linus Torvalds July 28, 2023, 8:50 p.m. UTC | #11

On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote:
>
> So would you rather favor a FOLL_NUMA that has to be passed from the
> outside by selected callers or a FOLL_NUMA that is set on the GUP path
> unconditionally (but left clear for follow_page())?

I'd rather see the FOLL_NUMA that has to be set by odd cases, and that
is never set by any sane user.

And it should not be called FOLL_NUMA. It should be called something
else. Because *not* having it doesn't disable following pages across
NUMA boundaries, and the name is actively misleading.

It sounds like what KVM actually wants is a "Do NOT follow NUMA pages,
I'll force a page fault".

And the fact that KVM wants a fault for NUMA pages shouldn't mean that
others - who clearly cannot care - get that insane behavior by
default.

The name should reflect that, instead of being the misleading mess of
FOLL_FORCE and bad naming that it is now.

So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to
make it clear that it's the *opposite* of FOLL_FORCE, and that it
honors the NUMA faulting that nobody should care about.

Then the KVM code can have a big comment about *why* it sets that bit.

Hmm? Can we please aim for something that is understandable and
documented? No odd implicit rules. No "force NUMA fault even when it
makes no sense". No tie-in with FOLL_FORCE.

                 Linus

David Hildenbrand July 28, 2023, 9:02 p.m. UTC | #12

On 28.07.23 22:50, Linus Torvalds wrote:
> On Fri, 28 Jul 2023 at 13:33, David Hildenbrand <david@redhat.com> wrote:
>>
>> So would you rather favor a FOLL_NUMA that has to be passed from the
>> outside by selected callers or a FOLL_NUMA that is set on the GUP path
>> unconditionally (but left clear for follow_page())?
> 
> I'd rather see the FOLL_NUMA that has to be set by odd cases, and that
> is never set by any sane user.

Thanks!

> 
> And it should not be called FOLL_NUMA. It should be called something
> else. Because *not* having it doesn't disable following pages across
> NUMA boundaries, and the name is actively misleading.
> 
> It sounds like what KVM actually wants is a "Do NOT follow NUMA pages,
> I'll force a page fault".
> 
> And the fact that KVM wants a fault for NUMA pages shouldn't mean that
> others - who clearly cannot care - get that insane behavior by
> default.

For KVM it represents actual CPU access. To map these pages into the VM 
MMU we have to look them up from the process -- in the context of the 
faulting CPU. So it makes a lot of sense for KVM. (which is also where 
autonuma gets heavily used)

> 
> The name should reflect that, instead of being the misleading mess of
> FOLL_FORCE and bad naming that it is now.
> 
> So maybe it can be called "FOLL_HONOR_NUMA_FAULT" or something, to
> make it clear that it's the *opposite* of FOLL_FORCE, and that it
> honors the NUMA faulting that nobody should care about.

Naming sounds much better to me.

> 
> Then the KVM code can have a big comment about *why* it sets that bit.

Yes.

> 
> Hmm? Can we please aim for something that is understandable and
> documented? No odd implicit rules. No "force NUMA fault even when it
> makes no sense". No tie-in with FOLL_FORCE.

I mean, I messed all that FOLL_NUMA handling up because I was very 
confused. So I'm all for better documentation.

Can we get a simple revert in first (without that FOLL_FORCE special 
casing and ideally with a better name) to handle stable backports, and 
I'll follow-up with more documentation and letting GUP callers pass in 
that flag instead?

That would help a lot. Then we also have more time to let that "move it 
to GUP callers" mature a bit in -next, to see if we find any surprises?

Peter Xu July 28, 2023, 9:20 p.m. UTC | #13

On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
> Can we get a simple revert in first (without that FOLL_FORCE special casing
> and ideally with a better name) to handle stable backports, and I'll
> follow-up with more documentation and letting GUP callers pass in that flag
> instead?
> 
> That would help a lot. Then we also have more time to let that "move it to
> GUP callers" mature a bit in -next, to see if we find any surprises?

As I raised my concern over the other thread, I still worry numa users can
be affected by this change. After all, numa isn't so uncommon to me, at
least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
suspect that's also true to major distros.  Meanwhile all kernel modules
use gup..

I'd say we can go ahead and try if we want, but I really don't know why
that helps in any form to move it to the callers.. with the risk of
breaking someone.

Logically it should also be always better to migrate earlier than later,
not only because the page will be local earlier, but also per I discussed
also in the other thread (that the gup can hold a ref to the page, and it
could potentially stop numa balancing to succeed later).

Thanks,

David Hildenbrand July 28, 2023, 9:31 p.m. UTC | #14

On 28.07.23 23:20, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
>> Can we get a simple revert in first (without that FOLL_FORCE special casing
>> and ideally with a better name) to handle stable backports, and I'll
>> follow-up with more documentation and letting GUP callers pass in that flag
>> instead?
>>
>> That would help a lot. Then we also have more time to let that "move it to
>> GUP callers" mature a bit in -next, to see if we find any surprises?
> 
> As I raised my concern over the other thread, I still worry numa users can
> be affected by this change. After all, numa isn't so uncommon to me, at
> least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> suspect that's also true to major distros.  Meanwhile all kernel modules
> use gup..
> 
> I'd say we can go ahead and try if we want, but I really don't know why
> that helps in any form to move it to the callers.. with the risk of
> breaking someone.
> 

Indeed, that's why I suggest to be a bit careful, especially with stable.

> Logically it should also be always better to migrate earlier than later,
> not only because the page will be local earlier, but also per I discussed
> also in the other thread (that the gup can hold a ref to the page, and it
> could potentially stop numa balancing to succeed later).

I get your point, but I also see the following cases (QEMU/KVM as example):

* User space triggers O_DIRECT. It will be short-lived. But is it really
   an access from that CPU (NUMA node) to that page? At least for KVM,
   you much rather want to let KVM trigger the NUMA fault on actual
   memory access from a guest VCPU, not from a QEMU iothread when pinning
   the page?

* vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
   Where should we migrate that page to? Would it actually be counter-
   productive to migrate it to the NUMA node of the setup thread? The
   longterm pin will turn the page unmovable, yes, but where to migrate
   it to?

John Hubbard July 28, 2023, 9:32 p.m. UTC | #15

On 7/28/23 14:20, Peter Xu wrote:
> On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
>> Can we get a simple revert in first (without that FOLL_FORCE special casing
>> and ideally with a better name) to handle stable backports, and I'll
>> follow-up with more documentation and letting GUP callers pass in that flag
>> instead?
>>
>> That would help a lot. Then we also have more time to let that "move it to
>> GUP callers" mature a bit in -next, to see if we find any surprises?
> 
> As I raised my concern over the other thread, I still worry numa users can
> be affected by this change. After all, numa isn't so uncommon to me, at
> least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> suspect that's also true to major distros.  Meanwhile all kernel modules
> use gup..
> 
> I'd say we can go ahead and try if we want, but I really don't know why
> that helps in any form to move it to the callers.. with the risk of
> breaking someone.

It's worth the trouble, in order to clear up this historical mess. It's
helping *future* callers of the API, and future maintenance efforts. Yes
there is some risk, but it seems very manageable.

The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
by the way, and now that I've read it I don't want to go back. :)


thanks,

Peter Xu July 28, 2023, 9:49 p.m. UTC | #16

Hello, John,

On Fri, Jul 28, 2023 at 02:32:12PM -0700, John Hubbard wrote:
> On 7/28/23 14:20, Peter Xu wrote:
> > On Fri, Jul 28, 2023 at 11:02:46PM +0200, David Hildenbrand wrote:
> > > Can we get a simple revert in first (without that FOLL_FORCE special casing
> > > and ideally with a better name) to handle stable backports, and I'll
> > > follow-up with more documentation and letting GUP callers pass in that flag
> > > instead?
> > > 
> > > That would help a lot. Then we also have more time to let that "move it to
> > > GUP callers" mature a bit in -next, to see if we find any surprises?
> > 
> > As I raised my concern over the other thread, I still worry numa users can
> > be affected by this change. After all, numa isn't so uncommon to me, at
> > least fedora / rhel as CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y. I highly
> > suspect that's also true to major distros.  Meanwhile all kernel modules
> > use gup..
> > 
> > I'd say we can go ahead and try if we want, but I really don't know why
> > that helps in any form to move it to the callers.. with the risk of
> > breaking someone.
> 
> It's worth the trouble, in order to clear up this historical mess. It's
> helping *future* callers of the API, and future maintenance efforts. Yes
> there is some risk, but it seems very manageable.
> 
> The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
> by the way, and now that I've read it I don't want to go back. :)

Yeah I fully agree we should hopefully remove the NUMA / FORCE
tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
to not revive that specific part.  I had a feeling that we're all on the
same page there.

It's more about the further step to make FOLL_NUMA opt-in for GUP.

Thanks,

John Hubbard July 28, 2023, 10 p.m. UTC | #17

On 7/28/23 14:49, Peter Xu wrote:
>> The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
>> by the way, and now that I've read it I don't want to go back. :)
> 
> Yeah I fully agree we should hopefully remove the NUMA / FORCE
> tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
> to not revive that specific part.  I had a feeling that we're all on the
> same page there.
> 

Yes, I think so. :)

> It's more about the further step to make FOLL_NUMA opt-in for GUP.

Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So
given that our API allows passing in FOLL_ flags, I don't understand the
objection to letting different callers pass in, or not pass in, that
flag.

It's the perfect way to clean up the whole thing. As Linus suggested
slightly earlier here, there can be a comment at the call site,
explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right?

thanks,

Jason Gunthorpe July 28, 2023, 10:14 p.m. UTC | #18

On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote:
> * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
>   Where should we migrate that page to? Would it actually be counter-
>   productive to migrate it to the NUMA node of the setup thread? The
>   longterm pin will turn the page unmovable, yes, but where to migrate
>   it to?

For VFIO & KVM you actively don't get any kind of numa balancing or
awareness. In this case qemu should probably strive to put the memory
on the numa node of the majorty of CPUs early on because it doesn't
get another shot at it.

In other cases it depends quite alot. Eg DPDK might want its VFIO
buffers to NUMA'd to the node that is close to the device, not the
CPU. Or vice versa. There is alot of micro sensitivity here at high
data rates. I think people today manually tune this by deliberately
allocating the memory to specific numas and then GUP should just leave
it alone.

FWIW, I'm reading this thread and I have no idea what the special
semantic is KVM needs from GUP, so I'm all for better documentation on
the GUP flag :)

Jason

David Hildenbrand July 29, 2023, 9:35 a.m. UTC | #19

On 29.07.23 00:35, David Hildenbrand wrote:
>> The original reason for not setting FOLL_NUMA all the time is
>> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
>> page faults from gup/gup_fast") from way back in 2012:
>>
>>            * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>>            * would be called on PROT_NONE ranges. We must never invoke
>>            * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>>            * page faults would unprotect the PROT_NONE ranges if
>>            * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>>            * bitflag. So to avoid that, don't set FOLL_NUMA if
>>            * FOLL_FORCE is set.
>>
>> but I don't think the original reason for this is *true* any more.
>>
>> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
>> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
>> Mel made the code able to distinguish between PROT_NONE and NUMA
>> pages, and he changed the comment above too.
> 
> 

Sleeping over it and looking into some nasty details, I realized the following things:


(1) The pte_protnone() comment in include/linux/pgtable.h is
     either wrong or misleading.

     Especially the "For PROT_NONE VMAs, the PTEs are not marked
     _PAGE_PROTNONE" is *wrong* nowadays on x86.

     Doing an mprotect(PROT_NONE) will also result in pte_protnone()
     succeeding, because the pages *are* marked _PAGE_PROTNONE.

     The comment should be something like this

     /*
      * In an inaccessible (PROT_NONE) VMA, pte_protnone() *may* indicate
      * "yes". It is perfectly valid to indicate "no" in that case,
      * which is why our default implementation defaults to "always no".
      *
      * In an accessible VMA, however, pte_protnone() *reliably*
      * indicates PROT_NONE page protection due to NUMA hinting. NUMA
      * hinting faults only apply in accessible VMAs.
      *
      * So, to reliably distinguish between PROT_NONE due to an
      * inaccessible VMA and NUMA hinting, looking at the VMA
      * accessibility is sufficient.
      */

     I'll send that as a separate patch.


(2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole
     story.

     commit 21d9ee3eda77 ("mm: remove remaining references to NUMA
     hinting bits and helpers") from 2015 made the distinction again
     impossible.

     Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making
     progress in GUP with an inaccessible (PROT_NONE) VMA.

     (a) GUP sees the pte_protnone() and triggers a NUMA hinting fault,
         although NUMA hinting does not apply.

     (b) handle_mm_fault() refuses to do anything with pte_protnone() in
         an inaccessible VMA. And even if it would do something, the new
         PTE would end up as pte_protnone() again.
  
     So, GUP will keep retrying. I have a reproducer that triggers that
     using ptrace read in an inaccessible VMA.

     It's easy to make that work in GUP, simply by looking at the VMA
     accessibility.

     See my patch proposal, that cleanly decouples FOLL_FORCE from
     FOLL_HONOR_NUMA_HINT.


(3) follow_page() does not check VMA permissions and, therefore, my
     "implied FOLL_FORCE" assumption is not actually completely wrong.

     And especially callers that dont't pass FOLL_WRITE really expect
     follow_page() to work even if the VMA is inaccessible.

     But the interaction with NUMA hinting is just nasty, absolutely
     agreed.

     As raised in another comment, I'm looking into removing the
     "foll_flags" parameter from follow_page() completely and cleanly
     documenting the semantics of follow_page().

     IMHO, the less follow_page(), the better. Let's see what we can do
     to improve that.


So this would be the patch I would suggest as the first fix we can also
backport to stable.

Gave it a quick test, also with my ptrace read reproducer (trigger
FOLL_FORCE on inaccessible VMA; make sure it works and that the pages don't
suddenly end up readable in the page table). Seems to work.

I'll follow up with cleanups and moving FOLL_HONOR_NUMA_HINT setting to the
relevant callers (especially KVM). Further, I'll add a selftest to make
sure that ptrace on inaccessible VMAs keeps working as expected.



 From 36c1aeb9aa3e859762f671776601a71179247d17 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Fri, 28 Jul 2023 21:57:04 +0200
Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT

As it turns out, unfortunately commit 474098edac26 ("mm/gup: replace
FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never set FOLL_NUMA because they really don't want
NUMA hinting faults.

As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
faults from gup/gup_fast"): "Other follow_page callers like KSM should not
use FOLL_NUMA, or they would fail to get the pages if they use follow_page
instead of get_user_pages."

While we didn't get BUG reports on the changed follow_page() semantics yet
(and it's just a matter of time), liubo reported [1] that smaps_rollup
results are imprecise, because they miss accounting of pages that are
mapped PROT_NONE due to NUMA hinting.

As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not really
an option.

To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior and add better comments.

Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE. To make that
combination work in inaccessible VMAs, we have to perform proper VMA
accessibility checks in gup_can_follow_protnone().

Move gup_can_follow_protnone() to internal.h which feels more
appropriate and is required as long as FOLL_HONOR_NUMA_FAULT is an
internal flag.

As Linus notes [2], this handling doesn't make sense for many GUP users.
So we should really see if we instead let relevant GUP callers specify it
manually, and not trigger NUMA hinting faults from GUP as default.

[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com

Reported-by: liubo <liubo254@huawei.com>
Reported-by: Peter Xu <peterx@redhat.com>
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/linux/mm.h | 15 ---------------
  mm/gup.c           | 18 ++++++++++++++----
  mm/huge_memory.c   |  2 +-
  mm/internal.h      | 31 +++++++++++++++++++++++++++++++
  4 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..f8d7fa3c01c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3400,21 +3400,6 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
  	return 0;
  }
  
-/*
- * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
- * a (NUMA hinting) fault is required.
- */
-static inline bool gup_can_follow_protnone(unsigned int flags)
-{
-	/*
-	 * FOLL_FORCE has to be able to make progress even if the VMA is
-	 * inaccessible. Further, FOLL_FORCE access usually does not represent
-	 * application behaviour and we should avoid triggering NUMA hinting
-	 * faults.
-	 */
-	return flags & FOLL_FORCE;
-}
-
  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
  			       unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 76d222ccc3ff..54b8d77f3a3d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
  	pte = ptep_get(ptep);
  	if (!pte_present(pte))
  		goto no_page;
-	if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
  		goto no_page;
  
  	page = vm_normal_page(vma, address, pte);
@@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
  	if (likely(!pmd_trans_huge(pmdval)))
  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
  
-	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
+	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
  		return no_page_table(vma, flags);
  
  	ptl = pmd_lock(mm, pmd);
@@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  	if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
  		return NULL;
  
+	/*
+	 * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
+	 * to fail on PROT_NONE-mapped pages.
+	 */
  	page = follow_page_mask(vma, address, foll_flags, &ctx);
  	if (ctx.pgmap)
  		put_dev_pagemap(ctx.pgmap);
@@ -1200,6 +1204,12 @@ static long __get_user_pages(struct mm_struct *mm,
  
  	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
  
+	/*
+	 * For now, always trigger NUMA hinting faults. Some GUP users like
+	 * KVM really require it to benefit from autonuma.
+	 */
+	gup_flags |= FOLL_HONOR_NUMA_FAULT;
+
  	do {
  		struct page *page;
  		unsigned int foll_flags = gup_flags;
@@ -2551,7 +2561,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
  		struct page *page;
  		struct folio *folio;
  
-		if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+		if (pte_protnone(pte) && !gup_can_follow_protnone(NULL, flags))
  			goto pte_unmap;
  
  		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
@@ -2971,7 +2981,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
  			     pmd_devmap(pmd))) {
  			if (pmd_protnone(pmd) &&
-			    !gup_can_follow_protnone(flags))
+			    !gup_can_follow_protnone(NULL, flags))
  				return 0;
  
  			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..ef6bdc4a6fec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
  		return ERR_PTR(-EFAULT);
  
  	/* Full NUMA hinting faults to serialise migration in fault paths */
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
  		return NULL;
  
  	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
diff --git a/mm/internal.h b/mm/internal.h
index a7d9e980429a..7db17259c51a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -937,6 +937,8 @@ enum {
  	FOLL_FAST_ONLY = 1 << 20,
  	/* allow unlocking the mmap lock */
  	FOLL_UNLOCKABLE = 1 << 21,
+	/* Honor (trigger) NUMA hinting faults on PROT_NONE-mapped pages. */
+	FOLL_HONOR_NUMA_FAULT = 1 << 22,
  };
  
  /*
@@ -1004,6 +1006,35 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
  	return !PageAnonExclusive(page);
  }
  
+/*
+ * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
+ * a (NUMA hinting) fault is required.
+ */
+static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
+					   unsigned int flags)
+{
+	/*
+	 * If callers don't want to honor NUMA hinting faults, no need to
+	 * determine if we would actually have to trigger a NUMA hinting fault.
+	 */
+	if (!(flags & FOLL_HONOR_NUMA_FAULT))
+		return true;
+
+	/* We really need the VMA ... */
+	if (!vma)
+		return false;
+
+	/*
+	 * ... because NUMA hinting faults only apply in accessible VMAs. In
+	 * inaccessible (PROT_NONE) VMAs, NUMA hinting faults don't apply.
+	 *
+	 * Requiring a fault here even for inaccessible VMAs would mean that
+	 * FOLL_FORCE cannot make any progress, because handle_mm_fault()
+	 * refuses to process NUMA hinting faults in inaccessible VMAs.
+	 */
+	return !vma_is_accessible(vma);
+}
+
  extern bool mirrored_kernelcore;
  
  static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)

Peter Xu July 31, 2023, 4:01 p.m. UTC | #20

On Fri, Jul 28, 2023 at 07:14:26PM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 28, 2023 at 11:31:49PM +0200, David Hildenbrand wrote:
> > * vfio triggers FOLL_PIN|FOLL_LONGTERM from a random QEMU thread.
> >   Where should we migrate that page to? Would it actually be counter-
> >   productive to migrate it to the NUMA node of the setup thread? The
> >   longterm pin will turn the page unmovable, yes, but where to migrate
> >   it to?
> 
> For VFIO & KVM you actively don't get any kind of numa balancing or
> awareness. In this case qemu should probably strive to put the memory
> on the numa node of the majorty of CPUs early on because it doesn't
> get another shot at it.
> 
> In other cases it depends quite alot. Eg DPDK might want its VFIO
> buffers to NUMA'd to the node that is close to the device, not the
> CPU. Or vice versa. There is alot of micro sensitivity here at high
> data rates. I think people today manually tune this by deliberately
> allocating the memory to specific numas and then GUP should just leave
> it alone.

Right.

For the other O_DIRECT example - it seems to be a more generic issue to
"whether we should rely on the follow up accessor to decide the target node
of numa balancing".  To me at least for KVM's use case I'd still expect the
major paths to trigger that is still when guest accessing a page from vcpu
threads, that's still the GUP paths.

Thanks,

Peter Xu July 31, 2023, 4:05 p.m. UTC | #21

On Fri, Jul 28, 2023 at 03:00:04PM -0700, John Hubbard wrote:
> On 7/28/23 14:49, Peter Xu wrote:
> > > The story of how FOLL_NUMA and FOLL_FORCE became entangled was enlightening,
> > > by the way, and now that I've read it I don't want to go back. :)
> > 
> > Yeah I fully agree we should hopefully remove the NUMA / FORCE
> > tangling.. even if we want to revert back to the FOLL_NUMA flag we may want
> > to not revive that specific part.  I had a feeling that we're all on the
> > same page there.
> > 
> 
> Yes, I think so. :)
> 
> > It's more about the further step to make FOLL_NUMA opt-in for GUP.
> 
> Let's say "FOLL_HONOR_NUMA_FAULT" for this next discussion, but yes. So
> given that our API allows passing in FOLL_ flags, I don't understand the
> objection to letting different callers pass in, or not pass in, that
> flag.
> 
> It's the perfect way to clean up the whole thing. As Linus suggested
> slightly earlier here, there can be a comment at the call site,
> explaining why KVM needs FOLL_HONOR_NUMA_FAULT, and you're good, right?

I'm good even if we want to experiment anything, as long as (at least) kvm
is all covered then I'm not against it, not yet strongly.

But again, IMHO we're not only talking about "cleaning up" of any flag - if
that falls into "cleanup" first, which I'm not 100% sure yet - we're
takling about changing GUP abi.  What I wanted to point out to be careful
is we're literally changing the GUP abi for all kernel modules on numa
implications.

Thanks,

Peter Xu July 31, 2023, 4:10 p.m. UTC | #22

On Sat, Jul 29, 2023 at 11:35:22AM +0200, David Hildenbrand wrote:
> On 29.07.23 00:35, David Hildenbrand wrote:
> > > The original reason for not setting FOLL_NUMA all the time is
> > > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> > > page faults from gup/gup_fast") from way back in 2012:
> > > 
> > >            * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> > >            * would be called on PROT_NONE ranges. We must never invoke
> > >            * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> > >            * page faults would unprotect the PROT_NONE ranges if
> > >            * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> > >            * bitflag. So to avoid that, don't set FOLL_NUMA if
> > >            * FOLL_FORCE is set.
> > > 
> > > but I don't think the original reason for this is *true* any more.
> > > 
> > > Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> > > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> > > Mel made the code able to distinguish between PROT_NONE and NUMA
> > > pages, and he changed the comment above too.
> > 
> > 
> 
> Sleeping over it and looking into some nasty details, I realized the following things:
> 
> 
> (1) The pte_protnone() comment in include/linux/pgtable.h is
>     either wrong or misleading.
> 
>     Especially the "For PROT_NONE VMAs, the PTEs are not marked
>     _PAGE_PROTNONE" is *wrong* nowadays on x86.
> 
>     Doing an mprotect(PROT_NONE) will also result in pte_protnone()
>     succeeding, because the pages *are* marked _PAGE_PROTNONE.
> 
>     The comment should be something like this
> 
>     /*
>      * In an inaccessible (PROT_NONE) VMA, pte_protnone() *may* indicate
>      * "yes". It is perfectly valid to indicate "no" in that case,
>      * which is why our default implementation defaults to "always no".
>      *
>      * In an accessible VMA, however, pte_protnone() *reliably*
>      * indicates PROT_NONE page protection due to NUMA hinting. NUMA
>      * hinting faults only apply in accessible VMAs.
>      *
>      * So, to reliably distinguish between PROT_NONE due to an
>      * inaccessible VMA and NUMA hinting, looking at the VMA
>      * accessibility is sufficient.
>      */
> 
>     I'll send that as a separate patch.
> 
> 
> (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole
>     story.
> 
>     commit 21d9ee3eda77 ("mm: remove remaining references to NUMA
>     hinting bits and helpers") from 2015 made the distinction again
>     impossible.
> 
>     Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making
>     progress in GUP with an inaccessible (PROT_NONE) VMA.

If we also teach follow_page_mask() on vma_is_accessible(), we should still
be good, am I right?

Basically fast-gup will stop working on protnone, and it always fallbacks
to slow-gup. Then it seems we're good decoupling FORCE with NUMA hint.

I assume that that's what you did below in the patch too, which looks right
to me.

> 
>     (a) GUP sees the pte_protnone() and triggers a NUMA hinting fault,
>         although NUMA hinting does not apply.
> 
>     (b) handle_mm_fault() refuses to do anything with pte_protnone() in
>         an inaccessible VMA. And even if it would do something, the new
>         PTE would end up as pte_protnone() again.
>     So, GUP will keep retrying. I have a reproducer that triggers that
>     using ptrace read in an inaccessible VMA.
> 
>     It's easy to make that work in GUP, simply by looking at the VMA
>     accessibility.
> 
>     See my patch proposal, that cleanly decouples FOLL_FORCE from
>     FOLL_HONOR_NUMA_HINT.
> 
> 
> (3) follow_page() does not check VMA permissions and, therefore, my
>     "implied FOLL_FORCE" assumption is not actually completely wrong.
> 
>     And especially callers that dont't pass FOLL_WRITE really expect
>     follow_page() to work even if the VMA is inaccessible.
> 
>     But the interaction with NUMA hinting is just nasty, absolutely
>     agreed.
> 
>     As raised in another comment, I'm looking into removing the
>     "foll_flags" parameter from follow_page() completely and cleanly
>     documenting the semantics of follow_page().
> 
>     IMHO, the less follow_page(), the better. Let's see what we can do
>     to improve that.
> 
> 
> So this would be the patch I would suggest as the first fix we can also
> backport to stable.
> 
> Gave it a quick test, also with my ptrace read reproducer (trigger
> FOLL_FORCE on inaccessible VMA; make sure it works and that the pages don't
> suddenly end up readable in the page table). Seems to work.
> 
> I'll follow up with cleanups and moving FOLL_HONOR_NUMA_HINT setting to the
> relevant callers (especially KVM). Further, I'll add a selftest to make
> sure that ptrace on inaccessible VMAs keeps working as expected.
> 
> 
> 
> From 36c1aeb9aa3e859762f671776601a71179247d17 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Fri, 28 Jul 2023 21:57:04 +0200
> Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
> 
> As it turns out, unfortunately commit 474098edac26 ("mm/gup: replace
> FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and
> follow_trans_huge_pmd() never set FOLL_NUMA because they really don't want
> NUMA hinting faults.
> 
> As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
> faults from gup/gup_fast"): "Other follow_page callers like KSM should not
> use FOLL_NUMA, or they would fail to get the pages if they use follow_page
> instead of get_user_pages."
> 
> While we didn't get BUG reports on the changed follow_page() semantics yet
> (and it's just a matter of time), liubo reported [1] that smaps_rollup
> results are imprecise, because they miss accounting of pages that are
> mapped PROT_NONE due to NUMA hinting.
> 
> As KVM really depends on these NUMA hinting faults, removing the
> pte_protnone()/pmd_protnone() handling in GUP code completely is not really
> an option.
> 
> To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
> to restore the original behavior and add better comments.
> 
> Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE. To make that
> combination work in inaccessible VMAs, we have to perform proper VMA
> accessibility checks in gup_can_follow_protnone().
> 
> Move gup_can_follow_protnone() to internal.h which feels more
> appropriate and is required as long as FOLL_HONOR_NUMA_FAULT is an
> internal flag.
> 
> As Linus notes [2], this handling doesn't make sense for many GUP users.
> So we should really see if we instead let relevant GUP callers specify it
> manually, and not trigger NUMA hinting faults from GUP as default.
> 
> [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
> [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
> 
> Reported-by: liubo <liubo254@huawei.com>
> Reported-by: Peter Xu <peterx@redhat.com>
> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: liubo <liubo254@huawei.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 15 ---------------
>  mm/gup.c           | 18 ++++++++++++++----
>  mm/huge_memory.c   |  2 +-
>  mm/internal.h      | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 46 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..f8d7fa3c01c1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3400,21 +3400,6 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
>  	return 0;
>  }
> -/*
> - * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
> - * a (NUMA hinting) fault is required.
> - */
> -static inline bool gup_can_follow_protnone(unsigned int flags)
> -{
> -	/*
> -	 * FOLL_FORCE has to be able to make progress even if the VMA is
> -	 * inaccessible. Further, FOLL_FORCE access usually does not represent
> -	 * application behaviour and we should avoid triggering NUMA hinting
> -	 * faults.
> -	 */
> -	return flags & FOLL_FORCE;
> -}
> -
>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  			       unsigned long size, pte_fn_t fn, void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 76d222ccc3ff..54b8d77f3a3d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	pte = ptep_get(ptep);
>  	if (!pte_present(pte))
>  		goto no_page;
> -	if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
> +	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
>  		goto no_page;
>  	page = vm_normal_page(vma, address, pte);
> @@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  	if (likely(!pmd_trans_huge(pmdval)))
>  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> -	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
> +	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
>  		return no_page_table(vma, flags);
>  	ptl = pmd_lock(mm, pmd);
> @@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  	if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
>  		return NULL;
> +	/*
> +	 * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
> +	 * to fail on PROT_NONE-mapped pages.
> +	 */
>  	page = follow_page_mask(vma, address, foll_flags, &ctx);
>  	if (ctx.pgmap)
>  		put_dev_pagemap(ctx.pgmap);
> @@ -1200,6 +1204,12 @@ static long __get_user_pages(struct mm_struct *mm,
>  	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
> +	/*
> +	 * For now, always trigger NUMA hinting faults. Some GUP users like
> +	 * KVM really require it to benefit from autonuma.
> +	 */
> +	gup_flags |= FOLL_HONOR_NUMA_FAULT;
> +
>  	do {
>  		struct page *page;
>  		unsigned int foll_flags = gup_flags;
> @@ -2551,7 +2561,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
>  		struct page *page;
>  		struct folio *folio;
> -		if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
> +		if (pte_protnone(pte) && !gup_can_follow_protnone(NULL, flags))
>  			goto pte_unmap;
>  		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> @@ -2971,7 +2981,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
>  			     pmd_devmap(pmd))) {
>  			if (pmd_protnone(pmd) &&
> -			    !gup_can_follow_protnone(flags))
> +			    !gup_can_follow_protnone(NULL, flags))
>  				return 0;
>  			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index eb3678360b97..ef6bdc4a6fec 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1468,7 +1468,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  		return ERR_PTR(-EFAULT);
>  	/* Full NUMA hinting faults to serialise migration in fault paths */
> -	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
> +	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
>  		return NULL;
>  	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
> diff --git a/mm/internal.h b/mm/internal.h
> index a7d9e980429a..7db17259c51a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -937,6 +937,8 @@ enum {
>  	FOLL_FAST_ONLY = 1 << 20,
>  	/* allow unlocking the mmap lock */
>  	FOLL_UNLOCKABLE = 1 << 21,
> +	/* Honor (trigger) NUMA hinting faults on PROT_NONE-mapped pages. */
> +	FOLL_HONOR_NUMA_FAULT = 1 << 22,
>  };
>  /*
> @@ -1004,6 +1006,35 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	return !PageAnonExclusive(page);
>  }
> +/*
> + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
> + * a (NUMA hinting) fault is required.
> + */
> +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
> +					   unsigned int flags)
> +{
> +	/*
> +	 * If callers don't want to honor NUMA hinting faults, no need to
> +	 * determine if we would actually have to trigger a NUMA hinting fault.
> +	 */
> +	if (!(flags & FOLL_HONOR_NUMA_FAULT))
> +		return true;
> +
> +	/* We really need the VMA ... */
> +	if (!vma)
> +		return false;

I'm not sure whether the compiler will be smart enough to inline this for
fast-gup on pmd/pte.  One way to guarantee this is we simply always bail
out for fast-gup on protnone (ignoring calling gup_can_follow_protnone()
with a comment), as discussed above.  Then WARN_ON_ONCE(!vma) for all the
rest callers, assuming that's a required knowledge to know what the
protnone means.

Thanks,

> +
> +	/*
> +	 * ... because NUMA hinting faults only apply in accessible VMAs. In
> +	 * inaccessible (PROT_NONE) VMAs, NUMA hinting faults don't apply.
> +	 *
> +	 * Requiring a fault here even for inaccessible VMAs would mean that
> +	 * FOLL_FORCE cannot make any progress, because handle_mm_fault()
> +	 * refuses to process NUMA hinting faults in inaccessible VMAs.
> +	 */
> +	return !vma_is_accessible(vma);
> +}
> +
>  extern bool mirrored_kernelcore;
>  static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
> -- 
> 2.41.0
> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
>

David Hildenbrand July 31, 2023, 4:20 p.m. UTC | #23

>> (2) Consequently, commit c46a7c817e66 from 2014 does not tell the whole
>>      story.
>>
>>      commit 21d9ee3eda77 ("mm: remove remaining references to NUMA
>>      hinting bits and helpers") from 2015 made the distinction again
>>      impossible.
>>
>>      Setting FOLL_FORCE | FOLL_HONOR_NUMA_HINT would end up never making
>>      progress in GUP with an inaccessible (PROT_NONE) VMA.
> 
> If we also teach follow_page_mask() on vma_is_accessible(), we should still
> be good, am I right?
> 
> Basically fast-gup will stop working on protnone, and it always fallbacks
> to slow-gup. Then it seems we're good decoupling FORCE with NUMA hint.
> 
> I assume that that's what you did below in the patch too, which looks right
> to me.

I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in 
is_valid_gup_args(), such that it will always be set for any GUP users, 
including GUP-fast.

[...]

>> +/*
>> + * Indicates whether GUP can follow a PROT_NONE mapped page, or whether
>> + * a (NUMA hinting) fault is required.
>> + */
>> +static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
>> +					   unsigned int flags)
>> +{
>> +	/*
>> +	 * If callers don't want to honor NUMA hinting faults, no need to
>> +	 * determine if we would actually have to trigger a NUMA hinting fault.
>> +	 */
>> +	if (!(flags & FOLL_HONOR_NUMA_FAULT))
>> +		return true;
>> +
>> +	/* We really need the VMA ... */
>> +	if (!vma)
>> +		return false;
> 
> I'm not sure whether the compiler will be smart enough to inline this for
> fast-gup on pmd/pte.

Why shouldn't it? It's placed in a head file and the vma == NULL is not 
obfuscated. :)

Anyhow, I'll take a look at the compiler output.


Thanks!

Linus Torvalds July 31, 2023, 6:23 p.m. UTC | #24

On Mon, 31 Jul 2023 at 09:20, David Hildenbrand <david@redhat.com> wrote:
>
> I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in
> is_valid_gup_args(), such that it will always be set for any GUP users,
> including GUP-fast.

But do we actually want that? It is actively crazy to honor NUMA
faulting at least for get_user_pages_remote().

So right now, GUP-fast requires us to honor NUMA faults, because
GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't
take any locks).

So GUP-fast can only look at the page table data, and as such *has* to
fail if the page table is inaccessible.

But GUP in general? Why would it want to honor numa faulting?
Particularly by default, and _particularly_ for things like
FOLL_REMOTE.

In fact, I feel like this is what the real rule should be: we simply
define that get_user_pages_fast() is about looking up the page in the
page tables.

So if you want something that acts like a page table lookup, you use
that "fast" thing.  It's literally how it is designed. The whole - and
pretty much only - point of it is that it can be used with no locking
at all, because it basically acts like the hardware lookup does.

So then if KVM wants to look up a page in the page table, that is what
kvm should use, and it automatically gets the "honor numa faults"
behavior, not because it sets a magic flag, but simply because that is
how GUP-fast *works*.

But if you use the "normal" get/pin_user_pages() function, which looks
up the vma, at that point you are following things at a "software
level", and it wouldn't do NUMA faulting, it would just get the page.

(Ok, we have the whole "FAST_ONLY vs fall back" case, so "fast" can
look up the vma too, but read the above argument as "fast *can* be
done without vma, so fast must honor page table bits as per
hardware").

              Linus

Peter Xu July 31, 2023, 6:51 p.m. UTC | #25

On Mon, Jul 31, 2023 at 11:23:59AM -0700, Linus Torvalds wrote:
> So GUP-fast can only look at the page table data, and as such *has* to
> fail if the page table is inaccessible.
> 
> But GUP in general? Why would it want to honor numa faulting?
> Particularly by default, and _particularly_ for things like
> FOLL_REMOTE.

True.

> 
> In fact, I feel like this is what the real rule should be: we simply
> define that get_user_pages_fast() is about looking up the page in the
> page tables.
> 
> So if you want something that acts like a page table lookup, you use
> that "fast" thing.  It's literally how it is designed. The whole - and
> pretty much only - point of it is that it can be used with no locking
> at all, because it basically acts like the hardware lookup does.

Unfortunately I think at least kvm (besides the rest..) relies not only on
numa balancing but also fast-gup.. :-( Please refer to hva_to_pfn() where
it even supports fast-gup-only HVA translation when atomic==true set.

Thanks,

David Hildenbrand July 31, 2023, 7 p.m. UTC | #26

On 31.07.23 20:23, Linus Torvalds wrote:
> On Mon, 31 Jul 2023 at 09:20, David Hildenbrand <david@redhat.com> wrote:
>>

Hi Linus,

>> I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in
>> is_valid_gup_args(), such that it will always be set for any GUP users,
>> including GUP-fast.
> 
> But do we actually want that? It is actively crazy to honor NUMA
> faulting at least for get_user_pages_remote().

This would only be for the stable backport that would go in first and 
where I want to be a bit careful.

Next step would be to let the callers (KVM) specify 
FOLL_HONOR_NUMA_FAULT, as suggested by you.

> 
> So right now, GUP-fast requires us to honor NUMA faults, because
> GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't
> take any locks).

With FOLL_HONOR_NUMA_FAULT moved to the GUP caller that would no longer 
be the case.

Anybody who

(1) doesn't specify FOLL_HONOR_NUMA_FAULT, which is the majority
(2) doesn't specify FOLL_WRITE

Would get GUP-fast just grabbing these pte_protnone() pages.

> 
> So GUP-fast can only look at the page table data, and as such *has* to
> fail if the page table is inaccessible.

gup_fast_only, yes, which is what KVM uses if a writable PFN is desired.

> 
> But GUP in general? Why would it want to honor numa faulting?
> Particularly by default, and _particularly_ for things like
> FOLL_REMOTE.

KVM currently does [virt/kvm/kvm_main.c]:

(1) hva_to_pfn_fast(): call get_user_page_fast_only(FOLL_WRITE) if a
     writable PFN is desired
(2) hva_to_pfn_slow(): call get_user_pages_unlocked()

So in the "!writable" case, we would always call 
get_user_pages_unlocked() and never honor NUMA faults.

Converting that to some other pattern might be possible (although KVM 
plays quite some tricks here!), but assuming we would always first do a 
get_user_page_fast_only(), then when not intending to write (!FOLL_WRITE)

(1) get_user_page_fast_only() would honor NUMA faults and fail
(2) get_user_pages() would not honor NUMA faults and succeed

Hmmm ... so we would have to use get_user_pages_fast()? It might be 
possible, but I am not sure if we want get_user_pages_fast() to always 
honor NUMA faults, because ...

> 
> In fact, I feel like this is what the real rule should be: we simply
> define that get_user_pages_fast() is about looking up the page in the
> page tables.
> 
> So if you want something that acts like a page table lookup, you use
> that "fast" thing.  It's literally how it is designed. The whole - and
> pretty much only - point of it is that it can be used with no locking
> at all, because it basically acts like the hardware lookup does.
> 

... I see what you mean (HW would similarly refuse to use such a page), 
but I do wonder if that makes the API clearer and if this is what we 
actually want.

We do have callers of pin_user_pages_fast() and friends that maybe 
*really* shouldn't care about NUMA hinting. 
iov_iter_extract_user_pages() is one example -- used for O_DIRECT nowadays.

Their logic is "if it's directly in the page table, create, hand it 
over. If not, please go the slow path.". In many cases user space just 
touched these pages so they are very likely in the page table.

Converting them to pin_user_pages() would mean they will just run slower 
in the common case.

Converting them to a manual pin_user_pages_fast_only() + 
pin_user_pages() doesn't seem very compelling.

... so we would need a new API? :/

> So then if KVM wants to look up a page in the page table, that is what
> kvm should use, and it automatically gets the "honor numa faults"
> behavior, not because it sets a magic flag, but simply because that is
> how GUP-fast *works*.
> 
> But if you use the "normal" get/pin_user_pages() function, which looks
> up the vma, at that point you are following things at a "software
> level", and it wouldn't do NUMA faulting, it would just get the page.

My main problem with that is that pin_user_pages_fast() and friends are 
used all over the place for a "likely already in the page table case, so 
just make everything faster as default".

Always honoring NUMA faults here does not sound like the improvement we 
wanted to have :) ... we actually *don't* want to honor NUMA faults here.

We just have to find a way to make the KVM special case happy.

Thanks!

Linus Torvalds July 31, 2023, 7:07 p.m. UTC | #27

On Mon, 31 Jul 2023 at 12:00, David Hildenbrand <david@redhat.com> wrote:
>
> So in the "!writable" case, we would always call
> get_user_pages_unlocked() and never honor NUMA faults.

Ok, so kvm doesn't just use the fast version. Oh well. It was an idea..

         Linus

David Hildenbrand July 31, 2023, 7:22 p.m. UTC | #28

On 31.07.23 21:07, Linus Torvalds wrote:
> On Mon, 31 Jul 2023 at 12:00, David Hildenbrand <david@redhat.com> wrote:
>>
>> So in the "!writable" case, we would always call
>> get_user_pages_unlocked() and never honor NUMA faults.
> 
> Ok, so kvm doesn't just use the fast version. Oh well. It was an idea..

Certainly an interesting one, thanks for thinking about possible 
alternatives! Unfortunately, KVM is an advanced GUP user to managed 
secondary MMUs.

I'll get the FOLL_HONOR_NUMA_FAULT patches ready tomorrow and we can 
discuss if that looks better.

(btw, the whole reasoning about "HW would refuse to use these pages" was 
exactly the whole reason I went into the FOLL_FORCE direction ... but 
it's really better to make FOLL_FORCE deal with VMA protection only)

Jason Gunthorpe Aug. 1, 2023, 1:05 p.m. UTC | #29

On Mon, Jul 31, 2023 at 09:00:06PM +0200, David Hildenbrand wrote:

> Their logic is "if it's directly in the page table, create, hand it over. If
> not, please go the slow path.". In many cases user space just touched these
> pages so they are very likely in the page table.

I think it has become pretty confusing, overall.

In my mind 'pin_user_pages_fast()' should be functionally the same as
'pin_user_pages_unlocked()'.

Places call fast if they have no idea about what is under memory,
otherwise they call unlocked if you are pretty sure something is there
that needs the mmap lock to resolve.

If we need different behaviors a GUP flag makes the most sense.

> Always honoring NUMA faults here does not sound like the improvement we
> wanted to have :) ... we actually *don't* want to honor NUMA faults here.

Yeah, I think that is right. We should not really use the CPU running
PUP as any input to a NUMA algorithm.. If we want NUMA'ness then the
PUP user should specify the affinity that makes sense.

Jason

Mel Gorman Aug. 2, 2023, 10:24 a.m. UTC | #30

On Fri, Jul 28, 2023 at 07:30:33PM +0200, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
> > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > This is my proposal on how to handle the fallout of 474098edac26
> > > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> > > accidentially missed that follow_page() and smaps implicitly kept the
> > > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> > > not trigger faults on PROT_NONE-mapped PTEs.
> > 
> > Ugh.
> 
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
> 
> > 
> > I hate how it uses FOLL_FORCE that is inherently scary.
> 
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
> 

FOLL_NUMA being conflated with FOLL_FORCE is almost certainly a historical
accident.

> > 
> > Why do we have that "gup_can_follow_protnone()" logic AT ALL?
> 
> That's what I was hoping for.
> 
> > 
> > Couldn't we just get rid of that disgusting thing, and just say that
> > GUP (and follow_page()) always just ignores NUMA hinting, and always
> > just follows protnone?
> > 
> > We literally used to have this:
> > 
> >          if (!(gup_flags & FOLL_FORCE))
> >                  gup_flags |= FOLL_NUMA;
> > 
> > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> > be the rare crazy case.
> 
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
> 
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
> 
> But yeah, if we can avoid all that, great.
> 
> > 
> > The original reason for not setting FOLL_NUMA all the time is
> > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> > page faults from gup/gup_fast") from way back in 2012:
> > 
> >           * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> >           * would be called on PROT_NONE ranges. We must never invoke
> >           * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> >           * page faults would unprotect the PROT_NONE ranges if
> >           * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> >           * bitflag. So to avoid that, don't set FOLL_NUMA if
> >           * FOLL_FORCE is set.
> 
> 
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
> 
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
> 
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
> 
> > 
> > but I don't think the original reason for this is *true* any more.
> > 
> > Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> > Mel made the code able to distinguish between PROT_NONE and NUMA
> > pages, and he changed the comment above too.
> 
> CCing Mel.
> 
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
> 

Ok, as usual, I'm far behind and this thread massive but I'll respond
to this part before trying to digest the history of this and the current
implementation.

To the best of my recollection, FOLL_NUMA used to be a correctness issue
but that should no longer true. Initially, it was to prevent mixing up
"PROT_NONE" that was for NUMA hinting and "PROT_NONE" due to VMA
protections. Now the bits are different so this case should be
avoidable.

Later it was still a different correctness issue because PMD migration had
a hacky implementation without migration entries and a GUP could find a
page that was being collapsed and had to be serialised. That should also
now be avoidable.

At some point, FOLL_FORCE and FOLL_NUMA got conflated but they really should
not be related even if they are by accident. FOLL_FORCE (e.g. ptrace)
may have to process the fault and make the page resident and accessible
regardless of any other consequences. FOLL_NUMA ideally should be much
more specific. If the calling context only cares about the struct page
(e.g. smaps) then it's ok to get a reference to the page. If necessary,
it could clear the protection and lose the hinting fault although it's less
than ideal. Just needing the struct page for informational purposes though
should not be treated as a NUMA hinting fault because it has nothing to
do with the tasks memory reference behaviour.

A variant of FOLL_NUMA (FOLL_NUMA_HINT?) may still be required to indicate
the calling context is accessing the page for reasons that are equivalent
to a real memory access from a CPU related to the task mapping the page.
I didn't check but KVM may be an example of this when dealing with some MMU
faults as the page is being looked up on behalf of the task and presumably
from the same CPU the task was running run. Something like reading smaps
only needs the struct page but it should not be treated as a NUMA hinting
fault as the access has nothing to do with the task mapping the page.

> > The original reason for FOLL_NUMA simply does not exist any more. We
> > know exactly when a page is marked for NUMA faulting, and we should
> > simply *ignore* it for GUP and follow_page().
> > 

I'd be wary of completely ignoring it if there is any known calling context
that is equivalent to a memory access and the hinting fault should be
processed -- KVM may be an example.

[v1,0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Message

Comments