diff mbox series

[v2,3/8] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()

Message ID 20230801124844.278698-4-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series smaps / mm/gup: fix gup_can_follow_protnone fallout | expand

Commit Message

David Hildenbrand Aug. 1, 2023, 12:48 p.m. UTC
KVM is *the* case we know that really wants to honor NUMA hinting falls.
As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
to map them into a secondary MMU, and add a comment why.

Do that unconditionally in hva_to_pfn_slow() when calling
get_user_pages_unlocked().

kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
gfn_to_page_many_atomic() are similarly used to map pages into a
secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
implicitly honor NUMA hinting faults -- as documented for
FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
for now.

Don't set it in check_user_page_hwpoison(), where we really only want to
check if the mapped page is HW-poisoned.

We won't set it for other KVM users of get_user_pages()/pin_user_pages()
* arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
  secondary MMU.
* arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
* arch/s390/kvm/*: s390x only supports a single NUMA node either way
* arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.

This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
implicitly be set by get_user_pages() and friends.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 virt/kvm/kvm_main.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Mel Gorman Aug. 2, 2023, 3:27 p.m. UTC | #1
On Tue, Aug 01, 2023 at 02:48:39PM +0200, David Hildenbrand wrote:
> KVM is *the* case we know that really wants to honor NUMA hinting falls.
> As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
> FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
> to map them into a secondary MMU, and add a comment why.
> 
> Do that unconditionally in hva_to_pfn_slow() when calling
> get_user_pages_unlocked().
> 
> kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
> gfn_to_page_many_atomic() are similarly used to map pages into a
> secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
> implicitly honor NUMA hinting faults -- as documented for
> FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
> for now.
> 
> Don't set it in check_user_page_hwpoison(), where we really only want to
> check if the mapped page is HW-poisoned.
> 
> We won't set it for other KVM users of get_user_pages()/pin_user_pages()
> * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
>   secondary MMU.
> * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
> * arch/s390/kvm/*: s390x only supports a single NUMA node either way
> * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
> 
> This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
> implicitly be set by get_user_pages() and friends.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Seems sane but I don't know KVM well enough to know if this is the only
relevant case so didn't ack.
David Hildenbrand Aug. 2, 2023, 3:29 p.m. UTC | #2
On 02.08.23 17:27, Mel Gorman wrote:
> On Tue, Aug 01, 2023 at 02:48:39PM +0200, David Hildenbrand wrote:
>> KVM is *the* case we know that really wants to honor NUMA hinting falls.
>> As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
>> FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
>> to map them into a secondary MMU, and add a comment why.
>>
>> Do that unconditionally in hva_to_pfn_slow() when calling
>> get_user_pages_unlocked().
>>
>> kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
>> gfn_to_page_many_atomic() are similarly used to map pages into a
>> secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
>> implicitly honor NUMA hinting faults -- as documented for
>> FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
>> for now.
>>
>> Don't set it in check_user_page_hwpoison(), where we really only want to
>> check if the mapped page is HW-poisoned.
>>
>> We won't set it for other KVM users of get_user_pages()/pin_user_pages()
>> * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
>>    secondary MMU.
>> * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
>> * arch/s390/kvm/*: s390x only supports a single NUMA node either way
>> * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
>>
>> This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
>> implicitly be set by get_user_pages() and friends.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Seems sane but I don't know KVM well enough to know if this is the only
> relevant case so didn't ack.

Makes sense, some careful eyes from KVM people would be appreciated.

At least from kvm_main.c POV, I'm pretty confident that that's it.
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..6e4f2b81541e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2517,7 +2517,18 @@  static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 			   bool interruptible, bool *writable, kvm_pfn_t *pfn)
 {
-	unsigned int flags = FOLL_HWPOISON;
+	/*
+	 * When a VCPU accesses a page that is not mapped into the secondary
+	 * MMU, we lookup the page using GUP to map it, so the guest VCPU can
+	 * make progress. We always want to honor NUMA hinting faults in that
+	 * case, because GUP usage corresponds to memory accesses from the VCPU.
+	 * Otherwise, we'd not trigger NUMA hinting faults once a page is
+	 * mapped into the secondary MMU and gets accessed by a VCPU.
+	 *
+	 * Note that get_user_page_fast_only() and FOLL_WRITE for now
+	 * implicitly honor NUMA hinting faults and don't need this flag.
+	 */
+	unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT;
 	struct page *page;
 	int npages;