[RFC,v2,0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

Message ID	20230810085636.25914-1-yan.y.zhao@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yan Zhao <yan.y.zhao@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com, david@redhat.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Date: Thu, 10 Aug 2023 16:56:36 +0800 Message-Id: <20230810085636.25914-1-yan.y.zhao@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Reduce NUMA balance caused TLB-shootdowns in a VM \| expand [RFC,v2,0/5] Reduce NUMA balance caused TLB-shootdowns in a VM [RFC,v2,1/5] mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA [RFC,v2,2/5] mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose [RFC,v2,3/5] mm/mmu_notifier: introduce a new callback .numa_protect [RFC,v2,4/5] mm/autonuma: call .numa_protect() when page is protected for NUMA migrate [RFC,v2,5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

Yan Zhao Aug. 10, 2023, 8:56 a.m. UTC

This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.

Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.

Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.


Changelog:
RFC v1 --> v2:
1. added patch 3-4 to introduce a new callback .numa_protect()
2. Rather than have KVM duplicate logic to check if a page is pinned for
long-term, let KVM depend on the new callback .numa_protect() to do the
page unmap in secondary MMU for NUMA migration purpose.

RFC v1:
https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ 

Yan Zhao (5):
  mm/mmu_notifier: introduce a new mmu notifier flag
    MMU_NOTIFIER_RANGE_NUMA
  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
    purpose
  mm/mmu_notifier: introduce a new callback .numa_protect
  mm/autonuma: call .numa_protect() when page is protected for NUMA
    migrate
  KVM: Unmap pages only when it's indeed protected for NUMA migration

 include/linux/mmu_notifier.h | 16 ++++++++++++++++
 mm/huge_memory.c             |  6 ++++++
 mm/mmu_notifier.c            | 18 ++++++++++++++++++
 mm/mprotect.c                | 10 +++++++++-
 virt/kvm/kvm_main.c          | 25 ++++++++++++++++++++++---
 5 files changed, 71 insertions(+), 4 deletions(-)

David Hildenbrand Aug. 10, 2023, 9:34 a.m. UTC | #1

On 10.08.23 10:56, Yan Zhao wrote:
> This is an RFC series trying to fix the issue of unnecessary NUMA
> protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> mediated devices during NUMA balance.
> 
> For VMs with assigned devices or VFIO mediated devices, all or part of
> guest memory are pinned for long-term.
> 
> Auto NUMA balancing will periodically selects VMAs of a process and change
> protections to PROT_NONE even though some or all pages in the selected
> ranges are long-term pinned for DMAs, which is true for VMs with assigned
> devices or VFIO mediated devices.
> 
> Though this will not cause real problem because NUMA migration will
> ultimately reject migration of those kind of pages and restore those
> PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> with equal SPTEs finally faulted back, wasting CPU cycles and generating
> unnecessary TLB-shootdowns.
> 
> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> event is sent for NUMA migration purpose in specific.
> 
> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> MMU to avoid NUMA protection introduced page faults and restoration of old
> huge PMDs/PTEs in primary MMU.
> 
> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> 
> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> is for NUMA balancing specific and do not do the page unmap in secondary
> MMU until .numa_protect() comes.
> 

Why do we need all that, when we should simply not be applying PROT_NONE 
to pinned pages?

In change_pte_range() we already have:

if (is_cow_mapping(vma->vm_flags) &&
     page_count(page) != 1)

Which includes both, shared and pinned pages.

Staring at page #2, are we still missing something similar for THPs?

Why is that MMU notifier thingy and touching KVM code required?

Yan Zhao Aug. 10, 2023, 9:50 a.m. UTC | #2

On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
> > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> > event is sent for NUMA migration purpose in specific.
> > 
> > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> > MMU to avoid NUMA protection introduced page faults and restoration of old
> > huge PMDs/PTEs in primary MMU.
> > 
> > Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> > will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> > 
> > Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> > is for NUMA balancing specific and do not do the page unmap in secondary
> > MMU until .numa_protect() comes.
> > 
> 
> Why do we need all that, when we should simply not be applying PROT_NONE to
> pinned pages?
> 
> In change_pte_range() we already have:
> 
> if (is_cow_mapping(vma->vm_flags) &&
>     page_count(page) != 1)
> 
> Which includes both, shared and pinned pages.
Ah, right, currently in my side, I don't see any pinned pages are
outside of this condition. 
But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

> Staring at page #2, are we still missing something similar for THPs?
Yes.

> Why is that MMU notifier thingy and touching KVM code required?
Because NUMA balancing code will firstly send .invalidate_range_start() with
event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
unconditionally, before it goes down into change_pte_range() and
change_huge_pmd() to check each page count and apply PROT_NONE.

Then current KVM will unmap all notified pages from secondary MMU
in .invalidate_range_start(), which could include pages that finally not
set to PROT_NONE in primary MMU.

For VMs with pass-through devices, though all guest pages are pinned,
KVM still periodically unmap pages in response to the
.invalidate_range_start() notification from auto NUMA balancing, which
is a waste.

So, if there's a new callback sent when pages is set to PROT_NONE for NUMA
migrate only, KVM can unmap only those pages.
As KVM still needs to unmap pages for other type of event in its handler of
.invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()),
and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1
added a range flag to help KVM not to do a blind unmap in
.invalidate_range_start(), but do it in the new .numa_protect() handler.

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
>

Chao Gao Aug. 10, 2023, 1:58 p.m. UTC | #3

On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
>This is an RFC series trying to fix the issue of unnecessary NUMA
>protection and TLB-shootdowns found in VMs with assigned devices or VFIO
>mediated devices during NUMA balance.
>
>For VMs with assigned devices or VFIO mediated devices, all or part of
>guest memory are pinned for long-term.
>
>Auto NUMA balancing will periodically selects VMAs of a process and change
>protections to PROT_NONE even though some or all pages in the selected
>ranges are long-term pinned for DMAs, which is true for VMs with assigned
>devices or VFIO mediated devices.
>
>Though this will not cause real problem because NUMA migration will
>ultimately reject migration of those kind of pages and restore those
>PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
>with equal SPTEs finally faulted back, wasting CPU cycles and generating
>unnecessary TLB-shootdowns.

In my understanding, NUMA balancing also moves tasks closer to the memory
they are accessing. Can this still work with this series applied?

>
>This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>event is sent for NUMA migration purpose in specific.
>
>Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>MMU to avoid NUMA protection introduced page faults and restoration of old
>huge PMDs/PTEs in primary MMU.
>
>Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>
>Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>is for NUMA balancing specific and do not do the page unmap in secondary
>MMU until .numa_protect() comes.
>
>
>Changelog:
>RFC v1 --> v2:
>1. added patch 3-4 to introduce a new callback .numa_protect()
>2. Rather than have KVM duplicate logic to check if a page is pinned for
>long-term, let KVM depend on the new callback .numa_protect() to do the
>page unmap in secondary MMU for NUMA migration purpose.
>
>RFC v1:
>https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ 
>
>Yan Zhao (5):
>  mm/mmu_notifier: introduce a new mmu notifier flag
>    MMU_NOTIFIER_RANGE_NUMA
>  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
>    purpose
>  mm/mmu_notifier: introduce a new callback .numa_protect
>  mm/autonuma: call .numa_protect() when page is protected for NUMA
>    migrate
>  KVM: Unmap pages only when it's indeed protected for NUMA migration
>
> include/linux/mmu_notifier.h | 16 ++++++++++++++++
> mm/huge_memory.c             |  6 ++++++
> mm/mmu_notifier.c            | 18 ++++++++++++++++++
> mm/mprotect.c                | 10 +++++++++-
> virt/kvm/kvm_main.c          | 25 ++++++++++++++++++++++---
> 5 files changed, 71 insertions(+), 4 deletions(-)
>
>-- 
>2.17.1
>

Yan Zhao Aug. 11, 2023, 5:22 a.m. UTC | #4

On Thu, Aug 10, 2023 at 09:58:43PM +0800, Chao Gao wrote:
> On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
> >This is an RFC series trying to fix the issue of unnecessary NUMA
> >protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> >mediated devices during NUMA balance.
> >
> >For VMs with assigned devices or VFIO mediated devices, all or part of
> >guest memory are pinned for long-term.
> >
> >Auto NUMA balancing will periodically selects VMAs of a process and change
> >protections to PROT_NONE even though some or all pages in the selected
> >ranges are long-term pinned for DMAs, which is true for VMs with assigned
> >devices or VFIO mediated devices.
> >
> >Though this will not cause real problem because NUMA migration will
> >ultimately reject migration of those kind of pages and restore those
> >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> >with equal SPTEs finally faulted back, wasting CPU cycles and generating
> >unnecessary TLB-shootdowns.
> 
> In my understanding, NUMA balancing also moves tasks closer to the memory
> they are accessing. Can this still work with this series applied?
> 
For pages protected with PROT_NONE in primary MMU in scanning phase, yes;
For pages not set to PROT_NONE, no.
Because looks this task_numa_migrate() is only triggered in next page
fault when PROT_NONE and accessible VMA is found.

David Hildenbrand Aug. 11, 2023, 5:25 p.m. UTC | #5

On 10.08.23 11:50, Yan Zhao wrote:
> On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
>>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>>> event is sent for NUMA migration purpose in specific.
>>>
>>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>>> MMU to avoid NUMA protection introduced page faults and restoration of old
>>> huge PMDs/PTEs in primary MMU.
>>>
>>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>>> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>>>
>>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>>> is for NUMA balancing specific and do not do the page unmap in secondary
>>> MMU until .numa_protect() comes.
>>>
>>
>> Why do we need all that, when we should simply not be applying PROT_NONE to
>> pinned pages?
>>
>> In change_pte_range() we already have:
>>
>> if (is_cow_mapping(vma->vm_flags) &&
>>      page_count(page) != 1)
>>
>> Which includes both, shared and pinned pages.
> Ah, right, currently in my side, I don't see any pinned pages are
> outside of this condition.
> But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
> need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

One issue is that folio_maybe_pinned...() ... is unreliable as soon as 
your page is mapped more than 1024 times.

One might argue that we also want to exclude pages that are mapped that 
often. That might possibly work.

> 
>> Staring at page #2, are we still missing something similar for THPs?
> Yes.
> 
>> Why is that MMU notifier thingy and touching KVM code required?
> Because NUMA balancing code will firstly send .invalidate_range_start() with
> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
> unconditionally, before it goes down into change_pte_range() and
> change_huge_pmd() to check each page count and apply PROT_NONE.

Ah, okay I see, thanks. That's indeed unfortunate.

> 
> Then current KVM will unmap all notified pages from secondary MMU
> in .invalidate_range_start(), which could include pages that finally not
> set to PROT_NONE in primary MMU.
> 
> For VMs with pass-through devices, though all guest pages are pinned,
> KVM still periodically unmap pages in response to the
> .invalidate_range_start() notification from auto NUMA balancing, which
> is a waste.

Should we want to disable NUMA hinting for such VMAs instead (for 
example, by QEMU/hypervisor) that knows that any NUMA hinting activity 
on these ranges would be a complete waste of time? I recall that John H. 
once mentioned that there are similar issues with GPU memory:  NUMA 
hinting is actually counter-productive and they end up disabling it.

John Hubbard Aug. 11, 2023, 6:20 p.m. UTC | #6

On 8/11/23 10:25, David Hildenbrand wrote:
...
> One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times.
> 
> One might argue that we also want to exclude pages that are mapped that often. That might possibly work.

Yes.

>>
>>> Staring at page #2, are we still missing something similar for THPs?
>> Yes.
>>
>>> Why is that MMU notifier thingy and touching KVM code required?
>> Because NUMA balancing code will firstly send .invalidate_range_start() with
>> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
>> unconditionally, before it goes down into change_pte_range() and
>> change_huge_pmd() to check each page count and apply PROT_NONE.
> 
> Ah, okay I see, thanks. That's indeed unfortunate.

Sigh. All this difficulty reminds me that this mechanism was created in
the early days of NUMA. I wonder sometimes lately whether the cost, in
complexity and CPU time, is still worth it on today's hardware.

But of course I am deeply biased, so don't take that too seriously.
See below. :)

> 
>>
>> Then current KVM will unmap all notified pages from secondary MMU
>> in .invalidate_range_start(), which could include pages that finally not
>> set to PROT_NONE in primary MMU.
>>
>> For VMs with pass-through devices, though all guest pages are pinned,
>> KVM still periodically unmap pages in response to the
>> .invalidate_range_start() notification from auto NUMA balancing, which
>> is a waste.
> 
> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are 
similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> 

Yes, NUMA balancing is incredibly harmful to performance, for GPU and
accelerators that map memory...and VMs as well, it seems. Basically,
anything that has its own processors and page tables needs to be left
strictly alone by NUMA balancing. Because the kernel is (still, even
today) unaware of what those processors are doing, and so it has no way
to do productive NUMA balancing.

thanks,

David Hildenbrand Aug. 11, 2023, 6:39 p.m. UTC | #7

>> Ah, okay I see, thanks. That's indeed unfortunate.
> 
> Sigh. All this difficulty reminds me that this mechanism was created in
> the early days of NUMA. I wonder sometimes lately whether the cost, in
> complexity and CPU time, is still worth it on today's hardware.
> 
> But of course I am deeply biased, so don't take that too seriously.
> See below. :)

:)

>>
>>>
>>> Then current KVM will unmap all notified pages from secondary MMU
>>> in .invalidate_range_start(), which could include pages that finally not
>>> set to PROT_NONE in primary MMU.
>>>
>>> For VMs with pass-through devices, though all guest pages are pinned,
>>> KVM still periodically unmap pages in response to the
>>> .invalidate_range_start() notification from auto NUMA balancing, which
>>> is a waste.
>>
>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>
> 
> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> accelerators that map memory...and VMs as well, it seems. Basically,
> anything that has its own processors and page tables needs to be left
> strictly alone by NUMA balancing. Because the kernel is (still, even
> today) unaware of what those processors are doing, and so it has no way
> to do productive NUMA balancing.

Is there any existing way we could handle that better on a per-VMA 
level, or on the process level? Any magic toggles?

MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might 
be better, but with things like iouring still too restrictive eventually.

I recall that setting a mempolicy could prevent auto-numa from getting 
active, but that might be undesired.

CCing Mel.

John Hubbard Aug. 11, 2023, 7:35 p.m. UTC | #8

On 8/11/23 11:39, David Hildenbrand wrote:
...
>>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
>> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>>
>>
>> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
>> accelerators that map memory...and VMs as well, it seems. Basically,
>> anything that has its own processors and page tables needs to be left
>> strictly alone by NUMA balancing. Because the kernel is (still, even
>> today) unaware of what those processors are doing, and so it has no way
>> to do productive NUMA balancing.
> 
> Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> 
> MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> 
> I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> 
> CCing Mel.
> 

Let's discern between page pinning situations, and HMM-style situations.
Page pinning of CPU memory is unnecessary when setting up for using that
memory by modern GPUs or accelerators, because the latter can handle
replayable page faults. So for such cases, the pages are in use by a GPU
or accelerator, but unpinned.

The performance problem occurs because for those pages, the NUMA
balancing causes unmapping, which generates callbacks to the device
driver, which dutifully unmaps the pages from the GPU or accelerator,
even if the GPU might be busy using those pages. The device promptly
causes a device page fault, and the driver then re-establishes the
device page table mapping, which is good until the next round of
unmapping from the NUMA balancer.

hmm_range_fault()-based memory management in particular might benefit
from having NUMA balancing disabled entirely for the memremap_pages()
region, come to think of it. That seems relatively easy and clean at
first glance anyway.

For other regions (allocated by the device driver), a per-VMA flag
seems about right: VM_NO_NUMA_BALANCING ?

thanks,

Yan Zhao Aug. 14, 2023, 9:09 a.m. UTC | #9

On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> On 8/11/23 11:39, David Hildenbrand wrote:
> ...
> > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > 
> > > 
> > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > anything that has its own processors and page tables needs to be left
> > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > today) unaware of what those processors are doing, and so it has no way
> > > to do productive NUMA balancing.
> > 
> > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > 
> > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > 
> > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > 
> > CCing Mel.
> > 
> 
> Let's discern between page pinning situations, and HMM-style situations.
> Page pinning of CPU memory is unnecessary when setting up for using that
> memory by modern GPUs or accelerators, because the latter can handle
> replayable page faults. So for such cases, the pages are in use by a GPU
> or accelerator, but unpinned.
> 
> The performance problem occurs because for those pages, the NUMA
> balancing causes unmapping, which generates callbacks to the device
> driver, which dutifully unmaps the pages from the GPU or accelerator,
> even if the GPU might be busy using those pages. The device promptly
> causes a device page fault, and the driver then re-establishes the
> device page table mapping, which is good until the next round of
> unmapping from the NUMA balancer.
> 
> hmm_range_fault()-based memory management in particular might benefit
> from having NUMA balancing disabled entirely for the memremap_pages()
> region, come to think of it. That seems relatively easy and clean at
> first glance anyway.
> 
> For other regions (allocated by the device driver), a per-VMA flag
> seems about right: VM_NO_NUMA_BALANCING ?
> 
Thanks a lot for those good suggestions!
For VMs, when could a per-VMA flag be set?
Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
after it's mapped into VFIO.
Then, should VFIO set this flag on after it maps a range?
Could this flag be unset after device hot-unplug?

John Hubbard Aug. 15, 2023, 2:34 a.m. UTC | #10

On 8/14/23 02:09, Yan Zhao wrote:
...
>> hmm_range_fault()-based memory management in particular might benefit
>> from having NUMA balancing disabled entirely for the memremap_pages()
>> region, come to think of it. That seems relatively easy and clean at
>> first glance anyway.
>>
>> For other regions (allocated by the device driver), a per-VMA flag
>> seems about right: VM_NO_NUMA_BALANCING ?
>>
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?
> 

I'm hoping someone who thinks about VMs and VFIO often can chime in.


thanks,

Yuan Yao Aug. 15, 2023, 2:36 a.m. UTC | #11

On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > On 8/11/23 11:39, David Hildenbrand wrote:
> > ...
> > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > >
> > > >
> > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > anything that has its own processors and page tables needs to be left
> > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > today) unaware of what those processors are doing, and so it has no way
> > > > to do productive NUMA balancing.
> > >
> > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > >
> > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > >
> > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > >
> > > CCing Mel.
> > >
> >
> > Let's discern between page pinning situations, and HMM-style situations.
> > Page pinning of CPU memory is unnecessary when setting up for using that
> > memory by modern GPUs or accelerators, because the latter can handle
> > replayable page faults. So for such cases, the pages are in use by a GPU
> > or accelerator, but unpinned.
> >
> > The performance problem occurs because for those pages, the NUMA
> > balancing causes unmapping, which generates callbacks to the device
> > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > even if the GPU might be busy using those pages. The device promptly
> > causes a device page fault, and the driver then re-establishes the
> > device page table mapping, which is good until the next round of
> > unmapping from the NUMA balancer.
> >
> > hmm_range_fault()-based memory management in particular might benefit
> > from having NUMA balancing disabled entirely for the memremap_pages()
> > region, come to think of it. That seems relatively easy and clean at
> > first glance anyway.
> >
> > For other regions (allocated by the device driver), a per-VMA flag
> > seems about right: VM_NO_NUMA_BALANCING ?
> >
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?

Emm... syscall madvise() in my mind, it does things like change flags
on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.

>
>

Yan Zhao Aug. 15, 2023, 2:37 a.m. UTC | #12

On Tue, Aug 15, 2023 at 10:36:18AM +0800, Yuan Yao wrote:
> On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > > On 8/11/23 11:39, David Hildenbrand wrote:
> > > ...
> > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > > similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
> > > > > >
> > > > >
> > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > > anything that has its own processors and page tables needs to be left
> > > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > > today) unaware of what those processors are doing, and so it has no way
> > > > > to do productive NUMA balancing.
> > > >
> > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > > >
> > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > > >
> > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > > >
> > > > CCing Mel.
> > > >
> > >
> > > Let's discern between page pinning situations, and HMM-style situations.
> > > Page pinning of CPU memory is unnecessary when setting up for using that
> > > memory by modern GPUs or accelerators, because the latter can handle
> > > replayable page faults. So for such cases, the pages are in use by a GPU
> > > or accelerator, but unpinned.
> > >
> > > The performance problem occurs because for those pages, the NUMA
> > > balancing causes unmapping, which generates callbacks to the device
> > > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > > even if the GPU might be busy using those pages. The device promptly
> > > causes a device page fault, and the driver then re-establishes the
> > > device page table mapping, which is good until the next round of
> > > unmapping from the NUMA balancer.
> > >
> > > hmm_range_fault()-based memory management in particular might benefit
> > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > region, come to think of it. That seems relatively easy and clean at
> > > first glance anyway.
> > >
> > > For other regions (allocated by the device driver), a per-VMA flag
> > > seems about right: VM_NO_NUMA_BALANCING ?
> > >
> > Thanks a lot for those good suggestions!
> > For VMs, when could a per-VMA flag be set?
> > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > after it's mapped into VFIO.
> > Then, should VFIO set this flag on after it maps a range?
> > Could this flag be unset after device hot-unplug?
> 
> Emm... syscall madvise() in my mind, it does things like change flags
> on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.
Yes, madvise() might work.
And setting this flag might be an easy decision, while unsetting it might be hard
unless some counters introduced.

David Hildenbrand Aug. 16, 2023, 7:43 a.m. UTC | #13

On 15.08.23 04:34, John Hubbard wrote:
> On 8/14/23 02:09, Yan Zhao wrote:
> ...
>>> hmm_range_fault()-based memory management in particular might benefit
>>> from having NUMA balancing disabled entirely for the memremap_pages()
>>> region, come to think of it. That seems relatively easy and clean at
>>> first glance anyway.
>>>
>>> For other regions (allocated by the device driver), a per-VMA flag
>>> seems about right: VM_NO_NUMA_BALANCING ?
>>>
>> Thanks a lot for those good suggestions!
>> For VMs, when could a per-VMA flag be set?
>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
>> after it's mapped into VFIO.
>> Then, should VFIO set this flag on after it maps a range?
>> Could this flag be unset after device hot-unplug?
>>
> 
> I'm hoping someone who thinks about VMs and VFIO often can chime in.

At least QEMU could just set it on the applicable VMAs (as said by Yuan 
Yao, using madvise).

BUT, I do wonder what value there would be for autonuma to still be 
active for the remainder of the hypervisor. If there is none, a prctl() 
would be better.

We already do have a mechanism in QEMU to get notified when 
longterm-pinning in the kernel might happen (and, therefore, 
MADV_DONTNEED must not be used):
* ram_block_discard_disable()
* ram_block_uncoordinated_discard_disable()

Yan Zhao Aug. 16, 2023, 9:06 a.m. UTC | #14

On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote:
> On 15.08.23 04:34, John Hubbard wrote:
> > On 8/14/23 02:09, Yan Zhao wrote:
> > ...
> > > > hmm_range_fault()-based memory management in particular might benefit
> > > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > > region, come to think of it. That seems relatively easy and clean at
> > > > first glance anyway.
> > > > 
> > > > For other regions (allocated by the device driver), a per-VMA flag
> > > > seems about right: VM_NO_NUMA_BALANCING ?
> > > > 
> > > Thanks a lot for those good suggestions!
> > > For VMs, when could a per-VMA flag be set?
> > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > > after it's mapped into VFIO.
> > > Then, should VFIO set this flag on after it maps a range?
> > > Could this flag be unset after device hot-unplug?
> > > 
> > 
> > I'm hoping someone who thinks about VMs and VFIO often can chime in.
> 
> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao,
> using madvise).
> 
> BUT, I do wonder what value there would be for autonuma to still be active
Currently MADV_* is up to 25
	#define MADV_COLLAPSE   25,
while madvise behavior is of type "int". So it's ok.

But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform.

> for the remainder of the hypervisor. If there is none, a prctl() would be
> better.
Add a new field in "struct vma_numab_state" in vma, and use prctl() to
update this field?

e.g.
struct vma_numab_state {
        unsigned long next_scan;
        unsigned long next_pid_reset;
        unsigned long access_pids[2];
	bool no_scan;
};

> 
> We already do have a mechanism in QEMU to get notified when longterm-pinning
> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used):
> * ram_block_discard_disable()
> * ram_block_uncoordinated_discard_disable()
Looks this ram_block_discard allow/disallow state is global rather than per-VMA
in QEMU.
So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and
it's up to the user space to choose between per-VMA and complex way or
global and simpler way?

David Hildenbrand Aug. 16, 2023, 9:49 a.m. UTC | #15

On 16.08.23 11:06, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote:
>> On 15.08.23 04:34, John Hubbard wrote:
>>> On 8/14/23 02:09, Yan Zhao wrote:
>>> ...
>>>>> hmm_range_fault()-based memory management in particular might benefit
>>>>> from having NUMA balancing disabled entirely for the memremap_pages()
>>>>> region, come to think of it. That seems relatively easy and clean at
>>>>> first glance anyway.
>>>>>
>>>>> For other regions (allocated by the device driver), a per-VMA flag
>>>>> seems about right: VM_NO_NUMA_BALANCING ?
>>>>>
>>>> Thanks a lot for those good suggestions!
>>>> For VMs, when could a per-VMA flag be set?
>>>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
>>>> after it's mapped into VFIO.
>>>> Then, should VFIO set this flag on after it maps a range?
>>>> Could this flag be unset after device hot-unplug?
>>>>
>>>
>>> I'm hoping someone who thinks about VMs and VFIO often can chime in.
>>
>> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao,
>> using madvise).
>>
>> BUT, I do wonder what value there would be for autonuma to still be active
> Currently MADV_* is up to 25
> 	#define MADV_COLLAPSE   25,
> while madvise behavior is of type "int". So it's ok.
> 
> But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform.

I remember there were discussions to increase it also for 32bit. If 
that's required, we might want to go down that path.

But do 32bit architectures even care about NUMA hinting? If not, just 
ignore them ...

> 
>> for the remainder of the hypervisor. If there is none, a prctl() would be
>> better.
> Add a new field in "struct vma_numab_state" in vma, and use prctl() to
> update this field?

Rather a global toggle per MM, no need to update individual VMAs -- if 
we go down that prctl() path.

No need to consume more memory for VMAs.

[...]

>> We already do have a mechanism in QEMU to get notified when longterm-pinning
>> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used):
>> * ram_block_discard_disable()
>> * ram_block_uncoordinated_discard_disable()
> Looks this ram_block_discard allow/disallow state is global rather than per-VMA
> in QEMU.

Yes. Once you transition into "discard of any kind disabled", you can go 
over all guest memory VMAs (RAMBlock) and issue an madvise() for them. 
(or alternatively, do the prctl() once )

We'll also have to handle new guest memory being created afterwards, but 
that is easy.

Once we transition to "no discarding disabled", you can go over all 
guest memory VMAs (RAMBlock) and issue an madvise() for them again (or 
alternatively, do the prctl() once).

> So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and
> it's up to the user space to choose between per-VMA and complex way or
> global and simpler way?

QEMU could do either way. The question would be if a per-vma settings 
makes sense for NUMA hinting.

John Hubbard Aug. 16, 2023, 6 p.m. UTC | #16

On 8/16/23 02:49, David Hildenbrand wrote:
> But do 32bit architectures even care about NUMA hinting? If not, just 
> ignore them ...

Probably not!

...
>> So, do you mean that let kernel provide a per-VMA allow/disallow 
>> mechanism, and
>> it's up to the user space to choose between per-VMA and complex way or
>> global and simpler way?
> 
> QEMU could do either way. The question would be if a per-vma settings 
> makes sense for NUMA hinting.

 From our experience with compute on GPUs, a per-mm setting would suffice.
No need to go all the way to VMA granularity.


thanks,

Yan Zhao Aug. 17, 2023, 5:05 a.m. UTC | #17

On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> On 8/16/23 02:49, David Hildenbrand wrote:
> > But do 32bit architectures even care about NUMA hinting? If not, just
> > ignore them ...
> 
> Probably not!
> 
> ...
> > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > mechanism, and
> > > it's up to the user space to choose between per-VMA and complex way or
> > > global and simpler way?
> > 
> > QEMU could do either way. The question would be if a per-vma settings
> > makes sense for NUMA hinting.
> 
> From our experience with compute on GPUs, a per-mm setting would suffice.
> No need to go all the way to VMA granularity.
> 
After an offline internal discussion, we think a per-mm setting is also
enough for device passthrough in VMs.

BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
think it's of any value to providing a flag like VM_MAYDMA?
Auto NUMA balancing or other components can decide how to use it by
themselves.

David Hildenbrand Aug. 17, 2023, 7:38 a.m. UTC | #18

On 17.08.23 07:05, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
>> On 8/16/23 02:49, David Hildenbrand wrote:
>>> But do 32bit architectures even care about NUMA hinting? If not, just
>>> ignore them ...
>>
>> Probably not!
>>
>> ...
>>>> So, do you mean that let kernel provide a per-VMA allow/disallow
>>>> mechanism, and
>>>> it's up to the user space to choose between per-VMA and complex way or
>>>> global and simpler way?
>>>
>>> QEMU could do either way. The question would be if a per-vma settings
>>> makes sense for NUMA hinting.
>>
>>  From our experience with compute on GPUs, a per-mm setting would suffice.
>> No need to go all the way to VMA granularity.
>>
> After an offline internal discussion, we think a per-mm setting is also
> enough for device passthrough in VMs.
> 
> BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> think it's of any value to providing a flag like VM_MAYDMA?
> Auto NUMA balancing or other components can decide how to use it by
> themselves.

Short-lived DMA is not really the problem. The problem is long-term pinning.

There was a discussion about letting user space similarly hint that 
long-term pinning might/will happen.

Because when long-term pinning a page we have to make sure to migrate it 
off of ZONE_MOVABLE / MIGRATE_CMA.

But the kernel prefers to place pages there.

So with vfio in QEMU, we might preallocate memory for the guest and 
place it on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to 
migrate all these fresh pages out of these areas again.

So letting the kernel know about that in this context might also help.

Yan Zhao Aug. 18, 2023, 12:13 a.m. UTC | #19

On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote:
> On 17.08.23 07:05, Yan Zhao wrote:
> > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> > > On 8/16/23 02:49, David Hildenbrand wrote:
> > > > But do 32bit architectures even care about NUMA hinting? If not, just
> > > > ignore them ...
> > > 
> > > Probably not!
> > > 
> > > ...
> > > > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > > > mechanism, and
> > > > > it's up to the user space to choose between per-VMA and complex way or
> > > > > global and simpler way?
> > > > 
> > > > QEMU could do either way. The question would be if a per-vma settings
> > > > makes sense for NUMA hinting.
> > > 
> > >  From our experience with compute on GPUs, a per-mm setting would suffice.
> > > No need to go all the way to VMA granularity.
> > > 
> > After an offline internal discussion, we think a per-mm setting is also
> > enough for device passthrough in VMs.
> > 
> > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> > think it's of any value to providing a flag like VM_MAYDMA?
> > Auto NUMA balancing or other components can decide how to use it by
> > themselves.
> 
> Short-lived DMA is not really the problem. The problem is long-term pinning.
> 
> There was a discussion about letting user space similarly hint that
> long-term pinning might/will happen.
> 
> Because when long-term pinning a page we have to make sure to migrate it off
> of ZONE_MOVABLE / MIGRATE_CMA.
> 
> But the kernel prefers to place pages there.
> 
> So with vfio in QEMU, we might preallocate memory for the guest and place it
> on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all
> these fresh pages out of these areas again.
> 
> So letting the kernel know about that in this context might also help.
>
Thanks! Glad to know it :)
But consider for GPUs case as what John mentioned, since the memory is
not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
and migration.

Is that good?
Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
now?

John Hubbard Aug. 18, 2023, 2:29 a.m. UTC | #20

On 8/17/23 17:13, Yan Zhao wrote:
...
> But consider for GPUs case as what John mentioned, since the memory is
> not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
> For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
> IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
> and migration.
> 
> Is that good?
> Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
> now?
> 

So far, a per-mm setting seems like it would suffice. However, it is
also true that new hardware is getting really creative and large, to
the point that it's not inconceivable that a process might actually
want to let NUMA balancing run in part of its mm, but turn it off
to allow fault-able device access to another part of the mm.

We aren't seeing that yet, but on the other hand, that may be
simply because there is no practical way to set that up and see
how well it works.

thanks,

Yan Zhao Sept. 4, 2023, 9:18 a.m. UTC | #21

On Thu, Aug 17, 2023 at 07:29:12PM -0700, John Hubbard wrote:
> On 8/17/23 17:13, Yan Zhao wrote:
> ...
> > But consider for GPUs case as what John mentioned, since the memory is
> > not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
> > For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
> > IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
> > and migration.
> > 
> > Is that good?
> > Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
> > now?
> > 
> 
> So far, a per-mm setting seems like it would suffice. However, it is
> also true that new hardware is getting really creative and large, to
> the point that it's not inconceivable that a process might actually
> want to let NUMA balancing run in part of its mm, but turn it off
> to allow fault-able device access to another part of the mm.
> 
> We aren't seeing that yet, but on the other hand, that may be
> simply because there is no practical way to set that up and see
> how well it works.
> 
>
Hi guys,
Thanks a lot for your review and suggestions!
I'll firstly try to add a per-mm flag to fix this problem later
(but maybe not very soon)

Thanks
Yan

[RFC,v2,0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

Message

Comments