Message ID | 20230810085636.25914-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Reduce NUMA balance caused TLB-shootdowns in a VM | expand |
On 10.08.23 10:56, Yan Zhao wrote: > This is an RFC series trying to fix the issue of unnecessary NUMA > protection and TLB-shootdowns found in VMs with assigned devices or VFIO > mediated devices during NUMA balance. > > For VMs with assigned devices or VFIO mediated devices, all or part of > guest memory are pinned for long-term. > > Auto NUMA balancing will periodically selects VMAs of a process and change > protections to PROT_NONE even though some or all pages in the selected > ranges are long-term pinned for DMAs, which is true for VMs with assigned > devices or VFIO mediated devices. > > Though this will not cause real problem because NUMA migration will > ultimately reject migration of those kind of pages and restore those > PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically > with equal SPTEs finally faulted back, wasting CPU cycles and generating > unnecessary TLB-shootdowns. > > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation > event is sent for NUMA migration purpose in specific. > > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary > MMU to avoid NUMA protection introduced page faults and restoration of old > huge PMDs/PTEs in primary MMU. > > Patch 3 introduces a new mmu notifier callback .numa_protect(), which > will be called in patch 4 when a page is ensured to be PROT_NONE protected. > > Then in patch 5, KVM can recognize a .invalidate_range_start() notification > is for NUMA balancing specific and do not do the page unmap in secondary > MMU until .numa_protect() comes. > Why do we need all that, when we should simply not be applying PROT_NONE to pinned pages? In change_pte_range() we already have: if (is_cow_mapping(vma->vm_flags) && page_count(page) != 1) Which includes both, shared and pinned pages. Staring at page #2, are we still missing something similar for THPs? Why is that MMU notifier thingy and touching KVM code required?
On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote: > > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 > > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that > > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation > > event is sent for NUMA migration purpose in specific. > > > > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary > > MMU to avoid NUMA protection introduced page faults and restoration of old > > huge PMDs/PTEs in primary MMU. > > > > Patch 3 introduces a new mmu notifier callback .numa_protect(), which > > will be called in patch 4 when a page is ensured to be PROT_NONE protected. > > > > Then in patch 5, KVM can recognize a .invalidate_range_start() notification > > is for NUMA balancing specific and do not do the page unmap in secondary > > MMU until .numa_protect() comes. > > > > Why do we need all that, when we should simply not be applying PROT_NONE to > pinned pages? > > In change_pte_range() we already have: > > if (is_cow_mapping(vma->vm_flags) && > page_count(page) != 1) > > Which includes both, shared and pinned pages. Ah, right, currently in my side, I don't see any pinned pages are outside of this condition. But I have a question regarding to is_cow_mapping(vma->vm_flags), do we need to allow pinned pages in !is_cow_mapping(vma->vm_flags)? > Staring at page #2, are we still missing something similar for THPs? Yes. > Why is that MMU notifier thingy and touching KVM code required? Because NUMA balancing code will firstly send .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() unconditionally, before it goes down into change_pte_range() and change_huge_pmd() to check each page count and apply PROT_NONE. Then current KVM will unmap all notified pages from secondary MMU in .invalidate_range_start(), which could include pages that finally not set to PROT_NONE in primary MMU. For VMs with pass-through devices, though all guest pages are pinned, KVM still periodically unmap pages in response to the .invalidate_range_start() notification from auto NUMA balancing, which is a waste. So, if there's a new callback sent when pages is set to PROT_NONE for NUMA migrate only, KVM can unmap only those pages. As KVM still needs to unmap pages for other type of event in its handler of .invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()), and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1 added a range flag to help KVM not to do a blind unmap in .invalidate_range_start(), but do it in the new .numa_protect() handler. > > -- > Cheers, > > David / dhildenb > >
On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote: >This is an RFC series trying to fix the issue of unnecessary NUMA >protection and TLB-shootdowns found in VMs with assigned devices or VFIO >mediated devices during NUMA balance. > >For VMs with assigned devices or VFIO mediated devices, all or part of >guest memory are pinned for long-term. > >Auto NUMA balancing will periodically selects VMAs of a process and change >protections to PROT_NONE even though some or all pages in the selected >ranges are long-term pinned for DMAs, which is true for VMs with assigned >devices or VFIO mediated devices. > >Though this will not cause real problem because NUMA migration will >ultimately reject migration of those kind of pages and restore those >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically >with equal SPTEs finally faulted back, wasting CPU cycles and generating >unnecessary TLB-shootdowns. In my understanding, NUMA balancing also moves tasks closer to the memory they are accessing. Can this still work with this series applied? > >This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 >to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that >the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation >event is sent for NUMA migration purpose in specific. > >Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary >MMU to avoid NUMA protection introduced page faults and restoration of old >huge PMDs/PTEs in primary MMU. > >Patch 3 introduces a new mmu notifier callback .numa_protect(), which >will be called in patch 4 when a page is ensured to be PROT_NONE protected. > >Then in patch 5, KVM can recognize a .invalidate_range_start() notification >is for NUMA balancing specific and do not do the page unmap in secondary >MMU until .numa_protect() comes. > > >Changelog: >RFC v1 --> v2: >1. added patch 3-4 to introduce a new callback .numa_protect() >2. Rather than have KVM duplicate logic to check if a page is pinned for >long-term, let KVM depend on the new callback .numa_protect() to do the >page unmap in secondary MMU for NUMA migration purpose. > >RFC v1: >https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ > >Yan Zhao (5): > mm/mmu_notifier: introduce a new mmu notifier flag > MMU_NOTIFIER_RANGE_NUMA > mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate > purpose > mm/mmu_notifier: introduce a new callback .numa_protect > mm/autonuma: call .numa_protect() when page is protected for NUMA > migrate > KVM: Unmap pages only when it's indeed protected for NUMA migration > > include/linux/mmu_notifier.h | 16 ++++++++++++++++ > mm/huge_memory.c | 6 ++++++ > mm/mmu_notifier.c | 18 ++++++++++++++++++ > mm/mprotect.c | 10 +++++++++- > virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++--- > 5 files changed, 71 insertions(+), 4 deletions(-) > >-- >2.17.1 >
On Thu, Aug 10, 2023 at 09:58:43PM +0800, Chao Gao wrote: > On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote: > >This is an RFC series trying to fix the issue of unnecessary NUMA > >protection and TLB-shootdowns found in VMs with assigned devices or VFIO > >mediated devices during NUMA balance. > > > >For VMs with assigned devices or VFIO mediated devices, all or part of > >guest memory are pinned for long-term. > > > >Auto NUMA balancing will periodically selects VMAs of a process and change > >protections to PROT_NONE even though some or all pages in the selected > >ranges are long-term pinned for DMAs, which is true for VMs with assigned > >devices or VFIO mediated devices. > > > >Though this will not cause real problem because NUMA migration will > >ultimately reject migration of those kind of pages and restore those > >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically > >with equal SPTEs finally faulted back, wasting CPU cycles and generating > >unnecessary TLB-shootdowns. > > In my understanding, NUMA balancing also moves tasks closer to the memory > they are accessing. Can this still work with this series applied? > For pages protected with PROT_NONE in primary MMU in scanning phase, yes; For pages not set to PROT_NONE, no. Because looks this task_numa_migrate() is only triggered in next page fault when PROT_NONE and accessible VMA is found.
On 10.08.23 11:50, Yan Zhao wrote: > On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote: >>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 >>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that >>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation >>> event is sent for NUMA migration purpose in specific. >>> >>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary >>> MMU to avoid NUMA protection introduced page faults and restoration of old >>> huge PMDs/PTEs in primary MMU. >>> >>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which >>> will be called in patch 4 when a page is ensured to be PROT_NONE protected. >>> >>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification >>> is for NUMA balancing specific and do not do the page unmap in secondary >>> MMU until .numa_protect() comes. >>> >> >> Why do we need all that, when we should simply not be applying PROT_NONE to >> pinned pages? >> >> In change_pte_range() we already have: >> >> if (is_cow_mapping(vma->vm_flags) && >> page_count(page) != 1) >> >> Which includes both, shared and pinned pages. > Ah, right, currently in my side, I don't see any pinned pages are > outside of this condition. > But I have a question regarding to is_cow_mapping(vma->vm_flags), do we > need to allow pinned pages in !is_cow_mapping(vma->vm_flags)? One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times. One might argue that we also want to exclude pages that are mapped that often. That might possibly work. > >> Staring at page #2, are we still missing something similar for THPs? > Yes. > >> Why is that MMU notifier thingy and touching KVM code required? > Because NUMA balancing code will firstly send .invalidate_range_start() with > event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() > unconditionally, before it goes down into change_pte_range() and > change_huge_pmd() to check each page count and apply PROT_NONE. Ah, okay I see, thanks. That's indeed unfortunate. > > Then current KVM will unmap all notified pages from secondary MMU > in .invalidate_range_start(), which could include pages that finally not > set to PROT_NONE in primary MMU. > > For VMs with pass-through devices, though all guest pages are pinned, > KVM still periodically unmap pages in response to the > .invalidate_range_start() notification from auto NUMA balancing, which > is a waste. Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it.
On 8/11/23 10:25, David Hildenbrand wrote: ... > One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times. > > One might argue that we also want to exclude pages that are mapped that often. That might possibly work. Yes. >> >>> Staring at page #2, are we still missing something similar for THPs? >> Yes. >> >>> Why is that MMU notifier thingy and touching KVM code required? >> Because NUMA balancing code will firstly send .invalidate_range_start() with >> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() >> unconditionally, before it goes down into change_pte_range() and >> change_huge_pmd() to check each page count and apply PROT_NONE. > > Ah, okay I see, thanks. That's indeed unfortunate. Sigh. All this difficulty reminds me that this mechanism was created in the early days of NUMA. I wonder sometimes lately whether the cost, in complexity and CPU time, is still worth it on today's hardware. But of course I am deeply biased, so don't take that too seriously. See below. :) > >> >> Then current KVM will unmap all notified pages from secondary MMU >> in .invalidate_range_start(), which could include pages that finally not >> set to PROT_NONE in primary MMU. >> >> For VMs with pass-through devices, though all guest pages are pinned, >> KVM still periodically unmap pages in response to the >> .invalidate_range_start() notification from auto NUMA balancing, which >> is a waste. > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > Yes, NUMA balancing is incredibly harmful to performance, for GPU and accelerators that map memory...and VMs as well, it seems. Basically, anything that has its own processors and page tables needs to be left strictly alone by NUMA balancing. Because the kernel is (still, even today) unaware of what those processors are doing, and so it has no way to do productive NUMA balancing. thanks,
>> Ah, okay I see, thanks. That's indeed unfortunate. > > Sigh. All this difficulty reminds me that this mechanism was created in > the early days of NUMA. I wonder sometimes lately whether the cost, in > complexity and CPU time, is still worth it on today's hardware. > > But of course I am deeply biased, so don't take that too seriously. > See below. :) :) >> >>> >>> Then current KVM will unmap all notified pages from secondary MMU >>> in .invalidate_range_start(), which could include pages that finally not >>> set to PROT_NONE in primary MMU. >>> >>> For VMs with pass-through devices, though all guest pages are pinned, >>> KVM still periodically unmap pages in response to the >>> .invalidate_range_start() notification from auto NUMA balancing, which >>> is a waste. >> >> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. >> > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > accelerators that map memory...and VMs as well, it seems. Basically, > anything that has its own processors and page tables needs to be left > strictly alone by NUMA balancing. Because the kernel is (still, even > today) unaware of what those processors are doing, and so it has no way > to do productive NUMA balancing. Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. CCing Mel.
On 8/11/23 11:39, David Hildenbrand wrote: ... >>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are >> similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. >>> >> >> Yes, NUMA balancing is incredibly harmful to performance, for GPU and >> accelerators that map memory...and VMs as well, it seems. Basically, >> anything that has its own processors and page tables needs to be left >> strictly alone by NUMA balancing. Because the kernel is (still, even >> today) unaware of what those processors are doing, and so it has no way >> to do productive NUMA balancing. > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > CCing Mel. > Let's discern between page pinning situations, and HMM-style situations. Page pinning of CPU memory is unnecessary when setting up for using that memory by modern GPUs or accelerators, because the latter can handle replayable page faults. So for such cases, the pages are in use by a GPU or accelerator, but unpinned. The performance problem occurs because for those pages, the NUMA balancing causes unmapping, which generates callbacks to the device driver, which dutifully unmaps the pages from the GPU or accelerator, even if the GPU might be busy using those pages. The device promptly causes a device page fault, and the driver then re-establishes the device page table mapping, which is good until the next round of unmapping from the NUMA balancer. hmm_range_fault()-based memory management in particular might benefit from having NUMA balancing disabled entirely for the memremap_pages() region, come to think of it. That seems relatively easy and clean at first glance anyway. For other regions (allocated by the device driver), a per-VMA flag seems about right: VM_NO_NUMA_BALANCING ? thanks,
On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote: > On 8/11/23 11:39, David Hildenbrand wrote: > ... > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > > > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > > > > > > > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > > > accelerators that map memory...and VMs as well, it seems. Basically, > > > anything that has its own processors and page tables needs to be left > > > strictly alone by NUMA balancing. Because the kernel is (still, even > > > today) unaware of what those processors are doing, and so it has no way > > > to do productive NUMA balancing. > > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > > > CCing Mel. > > > > Let's discern between page pinning situations, and HMM-style situations. > Page pinning of CPU memory is unnecessary when setting up for using that > memory by modern GPUs or accelerators, because the latter can handle > replayable page faults. So for such cases, the pages are in use by a GPU > or accelerator, but unpinned. > > The performance problem occurs because for those pages, the NUMA > balancing causes unmapping, which generates callbacks to the device > driver, which dutifully unmaps the pages from the GPU or accelerator, > even if the GPU might be busy using those pages. The device promptly > causes a device page fault, and the driver then re-establishes the > device page table mapping, which is good until the next round of > unmapping from the NUMA balancer. > > hmm_range_fault()-based memory management in particular might benefit > from having NUMA balancing disabled entirely for the memremap_pages() > region, come to think of it. That seems relatively easy and clean at > first glance anyway. > > For other regions (allocated by the device driver), a per-VMA flag > seems about right: VM_NO_NUMA_BALANCING ? > Thanks a lot for those good suggestions! For VMs, when could a per-VMA flag be set? Might be hard in mmap() in QEMU because a VMA may not be used for DMA until after it's mapped into VFIO. Then, should VFIO set this flag on after it maps a range? Could this flag be unset after device hot-unplug?
On 8/14/23 02:09, Yan Zhao wrote: ... >> hmm_range_fault()-based memory management in particular might benefit >> from having NUMA balancing disabled entirely for the memremap_pages() >> region, come to think of it. That seems relatively easy and clean at >> first glance anyway. >> >> For other regions (allocated by the device driver), a per-VMA flag >> seems about right: VM_NO_NUMA_BALANCING ? >> > Thanks a lot for those good suggestions! > For VMs, when could a per-VMA flag be set? > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > after it's mapped into VFIO. > Then, should VFIO set this flag on after it maps a range? > Could this flag be unset after device hot-unplug? > I'm hoping someone who thinks about VMs and VFIO often can chime in. thanks,
On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote: > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote: > > On 8/11/23 11:39, David Hildenbrand wrote: > > ... > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > > > > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > > > > > > > > > > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > > > > accelerators that map memory...and VMs as well, it seems. Basically, > > > > anything that has its own processors and page tables needs to be left > > > > strictly alone by NUMA balancing. Because the kernel is (still, even > > > > today) unaware of what those processors are doing, and so it has no way > > > > to do productive NUMA balancing. > > > > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > > > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > > > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > > > > > CCing Mel. > > > > > > > Let's discern between page pinning situations, and HMM-style situations. > > Page pinning of CPU memory is unnecessary when setting up for using that > > memory by modern GPUs or accelerators, because the latter can handle > > replayable page faults. So for such cases, the pages are in use by a GPU > > or accelerator, but unpinned. > > > > The performance problem occurs because for those pages, the NUMA > > balancing causes unmapping, which generates callbacks to the device > > driver, which dutifully unmaps the pages from the GPU or accelerator, > > even if the GPU might be busy using those pages. The device promptly > > causes a device page fault, and the driver then re-establishes the > > device page table mapping, which is good until the next round of > > unmapping from the NUMA balancer. > > > > hmm_range_fault()-based memory management in particular might benefit > > from having NUMA balancing disabled entirely for the memremap_pages() > > region, come to think of it. That seems relatively easy and clean at > > first glance anyway. > > > > For other regions (allocated by the device driver), a per-VMA flag > > seems about right: VM_NO_NUMA_BALANCING ? > > > Thanks a lot for those good suggestions! > For VMs, when could a per-VMA flag be set? > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > after it's mapped into VFIO. > Then, should VFIO set this flag on after it maps a range? > Could this flag be unset after device hot-unplug? Emm... syscall madvise() in my mind, it does things like change flags on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA. > >
On Tue, Aug 15, 2023 at 10:36:18AM +0800, Yuan Yao wrote: > On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote: > > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote: > > > On 8/11/23 11:39, David Hildenbrand wrote: > > > ... > > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > > > > > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > > > > > > > > > > > > > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > > > > > accelerators that map memory...and VMs as well, it seems. Basically, > > > > > anything that has its own processors and page tables needs to be left > > > > > strictly alone by NUMA balancing. Because the kernel is (still, even > > > > > today) unaware of what those processors are doing, and so it has no way > > > > > to do productive NUMA balancing. > > > > > > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > > > > > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > > > > > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > > > > > > > CCing Mel. > > > > > > > > > > Let's discern between page pinning situations, and HMM-style situations. > > > Page pinning of CPU memory is unnecessary when setting up for using that > > > memory by modern GPUs or accelerators, because the latter can handle > > > replayable page faults. So for such cases, the pages are in use by a GPU > > > or accelerator, but unpinned. > > > > > > The performance problem occurs because for those pages, the NUMA > > > balancing causes unmapping, which generates callbacks to the device > > > driver, which dutifully unmaps the pages from the GPU or accelerator, > > > even if the GPU might be busy using those pages. The device promptly > > > causes a device page fault, and the driver then re-establishes the > > > device page table mapping, which is good until the next round of > > > unmapping from the NUMA balancer. > > > > > > hmm_range_fault()-based memory management in particular might benefit > > > from having NUMA balancing disabled entirely for the memremap_pages() > > > region, come to think of it. That seems relatively easy and clean at > > > first glance anyway. > > > > > > For other regions (allocated by the device driver), a per-VMA flag > > > seems about right: VM_NO_NUMA_BALANCING ? > > > > > Thanks a lot for those good suggestions! > > For VMs, when could a per-VMA flag be set? > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > > after it's mapped into VFIO. > > Then, should VFIO set this flag on after it maps a range? > > Could this flag be unset after device hot-unplug? > > Emm... syscall madvise() in my mind, it does things like change flags > on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA. Yes, madvise() might work. And setting this flag might be an easy decision, while unsetting it might be hard unless some counters introduced.
On 15.08.23 04:34, John Hubbard wrote: > On 8/14/23 02:09, Yan Zhao wrote: > ... >>> hmm_range_fault()-based memory management in particular might benefit >>> from having NUMA balancing disabled entirely for the memremap_pages() >>> region, come to think of it. That seems relatively easy and clean at >>> first glance anyway. >>> >>> For other regions (allocated by the device driver), a per-VMA flag >>> seems about right: VM_NO_NUMA_BALANCING ? >>> >> Thanks a lot for those good suggestions! >> For VMs, when could a per-VMA flag be set? >> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until >> after it's mapped into VFIO. >> Then, should VFIO set this flag on after it maps a range? >> Could this flag be unset after device hot-unplug? >> > > I'm hoping someone who thinks about VMs and VFIO often can chime in. At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, using madvise). BUT, I do wonder what value there would be for autonuma to still be active for the remainder of the hypervisor. If there is none, a prctl() would be better. We already do have a mechanism in QEMU to get notified when longterm-pinning in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): * ram_block_discard_disable() * ram_block_uncoordinated_discard_disable()
On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: > On 15.08.23 04:34, John Hubbard wrote: > > On 8/14/23 02:09, Yan Zhao wrote: > > ... > > > > hmm_range_fault()-based memory management in particular might benefit > > > > from having NUMA balancing disabled entirely for the memremap_pages() > > > > region, come to think of it. That seems relatively easy and clean at > > > > first glance anyway. > > > > > > > > For other regions (allocated by the device driver), a per-VMA flag > > > > seems about right: VM_NO_NUMA_BALANCING ? > > > > > > > Thanks a lot for those good suggestions! > > > For VMs, when could a per-VMA flag be set? > > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > > > after it's mapped into VFIO. > > > Then, should VFIO set this flag on after it maps a range? > > > Could this flag be unset after device hot-unplug? > > > > > > > I'm hoping someone who thinks about VMs and VFIO often can chime in. > > At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, > using madvise). > > BUT, I do wonder what value there would be for autonuma to still be active Currently MADV_* is up to 25 #define MADV_COLLAPSE 25, while madvise behavior is of type "int". So it's ok. But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. > for the remainder of the hypervisor. If there is none, a prctl() would be > better. Add a new field in "struct vma_numab_state" in vma, and use prctl() to update this field? e.g. struct vma_numab_state { unsigned long next_scan; unsigned long next_pid_reset; unsigned long access_pids[2]; bool no_scan; }; > > We already do have a mechanism in QEMU to get notified when longterm-pinning > in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): > * ram_block_discard_disable() > * ram_block_uncoordinated_discard_disable() Looks this ram_block_discard allow/disallow state is global rather than per-VMA in QEMU. So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and it's up to the user space to choose between per-VMA and complex way or global and simpler way?
On 16.08.23 11:06, Yan Zhao wrote: > On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: >> On 15.08.23 04:34, John Hubbard wrote: >>> On 8/14/23 02:09, Yan Zhao wrote: >>> ... >>>>> hmm_range_fault()-based memory management in particular might benefit >>>>> from having NUMA balancing disabled entirely for the memremap_pages() >>>>> region, come to think of it. That seems relatively easy and clean at >>>>> first glance anyway. >>>>> >>>>> For other regions (allocated by the device driver), a per-VMA flag >>>>> seems about right: VM_NO_NUMA_BALANCING ? >>>>> >>>> Thanks a lot for those good suggestions! >>>> For VMs, when could a per-VMA flag be set? >>>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until >>>> after it's mapped into VFIO. >>>> Then, should VFIO set this flag on after it maps a range? >>>> Could this flag be unset after device hot-unplug? >>>> >>> >>> I'm hoping someone who thinks about VMs and VFIO often can chime in. >> >> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, >> using madvise). >> >> BUT, I do wonder what value there would be for autonuma to still be active > Currently MADV_* is up to 25 > #define MADV_COLLAPSE 25, > while madvise behavior is of type "int". So it's ok. > > But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. I remember there were discussions to increase it also for 32bit. If that's required, we might want to go down that path. But do 32bit architectures even care about NUMA hinting? If not, just ignore them ... > >> for the remainder of the hypervisor. If there is none, a prctl() would be >> better. > Add a new field in "struct vma_numab_state" in vma, and use prctl() to > update this field? Rather a global toggle per MM, no need to update individual VMAs -- if we go down that prctl() path. No need to consume more memory for VMAs. [...] >> We already do have a mechanism in QEMU to get notified when longterm-pinning >> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): >> * ram_block_discard_disable() >> * ram_block_uncoordinated_discard_disable() > Looks this ram_block_discard allow/disallow state is global rather than per-VMA > in QEMU. Yes. Once you transition into "discard of any kind disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them. (or alternatively, do the prctl() once ) We'll also have to handle new guest memory being created afterwards, but that is easy. Once we transition to "no discarding disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them again (or alternatively, do the prctl() once). > So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and > it's up to the user space to choose between per-VMA and complex way or > global and simpler way? QEMU could do either way. The question would be if a per-vma settings makes sense for NUMA hinting.
On 8/16/23 02:49, David Hildenbrand wrote: > But do 32bit architectures even care about NUMA hinting? If not, just > ignore them ... Probably not! ... >> So, do you mean that let kernel provide a per-VMA allow/disallow >> mechanism, and >> it's up to the user space to choose between per-VMA and complex way or >> global and simpler way? > > QEMU could do either way. The question would be if a per-vma settings > makes sense for NUMA hinting. From our experience with compute on GPUs, a per-mm setting would suffice. No need to go all the way to VMA granularity. thanks,
On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote: > On 8/16/23 02:49, David Hildenbrand wrote: > > But do 32bit architectures even care about NUMA hinting? If not, just > > ignore them ... > > Probably not! > > ... > > > So, do you mean that let kernel provide a per-VMA allow/disallow > > > mechanism, and > > > it's up to the user space to choose between per-VMA and complex way or > > > global and simpler way? > > > > QEMU could do either way. The question would be if a per-vma settings > > makes sense for NUMA hinting. > > From our experience with compute on GPUs, a per-mm setting would suffice. > No need to go all the way to VMA granularity. > After an offline internal discussion, we think a per-mm setting is also enough for device passthrough in VMs. BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you think it's of any value to providing a flag like VM_MAYDMA? Auto NUMA balancing or other components can decide how to use it by themselves.
On 17.08.23 07:05, Yan Zhao wrote: > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote: >> On 8/16/23 02:49, David Hildenbrand wrote: >>> But do 32bit architectures even care about NUMA hinting? If not, just >>> ignore them ... >> >> Probably not! >> >> ... >>>> So, do you mean that let kernel provide a per-VMA allow/disallow >>>> mechanism, and >>>> it's up to the user space to choose between per-VMA and complex way or >>>> global and simpler way? >>> >>> QEMU could do either way. The question would be if a per-vma settings >>> makes sense for NUMA hinting. >> >> From our experience with compute on GPUs, a per-mm setting would suffice. >> No need to go all the way to VMA granularity. >> > After an offline internal discussion, we think a per-mm setting is also > enough for device passthrough in VMs. > > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you > think it's of any value to providing a flag like VM_MAYDMA? > Auto NUMA balancing or other components can decide how to use it by > themselves. Short-lived DMA is not really the problem. The problem is long-term pinning. There was a discussion about letting user space similarly hint that long-term pinning might/will happen. Because when long-term pinning a page we have to make sure to migrate it off of ZONE_MOVABLE / MIGRATE_CMA. But the kernel prefers to place pages there. So with vfio in QEMU, we might preallocate memory for the guest and place it on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all these fresh pages out of these areas again. So letting the kernel know about that in this context might also help.
On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote: > On 17.08.23 07:05, Yan Zhao wrote: > > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote: > > > On 8/16/23 02:49, David Hildenbrand wrote: > > > > But do 32bit architectures even care about NUMA hinting? If not, just > > > > ignore them ... > > > > > > Probably not! > > > > > > ... > > > > > So, do you mean that let kernel provide a per-VMA allow/disallow > > > > > mechanism, and > > > > > it's up to the user space to choose between per-VMA and complex way or > > > > > global and simpler way? > > > > > > > > QEMU could do either way. The question would be if a per-vma settings > > > > makes sense for NUMA hinting. > > > > > > From our experience with compute on GPUs, a per-mm setting would suffice. > > > No need to go all the way to VMA granularity. > > > > > After an offline internal discussion, we think a per-mm setting is also > > enough for device passthrough in VMs. > > > > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you > > think it's of any value to providing a flag like VM_MAYDMA? > > Auto NUMA balancing or other components can decide how to use it by > > themselves. > > Short-lived DMA is not really the problem. The problem is long-term pinning. > > There was a discussion about letting user space similarly hint that > long-term pinning might/will happen. > > Because when long-term pinning a page we have to make sure to migrate it off > of ZONE_MOVABLE / MIGRATE_CMA. > > But the kernel prefers to place pages there. > > So with vfio in QEMU, we might preallocate memory for the guest and place it > on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all > these fresh pages out of these areas again. > > So letting the kernel know about that in this context might also help. > Thanks! Glad to know it :) But consider for GPUs case as what John mentioned, since the memory is not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ? For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace and migration. Is that good? Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for now?
On 8/17/23 17:13, Yan Zhao wrote: ... > But consider for GPUs case as what John mentioned, since the memory is > not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ? > For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting > IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace > and migration. > > Is that good? > Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for > now? > So far, a per-mm setting seems like it would suffice. However, it is also true that new hardware is getting really creative and large, to the point that it's not inconceivable that a process might actually want to let NUMA balancing run in part of its mm, but turn it off to allow fault-able device access to another part of the mm. We aren't seeing that yet, but on the other hand, that may be simply because there is no practical way to set that up and see how well it works. thanks,
On Thu, Aug 17, 2023 at 07:29:12PM -0700, John Hubbard wrote: > On 8/17/23 17:13, Yan Zhao wrote: > ... > > But consider for GPUs case as what John mentioned, since the memory is > > not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ? > > For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting > > IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace > > and migration. > > > > Is that good? > > Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for > > now? > > > > So far, a per-mm setting seems like it would suffice. However, it is > also true that new hardware is getting really creative and large, to > the point that it's not inconceivable that a process might actually > want to let NUMA balancing run in part of its mm, but turn it off > to allow fault-able device access to another part of the mm. > > We aren't seeing that yet, but on the other hand, that may be > simply because there is no practical way to set that up and see > how well it works. > > Hi guys, Thanks a lot for your review and suggestions! I'll firstly try to add a per-mm flag to fix this problem later (but maybe not very soon) Thanks Yan