[v2,00/19] mm: Support huge pfnmaps

Message ID	20240826204353.2228736-1-peterx@redhat.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Gavin Shan <gshan@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, x86@kernel.org, Ingo Molnar <mingo@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Paolo Bonzini <pbonzini@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Thomas Gleixner <tglx@linutronix.de>, Alistair Popple <apopple@nvidia.com>, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sean Christopherson <seanjc@google.com>, peterx@redhat.com, Oscar Salvador <osalvador@suse.de>, Jason Gunthorpe <jgg@nvidia.com>, Borislav Petkov <bp@alien8.de>, Zi Yan <ziy@nvidia.com>, Axel Rasmussen <axelrasmussen@google.com>, David Hildenbrand <david@redhat.com>, Yan Zhao <yan.y.zhao@intel.com>, Will Deacon <will@kernel.org>, Kefeng Wang <wangkefeng.wang@huawei.com>, Alex Williamson <alex.williamson@redhat.com> Subject: [PATCH v2 00/19] mm: Support huge pfnmaps Date: Mon, 26 Aug 2024 16:43:34 -0400 Message-ID: <20240826204353.2228736-1-peterx@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Support huge pfnmaps \| expand [v2,00/19] mm: Support huge pfnmaps [v2,01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud [v2,02/19] mm: Drop is_huge_zero_pud() [v2,03/19] mm: Mark special bits for huge pfn mappings when inject [v2,04/19] mm: Allow THP orders for PFNMAPs [v2,05/19] mm/gup: Detect huge pfnmap entries in gup-fast [v2,06/19] mm/pagewalk: Check pfnmap for folio_walk_start() [v2,07/19] mm/fork: Accept huge pfnmap entries [v2,08/19] mm: Always define pxx_pgprot() [v2,09/19] mm: New follow_pfnmap API [v2,10/19] KVM: Use follow_pfnmap API [v2,11/19] s390/pci_mmio: Use follow_pfnmap API [v2,12/19] mm/x86/pat: Use the new follow_pfnmap API [v2,13/19] vfio: Use the new follow_pfnmap API [v2,14/19] acrn: Use the new follow_pfnmap API [v2,15/19] mm/access_process_vm: Use the new follow_pfnmap API [v2,16/19] mm: Remove follow_pte() [v2,17/19] mm/x86: Support large pfn mappings [v2,18/19] mm/arm64: Support large pfn mappings [v2,19/19] vfio/pci: Implement huge_fault support

Peter Xu Aug. 26, 2024, 8:43 p.m. UTC

v2:
- Added tags
- Let folio_walk_start() scan special pmd/pud bits [DavidH]
- Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
- Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
  special bits for huge pfn mappings when inject" [JasonG]
- In gup-fast, reorder _special check v.s. _devmap check, so as to make
  pmd/pud path look the same as pte path [DavidH, JasonG]
- Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
  used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]

Overview
========

This series is based on mm-unstable, commit b659edec079c of Aug 26th
latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
reverted, as reported broken [0].

This series implements huge pfnmaps support for mm in general.  Huge pfnmap
allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
as large as 8GB or even bigger.

Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.

Implementation
==============

In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).

One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user.  This patchset will add the 2nd
user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.

The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping.  This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.

Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now.  Half of this series is about that and converting
all existing pfnmap walkers to use the new API properly.  Hopefully the new
API also looks better to avoid exposing e.g. pgtable lock details into the
callers, so that it can be used in an even more straightforward way.

Here, three more options will be introduced and involved in huge pfnmap:

  - ARCH_SUPPORTS_HUGE_PFNMAP

    Arch developers will need to select this option when huge pfnmap is
    supported in arch's Kconfig.  After this patchset applied, both x86_64
    and arm64 will start to enable it by default.

  - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP

    These options are for driver developers to identify whether current
    arch / config supports huge pfnmaps, making decision on whether it can
    use the huge pfnmap APIs to inject them.  One can refer to the last
    vfio-pci patch from Alex on the use of them properly in a device
    driver.

So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:

  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.

Patch Layout
============

Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
Patch 2:         A tiny cleanup
Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                 special bit for pmd/pud)
Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                 then drop follow_pte() API
Patch 17:        Add huge pfnmap support for x86_64
Patch 18:        Add huge pfnmap support for arm64
Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)

TODO
====

More architectures / More page sizes
------------------------------------

Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
have plan to support arm64 1G later on top of this series [2].

Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.

remap_pfn_range() support
-------------------------

Currently, remap_pfn_range() still only maps PTEs.  With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.

When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better TLB
hit rate and overall faster MMIO accesses similar to processor on hugepages.

More driver support
-------------------

VFIO is so far the only consumer for the huge pfnmaps after this series
applied.  Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar.  But I don't
think I know all the drivers to know the full picture.

Tests Done
==========

- Cross-build tests

- run_vmtests.sh

- Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
  and fork() tests on the bar mapped

- x86_64 + AMD GPU
  - Needs Alex's modified QEMU to guarantee proper VA alignment to make
    sure all pages to be mapped with PUDs
  - Main BAR (8GB) start to use PUD mappings
  - Sub BAR (??MBs?) start to use PMD mappings
  - Performance wise, slight improvement comparing to the old PTE mappings

- aarch64 + NIC
  - Detached NIC test to make sure driver loads fine with PMD mappings

Credits all go to Alex on help testing the GPU/NIC use cases above.

Comments welcomed, thanks.

[0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
[2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com

Alex Williamson (1):
  vfio/pci: Implement huge_fault support

Peter Xu (18):
  mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  mm: Drop is_huge_zero_pud()
  mm: Mark special bits for huge pfn mappings when inject
  mm: Allow THP orders for PFNMAPs
  mm/gup: Detect huge pfnmap entries in gup-fast
  mm/pagewalk: Check pfnmap for folio_walk_start()
  mm/fork: Accept huge pfnmap entries
  mm: Always define pxx_pgprot()
  mm: New follow_pfnmap API
  KVM: Use follow_pfnmap API
  s390/pci_mmio: Use follow_pfnmap API
  mm/x86/pat: Use the new follow_pfnmap API
  vfio: Use the new follow_pfnmap API
  acrn: Use the new follow_pfnmap API
  mm/access_process_vm: Use the new follow_pfnmap API
  mm: Remove follow_pte()
  mm/x86: Support large pfn mappings
  mm/arm64: Support large pfn mappings

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  30 +++++
 arch/powerpc/include/asm/pgtable.h  |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/s390/pci/pci_mmio.c            |  22 ++--
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/include/asm/pgtable.h      |  80 +++++++-----
 arch/x86/mm/pat/memtype.c           |  17 ++-
 drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
 drivers/vfio/vfio_iommu_type1.c     |  16 +--
 drivers/virt/acrn/mm.c              |  16 +--
 include/linux/huge_mm.h             |  16 +--
 include/linux/mm.h                  |  57 ++++++++-
 include/linux/pgtable.h             |  12 ++
 mm/Kconfig                          |  13 ++
 mm/gup.c                            |   6 +
 mm/huge_memory.c                    |  50 +++++---
 mm/memory.c                         | 183 ++++++++++++++++++++--------
 mm/pagewalk.c                       |   4 +-
 virt/kvm/kvm_main.c                 |  19 ++-
 21 files changed, 425 insertions(+), 181 deletions(-)

Jiaqi Yan Aug. 27, 2024, 10:36 p.m. UTC | #1

On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> v2:
> - Added tags
> - Let folio_walk_start() scan special pmd/pud bits [DavidH]
> - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
> - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
>   special bits for huge pfn mappings when inject" [JasonG]
> - In gup-fast, reorder _special check v.s. _devmap check, so as to make
>   pmd/pud path look the same as pte path [DavidH, JasonG]
> - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
>   used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]
>
> Overview
> ========
>
> This series is based on mm-unstable, commit b659edec079c of Aug 26th
> latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
> reverted, as reported broken [0].
>
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
>
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
>
> Implementation
> ==============
>
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).
>
> One trick here is that we're still unmature on PUDs in generic paths here
> and there, as DAX is so far the only user.  This patchset will add the 2nd
> user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
> go on smoothly, but to be discussed later.
>
> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.
>
> Along the way, we'll also notice that the major pgtable pfn walker, aka,
> follow_pte(), will need to retire soon due to the fact that it only works
> with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
> be able to do whatever follow_pte() can already do, plus that it can also
> process huge pfnmaps now.  Half of this series is about that and converting
> all existing pfnmap walkers to use the new API properly.  Hopefully the new
> API also looks better to avoid exposing e.g. pgtable lock details into the
> callers, so that it can be used in an even more straightforward way.
>
> Here, three more options will be introduced and involved in huge pfnmap:
>
>   - ARCH_SUPPORTS_HUGE_PFNMAP
>
>     Arch developers will need to select this option when huge pfnmap is
>     supported in arch's Kconfig.  After this patchset applied, both x86_64
>     and arm64 will start to enable it by default.
>
>   - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
>
>     These options are for driver developers to identify whether current
>     arch / config supports huge pfnmaps, making decision on whether it can
>     use the huge pfnmap APIs to inject them.  One can refer to the last
>     vfio-pci patch from Alex on the use of them properly in a device
>     driver.
>
> So after the whole set applied, and if one would enable some dynamic debug
> lines in vfio-pci core files, we should observe things like:
>
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
>
> In this specific case, it says that vfio-pci faults in PMDs properly for a
> few BAR0 offsets.
>
> Patch Layout
> ============
>
> Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
> Patch 2:         A tiny cleanup
> Patch 3-8:       Preparation patches for huge pfnmap (include introduce
>                  special bit for pmd/pud)
> Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
>                  then drop follow_pte() API
> Patch 17:        Add huge pfnmap support for x86_64
> Patch 18:        Add huge pfnmap support for arm64
> Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
>
> TODO
> ====
>
> More architectures / More page sizes
> ------------------------------------
>
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> have plan to support arm64 1G later on top of this series [2].
>
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.
>
> remap_pfn_range() support
> -------------------------
>
> Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.
>

Hi Peter,

I am curious if there is any work needed for unmap_mapping_range? If a
driver hugely remap_pfn_range()ed at 1G granularity, can the driver
unmap at PAGE_SIZE granularity? For example, when handling a PFN is
poisoned in the 1G mapping, it would be great if the mapping can be
splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
is lost. (Pretty much like the past proposal* to use HGM** to improve
hugetlb's memory failure handling).

Probably these questions can be answered after reading your code,
which I plan to do, but just want to ask in case you have an easy
answer for me.

* https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@google.com/
** https://lwn.net/Articles/912017

> More driver support
> -------------------
>
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.
>
> Tests Done
> ==========
>
> - Cross-build tests
>
> - run_vmtests.sh
>
> - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
>   and fork() tests on the bar mapped
>
> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs
>   - Main BAR (8GB) start to use PUD mappings
>   - Sub BAR (??MBs?) start to use PMD mappings
>   - Performance wise, slight improvement comparing to the old PTE mappings
>
> - aarch64 + NIC
>   - Detached NIC test to make sure driver loads fine with PMD mappings
>
> Credits all go to Alex on help testing the GPU/NIC use cases above.
>
> Comments welcomed, thanks.
>
> [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
> [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
> [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
>
> Alex Williamson (1):
>   vfio/pci: Implement huge_fault support
>
> Peter Xu (18):
>   mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
>   mm: Drop is_huge_zero_pud()
>   mm: Mark special bits for huge pfn mappings when inject
>   mm: Allow THP orders for PFNMAPs
>   mm/gup: Detect huge pfnmap entries in gup-fast
>   mm/pagewalk: Check pfnmap for folio_walk_start()
>   mm/fork: Accept huge pfnmap entries
>   mm: Always define pxx_pgprot()
>   mm: New follow_pfnmap API
>   KVM: Use follow_pfnmap API
>   s390/pci_mmio: Use follow_pfnmap API
>   mm/x86/pat: Use the new follow_pfnmap API
>   vfio: Use the new follow_pfnmap API
>   acrn: Use the new follow_pfnmap API
>   mm/access_process_vm: Use the new follow_pfnmap API
>   mm: Remove follow_pte()
>   mm/x86: Support large pfn mappings
>   mm/arm64: Support large pfn mappings
>
>  arch/arm64/Kconfig                  |   1 +
>  arch/arm64/include/asm/pgtable.h    |  30 +++++
>  arch/powerpc/include/asm/pgtable.h  |   1 +
>  arch/s390/include/asm/pgtable.h     |   1 +
>  arch/s390/pci/pci_mmio.c            |  22 ++--
>  arch/sparc/include/asm/pgtable_64.h |   1 +
>  arch/x86/Kconfig                    |   1 +
>  arch/x86/include/asm/pgtable.h      |  80 +++++++-----
>  arch/x86/mm/pat/memtype.c           |  17 ++-
>  drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
>  drivers/vfio/vfio_iommu_type1.c     |  16 +--
>  drivers/virt/acrn/mm.c              |  16 +--
>  include/linux/huge_mm.h             |  16 +--
>  include/linux/mm.h                  |  57 ++++++++-
>  include/linux/pgtable.h             |  12 ++
>  mm/Kconfig                          |  13 ++
>  mm/gup.c                            |   6 +
>  mm/huge_memory.c                    |  50 +++++---
>  mm/memory.c                         | 183 ++++++++++++++++++++--------
>  mm/pagewalk.c                       |   4 +-
>  virt/kvm/kvm_main.c                 |  19 ++-
>  21 files changed, 425 insertions(+), 181 deletions(-)
>
> --
> 2.45.0
>
>

Peter Xu Aug. 27, 2024, 10:57 p.m. UTC | #2

On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> Hi Peter,

Hi, Jiaqi,

> I am curious if there is any work needed for unmap_mapping_range? If a
> driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> unmap at PAGE_SIZE granularity? For example, when handling a PFN is

Yes it can, but it'll invoke the split_huge_pud() which default routes to
removal of the whole pud right now (currently only covers either DAX
mappings or huge pfnmaps; it won't for anonymous if it comes, for example).

In that case it'll rely on the driver providing proper fault() /
huge_fault() to refault things back with smaller sizes later when accessed
again.

> poisoned in the 1G mapping, it would be great if the mapping can be
> splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> is lost. (Pretty much like the past proposal* to use HGM** to improve
> hugetlb's memory failure handling).

Note that we're only talking about MMIO mappings here, in which case the
PFN doesn't even have a struct page, so the whole poison idea shouldn't
apply, afaiu.

Thanks,

Jiaqi Yan Aug. 28, 2024, 12:42 a.m. UTC | #3

On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > Hi Peter,
>
> Hi, Jiaqi,
>
> > I am curious if there is any work needed for unmap_mapping_range? If a
> > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
>
> Yes it can, but it'll invoke the split_huge_pud() which default routes to
> removal of the whole pud right now (currently only covers either DAX
> mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
>
> In that case it'll rely on the driver providing proper fault() /
> huge_fault() to refault things back with smaller sizes later when accessed
> again.

I see, so the driver needs to drive the recovery process, and code
needs to be in the driver.

But it seems to me the recovery process will be more or less the same
to different drivers? In that case does it make sense that
memory_failure do the common things for all drivers?

Instead of removing the whole pud, can driver or memory_failure do
something similar to non-struct-page-version of split_huge_page? So
driver doesn't need to re-fault good pages back?

>
> > poisoned in the 1G mapping, it would be great if the mapping can be
> > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > hugetlb's memory failure handling).
>
> Note that we're only talking about MMIO mappings here, in which case the
> PFN doesn't even have a struct page, so the whole poison idea shouldn't
> apply, afaiu.

Yes, there won't be any struct page. Ankit proposed this patchset* for
handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
driver adopts your change via new remap_pfn_range (install PMD/PUD
instead of PTE), and memory_failure_pfn still
unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
0), can it somehow just work and no re-fault needed?

* https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

>
> Thanks,
>
> --
> Peter Xu
>

Jiaqi Yan Aug. 28, 2024, 12:46 a.m. UTC | #4

Adding Ankit in case he has opinions.

On Tue, Aug 27, 2024 at 5:42 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > Hi Peter,
> >
> > Hi, Jiaqi,
> >
> > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> >
> > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > removal of the whole pud right now (currently only covers either DAX
> > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> >
> > In that case it'll rely on the driver providing proper fault() /
> > huge_fault() to refault things back with smaller sizes later when accessed
> > again.
>
> I see, so the driver needs to drive the recovery process, and code
> needs to be in the driver.
>
> But it seems to me the recovery process will be more or less the same
> to different drivers? In that case does it make sense that
> memory_failure do the common things for all drivers?
>
> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?
>
>
> >
> > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > hugetlb's memory failure handling).
> >
> > Note that we're only talking about MMIO mappings here, in which case the
> > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > apply, afaiu.
>
> Yes, there won't be any struct page. Ankit proposed this patchset* for
> handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> driver adopts your change via new remap_pfn_range (install PMD/PUD
> instead of PTE), and memory_failure_pfn still
> unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> 0), can it somehow just work and no re-fault needed?
>
> * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
>
>
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >

Jason Gunthorpe Aug. 28, 2024, 2:24 p.m. UTC | #5

On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:

> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?

It would be far nicer if we didn't have to poke a hole in a 1G mapping
just for memory failure reporting.

Jason

Peter Xu Aug. 28, 2024, 2:41 p.m. UTC | #6

On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > Hi Peter,
> >
> > Hi, Jiaqi,
> >
> > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> >
> > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > removal of the whole pud right now (currently only covers either DAX
> > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> >
> > In that case it'll rely on the driver providing proper fault() /
> > huge_fault() to refault things back with smaller sizes later when accessed
> > again.
> 
> I see, so the driver needs to drive the recovery process, and code
> needs to be in the driver.
> 
> But it seems to me the recovery process will be more or less the same
> to different drivers? In that case does it make sense that
> memory_failure do the common things for all drivers?
> 
> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?

I think we can, it's just that we don't yet have a valid use case.

DAX is definitely fault-able.

While for the new huge pfnmap, currently vfio is the only user, and vfio
only requires to either zap all or map all.  In that case there's no real
need to ask for what you described yet.  Meanwhile it's also faultable, so
if / when needed it should hopefully still do the work properly.

I believe it's not usual requirement too for most of the rest drivers, as
most of them don't even support fault() afaiu. remap_pfn_range() can start
to use huge mappings, however I'd expect they're mostly not ready for
random tearing down of any MMIO mappings.

It sounds doable to me though when there's a need of what you're
describing, but I don't think I know well on the use case yet.

> 
> 
> >
> > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > hugetlb's memory failure handling).
> >
> > Note that we're only talking about MMIO mappings here, in which case the
> > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > apply, afaiu.
> 
> Yes, there won't be any struct page. Ankit proposed this patchset* for
> handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> driver adopts your change via new remap_pfn_range (install PMD/PUD
> instead of PTE), and memory_failure_pfn still
> unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> 0), can it somehow just work and no re-fault needed?
> 
> * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

I see now, interesting.. Thanks for the link.  

In that case of nvgpu usage, one way is to do as what you said; we can
enhance the pmd/pud split for pfnmap, but maybe that's an overkill.

I saw that the nvgpu will need a fault() anyway so as to detect poisoned
PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault()
when it supports huge pfnmaps it'll need to try to detect whether the whole
faulting range contains any poisoned PFNs, then provide FALLBACK if so
(rather than VM_FAULT_HWPOISON).

E.g., when 4K of 2M is poisoned, we'll erase the 2M completely.  When
access happens, as long as the accessed 4K is not on top of the poisoned
4k, huge_fault() should still detect that there's 4k range poisoned, then
it'll not inject pmd but return FALLBACK, then in the fault() it'll see
the accessed 4k range is not poisoned, then install a pte.

Thanks,

Jiaqi Yan Aug. 28, 2024, 4:10 p.m. UTC | #7

On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
>
> > Instead of removing the whole pud, can driver or memory_failure do
> > something similar to non-struct-page-version of split_huge_page? So
> > driver doesn't need to re-fault good pages back?
>
> It would be far nicer if we didn't have to poke a hole in a 1G mapping
> just for memory failure reporting.

If I follow this, which of the following sounds better? 1. remove pud
and rely on the driver to re-fault PFNs that it knows are not poisoned
(what Peter suggested), or 2. keep the pud and allow access to both
good and bad PFNs.

Or provide some knob (configured by ?) so that kernel + driver can
switch between the two?

>
> Jason

Jiaqi Yan Aug. 28, 2024, 4:23 p.m. UTC | #8

On Wed, Aug 28, 2024 at 7:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > > Hi Peter,
> > >
> > > Hi, Jiaqi,
> > >
> > > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> > >
> > > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > > removal of the whole pud right now (currently only covers either DAX
> > > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> > >
> > > In that case it'll rely on the driver providing proper fault() /
> > > huge_fault() to refault things back with smaller sizes later when accessed
> > > again.
> >
> > I see, so the driver needs to drive the recovery process, and code
> > needs to be in the driver.
> >
> > But it seems to me the recovery process will be more or less the same
> > to different drivers? In that case does it make sense that
> > memory_failure do the common things for all drivers?
> >
> > Instead of removing the whole pud, can driver or memory_failure do
> > something similar to non-struct-page-version of split_huge_page? So
> > driver doesn't need to re-fault good pages back?
>
> I think we can, it's just that we don't yet have a valid use case.
>
> DAX is definitely fault-able.
>
> While for the new huge pfnmap, currently vfio is the only user, and vfio
> only requires to either zap all or map all.  In that case there's no real
> need to ask for what you described yet.  Meanwhile it's also faultable, so
> if / when needed it should hopefully still do the work properly.
>
> I believe it's not usual requirement too for most of the rest drivers, as
> most of them don't even support fault() afaiu. remap_pfn_range() can start
> to use huge mappings, however I'd expect they're mostly not ready for
> random tearing down of any MMIO mappings.
>
> It sounds doable to me though when there's a need of what you're
> describing, but I don't think I know well on the use case yet.
>
> >
> >
> > >
> > > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > > hugetlb's memory failure handling).
> > >
> > > Note that we're only talking about MMIO mappings here, in which case the
> > > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > > apply, afaiu.
> >
> > Yes, there won't be any struct page. Ankit proposed this patchset* for
> > handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> > driver adopts your change via new remap_pfn_range (install PMD/PUD
> > instead of PTE), and memory_failure_pfn still
> > unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> > 0), can it somehow just work and no re-fault needed?
> >
> > * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
>
> I see now, interesting.. Thanks for the link.
>
> In that case of nvgpu usage, one way is to do as what you said; we can
> enhance the pmd/pud split for pfnmap, but maybe that's an overkill.

Yeah, just want a poke to see if splitting pmd/pud is some low-hanging fruit.

>
> I saw that the nvgpu will need a fault() anyway so as to detect poisoned
> PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault()
> when it supports huge pfnmaps it'll need to try to detect whether the whole
> faulting range contains any poisoned PFNs, then provide FALLBACK if so
> (rather than VM_FAULT_HWPOISON).
>
> E.g., when 4K of 2M is poisoned, we'll erase the 2M completely.  When
> access happens, as long as the accessed 4K is not on top of the poisoned
> 4k, huge_fault() should still detect that there's 4k range poisoned, then
> it'll not inject pmd but return FALLBACK, then in the fault() it'll see
> the accessed 4k range is not poisoned, then install a pte.

Thanks for illustrating the re-fault flow again. I think this should
work well for drivers (having large MMIO size) that care about memory
errors. We can put the pmd/pud split idea to backlog and see if it is
needed in future.

>
> Thanks,
>
> --
> Peter Xu
>

Jason Gunthorpe Aug. 28, 2024, 11:49 p.m. UTC | #9

On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> >
> > > Instead of removing the whole pud, can driver or memory_failure do
> > > something similar to non-struct-page-version of split_huge_page? So
> > > driver doesn't need to re-fault good pages back?
> >
> > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > just for memory failure reporting.
> 
> If I follow this, which of the following sounds better? 1. remove pud
> and rely on the driver to re-fault PFNs that it knows are not poisoned
> (what Peter suggested), or 2. keep the pud and allow access to both
> good and bad PFNs.

In practice I think people will need 2, as breaking up a 1G mapping
just because a few bits are bad will destroy the VM performance.

For this the expectation would be for the VM to co-operate and not
keep causing memory failures, or perhaps for the platform to spare in
good memory somehow.

> Or provide some knob (configured by ?) so that kernel + driver can
> switch between the two?

This is also sounding reasonable, especially if we need some
alternative protocol to signal userspace about the failed memory
besides fault and SIGBUS.

Jason

Jiaqi Yan Aug. 29, 2024, 7:21 p.m. UTC | #10

On Wed, Aug 28, 2024 at 4:50 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> > On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > >
> > > > Instead of removing the whole pud, can driver or memory_failure do
> > > > something similar to non-struct-page-version of split_huge_page? So
> > > > driver doesn't need to re-fault good pages back?
> > >
> > > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > > just for memory failure reporting.
> >
> > If I follow this, which of the following sounds better? 1. remove pud
> > and rely on the driver to re-fault PFNs that it knows are not poisoned
> > (what Peter suggested), or 2. keep the pud and allow access to both
> > good and bad PFNs.
>
> In practice I think people will need 2, as breaking up a 1G mapping
> just because a few bits are bad will destroy the VM performance.
>

Totally agreed.

> For this the expectation would be for the VM to co-operate and not
> keep causing memory failures, or perhaps for the platform to spare in
> good memory somehow.

Yes, whether a VM gets into a memory-error-consumption loop
maliciously or accidentally, a reasonable VMM should have means to
detect and break it.

>
> > Or provide some knob (configured by ?) so that kernel + driver can
> > switch between the two?
>
> This is also sounding reasonable, especially if we need some
> alternative protocol to signal userspace about the failed memory
> besides fault and SIGBUS.

To clarify, what on my mind is a knob say named
"sysctl_enable_hard_offline", configured by userspace.

To apply to Ankit's memory_failure_pfn patch[*]:

static int memory_failure_pfn(unsigned long pfn, int flags)
{
  struct interval_tree_node *node;
  int res = MF_FAILED;
  LIST_HEAD(tokill);

  mutex_lock(&pfn_space_lock);
   for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
         node = interval_tree_iter_next(node, pfn, pfn)) {
    struct pfn_address_space *pfn_space =
      container_of(node, struct pfn_address_space, node);

    if (pfn_space->ops)
      pfn_space->ops->failure(pfn_space, pfn);

    collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);

    if (sysctl_enable_hard_offline)
      unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
                                             PAGE_SIZE, 0);

    res = MF_RECOVERED;
  }
  mutex_unlock(&pfn_space_lock);

  if (res == MF_FAILED)
    return action_result(pfn, MF_MSG_PFN_MAP, res);

  flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
  kill_procs(&tokill, true, false, pfn, flags);

  return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
}

I think we still want to attempt to SIGBUS userspace, regardless of
doing unmap_mapping_range or not.

[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

>
> Jason

Jason Gunthorpe Sept. 4, 2024, 3:52 p.m. UTC | #11

On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:

> I think we still want to attempt to SIGBUS userspace, regardless of
> doing unmap_mapping_range or not.

IMHO we need to eliminate this path if we actually want to keep things
mapped.

There is no way to generate the SIGBUS without poking a 4k hole in the
1G page, as only that 4k should get SIGBUS, every other byte of the 1G
is clean.

Jason

Jiaqi Yan Sept. 4, 2024, 4:38 p.m. UTC | #12

On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
>
> > I think we still want to attempt to SIGBUS userspace, regardless of
> > doing unmap_mapping_range or not.
>
> IMHO we need to eliminate this path if we actually want to keep things
> mapped.
>
> There is no way to generate the SIGBUS without poking a 4k hole in the
> 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> is clean.

Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
which is the whole purpose of not unmapping.

>
> Jason

Jason Gunthorpe Sept. 4, 2024, 4:43 p.m. UTC | #13

On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> >
> > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > doing unmap_mapping_range or not.
> >
> > IMHO we need to eliminate this path if we actually want to keep things
> > mapped.
> >
> > There is no way to generate the SIGBUS without poking a 4k hole in the
> > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > is clean.
> 
> Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> which is the whole purpose of not unmapping.

You can't get a SIGBUS if the things are still mapped. This is why the
SIGBUS flow requires poking a non-present hole around the poisoned
memory.

So keeping things mapped at 1G also means giving up on SIGBUS.

Jason

Jiaqi Yan Sept. 4, 2024, 4:58 p.m. UTC | #14

On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > >
> > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > doing unmap_mapping_range or not.
> > >
> > > IMHO we need to eliminate this path if we actually want to keep things
> > > mapped.
> > >
> > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > is clean.
> >
> > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > which is the whole purpose of not unmapping.
>
> You can't get a SIGBUS if the things are still mapped. This is why the
> SIGBUS flow requires poking a non-present hole around the poisoned
> memory.
>
> So keeping things mapped at 1G also means giving up on SIGBUS.

SIGBUS during page fault is definitely impossible when memory is still
mapped, but the platform still MCE or SEA in case of poison
consumption, right? So I wanted to propose new code to SIGBUS (either
BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
kernel in the synchronous poison consumption context, e.g. MCE on X86
and SEA on ARM64.

>
> Jason

Jason Gunthorpe Sept. 4, 2024, 5 p.m. UTC | #15

On Wed, Sep 04, 2024 at 09:58:54AM -0700, Jiaqi Yan wrote:
> On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > > >
> > > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > > doing unmap_mapping_range or not.
> > > >
> > > > IMHO we need to eliminate this path if we actually want to keep things
> > > > mapped.
> > > >
> > > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > > is clean.
> > >
> > > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > > which is the whole purpose of not unmapping.
> >
> > You can't get a SIGBUS if the things are still mapped. This is why the
> > SIGBUS flow requires poking a non-present hole around the poisoned
> > memory.
> >
> > So keeping things mapped at 1G also means giving up on SIGBUS.
> 
> SIGBUS during page fault is definitely impossible when memory is still
> mapped, but the platform still MCE or SEA in case of poison
> consumption, right? So I wanted to propose new code to SIGBUS (either
> BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
> kernel in the synchronous poison consumption context, e.g. MCE on X86
> and SEA on ARM64.

So you want a SIGBUS that is delivered asynchronously instead of via
the page fault handler? Something like that is sort of what I ment by
"eliminate this path", though I didn't think keeping an async SIGBUS
was an option?

Jason

Jiaqi Yan Sept. 4, 2024, 5:07 p.m. UTC | #16

On Wed, Sep 4, 2024 at 10:00 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:58:54AM -0700, Jiaqi Yan wrote:
> > On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > > > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > >
> > > > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > > > >
> > > > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > > > doing unmap_mapping_range or not.
> > > > >
> > > > > IMHO we need to eliminate this path if we actually want to keep things
> > > > > mapped.
> > > > >
> > > > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > > > is clean.
> > > >
> > > > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > > > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > > > which is the whole purpose of not unmapping.
> > >
> > > You can't get a SIGBUS if the things are still mapped. This is why the
> > > SIGBUS flow requires poking a non-present hole around the poisoned
> > > memory.
> > >
> > > So keeping things mapped at 1G also means giving up on SIGBUS.
> >
> > SIGBUS during page fault is definitely impossible when memory is still
> > mapped, but the platform still MCE or SEA in case of poison
> > consumption, right? So I wanted to propose new code to SIGBUS (either
> > BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
> > kernel in the synchronous poison consumption context, e.g. MCE on X86
> > and SEA on ARM64.
>
> So you want a SIGBUS that is delivered asynchronously instead of via
> the page fault handler? Something like that is sort of what I ment by
> "eliminate this path", though I didn't think keeping an async SIGBUS
> was an option?

Not really, I don't think an SIGBUS *async* to the poison consuming
thread is critical, at least not as useful as SIGBUS *sync* to the
poison consuming thread.

>
> Jason

Ankit Agrawal Sept. 9, 2024, 3:56 a.m. UTC | #17

> Yes, whether a VM gets into a memory-error-consumption loop
> maliciously or accidentally, a reasonable VMM should have means to
> detect and break it.
Agreed we need a way to handle it. I suppose it can easily happen if
a malicious app in the VM handles the SIGBUS to say read/write again
among other ways.

Regarding the following two ways discussed..
> 1. remove pud and rely on the driver to re-fault PFNs that it knows
> are not poisoned (what Peter suggested), or 2. keep the pud and
> allow access to both good and bad PFNs.
As mentioned, 2. have the advantage from the performance POV.
For my understanding, what are the pros for the mechanism 1 vs 2?
Wondering it is a choice out of some technical constraints.

Ankit Agrawal Sept. 9, 2024, 4:03 a.m. UTC | #18

> More architectures / More page sizes
> ------------------------------------
> 
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> have plan to support arm64 1G later on top of this series [2].
> 
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.

Just to confirm, would this also not support 512M for 64K pages on aarch64
with special PMD? Or am I missing something?

> remap_pfn_range() support
> -------------------------
> 
> Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.

Does Peter or other folks know of an ongoing effort/patches to extend
remap_pfn_range() to use this?

Peter Xu Sept. 9, 2024, 3:03 p.m. UTC | #19

On Mon, Sep 09, 2024 at 04:03:55AM +0000, Ankit Agrawal wrote:
> > More architectures / More page sizes
> > ------------------------------------
> > 
> > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> > have plan to support arm64 1G later on top of this series [2].
> > 
> > Any arch will need to first support THP / THP_1G, then provide a special
> > bit in pmds/puds to support huge pfnmaps.
> 
> Just to confirm, would this also not support 512M for 64K pages on aarch64
> with special PMD? Or am I missing something?

I don't think it's properly tested yet, but logically it should be
supported indeed, as here what matters is "pmd/pud", not the explicit size
that it uses.

> 
> > remap_pfn_range() support
> > -------------------------
> > 
> > Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> > remap_pfn_range() can logically start to inject either PMDs or PUDs when
> > the alignment requirements match on the VAs.
> >
> > When the support is there, it should be able to silently benefit all
> > drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> > hit rate and overall faster MMIO accesses similar to processor on hugepages.
> 
> Does Peter or other folks know of an ongoing effort/patches to extend
> remap_pfn_range() to use this?

Not away of any from my side.

Thanks,

[v2,00/19] mm: Support huge pfnmaps

Message

Comments