diff mbox series

mm/x86/pat: Only untrack the pfn range if unmap region

Message ID 20240712144244.3090089-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm/x86/pat: Only untrack the pfn range if unmap region | expand

Commit Message

Peter Xu July 12, 2024, 2:42 p.m. UTC
This patch is one patch of an old series [1] that got reposted standalone
here, with the hope to fix some reported untrack_pfn() issues reported
recently [2,3], where there used to be other fix [4] but unfortunately
which looks like to cause other issues.  The hope is this patch can fix it
the right way.

X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
start at mmap() of device drivers, then untracked when munmap().  However
in the current code the untrack is done in unmap_single_vma().  This might
be problematic.

For example, unmap_single_vma() can be used nowadays even for zapping a
single page rather than the whole vmas.  It's very confusing to do whole
vma untracking in this function even if a caller would like to zap one
page.  It could simply be wrong.

Such issue won't be exposed by things like MADV_DONTNEED won't ever work
for pfnmaps and it'll fail the madvise() already before reaching here.
However looks like it can be triggered like what was reported where invoked
from an unmap request from a file vma.

There's also work [5] on VFIO (merged now) to allow tearing down MMIO
pgtables before an munmap(), in which case we may not want to untrack the
pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
pfn tracking information as those pfn mappings can be restored later with
the same vma object.  Currently it's not an immediate problem for VFIO, as
VFIO uses UC- by default, but it looks like there's plan to extend that in
the near future.

IIUC, this was overlooked when zap_page_range_single() was introduced,
while in the past it was only used in the munmap() path which wants to
always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
zap_page_range() callers that act on a single VMA use separate helper") is
the initial commit that introduced unmap_single_vma(), in which the chunk
of untrack_pfn() was moved over from unmap_vmas().

Recover that behavior to untrack pfnmap only when unmap regions.

[1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
[2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
[3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
[4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
[5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: x86@kernel.org
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Pei Li <peili.dev@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Wang <00107082@163.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---

NOTE: I massaged the commit message comparing to the rfc post [1], the
patch itself is untouched.  Also removed rfc tag, and added more people
into the loop. Please kindly help test this patch if you have a reproducer,
as I can't reproduce it myself even with the syzbot reproducer on top of
mm-unstable.  Instead of further check on the reproducer, I decided to send
this out first as we have a bunch of reproducers on the list now..
---
 mm/memory.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

David Hildenbrand July 13, 2024, 1:18 a.m. UTC | #1
On 12.07.24 16:42, Peter Xu wrote:
> This patch is one patch of an old series [1] that got reposted standalone
> here, with the hope to fix some reported untrack_pfn() issues reported
> recently [2,3], where there used to be other fix [4] but unfortunately
> which looks like to cause other issues.  The hope is this patch can fix it
> the right way.
> 
> X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> start at mmap() of device drivers, then untracked when munmap().  However
> in the current code the untrack is done in unmap_single_vma().  This might
> be problematic.
> 
> For example, unmap_single_vma() can be used nowadays even for zapping a
> single page rather than the whole vmas.  It's very confusing to do whole
> vma untracking in this function even if a caller would like to zap one
> page.  It could simply be wrong.
> 
> Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> for pfnmaps and it'll fail the madvise() already before reaching here.
> However looks like it can be triggered like what was reported where invoked
> from an unmap request from a file vma.
> 
> There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> pgtables before an munmap(), in which case we may not want to untrack the
> pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> pfn tracking information as those pfn mappings can be restored later with
> the same vma object.  Currently it's not an immediate problem for VFIO, as
> VFIO uses UC- by default, but it looks like there's plan to extend that in
> the near future.
> 
> IIUC, this was overlooked when zap_page_range_single() was introduced,
> while in the past it was only used in the munmap() path which wants to
> always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> zap_page_range() callers that act on a single VMA use separate helper") is
> the initial commit that introduced unmap_single_vma(), in which the chunk
> of untrack_pfn() was moved over from unmap_vmas().
> 
> Recover that behavior to untrack pfnmap only when unmap regions.
> 
> [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: x86@kernel.org
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Pei Li <peili.dev@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: David Wang <00107082@163.com>
> Cc: Bert Karwatzki <spasswolf@web.de>
> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> NOTE: I massaged the commit message comparing to the rfc post [1], the
> patch itself is untouched.  Also removed rfc tag, and added more people
> into the loop. Please kindly help test this patch if you have a reproducer,
> as I can't reproduce it myself even with the syzbot reproducer on top of
> mm-unstable.  Instead of further check on the reproducer, I decided to send
> this out first as we have a bunch of reproducers on the list now..
> ---
>   mm/memory.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bcd79619574..f57cc304b318 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   	if (vma->vm_file)
>   		uprobe_munmap(vma, start, end);
>   
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>   	if (start != end) {
>   		if (unlikely(is_vm_hugetlb_page(vma))) {
>   			/*
> @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>   		unsigned long start = start_addr;
>   		unsigned long end = end_addr;
>   		hugetlb_zap_begin(vma, &start, &end);
> +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> +			untrack_pfn(vma, 0, 0, mm_wr_locked);
>   		unmap_single_vma(tlb, vma, start, end, &details,
>   				 mm_wr_locked);
>   		hugetlb_zap_end(vma, &details);


Yes, I would prefer that, and it makes sense to me.

Should we then have a "Fixes:" tag (at least my commit, maybe the 
original commit that made this also be called in wrong context)?

Acked-by: David Hildenbrand <david@redhat.com>
David Wang July 13, 2024, 3:36 a.m. UTC | #2
Hi,

At 2024-07-12 22:42:44, "Peter Xu" <peterx@redhat.com> wrote:
>
>NOTE: I massaged the commit message comparing to the rfc post [1], the
>patch itself is untouched.  Also removed rfc tag, and added more people
>into the loop. Please kindly help test this patch if you have a reproducer,
>as I can't reproduce it myself even with the syzbot reproducer on top of
>mm-unstable.  Instead of further check on the reproducer, I decided to send
>this out first as we have a bunch of reproducers on the list now..
>---
> mm/memory.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
>diff --git a/mm/memory.c b/mm/memory.c
>index 4bcd79619574..f57cc304b318 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> 	if (vma->vm_file)
> 		uprobe_munmap(vma, start, end);
> 
>-	if (unlikely(vma->vm_flags & VM_PFNMAP))
>-		untrack_pfn(vma, 0, 0, mm_wr_locked);
>-
> 	if (start != end) {
> 		if (unlikely(is_vm_hugetlb_page(vma))) {
> 			/*
>@@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> 		unsigned long start = start_addr;
> 		unsigned long end = end_addr;
> 		hugetlb_zap_begin(vma, &start, &end);
>+		if (unlikely(vma->vm_flags & VM_PFNMAP))
>+			untrack_pfn(vma, 0, 0, mm_wr_locked);
> 		unmap_single_vma(tlb, vma, start, end, &details,
> 				 mm_wr_locked);
> 		hugetlb_zap_end(vma, &details);
>-- 
>2.45.0

I apply this patch on 6.10.0-rc7, and confirmed that no kernel warning shows up when I suspend my system(with nvidia GPU),
After several round of suspend/resume cycle, no error/warning observed in kernel log.

Thanks
David
David Wang July 14, 2024, 10:59 a.m. UTC | #3
At 2024-07-12 22:42:44, "Peter Xu" <peterx@redhat.com> wrote:
>NOTE: I massaged the commit message comparing to the rfc post [1], the
>patch itself is untouched.  Also removed rfc tag, and added more people
>into the loop. Please kindly help test this patch if you have a reproducer,
>as I can't reproduce it myself even with the syzbot reproducer on top of
>mm-unstable.  Instead of further check on the reproducer, I decided to send
>this out first as we have a bunch of reproducers on the list now..
>---
> mm/memory.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
>diff --git a/mm/memory.c b/mm/memory.c
>index 4bcd79619574..f57cc304b318 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> 	if (vma->vm_file)
> 		uprobe_munmap(vma, start, end);
> 
>-	if (unlikely(vma->vm_flags & VM_PFNMAP))
>-		untrack_pfn(vma, 0, 0, mm_wr_locked);
>-
> 	if (start != end) {
> 		if (unlikely(is_vm_hugetlb_page(vma))) {
> 			/*
>@@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> 		unsigned long start = start_addr;
> 		unsigned long end = end_addr;
> 		hugetlb_zap_begin(vma, &start, &end);
>+		if (unlikely(vma->vm_flags & VM_PFNMAP))
>+			untrack_pfn(vma, 0, 0, mm_wr_locked);
> 		unmap_single_vma(tlb, vma, start, end, &details,
> 				 mm_wr_locked);
> 		hugetlb_zap_end(vma, &details);
>-- 
>2.45.0

Hi, 

Today, I notice a kernel warning with this patch.


[Sun Jul 14 16:51:38 2024] OOM killer enabled.
[Sun Jul 14 16:51:38 2024] Restarting tasks ... done.
[Sun Jul 14 16:51:38 2024] random: crng reseeded on system resumption
[Sun Jul 14 16:51:38 2024] PM: suspend exit
[Sun Jul 14 16:51:38 2024] ------------[ cut here ]------------
[Sun Jul 14 16:51:38 2024] WARNING: CPU: 1 PID: 2484 at arch/x86/mm/pat/memtype.c:1002 untrack_pfn+0x10c/0x120
[Sun Jul 14 16:51:38 2024] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) ctr(E) ccm(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) overlay(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) nvidia_drm(POE) nvidia_modeset(POE) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) kvm(E) iwlmvm(E) snd_hda_codec_generic(E) crct10dif_pclmul(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) ghash_clmulni_intel(E) sha512_ssse3(E) mac80211(E) sha512_generic(E) snd_hda_intel(E) nvidia(POE) sha256_ssse3(E) snd_intel_dspcfg(E) ppdev(E) sha1_ssse3(E) libarc4(E) snd_hda_codec(E) snd_usb_audio(E) snd_usbmidi_lib(E) uvcvideo(E) snd_hda_core(E) iwlwifi(E) aesni_intel(E) snd_rawmidi(E) snd_pcsp(E)
[Sun Jul 14 16:51:38 2024]  snd_hwdep(E) snd_seq_device(E) crypto_simd(E) videobuf2_vmalloc(E) snd_pcm(E) cryptd(E) uvc(E) videobuf2_memops(E) videobuf2_v4l2(E) snd_timer(E) rapl(E) cfg80211(E) k10temp(E) wmi_bmof(E) sp5100_tco(E) acpi_cpufreq(E) ccp(E) snd(E) videodev(E) drm_kms_helper(E) videobuf2_common(E) rfkill(E) video(E) rng_core(E) mc(E) soundcore(E) joydev(E) parport_pc(E) parport(E) sg(E) evdev(E) msr(E) loop(E) fuse(E) drm(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) libahci(E) xhci_pci(E) nvme(E) libata(E) crc32_pclmul(E) nvme_core(E) xhci_hcd(E) t10_pi(E) crc32c_intel(E) i2c_piix4(E) r8169(E) crc64_rocksoft(E) realtek(E) scsi_mod(E) usbcore(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E)
[Sun Jul 14 16:51:38 2024] CPU: 1 PID: 2484 Comm: gnome-shell Tainted: P           OE      6.10.0-rc7-linan-1 #283
[Sun Jul 14 16:51:38 2024] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.80 06/10/2020
[Sun Jul 14 16:51:38 2024] RIP: 0010:untrack_pfn+0x10c/0x120
[Sun Jul 14 16:51:38 2024] Code: e2 01 74 22 8b 98 e0 00 00 00 3b 5d 2c 74 ac 48 8b 7d 30 e8 66 e1 bc 00 89 5d 2c 48 8b 7d 30 e8 0a 6c 09 00 eb 95 0f 0b eb da <0f> 0b eb 95 e8 db b6 bb 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 90
[Sun Jul 14 16:51:38 2024] RSP: 0018:ffffae5b4ab1fbe8 EFLAGS: 00010202
[Sun Jul 14 16:51:38 2024] RAX: 0000000000000028 RBX: 0000000000000000 RCX: 0000000000000000
[Sun Jul 14 16:51:38 2024] RDX: 0000000000000001 RSI: 000fffffffe00000 RDI: ffff91d5be99ea80
[Sun Jul 14 16:51:38 2024] RBP: ffff91d5c44fbe70 R08: 00007f2e5ff32000 R09: 0000000000000001
[Sun Jul 14 16:51:38 2024] R10: ffff91d5b7ad6d1c R11: 00007f2e5ff35fff R12: 00007f2e5ff32000
[Sun Jul 14 16:51:38 2024] R13: 0000000000000000 R14: ffffae5b4ab1fde8 R15: ffff91d5c44fbe70
[Sun Jul 14 16:51:38 2024] FS:  00007f2e5ff59dc0(0000) GS:ffff91d84ec80000(0000) knlGS:0000000000000000
[Sun Jul 14 16:51:38 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Jul 14 16:51:38 2024] CR2: 00007fe71316b08c CR3: 000000018468e000 CR4: 0000000000350ef0
[Sun Jul 14 16:51:38 2024] Call Trace:
[Sun Jul 14 16:51:38 2024]  <TASK>
[Sun Jul 14 16:51:38 2024]  ? __warn+0x7c/0x120
[Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x10c/0x120
[Sun Jul 14 16:51:38 2024]  ? report_bug+0x18d/0x1c0
[Sun Jul 14 16:51:38 2024]  ? handle_bug+0x3c/0x80
[Sun Jul 14 16:51:38 2024]  ? exc_invalid_op+0x13/0x60
[Sun Jul 14 16:51:38 2024]  ? asm_exc_invalid_op+0x16/0x20
[Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x10c/0x120
[Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x53/0x120
[Sun Jul 14 16:51:38 2024]  unmap_vmas+0x115/0x1a0
[Sun Jul 14 16:51:38 2024]  unmap_region+0xd4/0x150
[Sun Jul 14 16:51:38 2024]  ? mas_nomem+0x14/0x80
[Sun Jul 14 16:51:38 2024]  ? srso_return_thunk+0x5/0x5f
[Sun Jul 14 16:51:38 2024]  ? mas_store_gfp+0x54/0x110
[Sun Jul 14 16:51:38 2024]  do_vmi_align_munmap+0x2d4/0x530
[Sun Jul 14 16:51:38 2024]  do_vmi_munmap+0xda/0x190
[Sun Jul 14 16:51:38 2024]  __vm_munmap+0xa0/0x160
[Sun Jul 14 16:51:38 2024]  __x64_sys_munmap+0x17/0x20
[Sun Jul 14 16:51:38 2024]  do_syscall_64+0x4b/0x110
[Sun Jul 14 16:51:38 2024]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Sun Jul 14 16:51:38 2024] RIP: 0033:0x7f2e647208f7
[Sun Jul 14 16:51:38 2024] Code: 00 00 00 48 8b 15 09 05 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d9 04 0d 00 f7 d8 64 89 01 48
[Sun Jul 14 16:51:38 2024] RSP: 002b:00007ffd289f0a48 EFLAGS: 00000246 ORIG_RAX: 000000000000000b
[Sun Jul 14 16:51:38 2024] RAX: ffffffffffffffda RBX: 00007f2e5ff31000 RCX: 00007f2e647208f7
[Sun Jul 14 16:51:38 2024] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 00007f2e5ff31000
[Sun Jul 14 16:51:38 2024] RBP: 0000557d5a9330a0 R08: 00000000c1d00028 R09: 00000000beef0100
[Sun Jul 14 16:51:38 2024] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[Sun Jul 14 16:51:38 2024] R13: 0000000000000001 R14: 0000000000000002 R15: 0000557d5a8408c0
[Sun Jul 14 16:51:38 2024]  </TASK>
[Sun Jul 14 16:51:38 2024] ---[ end trace 0000000000000000 ]---
[Sun Jul 14 16:51:39 2024] ------------[ cut here ]------------
[Sun Jul 14 16:51:39 2024] WARNING: CPU: 1 PID: 2272 at arch/x86/mm/pat/memtype.c:1002 track_pfn_copy+0x94/0xa0
[Sun Jul 14 16:51:39 2024] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) ctr(E) ccm(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) overlay(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) nvidia_drm(POE) nvidia_modeset(POE) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) kvm(E) iwlmvm(E) snd_hda_codec_generic(E) crct10dif_pclmul(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) ghash_clmulni_intel(E) sha512_ssse3(E) mac80211(E) sha512_generic(E) snd_hda_intel(E) nvidia(POE) sha256_ssse3(E) snd_intel_dspcfg(E) ppdev(E) sha1_ssse3(E) libarc4(E) snd_hda_codec(E) snd_usb_audio(E) snd_usbmidi_lib(E) uvcvideo(E) snd_hda_core(E) iwlwifi(E) aesni_intel(E) snd_rawmidi(E) snd_pcsp(E)
[Sun Jul 14 16:51:39 2024]  snd_hwdep(E) snd_seq_device(E) crypto_simd(E) videobuf2_vmalloc(E) snd_pcm(E) cryptd(E) uvc(E) videobuf2_memops(E) videobuf2_v4l2(E) snd_timer(E) rapl(E) cfg80211(E) k10temp(E) wmi_bmof(E) sp5100_tco(E) acpi_cpufreq(E) ccp(E) snd(E) videodev(E) drm_kms_helper(E) videobuf2_common(E) rfkill(E) video(E) rng_core(E) mc(E) soundcore(E) joydev(E) parport_pc(E) parport(E) sg(E) evdev(E) msr(E) loop(E) fuse(E) drm(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) libahci(E) xhci_pci(E) nvme(E) libata(E) crc32_pclmul(E) nvme_core(E) xhci_hcd(E) t10_pi(E) crc32c_intel(E) i2c_piix4(E) r8169(E) crc64_rocksoft(E) realtek(E) scsi_mod(E) usbcore(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E)
[Sun Jul 14 16:51:39 2024] CPU: 1 PID: 2272 Comm: Xorg Tainted: P        W  OE      6.10.0-rc7-linan-1 #283
[Sun Jul 14 16:51:39 2024] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.80 06/10/2020
[Sun Jul 14 16:51:39 2024] RIP: 0010:track_pfn_copy+0x94/0xa0
[Sun Jul 14 16:51:39 2024] Code: ff ff ff eb b4 48 89 ee 48 8b 44 24 10 48 8b 3c 24 b9 01 00 00 00 4c 29 e6 48 8d 54 24 08 48 89 44 24 08 e8 fe fc ff ff eb 8f <0f> 0b eb d0 e8 73 b9 bb 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90
[Sun Jul 14 16:51:39 2024] RSP: 0018:ffffae5b4a04fb68 EFLAGS: 00010202
[Sun Jul 14 16:51:39 2024] RAX: 0000000000000028 RBX: ffff91d546ae1d10 RCX: 0000000000000000
[Sun Jul 14 16:51:39 2024] RDX: 0000000000000001 RSI: 000fffffffe00000 RDI: ffff91d5b969c700
[Sun Jul 14 16:51:39 2024] RBP: 00007fe71316e000 R08: ffff91d639b0b9a0 R09: 00007fe71316e000
[Sun Jul 14 16:51:39 2024] R10: 00007fe71316dfff R11: 00007fe71316efff R12: 00007fe71316d000
[Sun Jul 14 16:51:39 2024] R13: ffff91d543702f40 R14: ffff91d639b0b9a0 R15: 00007fe71316e000
[Sun Jul 14 16:51:39 2024] FS:  00007fe7124f8ac0(0000) GS:ffff91d84ec80000(0000) knlGS:0000000000000000
[Sun Jul 14 16:51:39 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Jul 14 16:51:39 2024] CR2: 000055d6ab0453c0 CR3: 0000000179626000 CR4: 0000000000350ef0
[Sun Jul 14 16:51:39 2024] Call Trace:
[Sun Jul 14 16:51:39 2024]  <TASK>
[Sun Jul 14 16:51:39 2024]  ? __warn+0x7c/0x120
[Sun Jul 14 16:51:39 2024]  ? track_pfn_copy+0x94/0xa0
[Sun Jul 14 16:51:39 2024]  ? report_bug+0x18d/0x1c0
[Sun Jul 14 16:51:39 2024]  ? handle_bug+0x3c/0x80
[Sun Jul 14 16:51:39 2024]  ? exc_invalid_op+0x13/0x60
[Sun Jul 14 16:51:39 2024]  ? asm_exc_invalid_op+0x16/0x20
[Sun Jul 14 16:51:39 2024]  ? track_pfn_copy+0x94/0xa0
[Sun Jul 14 16:51:39 2024]  ? track_pfn_copy+0x57/0xa0
[Sun Jul 14 16:51:39 2024]  ? percpu_counter_add_batch+0x2e/0xa0
[Sun Jul 14 16:51:39 2024]  copy_page_range+0x156e/0x1630
[Sun Jul 14 16:51:39 2024]  ? srso_return_thunk+0x5/0x5f
[Sun Jul 14 16:51:39 2024]  ? mod_objcg_state+0xc9/0x2d0
[Sun Jul 14 16:51:39 2024]  ? obj_cgroup_charge+0x13f/0x1c0
[Sun Jul 14 16:51:39 2024]  ? __memcg_slab_post_alloc_hook+0x201/0x380
[Sun Jul 14 16:51:39 2024]  ? srso_return_thunk+0x5/0x5f
[Sun Jul 14 16:51:39 2024]  copy_process+0x1500/0x26e0
[Sun Jul 14 16:51:39 2024]  kernel_clone+0x97/0x3a0
[Sun Jul 14 16:51:39 2024]  ? srso_return_thunk+0x5/0x5f
[Sun Jul 14 16:51:39 2024]  ? preempt_count_add+0x69/0xa0
[Sun Jul 14 16:51:39 2024]  __do_sys_clone+0x66/0x90
[Sun Jul 14 16:51:39 2024]  do_syscall_64+0x4b/0x110
[Sun Jul 14 16:51:39 2024]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Sun Jul 14 16:51:39 2024] RIP: 0033:0x7fe712b3a293
[Sun Jul 14 16:51:39 2024] Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
[Sun Jul 14 16:51:39 2024] RSP: 002b:00007ffff7e77a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[Sun Jul 14 16:51:39 2024] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe712b3a293
[Sun Jul 14 16:51:39 2024] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[Sun Jul 14 16:51:39 2024] RBP: 0000000000000001 R08: 0000000000000000 R09: 000055dacf937b10
[Sun Jul 14 16:51:39 2024] R10: 00007fe7124f8d90 R11: 0000000000000246 R12: 0000000000000000
[Sun Jul 14 16:51:39 2024] R13: 00007ffff7e77bb0 R14: 00007ffff7e77aa0 R15: 000055daa65272a0
[Sun Jul 14 16:51:39 2024]  </TASK>
[Sun Jul 14 16:51:39 2024] ---[ end trace 0000000000000000 ]---
[Sun Jul 14 16:51:39 2024] rfkill: input handler disabled
[Sun Jul 14 16:51:39 2024] Generic FE-GE Realtek PHY r8169-0-2200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-2200:00, irq=MAC)






(Do not know a sure precedure to reproduce it yet....)



FYI

David
David Hildenbrand July 14, 2024, 6:27 p.m. UTC | #4
On 14.07.24 12:59, David Wang wrote:
> 
> At 2024-07-12 22:42:44, "Peter Xu" <peterx@redhat.com> wrote:
>> NOTE: I massaged the commit message comparing to the rfc post [1], the
>> patch itself is untouched.  Also removed rfc tag, and added more people
>> into the loop. Please kindly help test this patch if you have a reproducer,
>> as I can't reproduce it myself even with the syzbot reproducer on top of
>> mm-unstable.  Instead of further check on the reproducer, I decided to send
>> this out first as we have a bunch of reproducers on the list now..
>> ---
>> mm/memory.c | 5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 4bcd79619574..f57cc304b318 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>> 	if (vma->vm_file)
>> 		uprobe_munmap(vma, start, end);
>>
>> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
>> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
>> -
>> 	if (start != end) {
>> 		if (unlikely(is_vm_hugetlb_page(vma))) {
>> 			/*
>> @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>> 		unsigned long start = start_addr;
>> 		unsigned long end = end_addr;
>> 		hugetlb_zap_begin(vma, &start, &end);
>> +		if (unlikely(vma->vm_flags & VM_PFNMAP))
>> +			untrack_pfn(vma, 0, 0, mm_wr_locked);
>> 		unmap_single_vma(tlb, vma, start, end, &details,
>> 				 mm_wr_locked);
>> 		hugetlb_zap_end(vma, &details);
>> -- 
>> 2.45.0
> 
> Hi,
> 
> Today, I notice a kernel warning with this patch.
> 
> 
> [Sun Jul 14 16:51:38 2024] OOM killer enabled.
> [Sun Jul 14 16:51:38 2024] Restarting tasks ... done.
> [Sun Jul 14 16:51:38 2024] random: crng reseeded on system resumption
> [Sun Jul 14 16:51:38 2024] PM: suspend exit
> [Sun Jul 14 16:51:38 2024] ------------[ cut here ]------------
> [Sun Jul 14 16:51:38 2024] WARNING: CPU: 1 PID: 2484 at arch/x86/mm/pat/memtype.c:1002 untrack_pfn+0x10c/0x120

We fail to find what we need in the page tables, indicating that the 
page tables might have been modified / torn down in the meantime.

Likely we have a previous call to unmap_single_vma() that modifies the 
page tables, and unmaps present PFNs.

PAT is incompatible to that, it relies on information from the page 
tables to know what it has to undo during munmap(), or what it has to do 
during fork().

The splat from the previous discussion [1]:

   follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
   get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
   untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
   unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
   zap_page_range_single+0x326/0x560 mm/memory.c:1920
   unmap_mapping_range_vma mm/memory.c:3684 [inline]
   unmap_mapping_range_tree mm/memory.c:3701 [inline]
   unmap_mapping_pages mm/memory.c:3767 [inline]
   unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
   truncate_pagecache+0x53/0x90 mm/truncate.c:731
   simple_setattr+0xf2/0x120 fs/libfs.c:886
   notify_change+0xec6/0x11f0 fs/attr.c:499
   do_truncate+0x15c/0x220 fs/open.c:65
   handle_truncate fs/namei.c:3308 [inline]

indicates that file truncation seems to end up messing with a PFNMAP 
mapping that has PAT set. That is ... weird. I would have thought that 
PFNMAP would never really happen with file truncation.

Does this only happen with an OOT driver, that seems to do weird 
truncate stuff on files that have a PFNMAP mapping?

[1] 
https://lore.kernel.org/all/3879ee72-84de-4d2a-93a8-c0b3dc3f0a4c@redhat.com/

> [Sun Jul 14 16:51:38 2024] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) ctr(E) ccm(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) overlay(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) nvidia_drm(POE) nvidia_modeset(POE) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) kvm(E) iwlmvm(E) snd_hda_codec_generic(E) crct10dif_pclmul(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) ghash_clmulni_intel(E) sha512_ssse3(E) mac80211(E) sha512_generic(E) snd_hda_intel(E) nvidia(POE) sha256_ssse3(E) snd_intel_dspcfg(E) ppdev(E) sha1_ssse3(E) libarc4(E) snd_hda_codec(E) snd_usb_audio(E) snd_usbmidi_lib(E) uvcvideo(E) snd_hda_core(E) iwlwifi(E) aesni_intel(E) snd_rawmidi(E) snd_pcsp(E)
> [Sun Jul 14 16:51:38 2024]  snd_hwdep(E) snd_seq_device(E) crypto_simd(E) videobuf2_vmalloc(E) snd_pcm(E) cryptd(E) uvc(E) videobuf2_memops(E) videobuf2_v4l2(E) snd_timer(E) rapl(E) cfg80211(E) k10temp(E) wmi_bmof(E) sp5100_tco(E) acpi_cpufreq(E) ccp(E) snd(E) videodev(E) drm_kms_helper(E) videobuf2_common(E) rfkill(E) video(E) rng_core(E) mc(E) soundcore(E) joydev(E) parport_pc(E) parport(E) sg(E) evdev(E) msr(E) loop(E) fuse(E) drm(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) libahci(E) xhci_pci(E) nvme(E) libata(E) crc32_pclmul(E) nvme_core(E) xhci_hcd(E) t10_pi(E) crc32c_intel(E) i2c_piix4(E) r8169(E) crc64_rocksoft(E) realtek(E) scsi_mod(E) usbcore(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E)
> [Sun Jul 14 16:51:38 2024] CPU: 1 PID: 2484 Comm: gnome-shell Tainted: P           OE      6.10.0-rc7-linan-1 #283
> [Sun Jul 14 16:51:38 2024] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.80 06/10/2020
> [Sun Jul 14 16:51:38 2024] RIP: 0010:untrack_pfn+0x10c/0x120
> [Sun Jul 14 16:51:38 2024] Code: e2 01 74 22 8b 98 e0 00 00 00 3b 5d 2c 74 ac 48 8b 7d 30 e8 66 e1 bc 00 89 5d 2c 48 8b 7d 30 e8 0a 6c 09 00 eb 95 0f 0b eb da <0f> 0b eb 95 e8 db b6 bb 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 90
> [Sun Jul 14 16:51:38 2024] RSP: 0018:ffffae5b4ab1fbe8 EFLAGS: 00010202
> [Sun Jul 14 16:51:38 2024] RAX: 0000000000000028 RBX: 0000000000000000 RCX: 0000000000000000
> [Sun Jul 14 16:51:38 2024] RDX: 0000000000000001 RSI: 000fffffffe00000 RDI: ffff91d5be99ea80
> [Sun Jul 14 16:51:38 2024] RBP: ffff91d5c44fbe70 R08: 00007f2e5ff32000 R09: 0000000000000001
> [Sun Jul 14 16:51:38 2024] R10: ffff91d5b7ad6d1c R11: 00007f2e5ff35fff R12: 00007f2e5ff32000
> [Sun Jul 14 16:51:38 2024] R13: 0000000000000000 R14: ffffae5b4ab1fde8 R15: ffff91d5c44fbe70
> [Sun Jul 14 16:51:38 2024] FS:  00007f2e5ff59dc0(0000) GS:ffff91d84ec80000(0000) knlGS:0000000000000000
> [Sun Jul 14 16:51:38 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Sun Jul 14 16:51:38 2024] CR2: 00007fe71316b08c CR3: 000000018468e000 CR4: 0000000000350ef0
> [Sun Jul 14 16:51:38 2024] Call Trace:
> [Sun Jul 14 16:51:38 2024]  <TASK>
> [Sun Jul 14 16:51:38 2024]  ? __warn+0x7c/0x120
> [Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x10c/0x120
> [Sun Jul 14 16:51:38 2024]  ? report_bug+0x18d/0x1c0
> [Sun Jul 14 16:51:38 2024]  ? handle_bug+0x3c/0x80
> [Sun Jul 14 16:51:38 2024]  ? exc_invalid_op+0x13/0x60
> [Sun Jul 14 16:51:38 2024]  ? asm_exc_invalid_op+0x16/0x20
> [Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x10c/0x120
> [Sun Jul 14 16:51:38 2024]  ? untrack_pfn+0x53/0x120
> [Sun Jul 14 16:51:38 2024]  unmap_vmas+0x115/0x1a0
> [Sun Jul 14 16:51:38 2024]  unmap_region+0xd4/0x150
> [Sun Jul 14 16:51:38 2024]  ? mas_nomem+0x14/0x80
> [Sun Jul 14 16:51:38 2024]  ? srso_return_thunk+0x5/0x5f
> [Sun Jul 14 16:51:38 2024]  ? mas_store_gfp+0x54/0x110
> [Sun Jul 14 16:51:38 2024]  do_vmi_align_munmap+0x2d4/0x530
> [Sun Jul 14 16:51:38 2024]  do_vmi_munmap+0xda/0x190
> [Sun Jul 14 16:51:38 2024]  __vm_munmap+0xa0/0x160
> [Sun Jul 14 16:51:38 2024]  __x64_sys_munmap+0x17/0x20
> [Sun Jul 14 16:51:38 2024]  do_syscall_64+0x4b/0x110
> [Sun Jul 14 16:51:38 2024]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [Sun Jul 14 16:51:38 2024] RIP: 0033:0x7f2e647208f7
> [Sun Jul 14 16:51:38 2024] Code: 00 00 00 48 8b 15 09 05 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d9 04 0d 00 f7 d8 64 89 01 48
> [Sun Jul 14 16:51:38 2024] RSP: 002b:00007ffd289f0a48 EFLAGS: 00000246 ORIG_RAX: 000000000000000b
> [Sun Jul 14 16:51:38 2024] RAX: ffffffffffffffda RBX: 00007f2e5ff31000 RCX: 00007f2e647208f7
> [Sun Jul 14 16:51:38 2024] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 00007f2e5ff31000
> [Sun Jul 14 16:51:38 2024] RBP: 0000557d5a9330a0 R08: 00000000c1d00028 R09: 00000000beef0100
> [Sun Jul 14 16:51:38 2024] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
> [Sun Jul 14 16:51:38 2024] R13: 0000000000000001 R14: 0000000000000002 R15: 0000557d5a8408c0
> [Sun Jul 14 16:51:38 2024]  </TASK>
> [Sun Jul 14 16:51:38 2024] ---[ end trace 0000000000000000 ]---
> [Sun Jul 14 16:51:39 2024] ------------[ cut here ]------------
> [Sun Jul 14 16:51:39 2024] WARNING: CPU: 1 PID: 2272 at arch/x86/mm/pat/memtype.c:1002 track_pfn_copy+0x94/0xa0
> [Sun Jul 14 16:51:39 2024] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) ctr(E) ccm(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) overlay(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) nvidia_drm(POE) nvidia_modeset(POE) edac_mce_amd(E) kvm_amd(E) snd_hda_codec_realtek(E) kvm(E) iwlmvm(E) snd_hda_codec_generic(E) crct10dif_pclmul(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) ghash_clmulni_intel(E) sha512_ssse3(E) mac80211(E) sha512_generic(E) snd_hda_intel(E) nvidia(POE) sha256_ssse3(E) snd_intel_dspcfg(E) ppdev(E) sha1_ssse3(E) libarc4(E) snd_hda_codec(E) snd_usb_audio(E) snd_usbmidi_lib(E) uvcvideo(E) snd_hda_core(E) iwlwifi(E) aesni_intel(E) snd_rawmidi(E) snd_pcsp(E)
> [Sun Jul 14 16:51:39 2024]  snd_hwdep(E) snd_seq_device(E) crypto_simd(E) videobuf2_vmalloc(E) snd_pcm(E) cryptd(E) uvc(E) videobuf2_memops(E) videobuf2_v4l2(E) snd_timer(E) rapl(E) cfg80211(E) k10temp(E) wmi_bmof(E) sp5100_tco(E) acpi_cpufreq(E) ccp(E) snd(E) videodev(E) drm_kms_helper(E) videobuf2_common(E) rfkill(E) video(E) rng_core(E) mc(E) soundcore(E) joydev(E) parport_pc(E) parport(E) sg(E) evdev(E) msr(E) loop(E) fuse(E) drm(E) efi_pstore(E) dm_mod(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) raid1(E) raid0(E) md_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) ahci(E) libahci(E) xhci_pci(E) nvme(E) libata(E) crc32_pclmul(E) nvme_core(E) xhci_hcd(E) t10_pi(E) crc32c_intel(E) i2c_piix4(E) r8169(E) crc64_rocksoft(E) realtek(E) scsi_mod(E) usbcore(E) scsi_common(E) usb_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E) button(E)
> [Sun Jul 14 16:51:39 2024] CPU: 1 PID: 2272 Comm: Xorg Tainted: P        W  OE      6.10.0-rc7-linan-1 #283
> [Sun Jul 14 16:51:39 2024] Hardware name: Micro-Star International Co., Ltd. MS-7B89/B450M MORTAR MAX (MS-7B89), BIOS 2.80 06/10/2020
> [Sun Jul 14 16:51:39 2024] RIP: 0010:track_pfn_copy+0x94/0xa0
> [Sun Jul 14 16:51:39 2024] Code: ff ff ff eb b4 48 89 ee 48 8b 44 24 10 48 8b 3c 24 b9 01 00 00 00 4c 29 e6 48 8d 54 24 08 48 89 44 24 08 e8 fe fc ff ff eb 8f <0f> 0b eb d0 e8 73 b9 bb 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90
> [Sun Jul 14 16:51:39 2024] RSP: 0018:ffffae5b4a04fb68 EFLAGS: 00010202
> [Sun Jul 14 16:51:39 2024] RAX: 0000000000000028 RBX: ffff91d546ae1d10 RCX: 0000000000000000
> [Sun Jul 14 16:51:39 2024] RDX: 0000000000000001 RSI: 000fffffffe00000 RDI: ffff91d5b969c700
> [Sun Jul 14 16:51:39 2024] RBP: 00007fe71316e000 R08: ffff91d639b0b9a0 R09: 00007fe71316e000
> [Sun Jul 14 16:51:39 2024] R10: 00007fe71316dfff R11: 00007fe71316efff R12: 00007fe71316d000
> [Sun Jul 14 16:51:39 2024] R13: ffff91d543702f40 R14: ffff91d639b0b9a0 R15: 00007fe71316e000
> [Sun Jul 14 16:51:39 2024] FS:  00007fe7124f8ac0(0000) GS:ffff91d84ec80000(0000) knlGS:0000000000000000
> [Sun Jul 14 16:51:39 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Sun Jul 14 16:51:39 2024] CR2: 000055d6ab0453c0 CR3: 0000000179626000 CR4: 0000000000350ef0
> [Sun Jul 14 16:51:39 2024] Call Trace:
> [Sun Jul 14 16:51:39 2024]  <TASK>
> [Sun Jul 14 16:51:39 2024]  ? __warn+0x7c/0x120
> [Sun Jul 14 16:51:39 2024]  ? track_pfn_copy+0x94/0xa0

Same thing (follow-up error), during fork() we don't know what to do 
because the page tables were already modified and we don't know how to 
handle that PFNMAP mapping.
Yan Zhao July 15, 2024, 7:08 a.m. UTC | #5
On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> This patch is one patch of an old series [1] that got reposted standalone
> here, with the hope to fix some reported untrack_pfn() issues reported
> recently [2,3], where there used to be other fix [4] but unfortunately
> which looks like to cause other issues.  The hope is this patch can fix it
> the right way.
> 
> X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> start at mmap() of device drivers, then untracked when munmap().  However
> in the current code the untrack is done in unmap_single_vma().  This might
> be problematic.
> 
> For example, unmap_single_vma() can be used nowadays even for zapping a
> single page rather than the whole vmas.  It's very confusing to do whole
> vma untracking in this function even if a caller would like to zap one
> page.  It could simply be wrong.
> 
> Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> for pfnmaps and it'll fail the madvise() already before reaching here.
> However looks like it can be triggered like what was reported where invoked
> from an unmap request from a file vma.
> 
> There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> pgtables before an munmap(), in which case we may not want to untrack the
> pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> pfn tracking information as those pfn mappings can be restored later with
> the same vma object.  Currently it's not an immediate problem for VFIO, as
> VFIO uses UC- by default, but it looks like there's plan to extend that in
> the near future.
> 
> IIUC, this was overlooked when zap_page_range_single() was introduced,
> while in the past it was only used in the munmap() path which wants to
> always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> zap_page_range() callers that act on a single VMA use separate helper") is
> the initial commit that introduced unmap_single_vma(), in which the chunk
> of untrack_pfn() was moved over from unmap_vmas().
> 
> Recover that behavior to untrack pfnmap only when unmap regions.
> 
> [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: x86@kernel.org
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Pei Li <peili.dev@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: David Wang <00107082@163.com>
> Cc: Bert Karwatzki <spasswolf@web.de>
> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> NOTE: I massaged the commit message comparing to the rfc post [1], the
> patch itself is untouched.  Also removed rfc tag, and added more people
> into the loop. Please kindly help test this patch if you have a reproducer,
> as I can't reproduce it myself even with the syzbot reproducer on top of
> mm-unstable.  Instead of further check on the reproducer, I decided to send
> this out first as we have a bunch of reproducers on the list now..
> ---
>  mm/memory.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bcd79619574..f57cc304b318 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>  
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
called here, since remap_pfn_range() is not called in mmap() and fault
handler, and therefore (vma->vm_flags & VM_PAT) is always 0.

>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  		unsigned long start = start_addr;
>  		unsigned long end = end_addr;
>  		hugetlb_zap_begin(vma, &start, &end);
> +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> +			untrack_pfn(vma, 0, 0, mm_wr_locked);
Same here.

>  		unmap_single_vma(tlb, vma, start, end, &details,
>  				 mm_wr_locked);
>  		hugetlb_zap_end(vma, &details);
> -- 
> 2.45.0
>
Peter Xu July 15, 2024, 2:29 p.m. UTC | #6
On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> > This patch is one patch of an old series [1] that got reposted standalone
> > here, with the hope to fix some reported untrack_pfn() issues reported
> > recently [2,3], where there used to be other fix [4] but unfortunately
> > which looks like to cause other issues.  The hope is this patch can fix it
> > the right way.
> > 
> > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > start at mmap() of device drivers, then untracked when munmap().  However
> > in the current code the untrack is done in unmap_single_vma().  This might
> > be problematic.
> > 
> > For example, unmap_single_vma() can be used nowadays even for zapping a
> > single page rather than the whole vmas.  It's very confusing to do whole
> > vma untracking in this function even if a caller would like to zap one
> > page.  It could simply be wrong.
> > 
> > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > for pfnmaps and it'll fail the madvise() already before reaching here.
> > However looks like it can be triggered like what was reported where invoked
> > from an unmap request from a file vma.
> > 
> > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > pgtables before an munmap(), in which case we may not want to untrack the
> > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > pfn tracking information as those pfn mappings can be restored later with
> > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > the near future.
> > 
> > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > while in the past it was only used in the munmap() path which wants to
> > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > zap_page_range() callers that act on a single VMA use separate helper") is
> > the initial commit that introduced unmap_single_vma(), in which the chunk
> > of untrack_pfn() was moved over from unmap_vmas().
> > 
> > Recover that behavior to untrack pfnmap only when unmap regions.
> > 
> > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > 
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: x86@kernel.org
> > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Pei Li <peili.dev@gmail.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: David Wang <00107082@163.com>
> > Cc: Bert Karwatzki <spasswolf@web.de>
> > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > 
> > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > patch itself is untouched.  Also removed rfc tag, and added more people
> > into the loop. Please kindly help test this patch if you have a reproducer,
> > as I can't reproduce it myself even with the syzbot reproducer on top of
> > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > this out first as we have a bunch of reproducers on the list now..
> > ---
> >  mm/memory.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4bcd79619574..f57cc304b318 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >  	if (vma->vm_file)
> >  		uprobe_munmap(vma, start, end);
> >  
> > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > -
> Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> called here, since remap_pfn_range() is not called in mmap() and fault
> handler, and therefore (vma->vm_flags & VM_PAT) is always 0.

Right when with current repo, but I'm thinking maybe we should have VM_PAT
there..

See what reserve_pfn_range() does right now: it'll make sure only one owner
reserve this area, e.g. memtype_reserve() will fail if it has already been
reserved.  Then when succeeded as the first one to reserve the region,
it'll make sure this mem type to also be synchronized to the kernel map
(memtype_kernel_map_sync).

So I feel like when MMIO disabled for a VFIO card, we may need to call
reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
pgtable will be empty, and even if currently it's always UC- for now:

vfio_pci_core_mmap():
	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

Then the memtype will be reserved even if it cannot be used.  Otherwise I
don't know whether there's side effects of kernel identity mapping where it
mismatch later with what's mapped in the userspace via the vma, when MMIO
got enabled again: pgtable started to be injected with a different memtype
that specified only in the vma's pgprot.

> 
> >  	if (start != end) {
> >  		if (unlikely(is_vm_hugetlb_page(vma))) {
> >  			/*
> > @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  		unsigned long start = start_addr;
> >  		unsigned long end = end_addr;
> >  		hugetlb_zap_begin(vma, &start, &end);
> > +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> > +			untrack_pfn(vma, 0, 0, mm_wr_locked);
> Same here.
> 
> >  		unmap_single_vma(tlb, vma, start, end, &details,
> >  				 mm_wr_locked);
> >  		hugetlb_zap_end(vma, &details);
> > -- 
> > 2.45.0
> > 
>
Peter Xu July 15, 2024, 3:03 p.m. UTC | #7
On Sun, Jul 14, 2024 at 08:27:25PM +0200, David Hildenbrand wrote:
> On 14.07.24 12:59, David Wang wrote:
> > 
> > At 2024-07-12 22:42:44, "Peter Xu" <peterx@redhat.com> wrote:
> > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > this out first as we have a bunch of reproducers on the list now..
> > > ---
> > > mm/memory.c | 5 ++---
> > > 1 file changed, 2 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 4bcd79619574..f57cc304b318 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > 	if (vma->vm_file)
> > > 		uprobe_munmap(vma, start, end);
> > > 
> > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > -
> > > 	if (start != end) {
> > > 		if (unlikely(is_vm_hugetlb_page(vma))) {
> > > 			/*
> > > @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > > 		unsigned long start = start_addr;
> > > 		unsigned long end = end_addr;
> > > 		hugetlb_zap_begin(vma, &start, &end);
> > > +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > +			untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > 		unmap_single_vma(tlb, vma, start, end, &details,
> > > 				 mm_wr_locked);
> > > 		hugetlb_zap_end(vma, &details);
> > > -- 
> > > 2.45.0
> > 
> > Hi,
> > 
> > Today, I notice a kernel warning with this patch.
> > 
> > 
> > [Sun Jul 14 16:51:38 2024] OOM killer enabled.
> > [Sun Jul 14 16:51:38 2024] Restarting tasks ... done.
> > [Sun Jul 14 16:51:38 2024] random: crng reseeded on system resumption
> > [Sun Jul 14 16:51:38 2024] PM: suspend exit
> > [Sun Jul 14 16:51:38 2024] ------------[ cut here ]------------
> > [Sun Jul 14 16:51:38 2024] WARNING: CPU: 1 PID: 2484 at arch/x86/mm/pat/memtype.c:1002 untrack_pfn+0x10c/0x120
> 
> We fail to find what we need in the page tables, indicating that the page
> tables might have been modified / torn down in the meantime.
> 
> Likely we have a previous call to unmap_single_vma() that modifies the page
> tables, and unmaps present PFNs.
> 
> PAT is incompatible to that, it relies on information from the page tables
> to know what it has to undo during munmap(), or what it has to do during
> fork().
> 
> The splat from the previous discussion [1]:
> 
>   follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>   get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>   untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>   unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>   zap_page_range_single+0x326/0x560 mm/memory.c:1920
>   unmap_mapping_range_vma mm/memory.c:3684 [inline]
>   unmap_mapping_range_tree mm/memory.c:3701 [inline]
>   unmap_mapping_pages mm/memory.c:3767 [inline]
>   unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
>   truncate_pagecache+0x53/0x90 mm/truncate.c:731
>   simple_setattr+0xf2/0x120 fs/libfs.c:886
>   notify_change+0xec6/0x11f0 fs/attr.c:499
>   do_truncate+0x15c/0x220 fs/open.c:65
>   handle_truncate fs/namei.c:3308 [inline]
> 
> indicates that file truncation seems to end up messing with a PFNMAP mapping
> that has PAT set. That is ... weird. I would have thought that PFNMAP would
> never really happen with file truncation.
> 
> Does this only happen with an OOT driver, that seems to do weird truncate
> stuff on files that have a PFNMAP mapping?
> 
> [1]
> https://lore.kernel.org/all/3879ee72-84de-4d2a-93a8-c0b3dc3f0a4c@redhat.com/

Ohhh.. I guess this will also stop working in VFIO, but I think it's fine
for now because as Yan pointed out VFIO PCI doesn't register those regions
now so VM_PAT is not yet set..

And one thing I said wrong in the previous reply to Yan is, obviously
memtype_check_insert() can work with >1 owners as long as the memtype
matches.. and that's how fork() works where VM_PAT needs to be duplicated.
But this whole thing is a bit confusing to me..  As I think it also means
when fork the track_pfn_copy() will call memtype_kernel_map_sync one more
time even if we're 100% sure the pgprot will be the same for the kernel
mappings..

I wonder whether there's some way that untrack pfn framework doesn't need
to rely on the pgtable to fetch the pfn, because VFIO MMIO region
protection will also do that in the near future, AFAICT.  The pgprot part
should be easy there to fetch: get_pat_info() should fallback to vma's
pgprot if no mapping found; the only outlier should be CoW pages in
reality.  The pfn is the real issue so far, so that either track_pfn_copy()
or untrack_pfn() may need to know the pfn to untrack, even if it only has
the vma information.

Thanks,
Yan Zhao July 16, 2024, 9:13 a.m. UTC | #8
On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> > > This patch is one patch of an old series [1] that got reposted standalone
> > > here, with the hope to fix some reported untrack_pfn() issues reported
> > > recently [2,3], where there used to be other fix [4] but unfortunately
> > > which looks like to cause other issues.  The hope is this patch can fix it
> > > the right way.
> > > 
> > > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > > start at mmap() of device drivers, then untracked when munmap().  However
> > > in the current code the untrack is done in unmap_single_vma().  This might
> > > be problematic.
> > > 
> > > For example, unmap_single_vma() can be used nowadays even for zapping a
> > > single page rather than the whole vmas.  It's very confusing to do whole
> > > vma untracking in this function even if a caller would like to zap one
> > > page.  It could simply be wrong.
> > > 
> > > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > > for pfnmaps and it'll fail the madvise() already before reaching here.
> > > However looks like it can be triggered like what was reported where invoked
> > > from an unmap request from a file vma.
> > > 
> > > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > > pgtables before an munmap(), in which case we may not want to untrack the
> > > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > > pfn tracking information as those pfn mappings can be restored later with
> > > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > > the near future.
> > > 
> > > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > > while in the past it was only used in the munmap() path which wants to
> > > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > > zap_page_range() callers that act on a single VMA use separate helper") is
> > > the initial commit that introduced unmap_single_vma(), in which the chunk
> > > of untrack_pfn() was moved over from unmap_vmas().
> > > 
> > > Recover that behavior to untrack pfnmap only when unmap regions.
> > > 
> > > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > > 
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > Cc: Andy Lutomirski <luto@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > Cc: x86@kernel.org
> > > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Pei Li <peili.dev@gmail.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: David Wang <00107082@163.com>
> > > Cc: Bert Karwatzki <spasswolf@web.de>
> > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > > 
> > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > this out first as we have a bunch of reproducers on the list now..
> > > ---
> > >  mm/memory.c | 5 ++---
> > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 4bcd79619574..f57cc304b318 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > >  	if (vma->vm_file)
> > >  		uprobe_munmap(vma, start, end);
> > >  
> > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > -
> > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > called here, since remap_pfn_range() is not called in mmap() and fault
> > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> 
> Right when with current repo, but I'm thinking maybe we should have VM_PAT
> there..
Yes, I agree.

But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
handler since vm_flags_set() requires mmap lock held for write while
the fault handler can only hold mmap lock for read.
So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
without VM_PAT being set in vma.

> 
> See what reserve_pfn_range() does right now: it'll make sure only one owner
> reserve this area, e.g. memtype_reserve() will fail if it has already been
> reserved.  Then when succeeded as the first one to reserve the region,
> it'll make sure this mem type to also be synchronized to the kernel map
> (memtype_kernel_map_sync).
> 
> So I feel like when MMIO disabled for a VFIO card, we may need to call
> reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> pgtable will be empty, and even if currently it's always UC- for now:
> 
> vfio_pci_core_mmap():
> 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> 
> Then the memtype will be reserved even if it cannot be used.  Otherwise I
> don't know whether there's side effects of kernel identity mapping where it
> mismatch later with what's mapped in the userspace via the vma, when MMIO
> got enabled again: pgtable started to be injected with a different memtype
> that specified only in the vma's pgprot.
Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
no VM_PAT in vmas.
 
Calling remap_pfn_range() in the mmap() will cause problem to support
dynamic faulting. Looks there's still a window even with an immediate
unmap after mmap().

Also, calling remap_pfn_range() in mmap() may not meet the requirement of
drivers to dynamic switch backend resources, e.g. as what's in
cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
all backend resources at once.

So, is there any way for the driver to reserve memtypes and set VM_PAT in
fault handler?
Peter Xu July 16, 2024, 7:01 p.m. UTC | #9
On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
> On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> > > > This patch is one patch of an old series [1] that got reposted standalone
> > > > here, with the hope to fix some reported untrack_pfn() issues reported
> > > > recently [2,3], where there used to be other fix [4] but unfortunately
> > > > which looks like to cause other issues.  The hope is this patch can fix it
> > > > the right way.
> > > > 
> > > > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > > > start at mmap() of device drivers, then untracked when munmap().  However
> > > > in the current code the untrack is done in unmap_single_vma().  This might
> > > > be problematic.
> > > > 
> > > > For example, unmap_single_vma() can be used nowadays even for zapping a
> > > > single page rather than the whole vmas.  It's very confusing to do whole
> > > > vma untracking in this function even if a caller would like to zap one
> > > > page.  It could simply be wrong.
> > > > 
> > > > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > > > for pfnmaps and it'll fail the madvise() already before reaching here.
> > > > However looks like it can be triggered like what was reported where invoked
> > > > from an unmap request from a file vma.
> > > > 
> > > > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > > > pgtables before an munmap(), in which case we may not want to untrack the
> > > > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > > > pfn tracking information as those pfn mappings can be restored later with
> > > > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > > > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > > > the near future.
> > > > 
> > > > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > > > while in the past it was only used in the munmap() path which wants to
> > > > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > > > zap_page_range() callers that act on a single VMA use separate helper") is
> > > > the initial commit that introduced unmap_single_vma(), in which the chunk
> > > > of untrack_pfn() was moved over from unmap_vmas().
> > > > 
> > > > Recover that behavior to untrack pfnmap only when unmap regions.
> > > > 
> > > > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > > > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > > > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > > > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > > > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > > > 
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > > Cc: Andy Lutomirski <luto@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > Cc: Borislav Petkov <bp@alien8.de>
> > > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > > Cc: x86@kernel.org
> > > > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Pei Li <peili.dev@gmail.com>
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Cc: David Wang <00107082@163.com>
> > > > Cc: Bert Karwatzki <spasswolf@web.de>
> > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > > 
> > > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > > this out first as we have a bunch of reproducers on the list now..
> > > > ---
> > > >  mm/memory.c | 5 ++---
> > > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 4bcd79619574..f57cc304b318 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > >  	if (vma->vm_file)
> > > >  		uprobe_munmap(vma, start, end);
> > > >  
> > > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > > -
> > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > > called here, since remap_pfn_range() is not called in mmap() and fault
> > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> > 
> > Right when with current repo, but I'm thinking maybe we should have VM_PAT
> > there..
> Yes, I agree.
> 
> But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
> handler since vm_flags_set() requires mmap lock held for write while
> the fault handler can only hold mmap lock for read.
> So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
> without VM_PAT being set in vma.

Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
a fault handler.  They should only be called in mmap() phase.

When you mentioned ioremap(), it's only for the VGA device, right?  I don't
see it's used in the vfio-pci's fault handler.

> 
> > 
> > See what reserve_pfn_range() does right now: it'll make sure only one owner
> > reserve this area, e.g. memtype_reserve() will fail if it has already been
> > reserved.  Then when succeeded as the first one to reserve the region,
> > it'll make sure this mem type to also be synchronized to the kernel map
> > (memtype_kernel_map_sync).
> > 
> > So I feel like when MMIO disabled for a VFIO card, we may need to call
> > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> > pgtable will be empty, and even if currently it's always UC- for now:
> > 
> > vfio_pci_core_mmap():
> > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > 
> > Then the memtype will be reserved even if it cannot be used.  Otherwise I
> > don't know whether there's side effects of kernel identity mapping where it
> > mismatch later with what's mapped in the userspace via the vma, when MMIO
> > got enabled again: pgtable started to be injected with a different memtype
> > that specified only in the vma's pgprot.
> Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
> no VM_PAT in vmas.
>  
> Calling remap_pfn_range() in the mmap() will cause problem to support
> dynamic faulting. Looks there's still a window even with an immediate
> unmap after mmap().

It can be conditionally injected.  See Alex's commit (not yet posted
anywhere, only used in our internal testing so far):

https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1

> 
> Also, calling remap_pfn_range() in mmap() may not meet the requirement of
> drivers to dynamic switch backend resources, e.g. as what's in
> cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
> all backend resources at once.
> 
> So, is there any way for the driver to reserve memtypes and set VM_PAT in
> fault handler?

I must confess I'm not familiar with memtype stuff, and just started
looking recently.

Per my reading so far, we have these multiple ways of doing memtype
reservations, and no matter which API we use (ioremap / track_pfn_remap /
pci_iomap), the same memtype needs to be used otherwise the reservation
will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
virtual mapping will present already after the API returns.

Then here IMHO track_pfn_remap() is the one that we should still use in
page-level mmap() controls, because so far we don't want to map any MMIO
range yet, however we want to say "hey this VMA maps something special", by
reserving these memtype and set VM_PAT.  We want the virtual mapping only
later during a fault().

In short, it looks to me the right thing we should do is to manually invoke
track_pfn_remap() in vfio-pci's mmap() hook.

	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
		ret = remap_pfn_range(vma, vma->vm_start, pfn,
				      req_len, vma->vm_page_prot);
	else
                // TODO: manually invoke track_pfn_remap() here
		vm_flags_set(vma, VM_IO | VM_PFNMAP |
				  VM_DONTEXPAND | VM_DONTDUMP);

Then the vma is registered, and untrack_pfn() should be automatically done
when vma unmaps (and that relies on this patch to not do that too early).
From that POV, I still think this patch does the right thing and should be
merged, even if we probably have one more issue to fix as David Wang
reported..

I'll need to think about how to do that right, as I think that'll be needed
as long as pfnmaps will support fault()s: it means when munmap() the
pgtable may not present, and PAT cannot rely on walking the pgtable to know
the base PFN anymore.

Thanks,
Yan Zhao July 17, 2024, 1:38 a.m. UTC | #10
On Tue, Jul 16, 2024 at 03:01:50PM -0400, Peter Xu wrote:
> On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
> > On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> > > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> > > > > This patch is one patch of an old series [1] that got reposted standalone
> > > > > here, with the hope to fix some reported untrack_pfn() issues reported
> > > > > recently [2,3], where there used to be other fix [4] but unfortunately
> > > > > which looks like to cause other issues.  The hope is this patch can fix it
> > > > > the right way.
> > > > > 
> > > > > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > > > > start at mmap() of device drivers, then untracked when munmap().  However
> > > > > in the current code the untrack is done in unmap_single_vma().  This might
> > > > > be problematic.
> > > > > 
> > > > > For example, unmap_single_vma() can be used nowadays even for zapping a
> > > > > single page rather than the whole vmas.  It's very confusing to do whole
> > > > > vma untracking in this function even if a caller would like to zap one
> > > > > page.  It could simply be wrong.
> > > > > 
> > > > > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > > > > for pfnmaps and it'll fail the madvise() already before reaching here.
> > > > > However looks like it can be triggered like what was reported where invoked
> > > > > from an unmap request from a file vma.
> > > > > 
> > > > > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > > > > pgtables before an munmap(), in which case we may not want to untrack the
> > > > > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > > > > pfn tracking information as those pfn mappings can be restored later with
> > > > > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > > > > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > > > > the near future.
> > > > > 
> > > > > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > > > > while in the past it was only used in the munmap() path which wants to
> > > > > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > > > > zap_page_range() callers that act on a single VMA use separate helper") is
> > > > > the initial commit that introduced unmap_single_vma(), in which the chunk
> > > > > of untrack_pfn() was moved over from unmap_vmas().
> > > > > 
> > > > > Recover that behavior to untrack pfnmap only when unmap regions.
> > > > > 
> > > > > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > > > > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > > > > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > > > > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > > > > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > > > > 
> > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > Cc: Andy Lutomirski <luto@kernel.org>
> > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > > Cc: Borislav Petkov <bp@alien8.de>
> > > > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > > > Cc: x86@kernel.org
> > > > > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Cc: Pei Li <peili.dev@gmail.com>
> > > > > Cc: David Hildenbrand <david@redhat.com>
> > > > > Cc: David Wang <00107082@163.com>
> > > > > Cc: Bert Karwatzki <spasswolf@web.de>
> > > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > > 
> > > > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > > > this out first as we have a bunch of reproducers on the list now..
> > > > > ---
> > > > >  mm/memory.c | 5 ++---
> > > > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 4bcd79619574..f57cc304b318 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > > >  	if (vma->vm_file)
> > > > >  		uprobe_munmap(vma, start, end);
> > > > >  
> > > > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > > > -
> > > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > > > called here, since remap_pfn_range() is not called in mmap() and fault
> > > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> > > 
> > > Right when with current repo, but I'm thinking maybe we should have VM_PAT
> > > there..
> > Yes, I agree.
> > 
> > But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
> > handler since vm_flags_set() requires mmap lock held for write while
> > the fault handler can only hold mmap lock for read.
> > So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
> > without VM_PAT being set in vma.
> 
> Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
> a fault handler.  They should only be called in mmap() phase.
> 
> When you mentioned ioremap(), it's only for the VGA device, right?  I don't
> see it's used in the vfio-pci's fault handler.
Oh, it's pci_iomap() in the vfio-pci's fault handler, another version of
ioremap() :)

> > 
> > > 
> > > See what reserve_pfn_range() does right now: it'll make sure only one owner
> > > reserve this area, e.g. memtype_reserve() will fail if it has already been
> > > reserved.  Then when succeeded as the first one to reserve the region,
> > > it'll make sure this mem type to also be synchronized to the kernel map
> > > (memtype_kernel_map_sync).
> > > 
> > > So I feel like when MMIO disabled for a VFIO card, we may need to call
> > > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> > > pgtable will be empty, and even if currently it's always UC- for now:
> > > 
> > > vfio_pci_core_mmap():
> > > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > 
> > > Then the memtype will be reserved even if it cannot be used.  Otherwise I
> > > don't know whether there's side effects of kernel identity mapping where it
> > > mismatch later with what's mapped in the userspace via the vma, when MMIO
> > > got enabled again: pgtable started to be injected with a different memtype
> > > that specified only in the vma's pgprot.
> > Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
> > no VM_PAT in vmas.
> >  
> > Calling remap_pfn_range() in the mmap() will cause problem to support
> > dynamic faulting. Looks there's still a window even with an immediate
> > unmap after mmap().
> 
> It can be conditionally injected.  See Alex's commit (not yet posted
> anywhere, only used in our internal testing so far):
> 
> https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1
>
I'm a bit confused by the code.
It looks that it will do the remap_pfn_range() in mmap() if hardware is ready,
and will just set vma flags if hardware is not ready.

But if the hardware is not ready in mmap(), which code in the fault handler
will reserve pfn memtypes?

> > 
> > Also, calling remap_pfn_range() in mmap() may not meet the requirement of
> > drivers to dynamic switch backend resources, e.g. as what's in
> > cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
> > all backend resources at once.
> > 
> > So, is there any way for the driver to reserve memtypes and set VM_PAT in
> > fault handler?
> 
> I must confess I'm not familiar with memtype stuff, and just started
> looking recently.
> 
> Per my reading so far, we have these multiple ways of doing memtype
> reservations, and no matter which API we use (ioremap / track_pfn_remap /
> pci_iomap), the same memtype needs to be used otherwise the reservation
> will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
> virtual mapping will present already after the API returns.
Right, I understand in the same way :)

> 
> Then here IMHO track_pfn_remap() is the one that we should still use in
> page-level mmap() controls, because so far we don't want to map any MMIO
> range yet, however we want to say "hey this VMA maps something special", by
> reserving these memtype and set VM_PAT.  We want the virtual mapping only
> later during a fault().
> 
> In short, it looks to me the right thing we should do is to manually invoke
> track_pfn_remap() in vfio-pci's mmap() hook.
> 
> 	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
> 		ret = remap_pfn_range(vma, vma->vm_start, pfn,
> 				      req_len, vma->vm_page_prot);
> 	else
>                 // TODO: manually invoke track_pfn_remap() here
> 		vm_flags_set(vma, VM_IO | VM_PFNMAP |
> 				  VM_DONTEXPAND | VM_DONTDUMP);
What if we have to remap another set of pfns in the "else" case?

> 
> Then the vma is registered, and untrack_pfn() should be automatically done
> when vma unmaps (and that relies on this patch to not do that too early).
Well, I'm concerned by one use case.

1. The driver calls remap_pfn_range() to reserve memtype for pfn1 + add
   VM_PAT flag.
2. Then unmap_single_vma() is called. With this patch, the pfn1 memtype is
   still reserved.
3. The fault handler calls vmf_insert_pfn() for another pfn2.
4. unmap_vmas() is called. However, untrack_pfn() will only find pfn2
   instead of prevous pfn1.


> From that POV, I still think this patch does the right thing and should be
> merged, even if we probably have one more issue to fix as David Wang
> reported..
> 
> I'll need to think about how to do that right, as I think that'll be needed
> as long as pfnmaps will support fault()s: it means when munmap() the
> pgtable may not present, and PAT cannot rely on walking the pgtable to know
> the base PFN anymore.
> 
> Thanks,
> 
> -- 
> Peter Xu
>
David Hildenbrand July 17, 2024, 2:14 p.m. UTC | #11
[catching up on mails]

>> indicates that file truncation seems to end up messing with a PFNMAP mapping
>> that has PAT set. That is ... weird. I would have thought that PFNMAP would
>> never really happen with file truncation.
>>
>> Does this only happen with an OOT driver, that seems to do weird truncate
>> stuff on files that have a PFNMAP mapping?
>>
>> [1]
>> https://lore.kernel.org/all/3879ee72-84de-4d2a-93a8-c0b3dc3f0a4c@redhat.com/
> 
> Ohhh.. I guess this will also stop working in VFIO, but I think it's fine
> for now because as Yan pointed out VFIO PCI doesn't register those regions
> now so VM_PAT is not yet set..

Interesting, I was assuming that VFIO might be relying on that.

> 
> And one thing I said wrong in the previous reply to Yan is, obviously
> memtype_check_insert() can work with >1 owners as long as the memtype
> matches.. and that's how fork() works where VM_PAT needs to be duplicated.
> But this whole thing is a bit confusing to me..  As I think it also means
> when fork the track_pfn_copy() will call memtype_kernel_map_sync one more
> time even if we're 100% sure the pgprot will be the same for the kernel
> mappings..

I consider the VM_PAT code quite ugly and I wish we could just get rid 
of it (especially, the automatic "entire VMA covered" handling thingy).

> 
> I wonder whether there's some way that untrack pfn framework doesn't need
> to rely on the pgtable to fetch the pfn, because VFIO MMIO region
> protection will also do that in the near future, AFAICT.  The pgprot part
> should be easy there to fetch: get_pat_info() should fallback to vma's
> pgprot if no mapping found; the only outlier should be CoW pages in
> reality.  The pfn is the real issue so far, so that either track_pfn_copy()
> or untrack_pfn() may need to know the pfn to untrack, even if it only has
> the vma information.

I had a prototype to store that information per VMA to avoid the page 
table lookup. VMA splitting was a bit "added complication", but I got it 
to work. (maybe I can still find it if there is demand)

The downside was having to consume more memory for all VMAs in the 
system simply (even if only 8 byte) because a handful of VMAs in the 
system could be VM_PAT. I decided that's not what we want. I managed to 
not consume memory in some configurations, but not in all, so I 
discarded that approach.

I did not explore storing that information in some auxiliary datastructure.

IMHO the whole VM_PAT model is weird:

1) mmap()
2) remap_pfn_range(): if it covers the whole VMA apply some magic
    reservation.
3) munmap(): we unmap *all* PFNs and, therefore, clean up VM_PAT

(VMA splitting make the whole model weirder, but it works, because we 
never merge these VMAs)

This model cannot properly work if we get partial page table zapping via 
truncation/MADV_DONTNEED or similar things after 2). And likely we also 
shouldn't be doing it that way. We should forbid any partial unmappings 
in that model, just like we already disallow MADV_DONTNEED as you note.

As you mention in your other comment, maybe relevant/all? caller should 
just manage the PAT side independently. So maybe we can move to a 
different model.
Peter Xu July 17, 2024, 2:15 p.m. UTC | #12
On Wed, Jul 17, 2024 at 09:38:34AM +0800, Yan Zhao wrote:
> On Tue, Jul 16, 2024 at 03:01:50PM -0400, Peter Xu wrote:
> > On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
> > > On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> > > > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > > > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
> > > > > > This patch is one patch of an old series [1] that got reposted standalone
> > > > > > here, with the hope to fix some reported untrack_pfn() issues reported
> > > > > > recently [2,3], where there used to be other fix [4] but unfortunately
> > > > > > which looks like to cause other issues.  The hope is this patch can fix it
> > > > > > the right way.
> > > > > > 
> > > > > > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > > > > > start at mmap() of device drivers, then untracked when munmap().  However
> > > > > > in the current code the untrack is done in unmap_single_vma().  This might
> > > > > > be problematic.
> > > > > > 
> > > > > > For example, unmap_single_vma() can be used nowadays even for zapping a
> > > > > > single page rather than the whole vmas.  It's very confusing to do whole
> > > > > > vma untracking in this function even if a caller would like to zap one
> > > > > > page.  It could simply be wrong.
> > > > > > 
> > > > > > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > > > > > for pfnmaps and it'll fail the madvise() already before reaching here.
> > > > > > However looks like it can be triggered like what was reported where invoked
> > > > > > from an unmap request from a file vma.
> > > > > > 
> > > > > > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > > > > > pgtables before an munmap(), in which case we may not want to untrack the
> > > > > > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > > > > > pfn tracking information as those pfn mappings can be restored later with
> > > > > > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > > > > > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > > > > > the near future.
> > > > > > 
> > > > > > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > > > > > while in the past it was only used in the munmap() path which wants to
> > > > > > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > > > > > zap_page_range() callers that act on a single VMA use separate helper") is
> > > > > > the initial commit that introduced unmap_single_vma(), in which the chunk
> > > > > > of untrack_pfn() was moved over from unmap_vmas().
> > > > > > 
> > > > > > Recover that behavior to untrack pfnmap only when unmap regions.
> > > > > > 
> > > > > > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > > > > > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > > > > > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > > > > > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > > > > > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > > > > > 
> > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > Cc: Andy Lutomirski <luto@kernel.org>
> > > > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > > Cc: Ingo Molnar <mingo@redhat.com>
> > > > > > Cc: Borislav Petkov <bp@alien8.de>
> > > > > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > > > > Cc: x86@kernel.org
> > > > > > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > Cc: Pei Li <peili.dev@gmail.com>
> > > > > > Cc: David Hildenbrand <david@redhat.com>
> > > > > > Cc: David Wang <00107082@163.com>
> > > > > > Cc: Bert Karwatzki <spasswolf@web.de>
> > > > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > > > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > > > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > > > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > > > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > > > > this out first as we have a bunch of reproducers on the list now..
> > > > > > ---
> > > > > >  mm/memory.c | 5 ++---
> > > > > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 4bcd79619574..f57cc304b318 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > > > >  	if (vma->vm_file)
> > > > > >  		uprobe_munmap(vma, start, end);
> > > > > >  
> > > > > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > > > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > > > > -
> > > > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > > > > called here, since remap_pfn_range() is not called in mmap() and fault
> > > > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> > > > 
> > > > Right when with current repo, but I'm thinking maybe we should have VM_PAT
> > > > there..
> > > Yes, I agree.
> > > 
> > > But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
> > > handler since vm_flags_set() requires mmap lock held for write while
> > > the fault handler can only hold mmap lock for read.
> > > So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
> > > without VM_PAT being set in vma.
> > 
> > Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
> > a fault handler.  They should only be called in mmap() phase.
> > 
> > When you mentioned ioremap(), it's only for the VGA device, right?  I don't
> > see it's used in the vfio-pci's fault handler.
> Oh, it's pci_iomap() in the vfio-pci's fault handler, another version of
> ioremap() :)

Right. If to be explicit, I think it's in mmap(), and looks like Alex has a
comment for that:

	/*
	 * Even though we don't make use of the barmap for the mmap,
	 * we need to request the region and the barmap tracks that.
	 */

Maybe in 2012 that's a must?  PS: when looking, that looks like a proper
place to reuse vfio_pci_core_setup_barmap() also in the mmap() function.

> 
> > > 
> > > > 
> > > > See what reserve_pfn_range() does right now: it'll make sure only one owner
> > > > reserve this area, e.g. memtype_reserve() will fail if it has already been
> > > > reserved.  Then when succeeded as the first one to reserve the region,
> > > > it'll make sure this mem type to also be synchronized to the kernel map
> > > > (memtype_kernel_map_sync).
> > > > 
> > > > So I feel like when MMIO disabled for a VFIO card, we may need to call
> > > > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> > > > pgtable will be empty, and even if currently it's always UC- for now:
> > > > 
> > > > vfio_pci_core_mmap():
> > > > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > 
> > > > Then the memtype will be reserved even if it cannot be used.  Otherwise I
> > > > don't know whether there's side effects of kernel identity mapping where it
> > > > mismatch later with what's mapped in the userspace via the vma, when MMIO
> > > > got enabled again: pgtable started to be injected with a different memtype
> > > > that specified only in the vma's pgprot.
> > > Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
> > > no VM_PAT in vmas.
> > >  
> > > Calling remap_pfn_range() in the mmap() will cause problem to support
> > > dynamic faulting. Looks there's still a window even with an immediate
> > > unmap after mmap().
> > 
> > It can be conditionally injected.  See Alex's commit (not yet posted
> > anywhere, only used in our internal testing so far):
> > 
> > https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1
> >
> I'm a bit confused by the code.
> It looks that it will do the remap_pfn_range() in mmap() if hardware is ready,
> and will just set vma flags if hardware is not ready.
> 
> But if the hardware is not ready in mmap(), which code in the fault handler
> will reserve pfn memtypes?

I thought I answered that below.. :)  I think we should use track_pfn_remap().

> 
> > > 
> > > Also, calling remap_pfn_range() in mmap() may not meet the requirement of
> > > drivers to dynamic switch backend resources, e.g. as what's in
> > > cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
> > > all backend resources at once.
> > > 
> > > So, is there any way for the driver to reserve memtypes and set VM_PAT in
> > > fault handler?
> > 
> > I must confess I'm not familiar with memtype stuff, and just started
> > looking recently.
> > 
> > Per my reading so far, we have these multiple ways of doing memtype
> > reservations, and no matter which API we use (ioremap / track_pfn_remap /
> > pci_iomap), the same memtype needs to be used otherwise the reservation
> > will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
> > virtual mapping will present already after the API returns.
> Right, I understand in the same way :)
> 
> > 
> > Then here IMHO track_pfn_remap() is the one that we should still use in
> > page-level mmap() controls, because so far we don't want to map any MMIO
> > range yet, however we want to say "hey this VMA maps something special", by
> > reserving these memtype and set VM_PAT.  We want the virtual mapping only
> > later during a fault().
> > 
> > In short, it looks to me the right thing we should do is to manually invoke
> > track_pfn_remap() in vfio-pci's mmap() hook.
> > 
> > 	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
> > 		ret = remap_pfn_range(vma, vma->vm_start, pfn,
> > 				      req_len, vma->vm_page_prot);
> > 	else
> >                 // TODO: manually invoke track_pfn_remap() here
> > 		vm_flags_set(vma, VM_IO | VM_PFNMAP |
> > 				  VM_DONTEXPAND | VM_DONTDUMP);
> What if we have to remap another set of pfns in the "else" case?

Use vmf_insert_pfn*(): these helpers do not reserve memtype but looks them
up only.

> 
> > 
> > Then the vma is registered, and untrack_pfn() should be automatically done
> > when vma unmaps (and that relies on this patch to not do that too early).
> Well, I'm concerned by one use case.
> 
> 1. The driver calls remap_pfn_range() to reserve memtype for pfn1 + add
>    VM_PAT flag.
> 2. Then unmap_single_vma() is called. With this patch, the pfn1 memtype is
>    still reserved.
> 3. The fault handler calls vmf_insert_pfn() for another pfn2.
> 4. unmap_vmas() is called. However, untrack_pfn() will only find pfn2
>    instead of prevous pfn1.

It depends on what's your exact question, if it's about pgtable: I don't
think munmap() requires PFNMAP pgtables to always exist, and it'll simply
skip none entries.

And if it's about PAT tracking: that is exactly what I'm talking about
below..  where I think untracking shouldn't rely on pgtables.

I'll copy you too if I'll prepare some patches for the latter.  With that
patchset (plus this patch) it should fix David Wang's issue and similar,
AFAICT.

Thanks,

> 
> 
> > From that POV, I still think this patch does the right thing and should be
> > merged, even if we probably have one more issue to fix as David Wang
> > reported..
> > 
> > I'll need to think about how to do that right, as I think that'll be needed
> > as long as pfnmaps will support fault()s: it means when munmap() the
> > pgtable may not present, and PAT cannot rely on walking the pgtable to know
> > the base PFN anymore.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
>
David Hildenbrand July 17, 2024, 2:17 p.m. UTC | #13
On 16.07.24 21:01, Peter Xu wrote:
> On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
>> On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
>>> On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
>>>> On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
>>>>> This patch is one patch of an old series [1] that got reposted standalone
>>>>> here, with the hope to fix some reported untrack_pfn() issues reported
>>>>> recently [2,3], where there used to be other fix [4] but unfortunately
>>>>> which looks like to cause other issues.  The hope is this patch can fix it
>>>>> the right way.
>>>>>
>>>>> X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
>>>>> start at mmap() of device drivers, then untracked when munmap().  However
>>>>> in the current code the untrack is done in unmap_single_vma().  This might
>>>>> be problematic.
>>>>>
>>>>> For example, unmap_single_vma() can be used nowadays even for zapping a
>>>>> single page rather than the whole vmas.  It's very confusing to do whole
>>>>> vma untracking in this function even if a caller would like to zap one
>>>>> page.  It could simply be wrong.
>>>>>
>>>>> Such issue won't be exposed by things like MADV_DONTNEED won't ever work
>>>>> for pfnmaps and it'll fail the madvise() already before reaching here.
>>>>> However looks like it can be triggered like what was reported where invoked
>>>>> from an unmap request from a file vma.
>>>>>
>>>>> There's also work [5] on VFIO (merged now) to allow tearing down MMIO
>>>>> pgtables before an munmap(), in which case we may not want to untrack the
>>>>> pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
>>>>> pfn tracking information as those pfn mappings can be restored later with
>>>>> the same vma object.  Currently it's not an immediate problem for VFIO, as
>>>>> VFIO uses UC- by default, but it looks like there's plan to extend that in
>>>>> the near future.
>>>>>
>>>>> IIUC, this was overlooked when zap_page_range_single() was introduced,
>>>>> while in the past it was only used in the munmap() path which wants to
>>>>> always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
>>>>> zap_page_range() callers that act on a single VMA use separate helper") is
>>>>> the initial commit that introduced unmap_single_vma(), in which the chunk
>>>>> of untrack_pfn() was moved over from unmap_vmas().
>>>>>
>>>>> Recover that behavior to untrack pfnmap only when unmap regions.
>>>>>
>>>>> [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
>>>>> [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
>>>>> [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
>>>>> [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
>>>>> [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
>>>>>
>>>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>>>>> Cc: x86@kernel.org
>>>>> Cc: Yan Zhao <yan.y.zhao@intel.com>
>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>> Cc: Pei Li <peili.dev@gmail.com>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: David Wang <00107082@163.com>
>>>>> Cc: Bert Karwatzki <spasswolf@web.de>
>>>>> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>
>>>>> NOTE: I massaged the commit message comparing to the rfc post [1], the
>>>>> patch itself is untouched.  Also removed rfc tag, and added more people
>>>>> into the loop. Please kindly help test this patch if you have a reproducer,
>>>>> as I can't reproduce it myself even with the syzbot reproducer on top of
>>>>> mm-unstable.  Instead of further check on the reproducer, I decided to send
>>>>> this out first as we have a bunch of reproducers on the list now..
>>>>> ---
>>>>>   mm/memory.c | 5 ++---
>>>>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 4bcd79619574..f57cc304b318 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>>>>>   	if (vma->vm_file)
>>>>>   		uprobe_munmap(vma, start, end);
>>>>>   
>>>>> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
>>>>> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
>>>>> -
>>>> Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
>>>> called here, since remap_pfn_range() is not called in mmap() and fault
>>>> handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
>>>
>>> Right when with current repo, but I'm thinking maybe we should have VM_PAT
>>> there..
>> Yes, I agree.
>>
>> But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
>> handler since vm_flags_set() requires mmap lock held for write while
>> the fault handler can only hold mmap lock for read.
>> So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
>> without VM_PAT being set in vma.
> 
> Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
> a fault handler.  They should only be called in mmap() phase.
> 
> When you mentioned ioremap(), it's only for the VGA device, right?  I don't
> see it's used in the vfio-pci's fault handler.
> 
>>
>>>
>>> See what reserve_pfn_range() does right now: it'll make sure only one owner
>>> reserve this area, e.g. memtype_reserve() will fail if it has already been
>>> reserved.  Then when succeeded as the first one to reserve the region,
>>> it'll make sure this mem type to also be synchronized to the kernel map
>>> (memtype_kernel_map_sync).
>>>
>>> So I feel like when MMIO disabled for a VFIO card, we may need to call
>>> reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
>>> pgtable will be empty, and even if currently it's always UC- for now:
>>>
>>> vfio_pci_core_mmap():
>>> 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>>>
>>> Then the memtype will be reserved even if it cannot be used.  Otherwise I
>>> don't know whether there's side effects of kernel identity mapping where it
>>> mismatch later with what's mapped in the userspace via the vma, when MMIO
>>> got enabled again: pgtable started to be injected with a different memtype
>>> that specified only in the vma's pgprot.
>> Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
>> no VM_PAT in vmas.
>>   
>> Calling remap_pfn_range() in the mmap() will cause problem to support
>> dynamic faulting. Looks there's still a window even with an immediate
>> unmap after mmap().
> 
> It can be conditionally injected.  See Alex's commit (not yet posted
> anywhere, only used in our internal testing so far):
> 
> https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1
> 
>>
>> Also, calling remap_pfn_range() in mmap() may not meet the requirement of
>> drivers to dynamic switch backend resources, e.g. as what's in
>> cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
>> all backend resources at once.
>>
>> So, is there any way for the driver to reserve memtypes and set VM_PAT in
>> fault handler?
> 
> I must confess I'm not familiar with memtype stuff, and just started
> looking recently.
> 
> Per my reading so far, we have these multiple ways of doing memtype
> reservations, and no matter which API we use (ioremap / track_pfn_remap /
> pci_iomap), the same memtype needs to be used otherwise the reservation
> will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
> virtual mapping will present already after the API returns.
> 
> Then here IMHO track_pfn_remap() is the one that we should still use in
> page-level mmap() controls, because so far we don't want to map any MMIO
> range yet, however we want to say "hey this VMA maps something special", by
> reserving these memtype and set VM_PAT.  We want the virtual mapping only
> later during a fault().
> 
> In short, it looks to me the right thing we should do is to manually invoke
> track_pfn_remap() in vfio-pci's mmap() hook.
> 
> 	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
> 		ret = remap_pfn_range(vma, vma->vm_start, pfn,
> 				      req_len, vma->vm_page_prot);
> 	else
>                  // TODO: manually invoke track_pfn_remap() here
> 		vm_flags_set(vma, VM_IO | VM_PFNMAP |
> 				  VM_DONTEXPAND | VM_DONTDUMP);
> 
> Then the vma is registered, and untrack_pfn() should be automatically done
> when vma unmaps (and that relies on this patch to not do that too early).
>  From that POV, I still think this patch does the right thing and should be
> merged, even if we probably have one more issue to fix as David Wang
> reported..

Especially if it only applies to OOT drivers. Do we have report from 
in-tree drivers?

I'd be curious how a previous truncation on a file can tear down a 
PFNMAP mapping -- and if there are simple ways to forbid that (if possible).
Peter Xu July 17, 2024, 4:27 p.m. UTC | #14
On Wed, Jul 17, 2024 at 04:14:15PM +0200, David Hildenbrand wrote:
> [catching up on mails]
> 
> > > indicates that file truncation seems to end up messing with a PFNMAP mapping
> > > that has PAT set. That is ... weird. I would have thought that PFNMAP would
> > > never really happen with file truncation.
> > > 
> > > Does this only happen with an OOT driver, that seems to do weird truncate
> > > stuff on files that have a PFNMAP mapping?
> > > 
> > > [1]
> > > https://lore.kernel.org/all/3879ee72-84de-4d2a-93a8-c0b3dc3f0a4c@redhat.com/
> > 
> > Ohhh.. I guess this will also stop working in VFIO, but I think it's fine
> > for now because as Yan pointed out VFIO PCI doesn't register those regions
> > now so VM_PAT is not yet set..
> 
> Interesting, I was assuming that VFIO might be relying on that.
> 
> > 
> > And one thing I said wrong in the previous reply to Yan is, obviously
> > memtype_check_insert() can work with >1 owners as long as the memtype
> > matches.. and that's how fork() works where VM_PAT needs to be duplicated.
> > But this whole thing is a bit confusing to me..  As I think it also means
> > when fork the track_pfn_copy() will call memtype_kernel_map_sync one more
> > time even if we're 100% sure the pgprot will be the same for the kernel
> > mappings..
> 
> I consider the VM_PAT code quite ugly and I wish we could just get rid of it
> (especially, the automatic "entire VMA covered" handling thingy).

Yep, I agree.

> 
> > 
> > I wonder whether there's some way that untrack pfn framework doesn't need
> > to rely on the pgtable to fetch the pfn, because VFIO MMIO region
> > protection will also do that in the near future, AFAICT.  The pgprot part
> > should be easy there to fetch: get_pat_info() should fallback to vma's
> > pgprot if no mapping found; the only outlier should be CoW pages in
> > reality.  The pfn is the real issue so far, so that either track_pfn_copy()
> > or untrack_pfn() may need to know the pfn to untrack, even if it only has
> > the vma information.
> 
> I had a prototype to store that information per VMA to avoid the page table
> lookup. VMA splitting was a bit "added complication", but I got it to work.
> (maybe I can still find it if there is demand)
> 
> The downside was having to consume more memory for all VMAs in the system
> simply (even if only 8 byte) because a handful of VMAs in the system could
> be VM_PAT. I decided that's not what we want. I managed to not consume
> memory in some configurations, but not in all, so I discarded that approach.
> 
> I did not explore storing that information in some auxiliary datastructure.

One idea to avoid that is to let driver opt-in for such information, e.g. a
hook in vm_operations_struct to fetch base pfn for a vma map.  But that
will involve any driver to provide that information, e.g. for David Wang's
case IIUC it's at least an OOT driver, so nothing to fix it from an
upstream patch with that solution (while it should work for VFIO).

> 
> IMHO the whole VM_PAT model is weird:
> 
> 1) mmap()
> 2) remap_pfn_range(): if it covers the whole VMA apply some magic
>    reservation.
> 3) munmap(): we unmap *all* PFNs and, therefore, clean up VM_PAT
> 
> (VMA splitting make the whole model weirder, but it works, because we never
> merge these VMAs)
> 
> This model cannot properly work if we get partial page table zapping via
> truncation/MADV_DONTNEED or similar things after 2). And likely we also
> shouldn't be doing it that way. We should forbid any partial unmappings in
> that model, just like we already disallow MADV_DONTNEED as you note.
> 
> As you mention in your other comment, maybe relevant/all? caller should just
> manage the PAT side independently. So maybe we can move to a different
> model.

Any elaboration of what's the new model you're describing?

Thanks,
Peter Xu July 17, 2024, 4:30 p.m. UTC | #15
On Wed, Jul 17, 2024 at 04:17:12PM +0200, David Hildenbrand wrote:
> I'd be curious how a previous truncation on a file can tear down a PFNMAP
> mapping -- and if there are simple ways to forbid that (if possible).

Unfortunately, forbiding it may not work for vfio, as vfio would like to
allow none (or partial) mapping on the bars.. at least so far that's the
plan.
Jason Gunthorpe July 17, 2024, 4:31 p.m. UTC | #16
On Wed, Jul 17, 2024 at 12:30:19PM -0400, Peter Xu wrote:
> On Wed, Jul 17, 2024 at 04:17:12PM +0200, David Hildenbrand wrote:
> > I'd be curious how a previous truncation on a file can tear down a PFNMAP
> > mapping -- and if there are simple ways to forbid that (if possible).
> 
> Unfortunately, forbiding it may not work for vfio, as vfio would like to
> allow none (or partial) mapping on the bars.. at least so far that's the
> plan.

vfio doesn't support truncation? Or does that means something else
here?

Jason
David Hildenbrand July 17, 2024, 4:32 p.m. UTC | #17
On 17.07.24 18:30, Peter Xu wrote:
> On Wed, Jul 17, 2024 at 04:17:12PM +0200, David Hildenbrand wrote:
>> I'd be curious how a previous truncation on a file can tear down a PFNMAP
>> mapping -- and if there are simple ways to forbid that (if possible).
> 
> Unfortunately, forbiding it may not work for vfio, as vfio would like to
> allow none (or partial) mapping on the bars.. at least so far that's the
> plan.

I think vfio should handle that memtype reservation manually, as you 
proposed. So that shouldn't block that IIUC.
Peter Xu July 17, 2024, 6:10 p.m. UTC | #18
On Wed, Jul 17, 2024 at 01:31:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 17, 2024 at 12:30:19PM -0400, Peter Xu wrote:
> > On Wed, Jul 17, 2024 at 04:17:12PM +0200, David Hildenbrand wrote:
> > > I'd be curious how a previous truncation on a file can tear down a PFNMAP
> > > mapping -- and if there are simple ways to forbid that (if possible).
> > 
> > Unfortunately, forbiding it may not work for vfio, as vfio would like to
> > allow none (or partial) mapping on the bars.. at least so far that's the
> > plan.
> 
> vfio doesn't support truncation? Or does that means something else
> here?

Here I meant VFIO can explicitly zapping pgtables via vfio_pci_zap_bars(),
irrelevant of any form of truncations.  I suppose it'll have the same
effect where we may need to stop assuming PFNMAP vmas will always have
fully installed pgtables.

Thanks,
Peter Xu July 17, 2024, 6:12 p.m. UTC | #19
On Wed, Jul 17, 2024 at 06:32:05PM +0200, David Hildenbrand wrote:
> On 17.07.24 18:30, Peter Xu wrote:
> > On Wed, Jul 17, 2024 at 04:17:12PM +0200, David Hildenbrand wrote:
> > > I'd be curious how a previous truncation on a file can tear down a PFNMAP
> > > mapping -- and if there are simple ways to forbid that (if possible).
> > 
> > Unfortunately, forbiding it may not work for vfio, as vfio would like to
> > allow none (or partial) mapping on the bars.. at least so far that's the
> > plan.
> 
> I think vfio should handle that memtype reservation manually, as you
> proposed. So that shouldn't block that IIUC.

Oh, so maybe I misunderstood here.  As long as we can live with VFIO's vma
(with the pfn tracking on top, but without pgtable installed) I think it
should be fine.

Thanks,
Yan Zhao July 18, 2024, 1:50 a.m. UTC | #20
On Wed, Jul 17, 2024 at 10:15:24PM +0800, Peter Xu wrote:
> On Wed, Jul 17, 2024 at 09:38:34AM +0800, Yan Zhao wrote:
> > On Tue, Jul 16, 2024 at 03:01:50PM -0400, Peter Xu wrote:
> > > On Tue, Jul 16, 2024 at 05:13:29PM +0800, Yan Zhao wrote:
> > > > On Mon, Jul 15, 2024 at 10:29:59PM +0800, Peter Xu wrote:
> > > > > On Mon, Jul 15, 2024 at 03:08:58PM +0800, Yan Zhao wrote:
> > > > > > On Fri, Jul 12, 2024 at 10:42:44AM -0400, Peter Xu wrote:
...
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 4bcd79619574..f57cc304b318 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > > > > >  	if (vma->vm_file)
> > > > > > >  		uprobe_munmap(vma, start, end);
> > > > > > >  
> > > > > > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > > > > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > > > > > -
> > > > > > Specifically to VFIO's case, looks it doesn't matter if untrack_pfn() is
> > > > > > called here, since remap_pfn_range() is not called in mmap() and fault
> > > > > > handler, and therefore (vma->vm_flags & VM_PAT) is always 0.
> > > > > 
> > > > > Right when with current repo, but I'm thinking maybe we should have VM_PAT
> > > > > there..
> > > > Yes, I agree.
> > > > 
> > > > But, currently for VFIO, it cannot call io_remap_pfn_range() in the fault
> > > > handler since vm_flags_set() requires mmap lock held for write while
> > > > the fault handler can only hold mmap lock for read.
> > > > So, it relies on ioremap()/iounmap() to reserve/de-reserve memtypes,
> > > > without VM_PAT being set in vma.
> > > 
> > > Right, neither vm_flags_set() nor io_remap_pfn_range() should be called in
> > > a fault handler.  They should only be called in mmap() phase.
> > > 
> > > When you mentioned ioremap(), it's only for the VGA device, right?  I don't
> > > see it's used in the vfio-pci's fault handler.
> > Oh, it's pci_iomap() in the vfio-pci's fault handler, another version of
> > ioremap() :)
> 
> Right. If to be explicit, I think it's in mmap(), and looks like Alex has a
Yes, in mmap(). (The "fault handler" in the previous reply is a typo :))

> comment for that:
> 

> 	/*
> 	 * Even though we don't make use of the barmap for the mmap,
> 	 * we need to request the region and the barmap tracks that.
> 	 */
> 
> Maybe in 2012 that's a must?

I think ioremap/pci_iomap in mmap() is not a must. At least it's not there in
nvgrace_gpu_mmap().
But nvgrace_gpu_mmap() has remap_pfn_range() and Alex explained that
nvgrace-gpu "only exists on ARM platforms. memtype_reserve() only exists on
x86." [1].

[1] https://lore.kernel.org/all/20240529105012.39756143.alex.williamson@redhat.com/

> PS: when looking, that looks like a proper
> place to reuse vfio_pci_core_setup_barmap() also in the mmap() function.
Yes, look so.

> > 
> > > > 
> > > > > 
> > > > > See what reserve_pfn_range() does right now: it'll make sure only one owner
> > > > > reserve this area, e.g. memtype_reserve() will fail if it has already been
> > > > > reserved.  Then when succeeded as the first one to reserve the region,
> > > > > it'll make sure this mem type to also be synchronized to the kernel map
> > > > > (memtype_kernel_map_sync).
> > > > > 
> > > > > So I feel like when MMIO disabled for a VFIO card, we may need to call
> > > > > reserve_pfn_range(), telling the kernel the mem type VFIO uses, even if the
> > > > > pgtable will be empty, and even if currently it's always UC- for now:
> > > > > 
> > > > > vfio_pci_core_mmap():
> > > > > 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > > 
> > > > > Then the memtype will be reserved even if it cannot be used.  Otherwise I
> > > > > don't know whether there's side effects of kernel identity mapping where it
> > > > > mismatch later with what's mapped in the userspace via the vma, when MMIO
> > > > > got enabled again: pgtable started to be injected with a different memtype
> > > > > that specified only in the vma's pgprot.
> > > > Current VFIO relies on pci_iomap() to reserve MMIO pfns as UC-, thus
> > > > no VM_PAT in vmas.
> > > >  
> > > > Calling remap_pfn_range() in the mmap() will cause problem to support
> > > > dynamic faulting. Looks there's still a window even with an immediate
> > > > unmap after mmap().
> > > 
> > > It can be conditionally injected.  See Alex's commit (not yet posted
> > > anywhere, only used in our internal testing so far):
> > > 
> > > https://github.com/xzpeter/linux/commit/f432e0e46c34e493943034be4cb9d6eb638f57d1
> > >
> > I'm a bit confused by the code.
> > It looks that it will do the remap_pfn_range() in mmap() if hardware is ready,
> > and will just set vma flags if hardware is not ready.
> > 
> > But if the hardware is not ready in mmap(), which code in the fault handler
> > will reserve pfn memtypes?
> 
> I thought I answered that below.. :)  I think we should use track_pfn_remap().

Ok. Then if we have two sets of pfns, then we can
1. Call remap_pfn_range() in mmap() for pfn set 1.
2. Export track_pfn_remap() and call track_pfn_remap() in mmap() for pfn
   set 2.
3. Unmap and call vmf_insert_pfn() in the fault handler to map pfn set 2.

However, I'm not sure if we can properly untrack both pfn sets 1 and 2.

By exporting untrack_pfn() too? Or, you'll leave VFIO to use ioremap/iounmap()
(or the variants) to reserve memtype by itself?

> 
> > 
> > > > 
> > > > Also, calling remap_pfn_range() in mmap() may not meet the requirement of
> > > > drivers to dynamic switch backend resources, e.g. as what's in
> > > > cxl_mmap_fault(), since one remap_pfn_range() cannot reserve memtypes for
> > > > all backend resources at once.
> > > > 
> > > > So, is there any way for the driver to reserve memtypes and set VM_PAT in
> > > > fault handler?
> > > 
> > > I must confess I'm not familiar with memtype stuff, and just started
> > > looking recently.
> > > 
> > > Per my reading so far, we have these multiple ways of doing memtype
> > > reservations, and no matter which API we use (ioremap / track_pfn_remap /
> > > pci_iomap), the same memtype needs to be used otherwise the reservation
> > > will fail.  Here ioremap / pci_iomap will involve one more vmap so that the
> > > virtual mapping will present already after the API returns.
> > Right, I understand in the same way :)
> > 
> > > 
> > > Then here IMHO track_pfn_remap() is the one that we should still use in
> > > page-level mmap() controls, because so far we don't want to map any MMIO
> > > range yet, however we want to say "hey this VMA maps something special", by
> > > reserving these memtype and set VM_PAT.  We want the virtual mapping only
> > > later during a fault().
> > > 
> > > In short, it looks to me the right thing we should do is to manually invoke
> > > track_pfn_remap() in vfio-pci's mmap() hook.
> > > 
> > > 	if (!vdev->pm_runtime_engaged && __vfio_pci_memory_enabled(vdev))
> > > 		ret = remap_pfn_range(vma, vma->vm_start, pfn,
> > > 				      req_len, vma->vm_page_prot);
> > > 	else
> > >                 // TODO: manually invoke track_pfn_remap() here
> > > 		vm_flags_set(vma, VM_IO | VM_PFNMAP |
> > > 				  VM_DONTEXPAND | VM_DONTDUMP);
> > What if we have to remap another set of pfns in the "else" case?
> 
> Use vmf_insert_pfn*(): these helpers do not reserve memtype but looks them
> up only.
> 
> > 
> > > 
> > > Then the vma is registered, and untrack_pfn() should be automatically done
> > > when vma unmaps (and that relies on this patch to not do that too early).
> > Well, I'm concerned by one use case.
> > 
> > 1. The driver calls remap_pfn_range() to reserve memtype for pfn1 + add
> >    VM_PAT flag.
> > 2. Then unmap_single_vma() is called. With this patch, the pfn1 memtype is
> >    still reserved.
> > 3. The fault handler calls vmf_insert_pfn() for another pfn2.
> > 4. unmap_vmas() is called. However, untrack_pfn() will only find pfn2
> >    instead of prevous pfn1.
> 
> It depends on what's your exact question, if it's about pgtable: I don't
> think munmap() requires PFNMAP pgtables to always exist, and it'll simply
> skip none entries.
> 
> And if it's about PAT tracking: that is exactly what I'm talking about
> below..  where I think untracking shouldn't rely on pgtables.

> I'll copy you too if I'll prepare some patches for the latter.  With that
> patchset (plus this patch) it should fix David Wang's issue and similar,
> AFAICT.
Thank you!

> 
> > 
> > 
> > > From that POV, I still think this patch does the right thing and should be
> > > merged, even if we probably have one more issue to fix as David Wang
> > > reported..
> > > 
> > > I'll need to think about how to do that right, as I think that'll be needed
> > > as long as pfnmaps will support fault()s: it means when munmap() the
> > > pgtable may not present, and PAT cannot rely on walking the pgtable to know
> > > the base PFN anymore.
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > 
> 
> -- 
> Peter Xu
> 
>
Peter Xu July 18, 2024, 2:03 p.m. UTC | #21
On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> Ok. Then if we have two sets of pfns, then we can
> 1. Call remap_pfn_range() in mmap() for pfn set 1.

I don't think this will work..  At least from the current implementation,
remap_pfn_range() will only reserve the memtype if the range covers the
whole vma.

> 2. Export track_pfn_remap() and call track_pfn_remap() in mmap() for pfn
>    set 2.
> 3. Unmap and call vmf_insert_pfn() in the fault handler to map pfn set 2.

IMO this should be the similar case of MMIO being disabled on the bar,
where we can use track_pfn_remap() to register the whole vma in mmap()
first. Then in this case if you prefer proactive injection of partial of
the pfn mappings, one can do that via vmf_insert_pfn*() in mmap() after the
memtype registered.

> 
> However, I'm not sure if we can properly untrack both pfn sets 1 and 2.

Untrack should so far only happen per-vma, AFAIU, so "set 1+2" need to be
done together as they belong to the same vma.

> 
> By exporting untrack_pfn() too? Or, you'll leave VFIO to use ioremap/iounmap()
> (or the variants) to reserve memtype by itself?

Thanks,
Yan Zhao July 18, 2024, 11:18 p.m. UTC | #22
On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
> On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> > Ok. Then if we have two sets of pfns, then we can
> > 1. Call remap_pfn_range() in mmap() for pfn set 1.
> 
> I don't think this will work..  At least from the current implementation,
> remap_pfn_range() will only reserve the memtype if the range covers the
> whole vma.
Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
covering the entire vma, but at different times.

To make it more accurately:

Consider this hypothetical scenario (not the same as what's implemented in
vfio-pci, but seems plausible):

Suppose we have a vma covering only one page, then
(1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
(2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
(3) The driver then maps the entire vma to pfn2 in fault handler

Given this context, my questions are:
1. How can we reserve the memory type for pfn2? Should we call
   track_pfn_remap() in mmap() in advance?
2. How do we untrack the memory type for pfn1 and pfn2, considering they
   belong to the same VMA but mutual exclusively and not concurrently?

Thanks
Yan
> 
> > 2. Export track_pfn_remap() and call track_pfn_remap() in mmap() for pfn
> >    set 2.
> > 3. Unmap and call vmf_insert_pfn() in the fault handler to map pfn set 2.
> 
> IMO this should be the similar case of MMIO being disabled on the bar,
> where we can use track_pfn_remap() to register the whole vma in mmap()
> first. Then in this case if you prefer proactive injection of partial of
> the pfn mappings, one can do that via vmf_insert_pfn*() in mmap() after the
> memtype registered.
> 
> > 
> > However, I'm not sure if we can properly untrack both pfn sets 1 and 2.
> 
> Untrack should so far only happen per-vma, AFAIU, so "set 1+2" need to be
> done together as they belong to the same vma.
> 
> > 
> > By exporting untrack_pfn() too? Or, you'll leave VFIO to use ioremap/iounmap()
> > (or the variants) to reserve memtype by itself?
> 
> Thanks,
> 
> -- 
> Peter Xu
>
David Hildenbrand July 19, 2024, 8:28 a.m. UTC | #23
On 19.07.24 01:18, Yan Zhao wrote:
> On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
>> On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
>>> Ok. Then if we have two sets of pfns, then we can
>>> 1. Call remap_pfn_range() in mmap() for pfn set 1.
>>
>> I don't think this will work..  At least from the current implementation,
>> remap_pfn_range() will only reserve the memtype if the range covers the
>> whole vma.
> Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
> covering the entire vma, but at different times.
> 
> To make it more accurately:
> 
> Consider this hypothetical scenario (not the same as what's implemented in
> vfio-pci, but seems plausible):
> 
> Suppose we have a vma covering only one page, then
> (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
> (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
> (3) The driver then maps the entire vma to pfn2 in fault handler
> 
> Given this context, my questions are:
> 1. How can we reserve the memory type for pfn2? Should we call
>     track_pfn_remap() in mmap() in advance?
> 2. How do we untrack the memory type for pfn1 and pfn2, considering they
>     belong to the same VMA but mutual exclusively and not concurrently?

Do we really have to support such changing PFNs in a VMA? Are there 
existing use cases that would rely on that?

Would it be a problem if we would merge the mmap+track_pfn_remap, such 
that such a special VMA can only ever belong to a single PFN range, that 
is fixed, but PFNs can be faulted in lazily?
Peter Xu July 19, 2024, 2:13 p.m. UTC | #24
On Fri, Jul 19, 2024 at 10:28:09AM +0200, David Hildenbrand wrote:
> On 19.07.24 01:18, Yan Zhao wrote:
> > On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
> > > On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> > > > Ok. Then if we have two sets of pfns, then we can
> > > > 1. Call remap_pfn_range() in mmap() for pfn set 1.
> > > 
> > > I don't think this will work..  At least from the current implementation,
> > > remap_pfn_range() will only reserve the memtype if the range covers the
> > > whole vma.
> > Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
> > covering the entire vma, but at different times.
> > 
> > To make it more accurately:
> > 
> > Consider this hypothetical scenario (not the same as what's implemented in
> > vfio-pci, but seems plausible):
> > 
> > Suppose we have a vma covering only one page, then
> > (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
> > (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
> > (3) The driver then maps the entire vma to pfn2 in fault handler
> > 
> > Given this context, my questions are:
> > 1. How can we reserve the memory type for pfn2? Should we call
> >     track_pfn_remap() in mmap() in advance?
> > 2. How do we untrack the memory type for pfn1 and pfn2, considering they
> >     belong to the same VMA but mutual exclusively and not concurrently?
> 
> Do we really have to support such changing PFNs in a VMA? Are there existing
> use cases that would rely on that?

I share the same question with David.  I don't think we support that, and I
don't know whether we should, either.

Such flexibility already will break with current PAT design.  See:

untrack_pfn:
	if (!paddr && !size) {
		if (get_pat_info(vma, &paddr, NULL))
			return;
		size = vma->vm_end - vma->vm_start;
	}
	free_pfn_range(paddr, size);  <---- assumes PFNs to be continuous

So untrack_pfn() already assumed the pfn being continuous.  I think it
means pfns cannot be randomly faulted in, but determined when mmap().

Thanks,
Liam R. Howlett July 20, 2024, 2:18 a.m. UTC | #25
* Peter Xu <peterx@redhat.com> [240712 10:43]:
> This patch is one patch of an old series [1] that got reposted standalone
> here, with the hope to fix some reported untrack_pfn() issues reported
> recently [2,3], where there used to be other fix [4] but unfortunately
> which looks like to cause other issues.  The hope is this patch can fix it
> the right way.
> 
> X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> start at mmap() of device drivers, then untracked when munmap().  However
> in the current code the untrack is done in unmap_single_vma().  This might
> be problematic.
> 
> For example, unmap_single_vma() can be used nowadays even for zapping a
> single page rather than the whole vmas.  It's very confusing to do whole
> vma untracking in this function even if a caller would like to zap one
> page.  It could simply be wrong.
> 
> Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> for pfnmaps and it'll fail the madvise() already before reaching here.
> However looks like it can be triggered like what was reported where invoked
> from an unmap request from a file vma.
> 
> There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> pgtables before an munmap(), in which case we may not want to untrack the
> pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> pfn tracking information as those pfn mappings can be restored later with
> the same vma object.  Currently it's not an immediate problem for VFIO, as
> VFIO uses UC- by default, but it looks like there's plan to extend that in
> the near future.
> 
> IIUC, this was overlooked when zap_page_range_single() was introduced,
> while in the past it was only used in the munmap() path which wants to
> always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> zap_page_range() callers that act on a single VMA use separate helper") is
> the initial commit that introduced unmap_single_vma(), in which the chunk
> of untrack_pfn() was moved over from unmap_vmas().
> 
> Recover that behavior to untrack pfnmap only when unmap regions.
> 
> [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: x86@kernel.org
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Pei Li <peili.dev@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: David Wang <00107082@163.com>
> Cc: Bert Karwatzki <spasswolf@web.de>
> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> NOTE: I massaged the commit message comparing to the rfc post [1], the
> patch itself is untouched.  Also removed rfc tag, and added more people
> into the loop. Please kindly help test this patch if you have a reproducer,
> as I can't reproduce it myself even with the syzbot reproducer on top of
> mm-unstable.  Instead of further check on the reproducer, I decided to send
> this out first as we have a bunch of reproducers on the list now..
> ---
>  mm/memory.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bcd79619574..f57cc304b318 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>  
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  		unsigned long start = start_addr;
>  		unsigned long end = end_addr;
>  		hugetlb_zap_begin(vma, &start, &end);
> +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> +			untrack_pfn(vma, 0, 0, mm_wr_locked);
>  		unmap_single_vma(tlb, vma, start, end, &details,
>  				 mm_wr_locked);
>  		hugetlb_zap_end(vma, &details);
> -- 
> 2.45.0


...Trying to follow this discussion across several threads and bug
reports.   I was looped in when syzbot found that the [4] fix was a
deadlock.

How are we reaching unmap_vmas() without the mmap lock held in any mode?
We must be holding the read or write lock - otherwise the vma pointer is
unsafe...?

In any case, since this will just keep calling unmap_single_vma() it has
to be an incomplete fix?

Thanks,
Liam
Yan Zhao July 22, 2024, 6:43 a.m. UTC | #26
On Fri, Jul 19, 2024 at 10:28:09AM +0200, David Hildenbrand wrote:
> On 19.07.24 01:18, Yan Zhao wrote:
> > On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
> > > On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> > > > Ok. Then if we have two sets of pfns, then we can
> > > > 1. Call remap_pfn_range() in mmap() for pfn set 1.
> > > 
> > > I don't think this will work..  At least from the current implementation,
> > > remap_pfn_range() will only reserve the memtype if the range covers the
> > > whole vma.
> > Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
> > covering the entire vma, but at different times.
> > 
> > To make it more accurately:
> > 
> > Consider this hypothetical scenario (not the same as what's implemented in
> > vfio-pci, but seems plausible):
> > 
> > Suppose we have a vma covering only one page, then
> > (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
> > (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
> > (3) The driver then maps the entire vma to pfn2 in fault handler
> > 
> > Given this context, my questions are:
> > 1. How can we reserve the memory type for pfn2? Should we call
> >     track_pfn_remap() in mmap() in advance?
> > 2. How do we untrack the memory type for pfn1 and pfn2, considering they
> >     belong to the same VMA but mutual exclusively and not concurrently?
> 
> Do we really have to support such changing PFNs in a VMA? Are there existing
> use cases that would rely on that?
I don't know. But it's not disallowed.

> Would it be a problem if we would merge the mmap+track_pfn_remap, such that
> such a special VMA can only ever belong to a single PFN range, that is
> fixed, but PFNs can be faulted in lazily?
If we allow a fixed PFN range to be faulted in lazily to different physical
PFNs, looks we can't rely on track_pfn_remap() in mmap() and untrack_pfn()
in unmap_vmas() to reserve/untrack memtypes.
Yan Zhao July 22, 2024, 6:49 a.m. UTC | #27
On Fri, Jul 19, 2024 at 10:13:33AM -0400, Peter Xu wrote:
> On Fri, Jul 19, 2024 at 10:28:09AM +0200, David Hildenbrand wrote:
> > On 19.07.24 01:18, Yan Zhao wrote:
> > > On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
> > > > On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> > > > > Ok. Then if we have two sets of pfns, then we can
> > > > > 1. Call remap_pfn_range() in mmap() for pfn set 1.
> > > > 
> > > > I don't think this will work..  At least from the current implementation,
> > > > remap_pfn_range() will only reserve the memtype if the range covers the
> > > > whole vma.
> > > Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
> > > covering the entire vma, but at different times.
> > > 
> > > To make it more accurately:
> > > 
> > > Consider this hypothetical scenario (not the same as what's implemented in
> > > vfio-pci, but seems plausible):
> > > 
> > > Suppose we have a vma covering only one page, then
> > > (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
> > > (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
> > > (3) The driver then maps the entire vma to pfn2 in fault handler
> > > 
> > > Given this context, my questions are:
> > > 1. How can we reserve the memory type for pfn2? Should we call
> > >     track_pfn_remap() in mmap() in advance?
> > > 2. How do we untrack the memory type for pfn1 and pfn2, considering they
> > >     belong to the same VMA but mutual exclusively and not concurrently?
> > 
> > Do we really have to support such changing PFNs in a VMA? Are there existing
> > use cases that would rely on that?
> 
> I share the same question with David.  I don't think we support that, and I
> don't know whether we should, either.
> 
> Such flexibility already will break with current PAT design.  See:
Previously with remap_pfn_range() being able to be called in fault handlers,
this flexibility is doable. i.e. reserve in the fault handler and untrack
in unmap_single_vma().

> 
> untrack_pfn:
> 	if (!paddr && !size) {
> 		if (get_pat_info(vma, &paddr, NULL))
> 			return;
> 		size = vma->vm_end - vma->vm_start;
> 	}
> 	free_pfn_range(paddr, size);  <---- assumes PFNs to be continuous
> 
> So untrack_pfn() already assumed the pfn being continuous.  I think it
> means pfns cannot be randomly faulted in, but determined when mmap().
Hmm, in the hypothetical scenario, (vma->vm_end - vma->vm_start) == PAGE_SIZE.
David Hildenbrand July 22, 2024, 9:17 a.m. UTC | #28
On 22.07.24 08:43, Yan Zhao wrote:
> On Fri, Jul 19, 2024 at 10:28:09AM +0200, David Hildenbrand wrote:
>> On 19.07.24 01:18, Yan Zhao wrote:
>>> On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
>>>> On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
>>>>> Ok. Then if we have two sets of pfns, then we can
>>>>> 1. Call remap_pfn_range() in mmap() for pfn set 1.
>>>>
>>>> I don't think this will work..  At least from the current implementation,
>>>> remap_pfn_range() will only reserve the memtype if the range covers the
>>>> whole vma.
>>> Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
>>> covering the entire vma, but at different times.
>>>
>>> To make it more accurately:
>>>
>>> Consider this hypothetical scenario (not the same as what's implemented in
>>> vfio-pci, but seems plausible):
>>>
>>> Suppose we have a vma covering only one page, then
>>> (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
>>> (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
>>> (3) The driver then maps the entire vma to pfn2 in fault handler
>>>
>>> Given this context, my questions are:
>>> 1. How can we reserve the memory type for pfn2? Should we call
>>>      track_pfn_remap() in mmap() in advance?
>>> 2. How do we untrack the memory type for pfn1 and pfn2, considering they
>>>      belong to the same VMA but mutual exclusively and not concurrently?
>>
>> Do we really have to support such changing PFNs in a VMA? Are there existing
>> use cases that would rely on that?
> I don't know. But it's not disallowed.

Okay, but we don't know of any such use case.

What we do have is a single VMA, whereby within that VMA we place 
various different PFN ranges. (randomly looking at 
drivers/video/fbdev/smscufx.c)

These wouldn't have triggered VM_PAT code.
Peter Xu July 22, 2024, 1:52 p.m. UTC | #29
On Mon, Jul 22, 2024 at 02:49:12PM +0800, Yan Zhao wrote:
> On Fri, Jul 19, 2024 at 10:13:33AM -0400, Peter Xu wrote:
> > On Fri, Jul 19, 2024 at 10:28:09AM +0200, David Hildenbrand wrote:
> > > On 19.07.24 01:18, Yan Zhao wrote:
> > > > On Thu, Jul 18, 2024 at 10:03:01AM -0400, Peter Xu wrote:
> > > > > On Thu, Jul 18, 2024 at 09:50:31AM +0800, Yan Zhao wrote:
> > > > > > Ok. Then if we have two sets of pfns, then we can
> > > > > > 1. Call remap_pfn_range() in mmap() for pfn set 1.
> > > > > 
> > > > > I don't think this will work..  At least from the current implementation,
> > > > > remap_pfn_range() will only reserve the memtype if the range covers the
> > > > > whole vma.
> > > > Hmm, by referring to pfn set 1 and pfn set 2, I mean that they're both
> > > > covering the entire vma, but at different times.
> > > > 
> > > > To make it more accurately:
> > > > 
> > > > Consider this hypothetical scenario (not the same as what's implemented in
> > > > vfio-pci, but seems plausible):
> > > > 
> > > > Suppose we have a vma covering only one page, then
> > > > (1) Initially, the vma is mapped to pfn1, with remap_pfn_range().
> > > > (2) Subsequently, unmap_single_vma() is invoked to unmap the entire VMA.
> > > > (3) The driver then maps the entire vma to pfn2 in fault handler
> > > > 
> > > > Given this context, my questions are:
> > > > 1. How can we reserve the memory type for pfn2? Should we call
> > > >     track_pfn_remap() in mmap() in advance?
> > > > 2. How do we untrack the memory type for pfn1 and pfn2, considering they
> > > >     belong to the same VMA but mutual exclusively and not concurrently?
> > > 
> > > Do we really have to support such changing PFNs in a VMA? Are there existing
> > > use cases that would rely on that?
> > 
> > I share the same question with David.  I don't think we support that, and I
> > don't know whether we should, either.
> > 
> > Such flexibility already will break with current PAT design.  See:
> Previously with remap_pfn_range() being able to be called in fault handlers,
> this flexibility is doable. i.e. reserve in the fault handler and untrack
> in unmap_single_vma().

AFAICT, remap_pfn_range() should never be allowed to be called in a fault
handler..  So IMO it's not "it was allowed before", but we did it wrong
from when we used it in fault path: remap_pfn_range() changes VMA flags
since the 1st day, and that requires a writable lock, while fault paths
only hold it read..

I think it's just that the per-vma lock was added a few years ago (then
some lock attestations on vma lock v.s. vma flag changes), and until then
we found this issue.
Peter Xu July 22, 2024, 3:15 p.m. UTC | #30
On Fri, Jul 19, 2024 at 10:18:12PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [240712 10:43]:
> > This patch is one patch of an old series [1] that got reposted standalone
> > here, with the hope to fix some reported untrack_pfn() issues reported
> > recently [2,3], where there used to be other fix [4] but unfortunately
> > which looks like to cause other issues.  The hope is this patch can fix it
> > the right way.
> > 
> > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > start at mmap() of device drivers, then untracked when munmap().  However
> > in the current code the untrack is done in unmap_single_vma().  This might
> > be problematic.
> > 
> > For example, unmap_single_vma() can be used nowadays even for zapping a
> > single page rather than the whole vmas.  It's very confusing to do whole
> > vma untracking in this function even if a caller would like to zap one
> > page.  It could simply be wrong.
> > 
> > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > for pfnmaps and it'll fail the madvise() already before reaching here.
> > However looks like it can be triggered like what was reported where invoked
> > from an unmap request from a file vma.
> > 
> > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > pgtables before an munmap(), in which case we may not want to untrack the
> > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > pfn tracking information as those pfn mappings can be restored later with
> > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > the near future.
> > 
> > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > while in the past it was only used in the munmap() path which wants to
> > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > zap_page_range() callers that act on a single VMA use separate helper") is
> > the initial commit that introduced unmap_single_vma(), in which the chunk
> > of untrack_pfn() was moved over from unmap_vmas().
> > 
> > Recover that behavior to untrack pfnmap only when unmap regions.
> > 
> > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > 
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: x86@kernel.org
> > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Pei Li <peili.dev@gmail.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: David Wang <00107082@163.com>
> > Cc: Bert Karwatzki <spasswolf@web.de>
> > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > 
> > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > patch itself is untouched.  Also removed rfc tag, and added more people
> > into the loop. Please kindly help test this patch if you have a reproducer,
> > as I can't reproduce it myself even with the syzbot reproducer on top of
> > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > this out first as we have a bunch of reproducers on the list now..
> > ---
> >  mm/memory.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4bcd79619574..f57cc304b318 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >  	if (vma->vm_file)
> >  		uprobe_munmap(vma, start, end);
> >  
> > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > -
> >  	if (start != end) {
> >  		if (unlikely(is_vm_hugetlb_page(vma))) {
> >  			/*
> > @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  		unsigned long start = start_addr;
> >  		unsigned long end = end_addr;
> >  		hugetlb_zap_begin(vma, &start, &end);
> > +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> > +			untrack_pfn(vma, 0, 0, mm_wr_locked);
> >  		unmap_single_vma(tlb, vma, start, end, &details,
> >  				 mm_wr_locked);
> >  		hugetlb_zap_end(vma, &details);
> > -- 
> > 2.45.0
> 
> 
> ...Trying to follow this discussion across several threads and bug
> reports.   I was looped in when syzbot found that the [4] fix was a
> deadlock.
> 
> How are we reaching unmap_vmas() without the mmap lock held in any mode?
> We must be holding the read or write lock - otherwise the vma pointer is
> unsafe...?

The report was not calling unmap_vmas() but unmap_single_vma(), and this
patch proposed to move the untrack operation there.  We should always hold
write lock for unmap_vmas(), afaiu.

> 
> In any case, since this will just keep calling unmap_single_vma() it has
> to be an incomplete fix?

I think there's indeed some issue to settle besides this patch, however I
didn't quickly get why this patch is incomplete from this specific "untrack
pfn within unmap_single_vma()" problem.  I thought it was complete from
that regard, or could you elaborate otherwise?

For example, I think it's pretty common to use unmap_single_vma() in a
truncation path.

Thanks,
Liam R. Howlett July 22, 2024, 8:22 p.m. UTC | #31
* Peter Xu <peterx@redhat.com> [240722 11:15]:
> On Fri, Jul 19, 2024 at 10:18:12PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@redhat.com> [240712 10:43]:
> > > This patch is one patch of an old series [1] that got reposted standalone
> > > here, with the hope to fix some reported untrack_pfn() issues reported
> > > recently [2,3], where there used to be other fix [4] but unfortunately
> > > which looks like to cause other issues.  The hope is this patch can fix it
> > > the right way.
> > > 
> > > X86 uses pfn tracking to do pfnmaps.  AFAICT, the tracking should normally
> > > start at mmap() of device drivers, then untracked when munmap().  However
> > > in the current code the untrack is done in unmap_single_vma().  This might
> > > be problematic.
> > > 
> > > For example, unmap_single_vma() can be used nowadays even for zapping a
> > > single page rather than the whole vmas.  It's very confusing to do whole
> > > vma untracking in this function even if a caller would like to zap one
> > > page.  It could simply be wrong.
> > > 
> > > Such issue won't be exposed by things like MADV_DONTNEED won't ever work
> > > for pfnmaps and it'll fail the madvise() already before reaching here.
> > > However looks like it can be triggered like what was reported where invoked
> > > from an unmap request from a file vma.
> > > 
> > > There's also work [5] on VFIO (merged now) to allow tearing down MMIO
> > > pgtables before an munmap(), in which case we may not want to untrack the
> > > pfns if we're only tearing down the pgtables.  IOW, we may want to keep the
> > > pfn tracking information as those pfn mappings can be restored later with
> > > the same vma object.  Currently it's not an immediate problem for VFIO, as
> > > VFIO uses UC- by default, but it looks like there's plan to extend that in
> > > the near future.
> > > 
> > > IIUC, this was overlooked when zap_page_range_single() was introduced,
> > > while in the past it was only used in the munmap() path which wants to
> > > always unmap the region completely.  E.g., commit f5cc4eef9987 ("VM: make
> > > zap_page_range() callers that act on a single VMA use separate helper") is
> > > the initial commit that introduced unmap_single_vma(), in which the chunk
> > > of untrack_pfn() was moved over from unmap_vmas().
> > > 
> > > Recover that behavior to untrack pfnmap only when unmap regions.
> > > 
> > > [1] https://lore.kernel.org/r/20240523223745.395337-1-peterx@redhat.com
> > > [2] https://groups.google.com/g/syzkaller-bugs/c/FeQZvSbqWbQ/m/tHFmoZthAAAJ
> > > [3] https://lore.kernel.org/r/20240712131931.20207-1-00107082@163.com
> > > [4] https://lore.kernel.org/all/20240710-bug12-v1-1-0e5440f9b8d3@gmail.com/
> > > [5] https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
> > > 
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > Cc: Andy Lutomirski <luto@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > > Cc: x86@kernel.org
> > > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Pei Li <peili.dev@gmail.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: David Wang <00107082@163.com>
> > > Cc: Bert Karwatzki <spasswolf@web.de>
> > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > > 
> > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > patch itself is untouched.  Also removed rfc tag, and added more people
> > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > mm-unstable.  Instead of further check on the reproducer, I decided to send
> > > this out first as we have a bunch of reproducers on the list now..
> > > ---
> > >  mm/memory.c | 5 ++---
> > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 4bcd79619574..f57cc304b318 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > >  	if (vma->vm_file)
> > >  		uprobe_munmap(vma, start, end);
> > >  
> > > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > -
> > >  	if (start != end) {
> > >  		if (unlikely(is_vm_hugetlb_page(vma))) {
> > >  			/*
> > > @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  		unsigned long start = start_addr;
> > >  		unsigned long end = end_addr;
> > >  		hugetlb_zap_begin(vma, &start, &end);
> > > +		if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > +			untrack_pfn(vma, 0, 0, mm_wr_locked);
> > >  		unmap_single_vma(tlb, vma, start, end, &details,
> > >  				 mm_wr_locked);
> > >  		hugetlb_zap_end(vma, &details);
> > > -- 
> > > 2.45.0
> > 
> > 
> > ...Trying to follow this discussion across several threads and bug
> > reports.   I was looped in when syzbot found that the [4] fix was a
> > deadlock.
> > 
> > How are we reaching unmap_vmas() without the mmap lock held in any mode?
> > We must be holding the read or write lock - otherwise the vma pointer is
> > unsafe...?
> 
> The report was not calling unmap_vmas() but unmap_single_vma(), and this
> patch proposed to move the untrack operation there.  We should always hold
> write lock for unmap_vmas(), afaiu.

unmap_single_vma() also takes a vma pointer.  It is in both [2] and [3].

> 
> > 
> > In any case, since this will just keep calling unmap_single_vma() it has
> > to be an incomplete fix?
> 
> I think there's indeed some issue to settle besides this patch, however I
> didn't quickly get why this patch is incomplete from this specific "untrack
> pfn within unmap_single_vma()" problem.  I thought it was complete from
> that regard, or could you elaborate otherwise?
> 

The problem report from [2] and [3] is that we are getting to a call
path that includes unmap_single_vma() without the mmap lock.  This patch
fails to address that issue, it only takes the caller with the assert
out of the call path.

Removing the function with the lock check doesn't fix the locking issue.
If there is no locking issue here, please state the case in the commit
log as you feel it is safe to use a vma pointer without the mmap_lock
held.

Thanks,
Liam
Peter Xu July 22, 2024, 9:17 p.m. UTC | #32
On Mon, Jul 22, 2024 at 04:22:45PM -0400, Liam R. Howlett wrote:
> The problem report from [2] and [3] is that we are getting to a call
> path that includes unmap_single_vma() without the mmap lock.  This patch
> fails to address that issue, it only takes the caller with the assert
> out of the call path.
> 
> Removing the function with the lock check doesn't fix the locking issue.
> If there is no locking issue here, please state the case in the commit
> log as you feel it is safe to use a vma pointer without the mmap_lock
> held.

Could you please state why there's a locking issue, and why this patch (of
a x86 PAT specific issue...) would need any documentation on that?

I thought it was pretty common that file truncation (or anything similar)
does a file rmap walk over vmas that mapping this file, and vmas need to be
alive during the rmap walk, no?

Thanks,
David Hildenbrand July 23, 2024, 10:12 a.m. UTC | #33
On 22.07.24 23:17, Peter Xu wrote:
> On Mon, Jul 22, 2024 at 04:22:45PM -0400, Liam R. Howlett wrote:
>> The problem report from [2] and [3] is that we are getting to a call
>> path that includes unmap_single_vma() without the mmap lock.  This patch
>> fails to address that issue, it only takes the caller with the assert
>> out of the call path.
>>
>> Removing the function with the lock check doesn't fix the locking issue.
>> If there is no locking issue here, please state the case in the commit
>> log as you feel it is safe to use a vma pointer without the mmap_lock
>> held.
> 
> Could you please state why there's a locking issue, and why this patch (of
> a x86 PAT specific issue...) would need any documentation on that?
> 
> I thought it was pretty common that file truncation (or anything similar)
> does a file rmap walk over vmas that mapping this file, and vmas need to be
> alive during the rmap walk, no?

Right, I was also assuming that the rmap locking (from where we obtained 
that VMA -- rmap interval tree) makes sure that using the VMA is safe.
Liam R. Howlett July 23, 2024, 5:58 p.m. UTC | #34
* David Hildenbrand <david@redhat.com> [240723 06:12]:
> On 22.07.24 23:17, Peter Xu wrote:
> > On Mon, Jul 22, 2024 at 04:22:45PM -0400, Liam R. Howlett wrote:
> > > The problem report from [2] and [3] is that we are getting to a call
> > > path that includes unmap_single_vma() without the mmap lock.  This patch
> > > fails to address that issue, it only takes the caller with the assert
> > > out of the call path.
> > > 
> > > Removing the function with the lock check doesn't fix the locking issue.
> > > If there is no locking issue here, please state the case in the commit
> > > log as you feel it is safe to use a vma pointer without the mmap_lock
> > > held.
> > 
> > Could you please state why there's a locking issue, and why this patch (of
> > a x86 PAT specific issue...) would need any documentation on that?
> > 
> > I thought it was pretty common that file truncation (or anything similar)
> > does a file rmap walk over vmas that mapping this file, and vmas need to be
> > alive during the rmap walk, no?
> 
> Right, I was also assuming that the rmap locking (from where we obtained
> that VMA -- rmap interval tree) makes sure that using the VMA is safe.
> 

Ah, that's what I was missing.

Might be worth adding to the change log.

Thanks,
Liam
Peter Xu July 23, 2024, 8:27 p.m. UTC | #35
On Mon, Jul 22, 2024 at 11:17:57AM +0200, David Hildenbrand wrote:
> What we do have is a single VMA, whereby within that VMA we place various
> different PFN ranges. (randomly looking at drivers/video/fbdev/smscufx.c)
> 
> These wouldn't have triggered VM_PAT code.

Right, it looks like VM_PAT is only applying to a whole-vma mapping, even
though I don't know how that was designed..

I wished vma->vm_pgoff was for storing the base PFN for VM_SHARED too: now
it only works like that for CoW mappings in remap_pfn_range_notrack(), then
it looks like VM_SHARED users of remap_pfn_range() can reuse vm_pgoff, and
I think VFIO does reuse it at least..

I am a bit confused on why Linux made that different for VM_SHARED,
probably since b3b9c2932c32 ("mm, x86, pat: rework linear pfn-mmap
tracking").  I wished vm_pgoff was always used for internal maintenance
(even for VM_SHARED) so this issue should be easier to solve.

Maybe we can still re-define vm_pgoff for VM_SHARED pfnmaps? The caller
should always be able to encode information in vm_private_data anyway.
But I think that might break OOT users..
David Hildenbrand July 23, 2024, 9:36 p.m. UTC | #36
On 23.07.24 22:27, Peter Xu wrote:
> On Mon, Jul 22, 2024 at 11:17:57AM +0200, David Hildenbrand wrote:
>> What we do have is a single VMA, whereby within that VMA we place various
>> different PFN ranges. (randomly looking at drivers/video/fbdev/smscufx.c)
>>
>> These wouldn't have triggered VM_PAT code.
> 
> Right, it looks like VM_PAT is only applying to a whole-vma mapping, even
> though I don't know how that was designed..
> 
> I wished vma->vm_pgoff was for storing the base PFN for VM_SHARED too: now
> it only works like that for CoW mappings in remap_pfn_range_notrack(), then
> it looks like VM_SHARED users of remap_pfn_range() can reuse vm_pgoff, and
> I think VFIO does reuse it at least..
> 
> I am a bit confused on why Linux made that different for VM_SHARED,
> probably since b3b9c2932c32 ("mm, x86, pat: rework linear pfn-mmap
> tracking").  I wished vm_pgoff was always used for internal maintenance
> (even for VM_SHARED) so this issue should be easier to solve.
> 
> Maybe we can still re-define vm_pgoff for VM_SHARED pfnmaps? The caller
> should always be able to encode information in vm_private_data anyway.
> But I think that might break OOT users..

Yeah, I played with this idea when trying to remove the page table walk 
and storing it in the VMA, avoiding consuming more memory. The vm_pgoff 
usage for VM_PFNMAP mappings that I could spot looked nasty enough for 
me to not dare trying.

I wonder if we could just let relevant users do the PAT handling 
manually: I'm also not sure how many remap_pfn_range users end up 
triggering VM_PAT code although they don't really have to (just because 
they happen to cover a full VMA)?

One nasty thing is fork(), I was wondering if relevant users really rely 
on that or if we could force these VMAs to simply not get copied during 
fork. During fork() we have to "duplicate" the reservation ...

No real solution yet :/ Getting id of the page table walk is possible (I 
can try digging up the patch I was able to come up with), but it feels 
like the wrong approach.

The owner of the VMA should have that information ...
Jason Gunthorpe July 23, 2024, 9:44 p.m. UTC | #37
On Tue, Jul 23, 2024 at 11:36:51PM +0200, David Hildenbrand wrote:
> I wonder if we could just let relevant users do the PAT handling manually:
> I'm also not sure how many remap_pfn_range users end up triggering VM_PAT
> code although they don't really have to (just because they happen to cover a
> full VMA)?
> 
> One nasty thing is fork(), I was wondering if relevant users really rely on
> that or if we could force these VMAs to simply not get copied during fork.
> During fork() we have to "duplicate" the reservation ...

I admit I barely understand what x86 uses this PAT stuff for -
allowing WC mappings is part of it?

If yes, then RDMA would expect WC PFN MAP VMAs to copy their WCness
during fork..

Jason
David Hildenbrand July 24, 2024, 8:53 a.m. UTC | #38
On 23.07.24 23:44, Jason Gunthorpe wrote:
> On Tue, Jul 23, 2024 at 11:36:51PM +0200, David Hildenbrand wrote:
>> I wonder if we could just let relevant users do the PAT handling manually:
>> I'm also not sure how many remap_pfn_range users end up triggering VM_PAT
>> code although they don't really have to (just because they happen to cover a
>> full VMA)?
>>
>> One nasty thing is fork(), I was wondering if relevant users really rely on
>> that or if we could force these VMAs to simply not get copied during fork.
>> During fork() we have to "duplicate" the reservation ...
> 
> I admit I barely understand what x86 uses this PAT stuff for -
> allowing WC mappings is part of it?

I'm confused by all of that, but I think yes :)

> 
> If yes, then RDMA would expect WC PFN MAP VMAs to copy their WCness
> during fork..

Okay, that answers that question: we have to support fork().

During fork() we call vm_ops->open when cloning the VMA. So the backend 
can realize when all users (VMAs, whatsoever) are gone to undo any 
registration. Maybe that's a path forward. But still no clue on details 
... one important step might be figuring out who exactly really relies 
on that VM_PAT handling.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 4bcd79619574..f57cc304b318 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1827,9 +1827,6 @@  static void unmap_single_vma(struct mmu_gather *tlb,
 	if (vma->vm_file)
 		uprobe_munmap(vma, start, end);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0, mm_wr_locked);
-
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
 			/*
@@ -1894,6 +1891,8 @@  void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 		unsigned long start = start_addr;
 		unsigned long end = end_addr;
 		hugetlb_zap_begin(vma, &start, &end);
+		if (unlikely(vma->vm_flags & VM_PFNMAP))
+			untrack_pfn(vma, 0, 0, mm_wr_locked);
 		unmap_single_vma(tlb, vma, start, end, &details,
 				 mm_wr_locked);
 		hugetlb_zap_end(vma, &details);