Message ID | 20240312181118.318701-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] x86/mm/pat: fix VM_PAT handling in COW mappings | expand |
On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: > PAT handling won't do the right thing in COW mappings: the first PTE > (or, in fact, all PTEs) can be replaced during write faults to point at > anon folios. Reliably recovering the correct PFN and cachemode using > follow_phys() from PTEs will not work in COW mappings. I guess the first question is: Why do we want to support COW mappings of VM_PAT areas? What breaks if we just disallow it?
On 12.03.24 20:38, David Hildenbrand wrote: > On 12.03.24 20:22, Matthew Wilcox wrote: >> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: >>> PAT handling won't do the right thing in COW mappings: the first PTE >>> (or, in fact, all PTEs) can be replaced during write faults to point at >>> anon folios. Reliably recovering the correct PFN and cachemode using >>> follow_phys() from PTEs will not work in COW mappings. >> >> I guess the first question is: Why do we want to support COW mappings >> of VM_PAT areas? What breaks if we just disallow it? > > Well, that was my first approach. Then I decided to be less radical (IOW > make my life easier by breaking less user space) and "fix it" with > minimal effort. > > Chances of breaking some weird user space is possible, although I > believe for most such mappings MAP_PRIVATE doesn't make too much sense > sense. > > Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT > support. > > I can try finding digging through some possible user space users tomorrow. As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP mappings. However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in some cases. Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP, and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not so sure if that's what we want to happen ... As far as I can see, makedumpfile always mmap's memory to be dumped (/dev/mem, /proc/vmcore) using PROT_READ+MAP_PRIVATE, resulting in a COW mapping. In my opinion, we should use this fairly simple fix to keep it working for now and look into disabling any MAP_PRIVATE of VM_PFNMAP separately, for all architectures. But I'll leave the decision to x86 maintainers.
On 14.03.24 17:42, David Hildenbrand wrote: > On 12.03.24 20:38, David Hildenbrand wrote: >> On 12.03.24 20:22, Matthew Wilcox wrote: >>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: >>>> PAT handling won't do the right thing in COW mappings: the first PTE >>>> (or, in fact, all PTEs) can be replaced during write faults to point at >>>> anon folios. Reliably recovering the correct PFN and cachemode using >>>> follow_phys() from PTEs will not work in COW mappings. >>> >>> I guess the first question is: Why do we want to support COW mappings >>> of VM_PAT areas? What breaks if we just disallow it? >> >> Well, that was my first approach. Then I decided to be less radical (IOW >> make my life easier by breaking less user space) and "fix it" with >> minimal effort. >> >> Chances of breaking some weird user space is possible, although I >> believe for most such mappings MAP_PRIVATE doesn't make too much sense >> sense. >> >> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT >> support. >> >> I can try finding digging through some possible user space users tomorrow. > > As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP > mappings. > > However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in > some cases. > > Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP, > and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not > so sure if that's what we want to happen ... Correction: at least mmap_vmcore() ends up clearing VM_MAYWRITE. So no COW mapping. We could do the same to at least keep PROT_READ|MAP_PRIVATE working. If user space specified PROT_WRITE for whatever reason, it's not that easy.
On 2024/3/15 0:42, David Hildenbrand wrote: > On 12.03.24 20:38, David Hildenbrand wrote: >> On 12.03.24 20:22, Matthew Wilcox wrote: >>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: >>>> PAT handling won't do the right thing in COW mappings: the first PTE >>>> (or, in fact, all PTEs) can be replaced during write faults to point at >>>> anon folios. Reliably recovering the correct PFN and cachemode using >>>> follow_phys() from PTEs will not work in COW mappings. >>> >>> I guess the first question is: Why do we want to support COW mappings >>> of VM_PAT areas? What breaks if we just disallow it? >> >> Well, that was my first approach. Then I decided to be less radical (IOW >> make my life easier by breaking less user space) and "fix it" with >> minimal effort. >> >> Chances of breaking some weird user space is possible, although I >> believe for most such mappings MAP_PRIVATE doesn't make too much sense >> sense. >> >> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT >> support. >> >> I can try finding digging through some possible user space users tomorrow. > > As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP mappings. > > However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in some cases. > > Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP, and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not so sure if that's what we want to happen ... > > As far as I can see, makedumpfile always mmap's memory to be dumped (/dev/mem, /proc/vmcore) using PROT_READ+MAP_PRIVATE, resulting in a COW mapping. > > > In my opinion, we should use this fairly simple fix to keep it working for now and look into disabling any MAP_PRIVATE of VM_PFNMAP separately, for all architectures. > > But I'll leave the decision to x86 maintainers. Hi, x86 maintainers: kindle ping. >
* David Hildenbrand <david@redhat.com> wrote: > On 12.03.24 20:22, Matthew Wilcox wrote: > > On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: > > > PAT handling won't do the right thing in COW mappings: the first PTE > > > (or, in fact, all PTEs) can be replaced during write faults to point at > > > anon folios. Reliably recovering the correct PFN and cachemode using > > > follow_phys() from PTEs will not work in COW mappings. > > > > I guess the first question is: Why do we want to support COW mappings > > of VM_PAT areas? What breaks if we just disallow it? > > Well, that was my first approach. Then I decided to be less radical (IOW > make my life easier by breaking less user space) and "fix it" with > minimal effort. > > Chances of breaking some weird user space is possible, although I believe > for most such mappings MAP_PRIVATE doesn't make too much sense sense. > > Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT > support. > > I can try finding digging through some possible user space users > tomorrow. I'd much prefer restricting VM_PAT areas than expanding support. Could we try the trivial restriction approach first, and only go with your original patch if that fails? Thanks, Ingo
* David Hildenbrand <david@redhat.com> wrote: > On 26.03.24 09:33, Ingo Molnar wrote: > > > > * David Hildenbrand <david@redhat.com> wrote: > > > > > On 12.03.24 20:22, Matthew Wilcox wrote: > > > > On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote: > > > > > PAT handling won't do the right thing in COW mappings: the first PTE > > > > > (or, in fact, all PTEs) can be replaced during write faults to point at > > > > > anon folios. Reliably recovering the correct PFN and cachemode using > > > > > follow_phys() from PTEs will not work in COW mappings. > > > > > > > > I guess the first question is: Why do we want to support COW mappings > > > > of VM_PAT areas? What breaks if we just disallow it? > > > > > > Well, that was my first approach. Then I decided to be less radical (IOW > > > make my life easier by breaking less user space) and "fix it" with > > > minimal effort. > > > > > > Chances of breaking some weird user space is possible, although I believe > > > for most such mappings MAP_PRIVATE doesn't make too much sense sense. > > > > > > Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT > > > support. > > > > > > I can try finding digging through some possible user space users > > > tomorrow. > > > > I'd much prefer restricting VM_PAT areas than expanding support. Could we > > Note that we're not expanding support, we're fixing what used to be > possible before but mostly broke silently. Yeah - that's de-facto expanding support. :-) > But I agree that we should rather remove these corner cases instead of > fixing them. Yeah, especially if no code is hitting it intentionally. > > try the trivial restriction approach first, and only go with your original > > patch if that fails? > > Which version would you prefer, I had two alternatives (excluding comment > changes, white-space expected to be broken). > > > 1) Disallow when we would have set VM_PAT on is_cow_mapping() > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > index 0d72183b5dd0..6979912b1a5d 100644 > --- a/arch/x86/mm/pat/memtype.c > +++ b/arch/x86/mm/pat/memtype.c > @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, > && size == (vma->vm_end - vma->vm_start))) { > int ret; > + if (is_cow_mapping(vma->vm_flags)) > + return -EINVAL; > + > ret = reserve_pfn_range(paddr, size, prot, 0); > if (ret == 0 && vma) > vm_flags_set(vma, VM_PAT); > > > 2) Fallback to !VM_PAT > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > index 0d72183b5dd0..8e97156c9be8 100644 > --- a/arch/x86/mm/pat/memtype.c > +++ b/arch/x86/mm/pat/memtype.c > @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, > enum page_cache_mode pcm; > /* reserve the whole chunk starting from paddr */ > - if (!vma || (addr == vma->vm_start > - && size == (vma->vm_end - vma->vm_start))) { > + if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start && > + size == (vma->vm_end - vma->vm_start))) { > int ret; > ret = reserve_pfn_range(paddr, size, prot, 0); > > > > Personally, I'd go for 2). So what's the advantage of #2? This is clearly something the user didn't really intend or think about much. Isn't explicitly failing that mapping a better option than silently downgrading it to !VM_PAT? (If I'm reading it right ...) Thanks, Ingo
>>> try the trivial restriction approach first, and only go with your original >>> patch if that fails? >> >> Which version would you prefer, I had two alternatives (excluding comment >> changes, white-space expected to be broken). >> >> >> 1) Disallow when we would have set VM_PAT on is_cow_mapping() >> >> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c >> index 0d72183b5dd0..6979912b1a5d 100644 >> --- a/arch/x86/mm/pat/memtype.c >> +++ b/arch/x86/mm/pat/memtype.c >> @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, >> && size == (vma->vm_end - vma->vm_start))) { >> int ret; >> + if (is_cow_mapping(vma->vm_flags)) >> + return -EINVAL; >> + >> ret = reserve_pfn_range(paddr, size, prot, 0); >> if (ret == 0 && vma) >> vm_flags_set(vma, VM_PAT); >> >> >> 2) Fallback to !VM_PAT >> >> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c >> index 0d72183b5dd0..8e97156c9be8 100644 >> --- a/arch/x86/mm/pat/memtype.c >> +++ b/arch/x86/mm/pat/memtype.c >> @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, >> enum page_cache_mode pcm; >> /* reserve the whole chunk starting from paddr */ >> - if (!vma || (addr == vma->vm_start >> - && size == (vma->vm_end - vma->vm_start))) { >> + if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start && >> + size == (vma->vm_end - vma->vm_start))) { >> int ret; >> ret = reserve_pfn_range(paddr, size, prot, 0); >> >> >> >> Personally, I'd go for 2). > > So what's the advantage of #2? This is clearly something the user didn't > really intend or think about much. Isn't explicitly failing that mapping a > better option than silently downgrading it to !VM_PAT? > > (If I'm reading it right ...) I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail with 1), while it keeps on working for 2). Note that I think we currently set VM_PAT on each and every system if remap_pfn_range() will cover the whole VMA, even if pat is not actually enabled. It's all a bit of a mess TBH, but I got my hands dirty enough on that. So 1) can be rather destructive ... 2) at least somehow keeps it working. For that reason I went with the current patch, because it's hard to tell which use case you will end up breaking ... :/
* David Hildenbrand <david@redhat.com> wrote: > > > > try the trivial restriction approach first, and only go with your original > > > > patch if that fails? > > > > > > Which version would you prefer, I had two alternatives (excluding comment > > > changes, white-space expected to be broken). > > > > > > > > > 1) Disallow when we would have set VM_PAT on is_cow_mapping() > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > index 0d72183b5dd0..6979912b1a5d 100644 > > > --- a/arch/x86/mm/pat/memtype.c > > > +++ b/arch/x86/mm/pat/memtype.c > > > @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, > > > && size == (vma->vm_end - vma->vm_start))) { > > > int ret; > > > + if (is_cow_mapping(vma->vm_flags)) > > > + return -EINVAL; > > > + > > > ret = reserve_pfn_range(paddr, size, prot, 0); > > > if (ret == 0 && vma) > > > vm_flags_set(vma, VM_PAT); > > > > > > > > > 2) Fallback to !VM_PAT > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > index 0d72183b5dd0..8e97156c9be8 100644 > > > --- a/arch/x86/mm/pat/memtype.c > > > +++ b/arch/x86/mm/pat/memtype.c > > > @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, > > > enum page_cache_mode pcm; > > > /* reserve the whole chunk starting from paddr */ > > > - if (!vma || (addr == vma->vm_start > > > - && size == (vma->vm_end - vma->vm_start))) { > > > + if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start && > > > + size == (vma->vm_end - vma->vm_start))) { > > > int ret; > > > ret = reserve_pfn_range(paddr, size, prot, 0); > > > > > > > > > > > > Personally, I'd go for 2). > > > > So what's the advantage of #2? This is clearly something the user didn't > > really intend or think about much. Isn't explicitly failing that mapping a > > better option than silently downgrading it to !VM_PAT? > > > > (If I'm reading it right ...) > > I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail > with 1), while it keeps on working for 2). > > Note that I think we currently set VM_PAT on each and every system if > remap_pfn_range() will cover the whole VMA, even if pat is not actually > enabled. > > It's all a bit of a mess TBH, but I got my hands dirty enough on that. > > So 1) can be rather destructive ... 2) at least somehow keeps it working. > > For that reason I went with the current patch, because it's hard to tell > which use case you will end up breaking ... :/ Yeah, so I think you make valid observations, i.e. your first patch is probably the best one. But since it changes mm/memory.c, I'd like to pass that over to Andrew and the MM folks. The x86 bits: Acked-by: Ingo Molnar <mingo@kernel.org> Thanks, Ingo
On 01.04.24 11:45, Ingo Molnar wrote: > > * David Hildenbrand <david@redhat.com> wrote: > >>>>> try the trivial restriction approach first, and only go with your original >>>>> patch if that fails? >>>> >>>> Which version would you prefer, I had two alternatives (excluding comment >>>> changes, white-space expected to be broken). >>>> >>>> >>>> 1) Disallow when we would have set VM_PAT on is_cow_mapping() >>>> >>>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c >>>> index 0d72183b5dd0..6979912b1a5d 100644 >>>> --- a/arch/x86/mm/pat/memtype.c >>>> +++ b/arch/x86/mm/pat/memtype.c >>>> @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, >>>> && size == (vma->vm_end - vma->vm_start))) { >>>> int ret; >>>> + if (is_cow_mapping(vma->vm_flags)) >>>> + return -EINVAL; >>>> + >>>> ret = reserve_pfn_range(paddr, size, prot, 0); >>>> if (ret == 0 && vma) >>>> vm_flags_set(vma, VM_PAT); >>>> >>>> >>>> 2) Fallback to !VM_PAT >>>> >>>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c >>>> index 0d72183b5dd0..8e97156c9be8 100644 >>>> --- a/arch/x86/mm/pat/memtype.c >>>> +++ b/arch/x86/mm/pat/memtype.c >>>> @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, >>>> enum page_cache_mode pcm; >>>> /* reserve the whole chunk starting from paddr */ >>>> - if (!vma || (addr == vma->vm_start >>>> - && size == (vma->vm_end - vma->vm_start))) { >>>> + if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start && >>>> + size == (vma->vm_end - vma->vm_start))) { >>>> int ret; >>>> ret = reserve_pfn_range(paddr, size, prot, 0); >>>> >>>> >>>> >>>> Personally, I'd go for 2). >>> >>> So what's the advantage of #2? This is clearly something the user didn't >>> really intend or think about much. Isn't explicitly failing that mapping a >>> better option than silently downgrading it to !VM_PAT? >>> >>> (If I'm reading it right ...) >> >> I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail >> with 1), while it keeps on working for 2). >> >> Note that I think we currently set VM_PAT on each and every system if >> remap_pfn_range() will cover the whole VMA, even if pat is not actually >> enabled. >> >> It's all a bit of a mess TBH, but I got my hands dirty enough on that. >> >> So 1) can be rather destructive ... 2) at least somehow keeps it working. >> >> For that reason I went with the current patch, because it's hard to tell >> which use case you will end up breaking ... :/ > Hi, > Yeah, so I think you make valid observations, i.e. your first patch is > probably the best one. okay, so the original patch, thanks. > > But since it changes mm/memory.c, I'd like to pass that over to Andrew > and the MM folks. > > The x86 bits: > > Acked-by: Ingo Molnar <mingo@kernel.org> Thanks, there is now a conflict with other stuff that already landed in mm-unstable that moves follow_phys() to arch/x86/mm/pat/memtype.c. @Andrew, this here is a fix, how should we best handle that? Likely the fix should go in first, and the fixup of Christoph's patch should be easy. Just let me know how you want to handle that.
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 0904d7e8e1260..82857148dbec5 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -950,6 +950,38 @@ static void free_pfn_range(u64 paddr, unsigned long size) memtype_free(paddr, paddr + size); } +static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr, + pgprot_t *pgprot) +{ + unsigned long prot; + + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT)); + + /* + * We need the starting PFN and cachemode used for track_pfn_remap() + * that covered the whole VMA. For most mappings, we can obtain that + * information from the page tables. For COW mappings, we might now + * suddenly have anon folios mapped and follow_phys() will fail. + * + * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to + * detect the PFN. If we need the cachemode as well, we're out of luck + * for now and have to fail fork(). + */ + if (!follow_phys(vma, vma->vm_start, 0, &prot, paddr)) { + if (pgprot) + *pgprot = __pgprot(prot); + return 0; + } + if (is_cow_mapping(vma->vm_flags)) { + if (pgprot) + return -EINVAL; + *paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT; + return 0; + } + WARN_ON_ONCE(1); + return -EINVAL; +} + /* * track_pfn_copy is called when vma that is covering the pfnmap gets * copied through copy_page_range(). @@ -960,20 +992,13 @@ static void free_pfn_range(u64 paddr, unsigned long size) int track_pfn_copy(struct vm_area_struct *vma) { resource_size_t paddr; - unsigned long prot; unsigned long vma_size = vma->vm_end - vma->vm_start; pgprot_t pgprot; if (vma->vm_flags & VM_PAT) { - /* - * reserve the whole chunk covered by vma. We need the - * starting address and protection from pte. - */ - if (follow_phys(vma, vma->vm_start, 0, &prot, &paddr)) { - WARN_ON_ONCE(1); + if (get_pat_info(vma, &paddr, &pgprot)) return -EINVAL; - } - pgprot = __pgprot(prot); + /* reserve the whole chunk covered by vma. */ return reserve_pfn_range(paddr, vma_size, &pgprot, 1); } @@ -1048,7 +1073,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, unsigned long size, bool mm_wr_locked) { resource_size_t paddr; - unsigned long prot; if (vma && !(vma->vm_flags & VM_PAT)) return; @@ -1056,11 +1080,8 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, /* free the chunk starting from pfn or the whole chunk */ paddr = (resource_size_t)pfn << PAGE_SHIFT; if (!paddr && !size) { - if (follow_phys(vma, vma->vm_start, 0, &prot, &paddr)) { - WARN_ON_ONCE(1); + if (get_pat_info(vma, &paddr, NULL)) return; - } - size = vma->vm_end - vma->vm_start; } free_pfn_range(paddr, size); diff --git a/mm/memory.c b/mm/memory.c index f2bc6dd15eb83..bd3336fa5b66a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5971,6 +5971,10 @@ int follow_phys(struct vm_area_struct *vma, goto out; pte = ptep_get(ptep); + /* Never return PFNs of anon folios in COW mappings. */ + if (vm_normal_folio(vma, address, pte)) + goto unlock; + if ((flags & FOLL_WRITE) && !pte_write(pte)) goto unlock;
PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <liburing.h> int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char *map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries * sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE */ map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } /* We have at least one page. Let's COW it. */ *map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] <TASK> [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Reported-by: Wupeng Ma <mawupeng1@huawei.com> Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines") Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3") Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> --- On top of mm/mm-unstable. I spent way too much time trying to squeeze the cachemode into the VMA in a clean way. I even went the extra mile and tried to get rid of follow_phys() completely, storing the cachemode and the PFN in the VMA. But I didn't quite like the end result and decided to just keep it simple and not care about improving the already-broken fork() corner case. follow_phys() is rather ugly, and I have some patches to improve the situation. But let's get this fix here out first and discuss if there is an alternative. --- arch/x86/mm/pat/memtype.c | 49 ++++++++++++++++++++++++++++----------- mm/memory.c | 4 ++++ 2 files changed, 39 insertions(+), 14 deletions(-)