Message ID | 20220405014836.14077-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | userfaultfd-wp: Support shmem and hugetlbfs | expand |
Looks ok to me now and should work for all architectures. Reviewed-by: Alistair Popple <apopple@nvidia.com> Peter Xu <peterx@redhat.com> writes: > We used to check against none pte in finish_fault(), with the assumption > that the orig_pte is always none pte. > > This change prepares us to be able to call do_fault() on !none ptes. For > example, we should allow that to happen for pte marker so that we can restore > information out of the pte markers. > > Let's change the "pte_none" check into detecting changes since we fetched > orig_pte. One trivial thing to take care of here is, when pmd==NULL for > the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > > By default orig_pte will be all zeros however the problem is not all > architectures are using all-zeros for a none pte. pte_clear() will be the > right thing to use here so that we'll always have a valid orig_pte value > for the whole handle_pte_fault() call. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/memory.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 3f396241a7db..b1af996b09ca 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > vmf->address, &vmf->ptl); > ret = 0; > /* Re-check under ptl */ > - if (likely(pte_none(*vmf->pte))) > + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) > do_set_pte(vmf, page, vmf->address); > else > ret = VM_FAULT_NOPAGE; > @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > * concurrent faults and from rmap lookups. > */ > vmf->pte = NULL; > + /* > + * Always initialize orig_pte. This matches with below > + * code to have orig_pte to be the none pte if pte==NULL. > + * This makes the rest code to be always safe to reference > + * it, e.g. in finish_fault() we'll detect pte changes. > + */ > + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte); > } else { > /* > * If a huge pmd materialized under us just retry later. Use
On Tue, Apr 12, 2022 at 12:05:41PM +1000, Alistair Popple wrote: > Looks ok to me now and should work for all architectures. > > Reviewed-by: Alistair Popple <apopple@nvidia.com> Thanks again for taking a look. Currently the series is queued in -mm already, but I'll take these R-bs if there'll be a new version.
Hi, On 05.04.2022 03:48, Peter Xu wrote: > We used to check against none pte in finish_fault(), with the assumption > that the orig_pte is always none pte. > > This change prepares us to be able to call do_fault() on !none ptes. For > example, we should allow that to happen for pte marker so that we can restore > information out of the pte markers. > > Let's change the "pte_none" check into detecting changes since we fetched > orig_pte. One trivial thing to take care of here is, when pmd==NULL for > the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > > By default orig_pte will be all zeros however the problem is not all > architectures are using all-zeros for a none pte. pte_clear() will be the > right thing to use here so that we'll always have a valid orig_pte value > for the whole handle_pte_fault() call. > > Signed-off-by: Peter Xu <peterx@redhat.com> This patch landed in today's linux next-202204213 as commit fa6009949163 ("mm: check against orig_pte for finish_fault()"). Unfortunately it causes serious system instability on some ARM 32bit machines. I've observed it on all tested boards (various Samsung Exynos based, Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was compiled from multi_v7_defconfig. Here is a crash log from QEMU's ARM 32bit virt machine: 8<--- cut here --- Unable to handle kernel paging request at virtual address e093263c [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000 Internal error: Oops: 807 [#1] SMP ARM Modules linked in: CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted 5.18.0-rc2-00176-gfa6009949163 #11684 Hardware name: Generic DT based system PC is at cpu_ca15_set_pte_ext+0x4c/0x58 LR is at handle_mm_fault+0x46c/0xbb0 pc : [<c031bdec>] lr : [<c0478144>] psr: 40000013 sp : e0931df8 ip : e0931e54 fp : c26a8000 r10: 00000081 r9 : c2230880 r8 : 00000000 r7 : 00000081 r6 : beffffed r5 : c267f000 r4 : c2230880 r3 : 00000000 r2 : 00000000 r1 : 00000040 r0 : e0931e3c Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 4020406a DAC: 00000051 Register r0 information: 2-page vmalloc region starting at 0xe0930000 allocated at kernel_clone+0x8c/0x3a8 Register r1 information: non-paged memory Register r2 information: NULL pointer Register r3 information: NULL pointer Register r4 information: slab task_struct start c2230880 pointer offset 0 Register r5 information: slab vm_area_struct start c267f000 pointer offset 0 Register r6 information: non-paged memory Register r7 information: non-paged memory Register r8 information: NULL pointer Register r9 information: slab task_struct start c2230880 pointer offset 0 Register r10 information: non-paged memory Register r11 information: slab mm_struct start c26a8000 pointer offset 0 size 168 Register r12 information: 2-page vmalloc region starting at 0xe0930000 allocated at kernel_clone+0x8c/0x3a8 Process kworker/u4:0 (pid: 37, stack limit = 0x(ptrval)) Stack: (0xe0931df8 to 0xe0932000) ... ---[ end trace 0000000000000000 ]--- CAN device driver interface bgmac_bcma: Broadcom 47xx GBit MAC driver loaded e1000e: Intel(R) PRO/1000 Network Driver e1000e: Copyright(c) 1999 - 2015 Intel Corporation. igb: Intel(R) Gigabit Ethernet Network Driver igb: Copyright (c) 2007-2014 Intel Corporation. pegasus: Pegasus/Pegasus II USB Ethernet driver usbcore: registered new interface driver pegasus usbcore: registered new interface driver asix usbcore: registered new interface driver ax88179_178a usbcore: registered new interface driver cdc_ether usbcore: registered new interface driver smsc75xx usbcore: registered new interface driver smsc95xx usbcore: registered new interface driver net1080 usbcore: registered new interface driver cdc_subset usbcore: registered new interface driver zaurus usbcore: registered new interface driver cdc_ncm ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver ehci-pci: EHCI PCI platform driver ehci-platform: EHCI generic platform driver ehci-omap: OMAP-EHCI Host Controller driver ehci-orion: EHCI orion driver SPEAr-ehci: EHCI SPEAr driver ehci-st: EHCI STMicroelectronics driver ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver ohci-pci: OHCI PCI platform driver ohci-platform: OHCI generic platform driver SPEAr-ohci: OHCI SPEAr driver ohci-st: OHCI STMicroelectronics driver usbcore: registered new interface driver usb-storage rtc-pl031 9010000.pl031: registered as rtc0 rtc-pl031 9010000.pl031: setting system clock to 2022-04-13T13:49:19 UTC (1649857759) i2c_dev: i2c /dev entries driver sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman Synopsys Designware Multimedia Card Interface Driver sdhci-pltfm: SDHCI platform and OF driver helper ledtrig-cpu: registered to indicate activity on CPUs usbcore: registered new interface driver usbhid usbhid: USB HID core driver NET: Registered PF_INET6 protocol family Segment Routing with IPv6 In-situ OAM (IOAM) with IPv6 sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver NET: Registered PF_PACKET protocol family can: controller area network core NET: Registered PF_CAN protocol family can: raw protocol can: broadcast manager protocol can: netlink gateway - max_hops=1 Key type dns_resolver registered ThumbEE CPU extension supported. Registering SWP/SWPB emulation handler Loading compiled-in X.509 certificates input: gpio-keys as /devices/platform/gpio-keys/input/input0 uart-pl011 9000000.pl011: no DMA platform data EXT4-fs (vda): mounted filesystem with ordered data mode. Quota mode: disabled. VFS: Mounted root (ext4 filesystem) readonly on device 254:0. devtmpfs: mounted Freeing unused kernel image (initmem) memory: 2048K Run /sbin/init as init process with arguments: /sbin/init with environment: HOME=/ TERM=linux 8<--- cut here --- Unable to handle kernel paging request at virtual address e082662c [e082662c] *pgd=42083811, *pte=00000000, *ppte=00000000 Internal error: Oops: 807 [#2] SMP ARM Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G D 5.18.0-rc2-00176-gfa6009949163 #11684 Hardware name: Generic DT based system PC is at cpu_ca15_set_pte_ext+0x4c/0x58 LR is at handle_mm_fault+0x46c/0xbb0 pc : [<c031bdec>] lr : [<c0478144>] psr: 40000013 sp : e0825de8 ip : e0825e44 fp : c213e000 r10: 00000081 r9 : c20e0000 r8 : 00000000 r7 : 00000081 r6 : befffff1 r5 : c2695000 r4 : c20e0000 r3 : 00000000 r2 : 00000000 r1 : 00000040 r0 : e0825e2c Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 4020406a DAC: 00000051 Register r0 information: 2-page vmalloc region starting at 0xe0824000 allocated at kernel_clone+0x8c/0x3a8 Register r1 information: non-paged memory Register r2 information: NULL pointer Register r3 information: NULL pointer Register r4 information: slab task_struct start c20e0000 pointer offset 0 Register r5 information: slab vm_area_struct start c2695000 pointer offset 0 Register r6 information: non-paged memory Register r7 information: non-paged memory Register r8 information: NULL pointer Register r9 information: slab task_struct start c20e0000 pointer offset 0 Register r10 information: non-paged memory Register r11 information: slab mm_struct start c213e000 pointer offset 0 size 168 Register r12 information: 2-page vmalloc region starting at 0xe0824000 allocated at kernel_clone+0x8c/0x3a8 Process swapper/0 (pid: 1, stack limit = 0x(ptrval)) Stack: (0xe0825de8 to 0xe0826000) ... ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b CPU1: stopping CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 5.18.0-rc2-00176-gfa6009949163 #11684 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x40/0x4c dump_stack_lvl from do_handle_IPI+0x2c4/0x2fc do_handle_IPI from ipi_handler+0x18/0x20 ipi_handler from handle_percpu_devid_irq+0x8c/0x1e0 handle_percpu_devid_irq from generic_handle_domain_irq+0x40/0x84 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 gic_handle_irq from generic_handle_arch_irq+0x34/0x44 generic_handle_arch_irq from call_with_stack+0x18/0x20 call_with_stack from __irq_svc+0x98/0xb0 Exception stack(0xe0869f50 to 0xe0869f98) 9f40: 00009ddc 00000000 00000001 c031be20 9f60: c20e5d80 c1b48f20 c1904d10 c1904d6c c183e9e8 c1b47971 00000000 00000000 9f80: c1904e24 e0869fa0 c0307b74 c0307b78 60000113 ffffffff __irq_svc from arch_cpu_idle+0x38/0x3c arch_cpu_idle from default_idle_call+0x3c/0xb8 default_idle_call from do_idle+0x1f8/0x298 do_idle from cpu_startup_entry+0x18/0x1c cpu_startup_entry from 0x40301780 ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- > --- > mm/memory.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 3f396241a7db..b1af996b09ca 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > vmf->address, &vmf->ptl); > ret = 0; > /* Re-check under ptl */ > - if (likely(pte_none(*vmf->pte))) > + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) > do_set_pte(vmf, page, vmf->address); > else > ret = VM_FAULT_NOPAGE; > @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > * concurrent faults and from rmap lookups. > */ > vmf->pte = NULL; > + /* > + * Always initialize orig_pte. This matches with below > + * code to have orig_pte to be the none pte if pte==NULL. > + * This makes the rest code to be always safe to reference > + * it, e.g. in finish_fault() we'll detect pte changes. > + */ > + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte); > } else { > /* > * If a huge pmd materialized under us just retry later. Use Best regards
On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote: > Hi, Hi, Marek, > > On 05.04.2022 03:48, Peter Xu wrote: > > We used to check against none pte in finish_fault(), with the assumption > > that the orig_pte is always none pte. > > > > This change prepares us to be able to call do_fault() on !none ptes. For > > example, we should allow that to happen for pte marker so that we can restore > > information out of the pte markers. > > > > Let's change the "pte_none" check into detecting changes since we fetched > > orig_pte. One trivial thing to take care of here is, when pmd==NULL for > > the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > > > > By default orig_pte will be all zeros however the problem is not all > > architectures are using all-zeros for a none pte. pte_clear() will be the > > right thing to use here so that we'll always have a valid orig_pte value > > for the whole handle_pte_fault() call. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > This patch landed in today's linux next-202204213 as commit fa6009949163 > ("mm: check against orig_pte for finish_fault()"). Unfortunately it > causes serious system instability on some ARM 32bit machines. I've > observed it on all tested boards (various Samsung Exynos based, > Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was > compiled from multi_v7_defconfig. Thanks for the report. > > Here is a crash log from QEMU's ARM 32bit virt machine: > > 8<--- cut here --- > Unable to handle kernel paging request at virtual address e093263c > [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000 > Internal error: Oops: 807 [#1] SMP ARM > Modules linked in: > CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted > 5.18.0-rc2-00176-gfa6009949163 #11684 > Hardware name: Generic DT based system > PC is at cpu_ca15_set_pte_ext+0x4c/0x58 > LR is at handle_mm_fault+0x46c/0xbb0 I had a feeling that for some reason the pte_clear() isn't working right there when it's applying to a kernel stack variable for arm32. I'm totally newbie to arm32, so what I'm reading is this: https://people.kernel.org/linusw/arm32-page-tables Especially: https://dflund.se/~triad/images/classic-mmu-page-table.jpg It does match with what I read from arm32's proc-v7-2level.S of it, where from the comment above cpu_v7_set_pte_ext: * - ptep - pointer to level 2 translation table entry * (hardware version is stored at +2048 bytes) <---------- So it seems to me that arm32 needs to store some metadata at offset 0x800 of any pte_t* pointer passed over to pte_clear(), then it must be a real pgtable or it'll write to random places in the kernel, am I correct? Does it mean that all pte_*() operations upon a kernel stack var will be wrong? I thought it could happen easily in the rest of mm too but I didn't yet check much. The fact shows that it's mostly possible the current code just work well with arm32 and no such violation occured yet. That does sound a bit tricky, IMHO. But I don't have an immediate solution to make it less tricky.. though I have a thought of workaround, by simply not calling pte_clear() on the stack var. Would you try the attached patch to replace this problematic patch? So we need to revert commit fa6009949163 and apply the new one. Please let me know whether it'll solve the problem, so far I only compile tested it, but I'll run some more test to make sure the uffd-wp scenarios will be working right with the new version. Thanks,
Hi Peter, On 13.04.2022 18:43, Peter Xu wrote: > On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote: >> On 05.04.2022 03:48, Peter Xu wrote: >>> We used to check against none pte in finish_fault(), with the assumption >>> that the orig_pte is always none pte. >>> >>> This change prepares us to be able to call do_fault() on !none ptes. For >>> example, we should allow that to happen for pte marker so that we can restore >>> information out of the pte markers. >>> >>> Let's change the "pte_none" check into detecting changes since we fetched >>> orig_pte. One trivial thing to take care of here is, when pmd==NULL for >>> the pgtable we may not initialize orig_pte at all in handle_pte_fault(). >>> >>> By default orig_pte will be all zeros however the problem is not all >>> architectures are using all-zeros for a none pte. pte_clear() will be the >>> right thing to use here so that we'll always have a valid orig_pte value >>> for the whole handle_pte_fault() call. >>> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >> This patch landed in today's linux next-202204213 as commit fa6009949163 >> ("mm: check against orig_pte for finish_fault()"). Unfortunately it >> causes serious system instability on some ARM 32bit machines. I've >> observed it on all tested boards (various Samsung Exynos based, >> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was >> compiled from multi_v7_defconfig. > Thanks for the report. > >> Here is a crash log from QEMU's ARM 32bit virt machine: >> >> 8<--- cut here --- >> Unable to handle kernel paging request at virtual address e093263c >> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000 >> Internal error: Oops: 807 [#1] SMP ARM >> Modules linked in: >> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted >> 5.18.0-rc2-00176-gfa6009949163 #11684 >> Hardware name: Generic DT based system >> PC is at cpu_ca15_set_pte_ext+0x4c/0x58 >> LR is at handle_mm_fault+0x46c/0xbb0 > I had a feeling that for some reason the pte_clear() isn't working right > there when it's applying to a kernel stack variable for arm32. I'm totally > newbie to arm32, so what I'm reading is this: > > https://people.kernel.org/linusw/arm32-page-tables > > Especially: > > https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg > > It does match with what I read from arm32's proc-v7-2level.S of it, where > from the comment above cpu_v7_set_pte_ext: > > * - ptep - pointer to level 2 translation table entry > * (hardware version is stored at +2048 bytes) <---------- > > So it seems to me that arm32 needs to store some metadata at offset 0x800 > of any pte_t* pointer passed over to pte_clear(), then it must be a real > pgtable or it'll write to random places in the kernel, am I correct? > > Does it mean that all pte_*() operations upon a kernel stack var will be > wrong? I thought it could happen easily in the rest of mm too but I didn't > yet check much. The fact shows that it's mostly possible the current code > just work well with arm32 and no such violation occured yet. > > That does sound a bit tricky, IMHO. But I don't have an immediate solution > to make it less tricky.. though I have a thought of workaround, by simply > not calling pte_clear() on the stack var. > > Would you try the attached patch to replace this problematic patch? So we > need to revert commit fa6009949163 and apply the new one. Please let me > know whether it'll solve the problem, so far I only compile tested it, but > I'll run some more test to make sure the uffd-wp scenarios will be working > right with the new version. I've reverted fa6009949163 and applied the attached patch on top of linux next-20220314. The ARM 32bit issues went away. :) Feel free to add: Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Best regards
On Thu, Apr 14, 2022 at 09:51:01AM +0200, Marek Szyprowski wrote: > Hi Peter, > > On 13.04.2022 18:43, Peter Xu wrote: > > On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote: > >> On 05.04.2022 03:48, Peter Xu wrote: > >>> We used to check against none pte in finish_fault(), with the assumption > >>> that the orig_pte is always none pte. > >>> > >>> This change prepares us to be able to call do_fault() on !none ptes. For > >>> example, we should allow that to happen for pte marker so that we can restore > >>> information out of the pte markers. > >>> > >>> Let's change the "pte_none" check into detecting changes since we fetched > >>> orig_pte. One trivial thing to take care of here is, when pmd==NULL for > >>> the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > >>> > >>> By default orig_pte will be all zeros however the problem is not all > >>> architectures are using all-zeros for a none pte. pte_clear() will be the > >>> right thing to use here so that we'll always have a valid orig_pte value > >>> for the whole handle_pte_fault() call. > >>> > >>> Signed-off-by: Peter Xu <peterx@redhat.com> > >> This patch landed in today's linux next-202204213 as commit fa6009949163 > >> ("mm: check against orig_pte for finish_fault()"). Unfortunately it > >> causes serious system instability on some ARM 32bit machines. I've > >> observed it on all tested boards (various Samsung Exynos based, > >> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was > >> compiled from multi_v7_defconfig. > > Thanks for the report. > > > >> Here is a crash log from QEMU's ARM 32bit virt machine: > >> > >> 8<--- cut here --- > >> Unable to handle kernel paging request at virtual address e093263c > >> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000 > >> Internal error: Oops: 807 [#1] SMP ARM > >> Modules linked in: > >> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted > >> 5.18.0-rc2-00176-gfa6009949163 #11684 > >> Hardware name: Generic DT based system > >> PC is at cpu_ca15_set_pte_ext+0x4c/0x58 > >> LR is at handle_mm_fault+0x46c/0xbb0 > > I had a feeling that for some reason the pte_clear() isn't working right > > there when it's applying to a kernel stack variable for arm32. I'm totally > > newbie to arm32, so what I'm reading is this: > > > > https://people.kernel.org/linusw/arm32-page-tables > > > > Especially: > > > > https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg > > > > It does match with what I read from arm32's proc-v7-2level.S of it, where > > from the comment above cpu_v7_set_pte_ext: > > > > * - ptep - pointer to level 2 translation table entry > > * (hardware version is stored at +2048 bytes) <---------- > > > > So it seems to me that arm32 needs to store some metadata at offset 0x800 > > of any pte_t* pointer passed over to pte_clear(), then it must be a real > > pgtable or it'll write to random places in the kernel, am I correct? > > > > Does it mean that all pte_*() operations upon a kernel stack var will be > > wrong? I thought it could happen easily in the rest of mm too but I didn't > > yet check much. The fact shows that it's mostly possible the current code > > just work well with arm32 and no such violation occured yet. > > > > That does sound a bit tricky, IMHO. But I don't have an immediate solution > > to make it less tricky.. though I have a thought of workaround, by simply > > not calling pte_clear() on the stack var. > > > > Would you try the attached patch to replace this problematic patch? So we > > need to revert commit fa6009949163 and apply the new one. Please let me > > know whether it'll solve the problem, so far I only compile tested it, but > > I'll run some more test to make sure the uffd-wp scenarios will be working > > right with the new version. > > I've reverted fa6009949163 and applied the attached patch on top of > linux next-20220314. The ARM 32bit issues went away. :) > > Feel free to add: > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Thanks, Marek, for the fast feedback! I've also verified it for the uffd-wp case so the whole series keeps running as usual and nothing else shows up after the new patch replaced. Andrew, any suggestion on how we proceed with the replacement patch? E.g. do you want me to post it separately to the list? Please let me know your preference, thanks.
On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote: > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> > > > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > > Thanks, Marek, for the fast feedback! Certainly. > I've also verified it for the uffd-wp case so the whole series keeps > running as usual and nothing else shows up after the new patch replaced. > > Andrew, any suggestion on how we proceed with the replacement patch? > E.g. do you want me to post it separately to the list? I turned it into an incremental diff and queued it against [03/23]: --- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix +++ a/include/linux/mm_types.h @@ -814,6 +814,8 @@ typedef struct { * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark * exclusive) a possibly shared anonymous page that is * mapped R/O. + * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. + * We should only access orig_pte if this flag set. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -850,6 +852,7 @@ enum fault_flag { FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, FAULT_FLAG_UNSHARE = 1 << 10, + FAULT_FLAG_ORIG_PTE_VALID = 1 << 11, }; #endif /* _LINUX_MM_TYPES_H */ --- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix +++ a/mm/memory.c @@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st set_pte_at(vma->vm_mm, addr, vmf->pte, entry); } +static bool vmf_pte_changed(struct vm_fault *vmf) +{ + if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) { + return !pte_same(*vmf->pte, vmf->orig_pte); + } + + return !pte_none(*vmf->pte); +} + /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault vmf->address, &vmf->ptl); ret = 0; /* Re-check under ptl */ - if (likely(pte_same(*vmf->pte, vmf->orig_pte))) + if (likely(!vmf_pte_changed(vmf))) do_set_pte(vmf, page, vmf->address); else ret = VM_FAULT_NOPAGE; @@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc * concurrent faults and from rmap lookups. */ vmf->pte = NULL; - /* - * Always initialize orig_pte. This matches with below - * code to have orig_pte to be the none pte if pte==NULL. - * This makes the rest code to be always safe to reference - * it, e.g. in finish_fault() we'll detect pte changes. - */ - pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte); + vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID; } else { /* * If a huge pmd materialized under us just retry later. Use @@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc */ vmf->pte = pte_offset_map(vmf->pmd, vmf->address); vmf->orig_pte = *vmf->pte; + vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; /* * some architectures can have larger ptes than wordsize,
On Thu, Apr 14, 2022 at 01:57:40PM -0700, Andrew Morton wrote: > On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote: > > > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> > > > > > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > > > > Thanks, Marek, for the fast feedback! > > Certainly. > > > I've also verified it for the uffd-wp case so the whole series keeps > > running as usual and nothing else shows up after the new patch replaced. > > > > Andrew, any suggestion on how we proceed with the replacement patch? > > E.g. do you want me to post it separately to the list? > > I turned it into an incremental diff and queued it against [03/23]: > > --- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix > +++ a/include/linux/mm_types.h > @@ -814,6 +814,8 @@ typedef struct { > * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark > * exclusive) a possibly shared anonymous page that is > * mapped R/O. > + * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. > + * We should only access orig_pte if this flag set. > * > * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify > * whether we would allow page faults to retry by specifying these two > @@ -850,6 +852,7 @@ enum fault_flag { > FAULT_FLAG_INSTRUCTION = 1 << 8, > FAULT_FLAG_INTERRUPTIBLE = 1 << 9, > FAULT_FLAG_UNSHARE = 1 << 10, > + FAULT_FLAG_ORIG_PTE_VALID = 1 << 11, > }; > > #endif /* _LINUX_MM_TYPES_H */ > --- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix > +++ a/mm/memory.c > @@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st > set_pte_at(vma->vm_mm, addr, vmf->pte, entry); > } > > +static bool vmf_pte_changed(struct vm_fault *vmf) > +{ > + if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) { > + return !pte_same(*vmf->pte, vmf->orig_pte); > + } > + > + return !pte_none(*vmf->pte); > +} > + > /** > * finish_fault - finish page fault once we have prepared the page to fault > * > @@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault > vmf->address, &vmf->ptl); > ret = 0; > /* Re-check under ptl */ > - if (likely(pte_same(*vmf->pte, vmf->orig_pte))) > + if (likely(!vmf_pte_changed(vmf))) > do_set_pte(vmf, page, vmf->address); > else > ret = VM_FAULT_NOPAGE; > @@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc > * concurrent faults and from rmap lookups. > */ > vmf->pte = NULL; > - /* > - * Always initialize orig_pte. This matches with below > - * code to have orig_pte to be the none pte if pte==NULL. > - * This makes the rest code to be always safe to reference > - * it, e.g. in finish_fault() we'll detect pte changes. > - */ > - pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte); > + vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID; > } else { > /* > * If a huge pmd materialized under us just retry later. Use > @@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc > */ > vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > vmf->orig_pte = *vmf->pte; > + vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > /* > * some architectures can have larger ptes than wordsize, > _ > I verified the diff, that matches with what I got. Thanks Andrew.
Hi, On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote: > We used to check against none pte in finish_fault(), with the assumption > that the orig_pte is always none pte. > > This change prepares us to be able to call do_fault() on !none ptes. For > example, we should allow that to happen for pte marker so that we can restore > information out of the pte markers. > > Let's change the "pte_none" check into detecting changes since we fetched > orig_pte. One trivial thing to take care of here is, when pmd==NULL for > the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > > By default orig_pte will be all zeros however the problem is not all > architectures are using all-zeros for a none pte. pte_clear() will be the > right thing to use here so that we'll always have a valid orig_pte value > for the whole handle_pte_fault() call. > > Signed-off-by: Peter Xu <peterx@redhat.com> This patch crashes pretty much all arm images in linux-next. Reverting it fixes the problem. Sample crash log and bisect results attached. Guenter --- [ 11.232343] 8<--- cut here --- [ 11.232564] Unable to handle kernel paging request at virtual address 88016664 [ 11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000 [ 11.233128] Internal error: Oops: 807 [#1] ARM [ 11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1 [ 11.233564] Hardware name: Generic DT based system [ 11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40 [ 11.233863] LR is at handle_mm_fault+0x4b0/0x11a8 [ 11.233963] pc : [<8010e60c>] lr : [<802944ec>] psr: 00000113 [ 11.234080] sp : 88015e20 ip : 88015e7c fp : 00000492 [ 11.234179] r10: 00000000 r9 : 00000000 r8 : 81167e50 [ 11.234280] r7 : 00000000 r6 : 00000081 r5 : 7efffff1 r4 : 83034690 [ 11.234402] r3 : 00000043 r2 : 00000000 r1 : 00000000 r0 : 88016664 [ 11.234549] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 11.234691] Control: 00093177 Table: 40004000 DAC: 00000053 [ 11.234816] Register r0 information: non-paged memory [ 11.235031] Register r1 information: NULL pointer [ 11.235127] Register r2 information: NULL pointer [ 11.235219] Register r3 information: non-paged memory [ 11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c [ 11.235825] kmem_cache_alloc+0x1fc/0x21c [ 11.235926] vm_area_alloc+0x20/0x5c [ 11.236007] alloc_bprm+0xd0/0x298 [ 11.236082] kernel_execve+0x34/0x194 [ 11.236159] kernel_init+0x6c/0x138 [ 11.236235] ret_from_fork+0x14/0x3c [ 11.236330] Register r5 information: non-paged memory [ 11.236432] Register r6 information: non-paged memory [ 11.236529] Register r7 information: NULL pointer [ 11.236620] Register r8 information: non-slab/vmalloc memory [ 11.236741] Register r9 information: NULL pointer [ 11.236833] Register r10 information: NULL pointer [ 11.236926] Register r11 information: non-paged memory [ 11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440 [ 11.237253] Process swapper (pid: 1, stack limit = 0x88014000) [ 11.237388] Stack: (0x88015e20 to 0x88016000) [ 11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e [ 11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8 [ 11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34 [ 11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000 [ 11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0 [ 11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c [ 11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000 [ 11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b [ 11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34 [ 11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc [ 11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000 [ 11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400 [ 11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000 [ 11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 [ 11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000) [ 11.240097] ---[ end trace 0000000000000000 ]--- [ 11.240307] Kernel panic - not syncing: Fatal exception -- # bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414 # good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2 git bisect start 'HEAD' 'v5.18-rc2' # good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e # good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4 # good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd # good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c # bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f # good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4 # good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d # good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* git bisect good 3e68e467590511e2cf7f47194464a5512583f641 # good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4 # bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault() git bisect bad fa600994916318341cf53e18769be547aa5975d2 # good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry git bisect good 1112411b72b5e9774897538260028a677d616779 # good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7 # first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
On Fri, Apr 15, 2022 at 07:21:12AM -0700, Guenter Roeck wrote: > Hi, Hi, Guenter, > > On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote: > > We used to check against none pte in finish_fault(), with the assumption > > that the orig_pte is always none pte. > > > > This change prepares us to be able to call do_fault() on !none ptes. For > > example, we should allow that to happen for pte marker so that we can restore > > information out of the pte markers. > > > > Let's change the "pte_none" check into detecting changes since we fetched > > orig_pte. One trivial thing to take care of here is, when pmd==NULL for > > the pgtable we may not initialize orig_pte at all in handle_pte_fault(). > > > > By default orig_pte will be all zeros however the problem is not all > > architectures are using all-zeros for a none pte. pte_clear() will be the > > right thing to use here so that we'll always have a valid orig_pte value > > for the whole handle_pte_fault() call. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > This patch crashes pretty much all arm images in linux-next. Reverting it > fixes the problem. Sample crash log and bisect results attached. Sorry for the issue, and thanks for reporting and bisecting. It's already reported by Marek and this problematic patch will be replaced by this one (already updated in -mm, but may land -next later I think): https://lore.kernel.org/all/Ylb9rXJyPm8%2Fao8f@xz-m1.local/ Thanks, > > Guenter > > --- > [ 11.232343] 8<--- cut here --- > [ 11.232564] Unable to handle kernel paging request at virtual address 88016664 > [ 11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000 > [ 11.233128] Internal error: Oops: 807 [#1] ARM > [ 11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1 > [ 11.233564] Hardware name: Generic DT based system > [ 11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40 > [ 11.233863] LR is at handle_mm_fault+0x4b0/0x11a8 > [ 11.233963] pc : [<8010e60c>] lr : [<802944ec>] psr: 00000113 > [ 11.234080] sp : 88015e20 ip : 88015e7c fp : 00000492 > [ 11.234179] r10: 00000000 r9 : 00000000 r8 : 81167e50 > [ 11.234280] r7 : 00000000 r6 : 00000081 r5 : 7efffff1 r4 : 83034690 > [ 11.234402] r3 : 00000043 r2 : 00000000 r1 : 00000000 r0 : 88016664 > [ 11.234549] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none > [ 11.234691] Control: 00093177 Table: 40004000 DAC: 00000053 > [ 11.234816] Register r0 information: non-paged memory > [ 11.235031] Register r1 information: NULL pointer > [ 11.235127] Register r2 information: NULL pointer > [ 11.235219] Register r3 information: non-paged memory > [ 11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c > [ 11.235825] kmem_cache_alloc+0x1fc/0x21c > [ 11.235926] vm_area_alloc+0x20/0x5c > [ 11.236007] alloc_bprm+0xd0/0x298 > [ 11.236082] kernel_execve+0x34/0x194 > [ 11.236159] kernel_init+0x6c/0x138 > [ 11.236235] ret_from_fork+0x14/0x3c > [ 11.236330] Register r5 information: non-paged memory > [ 11.236432] Register r6 information: non-paged memory > [ 11.236529] Register r7 information: NULL pointer > [ 11.236620] Register r8 information: non-slab/vmalloc memory > [ 11.236741] Register r9 information: NULL pointer > [ 11.236833] Register r10 information: NULL pointer > [ 11.236926] Register r11 information: non-paged memory > [ 11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440 > [ 11.237253] Process swapper (pid: 1, stack limit = 0x88014000) > [ 11.237388] Stack: (0x88015e20 to 0x88016000) > [ 11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e > [ 11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8 > [ 11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34 > [ 11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000 > [ 11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0 > [ 11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c > [ 11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000 > [ 11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b > [ 11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34 > [ 11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc > [ 11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000 > [ 11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400 > [ 11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000 > [ 11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > [ 11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 > [ 11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000) > [ 11.240097] ---[ end trace 0000000000000000 ]--- > [ 11.240307] Kernel panic - not syncing: Fatal exception > > -- > # bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414 > # good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2 > git bisect start 'HEAD' 'v5.18-rc2' > # good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git > git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e > # good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4 > # good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git > git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd > # good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git > git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c > # bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed > git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f > # good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw > git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4 > # good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds > git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d > # good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* > git bisect good 3e68e467590511e2cf7f47194464a5512583f641 > # good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged > git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4 > # bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault() > git bisect bad fa600994916318341cf53e18769be547aa5975d2 > # good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry > git bisect good 1112411b72b5e9774897538260028a677d616779 > # good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers > git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7 > # first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault() >
diff --git a/mm/memory.c b/mm/memory.c index 3f396241a7db..b1af996b09ca 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->address, &vmf->ptl); ret = 0; /* Re-check under ptl */ - if (likely(pte_none(*vmf->pte))) + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) do_set_pte(vmf, page, vmf->address); else ret = VM_FAULT_NOPAGE; @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) * concurrent faults and from rmap lookups. */ vmf->pte = NULL; + /* + * Always initialize orig_pte. This matches with below + * code to have orig_pte to be the none pte if pte==NULL. + * This makes the rest code to be always safe to reference + * it, e.g. in finish_fault() we'll detect pte changes. + */ + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte); } else { /* * If a huge pmd materialized under us just retry later. Use
We used to check against none pte in finish_fault(), with the assumption that the orig_pte is always none pte. This change prepares us to be able to call do_fault() on !none ptes. For example, we should allow that to happen for pte marker so that we can restore information out of the pte markers. Let's change the "pte_none" check into detecting changes since we fetched orig_pte. One trivial thing to take care of here is, when pmd==NULL for the pgtable we may not initialize orig_pte at all in handle_pte_fault(). By default orig_pte will be all zeros however the problem is not all architectures are using all-zeros for a none pte. pte_clear() will be the right thing to use here so that we'll always have a valid orig_pte value for the whole handle_pte_fault() call. Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/memory.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)