diff mbox series

[v8,03/23] mm: Check against orig_pte for finish_fault()

Message ID 20220405014836.14077-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series userfaultfd-wp: Support shmem and hugetlbfs | expand

Commit Message

Peter Xu April 5, 2022, 1:48 a.m. UTC
We used to check against none pte in finish_fault(), with the assumption
that the orig_pte is always none pte.

This change prepares us to be able to call do_fault() on !none ptes.  For
example, we should allow that to happen for pte marker so that we can restore
information out of the pte markers.

Let's change the "pte_none" check into detecting changes since we fetched
orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
the pgtable we may not initialize orig_pte at all in handle_pte_fault().

By default orig_pte will be all zeros however the problem is not all
architectures are using all-zeros for a none pte.  pte_clear() will be the
right thing to use here so that we'll always have a valid orig_pte value
for the whole handle_pte_fault() call.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Alistair Popple April 12, 2022, 2:05 a.m. UTC | #1
Looks ok to me now and should work for all architectures.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Peter Xu <peterx@redhat.com> writes:

> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  				      vmf->address, &vmf->ptl);
>  	ret = 0;
>  	/* Re-check under ptl */
> -	if (likely(pte_none(*vmf->pte)))
> +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>  		do_set_pte(vmf, page, vmf->address);
>  	else
>  		ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  		 * concurrent faults and from rmap lookups.
>  		 */
>  		vmf->pte = NULL;
> +		/*
> +		 * Always initialize orig_pte.  This matches with below
> +		 * code to have orig_pte to be the none pte if pte==NULL.
> +		 * This makes the rest code to be always safe to reference
> +		 * it, e.g. in finish_fault() we'll detect pte changes.
> +		 */
> +		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
>  	} else {
>  		/*
>  		 * If a huge pmd materialized under us just retry later.  Use
Peter Xu April 12, 2022, 7:54 p.m. UTC | #2
On Tue, Apr 12, 2022 at 12:05:41PM +1000, Alistair Popple wrote:
> Looks ok to me now and should work for all architectures.
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

Thanks again for taking a look.  Currently the series is queued in -mm
already, but I'll take these R-bs if there'll be a new version.
Marek Szyprowski April 13, 2022, 2:03 p.m. UTC | #3
Hi,

On 05.04.2022 03:48, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This patch landed in today's linux next-202204213 as commit fa6009949163 
("mm: check against orig_pte for finish_fault()"). Unfortunately it 
causes serious system instability on some ARM 32bit machines. I've 
observed it on all tested boards (various Samsung Exynos based, 
Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was 
compiled from multi_v7_defconfig.

Here is a crash log from QEMU's ARM 32bit virt machine:

8<--- cut here ---
Unable to handle kernel paging request at virtual address e093263c
[e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#1] SMP ARM
Modules linked in:
CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0931df8  ip : e0931e54  fp : c26a8000
r10: 00000081  r9 : c2230880  r8 : 00000000
r7 : 00000081  r6 : beffffed  r5 : c267f000  r4 : c2230880
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0931e3c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0930000 
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c2230880 pointer offset 0
Register r5 information: slab vm_area_struct start c267f000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c2230880 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c26a8000 pointer offset 0 
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0930000 
allocated at kernel_clone+0x8c/0x3a8
Process kworker/u4:0 (pid: 37, stack limit = 0x(ptrval))
Stack: (0xe0931df8 to 0xe0932000)
...
---[ end trace 0000000000000000 ]---
CAN device driver interface
bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
e1000e: Intel(R) PRO/1000 Network Driver
e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
igb: Intel(R) Gigabit Ethernet Network Driver
igb: Copyright (c) 2007-2014 Intel Corporation.
pegasus: Pegasus/Pegasus II USB Ethernet driver
usbcore: registered new interface driver pegasus
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
usbcore: registered new interface driver smsc75xx
usbcore: registered new interface driver smsc95xx
usbcore: registered new interface driver net1080
usbcore: registered new interface driver cdc_subset
usbcore: registered new interface driver zaurus
usbcore: registered new interface driver cdc_ncm
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci-pci: EHCI PCI platform driver
ehci-platform: EHCI generic platform driver
ehci-omap: OMAP-EHCI Host Controller driver
ehci-orion: EHCI orion driver
SPEAr-ehci: EHCI SPEAr driver
ehci-st: EHCI STMicroelectronics driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci-pci: OHCI PCI platform driver
ohci-platform: OHCI generic platform driver
SPEAr-ohci: OHCI SPEAr driver
ohci-st: OHCI STMicroelectronics driver
usbcore: registered new interface driver usb-storage
rtc-pl031 9010000.pl031: registered as rtc0
rtc-pl031 9010000.pl031: setting system clock to 2022-04-13T13:49:19 UTC 
(1649857759)
i2c_dev: i2c /dev entries driver
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
Synopsys Designware Multimedia Card Interface Driver
sdhci-pltfm: SDHCI platform and OF driver helper
ledtrig-cpu: registered to indicate activity on CPUs
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
NET: Registered PF_INET6 protocol family
Segment Routing with IPv6
In-situ OAM (IOAM) with IPv6
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
NET: Registered PF_PACKET protocol family
can: controller area network core
NET: Registered PF_CAN protocol family
can: raw protocol
can: broadcast manager protocol
can: netlink gateway - max_hops=1
Key type dns_resolver registered
ThumbEE CPU extension supported.
Registering SWP/SWPB emulation handler
Loading compiled-in X.509 certificates
input: gpio-keys as /devices/platform/gpio-keys/input/input0
uart-pl011 9000000.pl011: no DMA platform data
EXT4-fs (vda): mounted filesystem with ordered data mode. Quota mode: 
disabled.
VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 2048K
Run /sbin/init as init process
   with arguments:
     /sbin/init
   with environment:
     HOME=/
     TERM=linux
8<--- cut here ---
Unable to handle kernel paging request at virtual address e082662c
[e082662c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G      D 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0825de8  ip : e0825e44  fp : c213e000
r10: 00000081  r9 : c20e0000  r8 : 00000000
r7 : 00000081  r6 : befffff1  r5 : c2695000  r4 : c20e0000
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0825e2c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0824000 
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c20e0000 pointer offset 0
Register r5 information: slab vm_area_struct start c2695000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c20e0000 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c213e000 pointer offset 0 
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0824000 
allocated at kernel_clone+0x8c/0x3a8
Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
Stack: (0xe0825de8 to 0xe0826000)
...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x40/0x4c
  dump_stack_lvl from do_handle_IPI+0x2c4/0x2fc
  do_handle_IPI from ipi_handler+0x18/0x20
  ipi_handler from handle_percpu_devid_irq+0x8c/0x1e0
  handle_percpu_devid_irq from generic_handle_domain_irq+0x40/0x84
  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
  generic_handle_arch_irq from call_with_stack+0x18/0x20
  call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xe0869f50 to 0xe0869f98)
9f40:                                     00009ddc 00000000 00000001 
c031be20
9f60: c20e5d80 c1b48f20 c1904d10 c1904d6c c183e9e8 c1b47971 00000000 
00000000
9f80: c1904e24 e0869fa0 c0307b74 c0307b78 60000113 ffffffff
  __irq_svc from arch_cpu_idle+0x38/0x3c
  arch_cpu_idle from default_idle_call+0x3c/0xb8
  default_idle_call from do_idle+0x1f8/0x298
  do_idle from cpu_startup_entry+0x18/0x1c
  cpu_startup_entry from 0x40301780
---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b ]---


> ---
>   mm/memory.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   				      vmf->address, &vmf->ptl);
>   	ret = 0;
>   	/* Re-check under ptl */
> -	if (likely(pte_none(*vmf->pte)))
> +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>   		do_set_pte(vmf, page, vmf->address);
>   	else
>   		ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   		 * concurrent faults and from rmap lookups.
>   		 */
>   		vmf->pte = NULL;
> +		/*
> +		 * Always initialize orig_pte.  This matches with below
> +		 * code to have orig_pte to be the none pte if pte==NULL.
> +		 * This makes the rest code to be always safe to reference
> +		 * it, e.g. in finish_fault() we'll detect pte changes.
> +		 */
> +		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
>   	} else {
>   		/*
>   		 * If a huge pmd materialized under us just retry later.  Use

Best regards
Peter Xu April 13, 2022, 4:43 p.m. UTC | #4
On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> Hi,

Hi, Marek,

> 
> On 05.04.2022 03:48, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> >
> > This change prepares us to be able to call do_fault() on !none ptes.  For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> >
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte.  pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This patch landed in today's linux next-202204213 as commit fa6009949163 
> ("mm: check against orig_pte for finish_fault()"). Unfortunately it 
> causes serious system instability on some ARM 32bit machines. I've 
> observed it on all tested boards (various Samsung Exynos based, 
> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was 
> compiled from multi_v7_defconfig.

Thanks for the report.

> 
> Here is a crash log from QEMU's ARM 32bit virt machine:
> 
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address e093263c
> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> Internal error: Oops: 807 [#1] SMP ARM
> Modules linked in:
> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted 
> 5.18.0-rc2-00176-gfa6009949163 #11684
> Hardware name: Generic DT based system
> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> LR is at handle_mm_fault+0x46c/0xbb0

I had a feeling that for some reason the pte_clear() isn't working right
there when it's applying to a kernel stack variable for arm32.  I'm totally
newbie to arm32, so what I'm reading is this:

https://people.kernel.org/linusw/arm32-page-tables

Especially:

https://dflund.se/~triad/images/classic-mmu-page-table.jpg

It does match with what I read from arm32's proc-v7-2level.S of it, where
from the comment above cpu_v7_set_pte_ext:

  * - ptep  - pointer to level 2 translation table entry
  *    (hardware version is stored at +2048 bytes)        <----------

So it seems to me that arm32 needs to store some metadata at offset 0x800
of any pte_t* pointer passed over to pte_clear(), then it must be a real
pgtable or it'll write to random places in the kernel, am I correct?

Does it mean that all pte_*() operations upon a kernel stack var will be
wrong?  I thought it could happen easily in the rest of mm too but I didn't
yet check much.  The fact shows that it's mostly possible the current code
just work well with arm32 and no such violation occured yet.

That does sound a bit tricky, IMHO.  But I don't have an immediate solution
to make it less tricky.. though I have a thought of workaround, by simply
not calling pte_clear() on the stack var.

Would you try the attached patch to replace this problematic patch? So we
need to revert commit fa6009949163 and apply the new one.  Please let me
know whether it'll solve the problem, so far I only compile tested it, but
I'll run some more test to make sure the uffd-wp scenarios will be working
right with the new version.

Thanks,
Marek Szyprowski April 14, 2022, 7:51 a.m. UTC | #5
Hi Peter,

On 13.04.2022 18:43, Peter Xu wrote:
> On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
>> On 05.04.2022 03:48, Peter Xu wrote:
>>> We used to check against none pte in finish_fault(), with the assumption
>>> that the orig_pte is always none pte.
>>>
>>> This change prepares us to be able to call do_fault() on !none ptes.  For
>>> example, we should allow that to happen for pte marker so that we can restore
>>> information out of the pte markers.
>>>
>>> Let's change the "pte_none" check into detecting changes since we fetched
>>> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
>>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>>>
>>> By default orig_pte will be all zeros however the problem is not all
>>> architectures are using all-zeros for a none pte.  pte_clear() will be the
>>> right thing to use here so that we'll always have a valid orig_pte value
>>> for the whole handle_pte_fault() call.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> This patch landed in today's linux next-202204213 as commit fa6009949163
>> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
>> causes serious system instability on some ARM 32bit machines. I've
>> observed it on all tested boards (various Samsung Exynos based,
>> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
>> compiled from multi_v7_defconfig.
> Thanks for the report.
>
>> Here is a crash log from QEMU's ARM 32bit virt machine:
>>
>> 8<--- cut here ---
>> Unable to handle kernel paging request at virtual address e093263c
>> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
>> Internal error: Oops: 807 [#1] SMP ARM
>> Modules linked in:
>> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
>> 5.18.0-rc2-00176-gfa6009949163 #11684
>> Hardware name: Generic DT based system
>> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
>> LR is at handle_mm_fault+0x46c/0xbb0
> I had a feeling that for some reason the pte_clear() isn't working right
> there when it's applying to a kernel stack variable for arm32.  I'm totally
> newbie to arm32, so what I'm reading is this:
>
> https://people.kernel.org/linusw/arm32-page-tables
>
> Especially:
>
> https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
>
> It does match with what I read from arm32's proc-v7-2level.S of it, where
> from the comment above cpu_v7_set_pte_ext:
>
>    * - ptep  - pointer to level 2 translation table entry
>    *    (hardware version is stored at +2048 bytes)        <----------
>
> So it seems to me that arm32 needs to store some metadata at offset 0x800
> of any pte_t* pointer passed over to pte_clear(), then it must be a real
> pgtable or it'll write to random places in the kernel, am I correct?
>
> Does it mean that all pte_*() operations upon a kernel stack var will be
> wrong?  I thought it could happen easily in the rest of mm too but I didn't
> yet check much.  The fact shows that it's mostly possible the current code
> just work well with arm32 and no such violation occured yet.
>
> That does sound a bit tricky, IMHO.  But I don't have an immediate solution
> to make it less tricky.. though I have a thought of workaround, by simply
> not calling pte_clear() on the stack var.
>
> Would you try the attached patch to replace this problematic patch? So we
> need to revert commit fa6009949163 and apply the new one.  Please let me
> know whether it'll solve the problem, so far I only compile tested it, but
> I'll run some more test to make sure the uffd-wp scenarios will be working
> right with the new version.

I've reverted fa6009949163 and applied the attached patch on top of 
linux next-20220314. The ARM 32bit issues went away. :)

Feel free to add:

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

Best regards
Peter Xu April 14, 2022, 4:30 p.m. UTC | #6
On Thu, Apr 14, 2022 at 09:51:01AM +0200, Marek Szyprowski wrote:
> Hi Peter,
> 
> On 13.04.2022 18:43, Peter Xu wrote:
> > On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> >> On 05.04.2022 03:48, Peter Xu wrote:
> >>> We used to check against none pte in finish_fault(), with the assumption
> >>> that the orig_pte is always none pte.
> >>>
> >>> This change prepares us to be able to call do_fault() on !none ptes.  For
> >>> example, we should allow that to happen for pte marker so that we can restore
> >>> information out of the pte markers.
> >>>
> >>> Let's change the "pte_none" check into detecting changes since we fetched
> >>> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> >>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >>>
> >>> By default orig_pte will be all zeros however the problem is not all
> >>> architectures are using all-zeros for a none pte.  pte_clear() will be the
> >>> right thing to use here so that we'll always have a valid orig_pte value
> >>> for the whole handle_pte_fault() call.
> >>>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >> This patch landed in today's linux next-202204213 as commit fa6009949163
> >> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
> >> causes serious system instability on some ARM 32bit machines. I've
> >> observed it on all tested boards (various Samsung Exynos based,
> >> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
> >> compiled from multi_v7_defconfig.
> > Thanks for the report.
> >
> >> Here is a crash log from QEMU's ARM 32bit virt machine:
> >>
> >> 8<--- cut here ---
> >> Unable to handle kernel paging request at virtual address e093263c
> >> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> >> Internal error: Oops: 807 [#1] SMP ARM
> >> Modules linked in:
> >> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
> >> 5.18.0-rc2-00176-gfa6009949163 #11684
> >> Hardware name: Generic DT based system
> >> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> >> LR is at handle_mm_fault+0x46c/0xbb0
> > I had a feeling that for some reason the pte_clear() isn't working right
> > there when it's applying to a kernel stack variable for arm32.  I'm totally
> > newbie to arm32, so what I'm reading is this:
> >
> > https://people.kernel.org/linusw/arm32-page-tables
> >
> > Especially:
> >
> > https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
> >
> > It does match with what I read from arm32's proc-v7-2level.S of it, where
> > from the comment above cpu_v7_set_pte_ext:
> >
> >    * - ptep  - pointer to level 2 translation table entry
> >    *    (hardware version is stored at +2048 bytes)        <----------
> >
> > So it seems to me that arm32 needs to store some metadata at offset 0x800
> > of any pte_t* pointer passed over to pte_clear(), then it must be a real
> > pgtable or it'll write to random places in the kernel, am I correct?
> >
> > Does it mean that all pte_*() operations upon a kernel stack var will be
> > wrong?  I thought it could happen easily in the rest of mm too but I didn't
> > yet check much.  The fact shows that it's mostly possible the current code
> > just work well with arm32 and no such violation occured yet.
> >
> > That does sound a bit tricky, IMHO.  But I don't have an immediate solution
> > to make it less tricky.. though I have a thought of workaround, by simply
> > not calling pte_clear() on the stack var.
> >
> > Would you try the attached patch to replace this problematic patch? So we
> > need to revert commit fa6009949163 and apply the new one.  Please let me
> > know whether it'll solve the problem, so far I only compile tested it, but
> > I'll run some more test to make sure the uffd-wp scenarios will be working
> > right with the new version.
> 
> I've reverted fa6009949163 and applied the attached patch on top of 
> linux next-20220314. The ARM 32bit issues went away. :)
> 
> Feel free to add:
> 
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

Thanks, Marek, for the fast feedback!

I've also verified it for the uffd-wp case so the whole series keeps
running as usual and nothing else shows up after the new patch replaced.

Andrew, any suggestion on how we proceed with the replacement patch?
E.g. do you want me to post it separately to the list?

Please let me know your preference, thanks.
Andrew Morton April 14, 2022, 8:57 p.m. UTC | #7
On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote:

> > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > 
> > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> Thanks, Marek, for the fast feedback!

Certainly.

> I've also verified it for the uffd-wp case so the whole series keeps
> running as usual and nothing else shows up after the new patch replaced.
> 
> Andrew, any suggestion on how we proceed with the replacement patch?
> E.g. do you want me to post it separately to the list?

I turned it into an incremental diff and queued it against [03/23]:

--- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/include/linux/mm_types.h
@@ -814,6 +814,8 @@ typedef struct {
  * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
  *                      exclusive) a possibly shared anonymous page that is
  *                      mapped R/O.
+ * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
+ *                        We should only access orig_pte if this flag set.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -850,6 +852,7 @@ enum fault_flag {
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
+	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
 };
 
 #endif /* _LINUX_MM_TYPES_H */
--- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/mm/memory.c
@@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
 
+static bool vmf_pte_changed(struct vm_fault *vmf)
+{
+	if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
+		return !pte_same(*vmf->pte, vmf->orig_pte);
+	}
+
+	return !pte_none(*vmf->pte);
+}
+
 /**
  * finish_fault - finish page fault once we have prepared the page to fault
  *
@@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+	if (likely(!vmf_pte_changed(vmf)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
@@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
 		 * concurrent faults and from rmap lookups.
 		 */
 		vmf->pte = NULL;
-		/*
-		 * Always initialize orig_pte.  This matches with below
-		 * code to have orig_pte to be the none pte if pte==NULL.
-		 * This makes the rest code to be always safe to reference
-		 * it, e.g. in finish_fault() we'll detect pte changes.
-		 */
-		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
+		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
 	} else {
 		/*
 		 * If a huge pmd materialized under us just retry later.  Use
@@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
 		 */
 		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
 		vmf->orig_pte = *vmf->pte;
+		vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
 
 		/*
 		 * some architectures can have larger ptes than wordsize,
Peter Xu April 14, 2022, 9:08 p.m. UTC | #8
On Thu, Apr 14, 2022 at 01:57:40PM -0700, Andrew Morton wrote:
> On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > > 
> > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > 
> > Thanks, Marek, for the fast feedback!
> 
> Certainly.
> 
> > I've also verified it for the uffd-wp case so the whole series keeps
> > running as usual and nothing else shows up after the new patch replaced.
> > 
> > Andrew, any suggestion on how we proceed with the replacement patch?
> > E.g. do you want me to post it separately to the list?
> 
> I turned it into an incremental diff and queued it against [03/23]:
> 
> --- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/include/linux/mm_types.h
> @@ -814,6 +814,8 @@ typedef struct {
>   * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
>   *                      exclusive) a possibly shared anonymous page that is
>   *                      mapped R/O.
> + * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> + *                        We should only access orig_pte if this flag set.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -850,6 +852,7 @@ enum fault_flag {
>  	FAULT_FLAG_INSTRUCTION =	1 << 8,
>  	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
>  	FAULT_FLAG_UNSHARE =		1 << 10,
> +	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> --- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/mm/memory.c
> @@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
>  	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>  }
>  
> +static bool vmf_pte_changed(struct vm_fault *vmf)
> +{
> +	if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
> +		return !pte_same(*vmf->pte, vmf->orig_pte);
> +	}
> +
> +	return !pte_none(*vmf->pte);
> +}
> +
>  /**
>   * finish_fault - finish page fault once we have prepared the page to fault
>   *
> @@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
>  				      vmf->address, &vmf->ptl);
>  	ret = 0;
>  	/* Re-check under ptl */
> -	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> +	if (likely(!vmf_pte_changed(vmf)))
>  		do_set_pte(vmf, page, vmf->address);
>  	else
>  		ret = VM_FAULT_NOPAGE;
> @@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
>  		 * concurrent faults and from rmap lookups.
>  		 */
>  		vmf->pte = NULL;
> -		/*
> -		 * Always initialize orig_pte.  This matches with below
> -		 * code to have orig_pte to be the none pte if pte==NULL.
> -		 * This makes the rest code to be always safe to reference
> -		 * it, e.g. in finish_fault() we'll detect pte changes.
> -		 */
> -		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> +		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
>  	} else {
>  		/*
>  		 * If a huge pmd materialized under us just retry later.  Use
> @@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
>  		 */
>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>  		vmf->orig_pte = *vmf->pte;
> +		vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>  
>  		/*
>  		 * some architectures can have larger ptes than wordsize,
> _
> 

I verified the diff, that matches with what I got.  Thanks Andrew.
Guenter Roeck April 15, 2022, 2:21 p.m. UTC | #9
Hi,

On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
> 
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
> 
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> 
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

This patch crashes pretty much all arm images in linux-next. Reverting it
fixes the problem. Sample crash log and bisect results attached.

Guenter

---
[   11.232343] 8<--- cut here ---
[   11.232564] Unable to handle kernel paging request at virtual address 88016664
[   11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
[   11.233128] Internal error: Oops: 807 [#1] ARM
[   11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
[   11.233564] Hardware name: Generic DT based system
[   11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
[   11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
[   11.233963] pc : [<8010e60c>]    lr : [<802944ec>]    psr: 00000113
[   11.234080] sp : 88015e20  ip : 88015e7c  fp : 00000492
[   11.234179] r10: 00000000  r9 : 00000000  r8 : 81167e50
[   11.234280] r7 : 00000000  r6 : 00000081  r5 : 7efffff1  r4 : 83034690
[   11.234402] r3 : 00000043  r2 : 00000000  r1 : 00000000  r0 : 88016664
[   11.234549] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   11.234691] Control: 00093177  Table: 40004000  DAC: 00000053
[   11.234816] Register r0 information: non-paged memory
[   11.235031] Register r1 information: NULL pointer
[   11.235127] Register r2 information: NULL pointer
[   11.235219] Register r3 information: non-paged memory
[   11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
[   11.235825]     kmem_cache_alloc+0x1fc/0x21c
[   11.235926]     vm_area_alloc+0x20/0x5c
[   11.236007]     alloc_bprm+0xd0/0x298
[   11.236082]     kernel_execve+0x34/0x194
[   11.236159]     kernel_init+0x6c/0x138
[   11.236235]     ret_from_fork+0x14/0x3c
[   11.236330] Register r5 information: non-paged memory
[   11.236432] Register r6 information: non-paged memory
[   11.236529] Register r7 information: NULL pointer
[   11.236620] Register r8 information: non-slab/vmalloc memory
[   11.236741] Register r9 information: NULL pointer
[   11.236833] Register r10 information: NULL pointer
[   11.236926] Register r11 information: non-paged memory
[   11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
[   11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
[   11.237388] Stack: (0x88015e20 to 0x88016000)
[   11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
[   11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
[   11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
[   11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
[   11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
[   11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
[   11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
[   11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
[   11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
[   11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
[   11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
[   11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
[   11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
[   11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[   11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
[   11.240097] ---[ end trace 0000000000000000 ]---
[   11.240307] Kernel panic - not syncing: Fatal exception

--
# bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
# good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
git bisect start 'HEAD' 'v5.18-rc2'
# good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
# good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
# good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
# good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
# bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
# good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
# good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
# good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
git bisect good 3e68e467590511e2cf7f47194464a5512583f641
# good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
# bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
git bisect bad fa600994916318341cf53e18769be547aa5975d2
# good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
git bisect good 1112411b72b5e9774897538260028a677d616779
# good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
# first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
Peter Xu April 15, 2022, 2:41 p.m. UTC | #10
On Fri, Apr 15, 2022 at 07:21:12AM -0700, Guenter Roeck wrote:
> Hi,

Hi, Guenter,

> 
> On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> > 
> > This change prepares us to be able to call do_fault() on !none ptes.  For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> > 
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> > 
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte.  pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This patch crashes pretty much all arm images in linux-next. Reverting it
> fixes the problem. Sample crash log and bisect results attached.

Sorry for the issue, and thanks for reporting and bisecting.

It's already reported by Marek and this problematic patch will be replaced
by this one (already updated in -mm, but may land -next later I think):

https://lore.kernel.org/all/Ylb9rXJyPm8%2Fao8f@xz-m1.local/

Thanks,

> 
> Guenter
> 
> ---
> [   11.232343] 8<--- cut here ---
> [   11.232564] Unable to handle kernel paging request at virtual address 88016664
> [   11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
> [   11.233128] Internal error: Oops: 807 [#1] ARM
> [   11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
> [   11.233564] Hardware name: Generic DT based system
> [   11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
> [   11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
> [   11.233963] pc : [<8010e60c>]    lr : [<802944ec>]    psr: 00000113
> [   11.234080] sp : 88015e20  ip : 88015e7c  fp : 00000492
> [   11.234179] r10: 00000000  r9 : 00000000  r8 : 81167e50
> [   11.234280] r7 : 00000000  r6 : 00000081  r5 : 7efffff1  r4 : 83034690
> [   11.234402] r3 : 00000043  r2 : 00000000  r1 : 00000000  r0 : 88016664
> [   11.234549] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   11.234691] Control: 00093177  Table: 40004000  DAC: 00000053
> [   11.234816] Register r0 information: non-paged memory
> [   11.235031] Register r1 information: NULL pointer
> [   11.235127] Register r2 information: NULL pointer
> [   11.235219] Register r3 information: non-paged memory
> [   11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
> [   11.235825]     kmem_cache_alloc+0x1fc/0x21c
> [   11.235926]     vm_area_alloc+0x20/0x5c
> [   11.236007]     alloc_bprm+0xd0/0x298
> [   11.236082]     kernel_execve+0x34/0x194
> [   11.236159]     kernel_init+0x6c/0x138
> [   11.236235]     ret_from_fork+0x14/0x3c
> [   11.236330] Register r5 information: non-paged memory
> [   11.236432] Register r6 information: non-paged memory
> [   11.236529] Register r7 information: NULL pointer
> [   11.236620] Register r8 information: non-slab/vmalloc memory
> [   11.236741] Register r9 information: NULL pointer
> [   11.236833] Register r10 information: NULL pointer
> [   11.236926] Register r11 information: non-paged memory
> [   11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
> [   11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
> [   11.237388] Stack: (0x88015e20 to 0x88016000)
> [   11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
> [   11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
> [   11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
> [   11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
> [   11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
> [   11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
> [   11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
> [   11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
> [   11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
> [   11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
> [   11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
> [   11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
> [   11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
> [   11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [   11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> [   11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
> [   11.240097] ---[ end trace 0000000000000000 ]---
> [   11.240307] Kernel panic - not syncing: Fatal exception
> 
> --
> # bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
> # good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
> git bisect start 'HEAD' 'v5.18-rc2'
> # good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
> git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
> # good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
> # good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
> git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
> # good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
> git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
> # bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
> git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
> # good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
> git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
> # good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
> git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
> # good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
> git bisect good 3e68e467590511e2cf7f47194464a5512583f641
> # good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
> git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
> # bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
> git bisect bad fa600994916318341cf53e18769be547aa5975d2
> # good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
> git bisect good 1112411b72b5e9774897538260028a677d616779
> # good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
> git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
> # first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
>
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 3f396241a7db..b1af996b09ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4241,7 +4241,7 @@  vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
@@ -4709,6 +4709,13 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * concurrent faults and from rmap lookups.
 		 */
 		vmf->pte = NULL;
+		/*
+		 * Always initialize orig_pte.  This matches with below
+		 * code to have orig_pte to be the none pte if pte==NULL.
+		 * This makes the rest code to be always safe to reference
+		 * it, e.g. in finish_fault() we'll detect pte changes.
+		 */
+		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
 	} else {
 		/*
 		 * If a huge pmd materialized under us just retry later.  Use