[v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset()
diff mbox series

Message ID 1582342427-230392-1-git-send-email-longpeng2@huawei.com
State New
Headers show
Series
  • [v2] mm/hugetlb: fix a addressing exception caused by huge_pte_offset()
Related show

Commit Message

Longpeng(Mike) Feb. 22, 2020, 3:33 a.m. UTC
From: Longpeng <longpeng2@huawei.com>

Our machine encountered a panic(addressing exception) after run
for a long time and the calltrace is:
RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
 [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
 [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
 [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
 [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
 [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
 [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
 [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
 [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
 [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
 [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
 [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
 [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
 [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
 [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
 [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
 [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
 [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
 [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
 [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
 [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
 [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
 [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
 [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
 [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27

( The kernel we used is older, but we think the latest kernel also has this
  bug after dig into this problem. )

For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
may return a wrong 'pmdp' if there is a race. Please look at the following
code snippet:
    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
        return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
        return (pte_t *)pud;

    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
        return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
        return (pte_t *)pmd;
    ...

The following sequence would trigger this bug:
1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
1. CPU0: "pud_huge(*pud)" is false
2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
3. CPU0: "!pud_present(*pud)" is false, continue
4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
However, we want CPU0 to return NULL or pudp.

We can avoid this race by read the pud only once. What's more, we also use
READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Longpeng <longpeng2@huawei.com>
---
v1 -> v2:
  - avoid renaming    [Matthew, Mike]

---
 mm/hugetlb.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Comments

Mike Kravetz March 21, 2020, 11:38 p.m. UTC | #1
On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
> From: Longpeng <longpeng2@huawei.com>
> 
> Our machine encountered a panic(addressing exception) after run
> for a long time and the calltrace is:
> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
> 
> ( The kernel we used is older, but we think the latest kernel also has this
>   bug after dig into this problem. )
> 
> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
> may return a wrong 'pmdp' if there is a race. Please look at the following
> code snippet:
>     ...
>     pud = pud_offset(p4d, addr);
>     if (sz != PUD_SIZE && pud_none(*pud))
>         return NULL;
>     /* hugepage or swap? */
>     if (pud_huge(*pud) || !pud_present(*pud))
>         return (pte_t *)pud;
> 
>     pmd = pmd_offset(pud, addr);
>     if (sz != PMD_SIZE && pmd_none(*pmd))
>         return NULL;
>     /* hugepage or swap? */
>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>         return (pte_t *)pmd;
>     ...
> 
> The following sequence would trigger this bug:
> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
> 1. CPU0: "pud_huge(*pud)" is false
> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
> 3. CPU0: "!pud_present(*pud)" is false, continue
> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
> However, we want CPU0 to return NULL or pudp.
> 
> We can avoid this race by read the pud only once. What's more, we also use
> READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Longpeng <longpeng2@huawei.com>

Andrew dropped this patch from his tree which caused me to go back and
look at the status of this patch/issue.

It is pretty obvious that code in the current huge_pte_offset routine
is racy.  I checked out the assembly code produced by my compiler and
verified that the line,

	if (pud_huge(*pud) || !pud_present(*pud))

does actually dereference *pud twice.  So, the value could change between
those two dereferences.   Longpeng (Mike) could easlily recreate the issue
if he put a delay between the two dereferences.  I believe the only
reservations/concerns about the patch below was the use of READ_ONCE().
Is that correct?

Are there any objections to the patch if the READ_ONCE() calls are removed?

Longpeng (Mike), can you recreate the issue by adding the delay and removing
the READ_ONCE() calls?
Longpeng(Mike) March 23, 2020, 2:03 a.m. UTC | #2
On 2020/3/22 7:38, Mike Kravetz wrote:
> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
>> From: Longpeng <longpeng2@huawei.com>
>>
>> Our machine encountered a panic(addressing exception) after run
>> for a long time and the calltrace is:
>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>
>> ( The kernel we used is older, but we think the latest kernel also has this
>>   bug after dig into this problem. )
>>
>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>> may return a wrong 'pmdp' if there is a race. Please look at the following
>> code snippet:
>>     ...
>>     pud = pud_offset(p4d, addr);
>>     if (sz != PUD_SIZE && pud_none(*pud))
>>         return NULL;
>>     /* hugepage or swap? */
>>     if (pud_huge(*pud) || !pud_present(*pud))
>>         return (pte_t *)pud;
>>
>>     pmd = pmd_offset(pud, addr);
>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>         return NULL;
>>     /* hugepage or swap? */
>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>         return (pte_t *)pmd;
>>     ...
>>
>> The following sequence would trigger this bug:
>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>> 1. CPU0: "pud_huge(*pud)" is false
>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>> 3. CPU0: "!pud_present(*pud)" is false, continue
>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>> However, we want CPU0 to return NULL or pudp.
>>
>> We can avoid this race by read the pud only once. What's more, we also use
>> READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)
>>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Longpeng <longpeng2@huawei.com>
> 
> Andrew dropped this patch from his tree which caused me to go back and
> look at the status of this patch/issue.
> 
> It is pretty obvious that code in the current huge_pte_offset routine
> is racy.  I checked out the assembly code produced by my compiler and
> verified that the line,
> 
> 	if (pud_huge(*pud) || !pud_present(*pud))
> 
> does actually dereference *pud twice.  So, the value could change between
> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
> if he put a delay between the two dereferences.  I believe the only
> reservations/concerns about the patch below was the use of READ_ONCE().
> Is that correct?
> 
Hi Mike,

It seems I've missed your another mail in my client, I found it here
(https://lkml.org/lkml/2020/2/27/1927) just now.

I think we have reached an agreement that the pud/pmd need READ_ONCE in
huge_pte_offset() and disagreement is whether the pgd/p4d also need READ_ONCE,
right ?

> Are there any objections to the patch if the READ_ONCE() calls are removed?
> 
Because the pgd/p4g are only accessed and dereferenced once here, so some guys
want to remove it.

But we must make sure they are *really* accessed once, in other words, this
makes we need to care about both the implementation of pgd_present/p4d_present
and the behavior of any compiler, for example:

'''
static inline int func(int val)
{
    return subfunc1(val) & subfunc2(val);
}

func(*p); // int *p
'''
We must make sure there's no strange compiler to generate an assemble code that
access and dereference 'p' more than once.

I've not found any backwards with READ_ONCE here. However, if you also agree to
remove READ_ONCE around pgd/p4d, I'll do.

> Longpeng (Mike), can you recreate the issue by adding the delay and removing
> the READ_ONCE() calls?
> 
I think remove the READ_ONCE around pgd/p4d won't cause any fucntional change.

---
Regards,
Longpeng(Mike)
Mike Kravetz March 23, 2020, 2:54 a.m. UTC | #3
On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> On 2020/3/22 7:38, Mike Kravetz wrote:
>> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
>>> From: Longpeng <longpeng2@huawei.com>
>>>
>>> Our machine encountered a panic(addressing exception) after run
>>> for a long time and the calltrace is:
>>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>>
>>> ( The kernel we used is older, but we think the latest kernel also has this
>>>   bug after dig into this problem. )
>>>
>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>>> may return a wrong 'pmdp' if there is a race. Please look at the following
>>> code snippet:
>>>     ...
>>>     pud = pud_offset(p4d, addr);
>>>     if (sz != PUD_SIZE && pud_none(*pud))
>>>         return NULL;
>>>     /* hugepage or swap? */
>>>     if (pud_huge(*pud) || !pud_present(*pud))
>>>         return (pte_t *)pud;
>>>
>>>     pmd = pmd_offset(pud, addr);
>>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>>         return NULL;
>>>     /* hugepage or swap? */
>>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>         return (pte_t *)pmd;
>>>     ...
>>>
>>> The following sequence would trigger this bug:
>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>>> 1. CPU0: "pud_huge(*pud)" is false
>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>>> 3. CPU0: "!pud_present(*pud)" is false, continue
>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>>> However, we want CPU0 to return NULL or pudp.
>>>
>>> We can avoid this race by read the pud only once. What's more, we also use
>>> READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)
>>>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Longpeng <longpeng2@huawei.com>
>>
>> Andrew dropped this patch from his tree which caused me to go back and
>> look at the status of this patch/issue.
>>
>> It is pretty obvious that code in the current huge_pte_offset routine
>> is racy.  I checked out the assembly code produced by my compiler and
>> verified that the line,
>>
>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>
>> does actually dereference *pud twice.  So, the value could change between
>> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
>> if he put a delay between the two dereferences.  I believe the only
>> reservations/concerns about the patch below was the use of READ_ONCE().
>> Is that correct?
>>
> Hi Mike,
> 
> It seems I've missed your another mail in my client, I found it here
> (https://lkml.org/lkml/2020/2/27/1927) just now.
> 
> I think we have reached an agreement that the pud/pmd need READ_ONCE in
> huge_pte_offset() and disagreement is whether the pgd/p4d also need READ_ONCE,
> right ?

Correct.

Sorry, I did not reply to the mail thread with more context.

>> Are there any objections to the patch if the READ_ONCE() calls are removed?
>>
> Because the pgd/p4g are only accessed and dereferenced once here, so some guys
> want to remove it.
> 
> But we must make sure they are *really* accessed once, in other words, this
> makes we need to care about both the implementation of pgd_present/p4d_present
> and the behavior of any compiler, for example:
> 
> '''
> static inline int func(int val)
> {
>     return subfunc1(val) & subfunc2(val);
> }
> 
> func(*p); // int *p
> '''
> We must make sure there's no strange compiler to generate an assemble code that
> access and dereference 'p' more than once.
> 
> I've not found any backwards with READ_ONCE here. However, if you also agree to
> remove READ_ONCE around pgd/p4d, I'll do.
> 

I would like to remove the READ_ONCE calls and move the patch forward.  It
does address a real issue you are seeing.

To be honest, I am more worried about the races in lookup_address_in_pgd()
than using or not using READ_ONCE for  pgd/p4d in this patch.

I have not looked closely at the generated code for lookup_address_in_pgd.
It appears that it would dereference p4d, pud and pmd multiple times.  Sean
seemed to think there was something about the calling context that would
make issues like those seen with huge_pte_offset less likely to happen.  I
do not know if this is accurate or not.

Let's remove the two READ_ONCE calls and move this patch forward.  We can
look closer at lookup_address_in_pgd and generate another patch if that needs
to be fixed as well.

Thanks
Longpeng(Mike) March 23, 2020, 3:43 a.m. UTC | #4
On 2020/3/23 10:54, Mike Kravetz wrote:
> On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> On 2020/3/22 7:38, Mike Kravetz wrote:
>>> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
>>>> From: Longpeng <longpeng2@huawei.com>
>>>>
>>>> Our machine encountered a panic(addressing exception) after run
>>>> for a long time and the calltrace is:

[snip]

>>>>
>>>> We can avoid this race by read the pud only once. What's more, we also use
>>>> READ_ONCE to access the entries for safe(e.g. avoid the compilier mischief)
>>>>
>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Longpeng <longpeng2@huawei.com>
>>>
>>> Andrew dropped this patch from his tree which caused me to go back and
>>> look at the status of this patch/issue.
>>>
>>> It is pretty obvious that code in the current huge_pte_offset routine
>>> is racy.  I checked out the assembly code produced by my compiler and
>>> verified that the line,
>>>
>>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>>
>>> does actually dereference *pud twice.  So, the value could change between
>>> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
>>> if he put a delay between the two dereferences.  I believe the only
>>> reservations/concerns about the patch below was the use of READ_ONCE().
>>> Is that correct?
>>>
>> Hi Mike,
>>
>> It seems I've missed your another mail in my client, I found it here
>> (https://lkml.org/lkml/2020/2/27/1927) just now.
>>
>> I think we have reached an agreement that the pud/pmd need READ_ONCE in
>> huge_pte_offset() and disagreement is whether the pgd/p4d also need READ_ONCE,
>> right ?
> 
> Correct.
> 
> Sorry, I did not reply to the mail thread with more context.
> 
>>> Are there any objections to the patch if the READ_ONCE() calls are removed?
>>>
>> Because the pgd/p4g are only accessed and dereferenced once here, so some guys
>> want to remove it.
>>
>> But we must make sure they are *really* accessed once, in other words, this
>> makes we need to care about both the implementation of pgd_present/p4d_present
>> and the behavior of any compiler, for example:
>>
>> '''
>> static inline int func(int val)
>> {
>>     return subfunc1(val) & subfunc2(val);
>> }
>>
>> func(*p); // int *p
>> '''
>> We must make sure there's no strange compiler to generate an assemble code that
>> access and dereference 'p' more than once.
>>
>> I've not found any backwards with READ_ONCE here. However, if you also agree to
>> remove READ_ONCE around pgd/p4d, I'll do.
>>
> 
> I would like to remove the READ_ONCE calls and move the patch forward.  It
> does address a real issue you are seeing.
> 
> To be honest, I am more worried about the races in lookup_address_in_pgd()
> than using or not using READ_ONCE for  pgd/p4d in this patch.
> 
I had the same worry, we've discussed in another thread
(https://lkml.org/lkml/2020/2/20/1182) where I asked you `Is it possible the pud
changes from pud_huge() to pud_none() while another CPU is walking the
pagetable` and you thought it's possible.
The reason why I didn't do something in lookup_address_in_pgd together is just
because I haven't went into trouble caused by it yet.

> I have not looked closely at the generated code for lookup_address_in_pgd.
> It appears that it would dereference p4d, pud and pmd multiple times.  Sean
> seemed to think there was something about the calling context that would
> make issues like those seen with huge_pte_offset less likely to happen.  I
> do not know if this is accurate or not.
> 
> Let's remove the two READ_ONCE calls and move this patch forward.  We can
> look closer at lookup_address_in_pgd and generate another patch if that needs
> to be fixed as well.
> 
OK, I'll remove them in v3.

I'll do some fault injection or add some delays in lookup_address_in_pgd to test
if it can work well.

> Thanks
> 

---
Regards,
Longpeng(Mike)
Sean Christopherson March 23, 2020, 2:40 p.m. UTC | #5
On Sun, Mar 22, 2020 at 07:54:32PM -0700, Mike Kravetz wrote:
> On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > 
> > On 2020/3/22 7:38, Mike Kravetz wrote:
> >> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
> >>> From: Longpeng <longpeng2@huawei.com>
> I have not looked closely at the generated code for lookup_address_in_pgd.
> It appears that it would dereference p4d, pud and pmd multiple times.  Sean
> seemed to think there was something about the calling context that would
> make issues like those seen with huge_pte_offset less likely to happen.  I
> do not know if this is accurate or not.

Only for KVM's calls to lookup_address_in_mm(), I can't speak to other
calls that funnel into to lookup_address_in_pgd().

KVM uses a combination of tracking and blocking mmu_notifier calls to ensure
PTE changes/invalidations between gup() and lookup_address_in_pgd() cause a
restart of the faulting instruction, and that pending changes/invalidations
are blocked until installation of the pfn in KVM's secondary MMU completes.

kvm_mmu_page_fault():

	mmu_seq = kvm->mmu_notifier_seq;
	smp_rmb();

	pfn = gup(hva);

	spin_lock(&kvm->mmu_lock);
	smp_rmb();
	if (kvm->mmu_notifier_seq != mmu_seq)
		goto out_unlock: // Restart guest, i.e. retry the fault

	lookup_address_in_mm(hva, ...);

	...

  out_unlock:
	spin_unlock(&kvm->mmu_lock);


kvm_mmu_notifier_change_pte() / kvm_mmu_notifier_invalidate_range_end():

	spin_lock(&kvm->mmu_lock);
	kvm->mmu_notifier_seq++;
	smp_wmb();
	spin_unlock(&kvm->mmu_lock);


> Let's remove the two READ_ONCE calls and move this patch forward.  We can
> look closer at lookup_address_in_pgd and generate another patch if that needs
> to be fixed as well.
> 
> Thanks
> -- 
> Mike Kravetz
Jason Gunthorpe March 23, 2020, 4:09 p.m. UTC | #6
On Sat, Mar 21, 2020 at 04:38:19PM -0700, Mike Kravetz wrote:

> Andrew dropped this patch from his tree which caused me to go back and
> look at the status of this patch/issue.
> 
> It is pretty obvious that code in the current huge_pte_offset routine
> is racy.  I checked out the assembly code produced by my compiler and
> verified that the line,
> 
> 	if (pud_huge(*pud) || !pud_present(*pud))
> 
> does actually dereference *pud twice.  So, the value could change between
> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
> if he put a delay between the two dereferences.  I believe the only
> reservations/concerns about the patch below was the use of READ_ONCE().
> Is that correct?

I'm looking at a similar situation in pagewalk.c right now with PUD,
and it is very confusing to see that locks are being held, memory
accessed without READ_ONCE, but actually it has concurrent writes.

I think it is valuable to annotate with READ_ONCE when the author
knows this is an unlocked data access, regardless of what the compiler
does.

pagewalk probably has the same racy bug you show here, I'm going to
send a very similar looking patch to pagewalk hopefully soon.

Also, the remark about pmd_offset() seems accurate. The
get_user_fast_pages() pattern seems like the correct one to emulate:

  pud = READ_ONCE(*pudp);
  if (pud_none(pud)) 
     ..
  if (!pud_'is a pmd pointer')
     ..
  pmdp = pmd_offset(&pud, address);
  pmd = READ_ONCE(*pmd);
  [...]

Passing &pud in avoids another de-reference of the pudp. Honestly all
these APIs that take in page table pointers and internally
de-reference them seem very hard to use correctly when the page table
access isn't fully locked against write.

This also relies on 'some kind of locking' to prevent the pmdp from
becoming freed concurrently while this is running.

.. also this only works if READ_ONCE() is atomic, ie the pud can't be
64 bit on a 32 bit platform. At least pmd has this problem, I haven't
figured out if pud does??

> Are there any objections to the patch if the READ_ONCE() calls are removed?

I think if we know there is no concurrent data access then it makes
sense to keep the READ_ONCE.

It looks like at least the p4d read from the pgd is also unlocked here
as handle_mm_fault() writes to it??

Jason
Jason Gunthorpe March 23, 2020, 4:44 p.m. UTC | #7
On Mon, Mar 23, 2020 at 07:40:31AM -0700, Sean Christopherson wrote:
> On Sun, Mar 22, 2020 at 07:54:32PM -0700, Mike Kravetz wrote:
> > On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > 
> > > On 2020/3/22 7:38, Mike Kravetz wrote:
> > >> On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
> > >>> From: Longpeng <longpeng2@huawei.com>
> > I have not looked closely at the generated code for lookup_address_in_pgd.
> > It appears that it would dereference p4d, pud and pmd multiple times.  Sean
> > seemed to think there was something about the calling context that would
> > make issues like those seen with huge_pte_offset less likely to happen.  I
> > do not know if this is accurate or not.
> 
> Only for KVM's calls to lookup_address_in_mm(), I can't speak to other
> calls that funnel into to lookup_address_in_pgd().
> 
> KVM uses a combination of tracking and blocking mmu_notifier calls to ensure
> PTE changes/invalidations between gup() and lookup_address_in_pgd() cause a
> restart of the faulting instruction, and that pending changes/invalidations
> are blocked until installation of the pfn in KVM's secondary MMU completes.
> 
> kvm_mmu_page_fault():
> 
> 	mmu_seq = kvm->mmu_notifier_seq;
> 	smp_rmb();
> 
> 	pfn = gup(hva);
> 
> 	spin_lock(&kvm->mmu_lock);
> 	smp_rmb();
> 	if (kvm->mmu_notifier_seq != mmu_seq)
> 		goto out_unlock: // Restart guest, i.e. retry the fault
> 
> 	lookup_address_in_mm(hva, ...);

It works because the mmu_lock spinlock is taken before and after any
change to the page table via invalidate_range_start/end() callbacks.

So if you are in the spinlock and mmu_notifier_count == 0, then nobody
can be writing to the page tables. 

It is effectively a full page table lock, so any page table read under
that lock do not need to worry about any data races.

Jason
Mike Kravetz March 23, 2020, 5:27 p.m. UTC | #8
On 3/23/20 9:09 AM, Jason Gunthorpe wrote:
> On Sat, Mar 21, 2020 at 04:38:19PM -0700, Mike Kravetz wrote:
> 
>> Andrew dropped this patch from his tree which caused me to go back and
>> look at the status of this patch/issue.
>>
>> It is pretty obvious that code in the current huge_pte_offset routine
>> is racy.  I checked out the assembly code produced by my compiler and
>> verified that the line,
>>
>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>
>> does actually dereference *pud twice.  So, the value could change between
>> those two dereferences.   Longpeng (Mike) could easlily recreate the issue
>> if he put a delay between the two dereferences.  I believe the only
>> reservations/concerns about the patch below was the use of READ_ONCE().
>> Is that correct?
> 
> I'm looking at a similar situation in pagewalk.c right now with PUD,
> and it is very confusing to see that locks are being held, memory
> accessed without READ_ONCE, but actually it has concurrent writes.
> 
> I think it is valuable to annotate with READ_ONCE when the author
> knows this is an unlocked data access, regardless of what the compiler
> does.
> 
> pagewalk probably has the same racy bug you show here, I'm going to
> send a very similar looking patch to pagewalk hopefully soon.

Thanks Jason.

Unfortunately, I replied to the thread without full context for the discussion
we were having.  The primary objection to this patch was the use of READ_ONCE
in these two instances:

>  	pgd = pgd_offset(mm, addr);
> -	if (!pgd_present(*pgd))
> +	if (!pgd_present(READ_ONCE(*pgd)))
>  		return NULL;
>  	p4d = p4d_offset(pgd, addr);
> -	if (!p4d_present(*p4d))
> +	if (!p4d_present(READ_ONCE(*p4d)))
>  		return NULL;
>  
>       pud = pud_offset(p4d, addr);

One would argue that pgd and p4d can not change from present to !present
during the execution of this code.  To me, that seems like the issue which
would cause an issue.  Of course, I could be missing something.

> Also, the remark about pmd_offset() seems accurate. The
> get_user_fast_pages() pattern seems like the correct one to emulate:
> 
>   pud = READ_ONCE(*pudp);
>   if (pud_none(pud)) 
>      ..
>   if (!pud_'is a pmd pointer')
>      ..
>   pmdp = pmd_offset(&pud, address);
>   pmd = READ_ONCE(*pmd);
>   [...]
> 
> Passing &pud in avoids another de-reference of the pudp. Honestly all
> these APIs that take in page table pointers and internally
> de-reference them seem very hard to use correctly when the page table
> access isn't fully locked against write.
> 
> This also relies on 'some kind of locking' to prevent the pmdp from
> becoming freed concurrently while this is running.
> 
> .. also this only works if READ_ONCE() is atomic, ie the pud can't be
> 64 bit on a 32 bit platform. At least pmd has this problem, I haven't
> figured out if pud does??
> 
>> Are there any objections to the patch if the READ_ONCE() calls are removed?
> 
> I think if we know there is no concurrent data access then it makes
> sense to keep the READ_ONCE.
> 
> It looks like at least the p4d read from the pgd is also unlocked here
> as handle_mm_fault() writes to it??

Yes, there is no locking required to call huge_pte_offset().
Jason Gunthorpe March 23, 2020, 6:07 p.m. UTC | #9
On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:

> >  	pgd = pgd_offset(mm, addr);
> > -	if (!pgd_present(*pgd))
> > +	if (!pgd_present(READ_ONCE(*pgd)))
> >  		return NULL;
> >  	p4d = p4d_offset(pgd, addr);
> > -	if (!p4d_present(*p4d))
> > +	if (!p4d_present(READ_ONCE(*p4d)))
> >  		return NULL;
> >  
> >       pud = pud_offset(p4d, addr);
> 
> One would argue that pgd and p4d can not change from present to !present
> during the execution of this code.  To me, that seems like the issue which
> would cause an issue.  Of course, I could be missing something.

This I am not sure of, I think it must be true under the read side of
the mmap_sem, but probably not guarenteed under RCU..

In any case, it doesn't matter, the fact that *p4d can change at all
is problematic. Unwinding the above inlines we get:

  p4d = p4d_offset(pgd, addr)
  if (!p4d_present(*p4d))
      return NULL;
  pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);

According to our memory model the compiler/CPU is free to execute this
as:

  p4d = p4d_offset(pgd, addr)
  p4d_for_vaddr = *p4d;
  if (!p4d_present(*p4d))
      return NULL;
  pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);

In the case where p4 goes from !present -> present (ie
handle_mm_fault()):

p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the
p4d_page_vaddr() will crash.

Basically the problem here is not just missing READ_ONCE, but that the
p4d is read multiple times at all. It should be written like gup_fast
does, to guarantee a single CPU read of the unstable data:

  p4d = READ_ONCE(*p4d_offset(pgdp, addr));
  if (!p4d_present(p4))
      return NULL;
  pud = pud_offset(&p4d, addr);

At least this is what I've been able to figure out :\

> > Also, the remark about pmd_offset() seems accurate. The
> > get_user_fast_pages() pattern seems like the correct one to emulate:
> > 
> >   pud = READ_ONCE(*pudp);
> >   if (pud_none(pud)) 
> >      ..
> >   if (!pud_'is a pmd pointer')
> >      ..
> >   pmdp = pmd_offset(&pud, address);
> >   pmd = READ_ONCE(*pmd);
> >   [...]
> > 
> > Passing &pud in avoids another de-reference of the pudp. Honestly all
> > these APIs that take in page table pointers and internally
> > de-reference them seem very hard to use correctly when the page table
> > access isn't fully locked against write.

And the same protocol for the PUD, etc.

> > It looks like at least the p4d read from the pgd is also unlocked here
> > as handle_mm_fault() writes to it??
> 
> Yes, there is no locking required to call huge_pte_offset().

None? Not RCU or read mmap_sem?

Jason
Mike Kravetz March 23, 2020, 8:35 p.m. UTC | #10
On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
> On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
> 
>>>  	pgd = pgd_offset(mm, addr);
>>> -	if (!pgd_present(*pgd))
>>> +	if (!pgd_present(READ_ONCE(*pgd)))
>>>  		return NULL;
>>>  	p4d = p4d_offset(pgd, addr);
>>> -	if (!p4d_present(*p4d))
>>> +	if (!p4d_present(READ_ONCE(*p4d)))
>>>  		return NULL;
>>>  
>>>       pud = pud_offset(p4d, addr);
>>
>> One would argue that pgd and p4d can not change from present to !present
>> during the execution of this code.  To me, that seems like the issue which
>> would cause an issue.  Of course, I could be missing something.
> 
> This I am not sure of, I think it must be true under the read side of
> the mmap_sem, but probably not guarenteed under RCU..
> 
> In any case, it doesn't matter, the fact that *p4d can change at all
> is problematic. Unwinding the above inlines we get:
> 
>   p4d = p4d_offset(pgd, addr)
>   if (!p4d_present(*p4d))
>       return NULL;
>   pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> 
> According to our memory model the compiler/CPU is free to execute this
> as:
> 
>   p4d = p4d_offset(pgd, addr)
>   p4d_for_vaddr = *p4d;
>   if (!p4d_present(*p4d))
>       return NULL;
>   pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
> 

Wow!  How do you know this?  You don't need to answer :)

> In the case where p4 goes from !present -> present (ie
> handle_mm_fault()):
> 
> p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the
> p4d_page_vaddr() will crash.
> 
> Basically the problem here is not just missing READ_ONCE, but that the
> p4d is read multiple times at all. It should be written like gup_fast
> does, to guarantee a single CPU read of the unstable data:
> 
>   p4d = READ_ONCE(*p4d_offset(pgdp, addr));
>   if (!p4d_present(p4))
>       return NULL;
>   pud = pud_offset(&p4d, addr);
> 
> At least this is what I've been able to figure out :\

In that case, I believe there are a bunch of similar routines with this issue.

For this patch, I was primarily interested in seeing the obvious multiple
dereferences in C fixed up.  This is above and beyond that! :)

>>> Also, the remark about pmd_offset() seems accurate. The
>>> get_user_fast_pages() pattern seems like the correct one to emulate:
>>>
>>>   pud = READ_ONCE(*pudp);
>>>   if (pud_none(pud)) 
>>>      ..
>>>   if (!pud_'is a pmd pointer')
>>>      ..
>>>   pmdp = pmd_offset(&pud, address);
>>>   pmd = READ_ONCE(*pmd);
>>>   [...]
>>>
>>> Passing &pud in avoids another de-reference of the pudp. Honestly all
>>> these APIs that take in page table pointers and internally
>>> de-reference them seem very hard to use correctly when the page table
>>> access isn't fully locked against write.
> 
> And the same protocol for the PUD, etc.
> 
>>> It looks like at least the p4d read from the pgd is also unlocked here
>>> as handle_mm_fault() writes to it??
>>
>> Yes, there is no locking required to call huge_pte_offset().
> 
> None? Not RCU or read mmap_sem?

Yes, mmap_sem in read mode.
Sorry, I was confusing this with additional locking requirements for hugetlb
specific code.
Jason Gunthorpe March 23, 2020, 10:52 p.m. UTC | #11
On Mon, Mar 23, 2020 at 01:35:07PM -0700, Mike Kravetz wrote:
> On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
> > 
> >>>  	pgd = pgd_offset(mm, addr);
> >>> -	if (!pgd_present(*pgd))
> >>> +	if (!pgd_present(READ_ONCE(*pgd)))
> >>>  		return NULL;
> >>>  	p4d = p4d_offset(pgd, addr);
> >>> -	if (!p4d_present(*p4d))
> >>> +	if (!p4d_present(READ_ONCE(*p4d)))
> >>>  		return NULL;
> >>>  
> >>>       pud = pud_offset(p4d, addr);
> >>
> >> One would argue that pgd and p4d can not change from present to !present
> >> during the execution of this code.  To me, that seems like the issue which
> >> would cause an issue.  Of course, I could be missing something.
> > 
> > This I am not sure of, I think it must be true under the read side of
> > the mmap_sem, but probably not guarenteed under RCU..
> > 
> > In any case, it doesn't matter, the fact that *p4d can change at all
> > is problematic. Unwinding the above inlines we get:
> > 
> >   p4d = p4d_offset(pgd, addr)
> >   if (!p4d_present(*p4d))
> >       return NULL;
> >   pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> > 
> > According to our memory model the compiler/CPU is free to execute this
> > as:
> > 
> >   p4d = p4d_offset(pgd, addr)
> >   p4d_for_vaddr = *p4d;
> >   if (!p4d_present(*p4d))
> >       return NULL;
> >   pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
> > 
> 
> Wow!  How do you know this?  You don't need to answer :)

It says explicitly in Documentation/memory-barriers.txt - see
section COMPILER BARRIER:

 (*) The compiler is within its rights to reorder loads and stores
     to the same variable, and in some cases, the CPU is within its
     rights to reorder loads to the same variable.  This means that
     the following code:

        a[0] = x;
        a[1] = x;

     Might result in an older value of x stored in a[1] than in a[0].

It also says READ_ONCE puts things in program order, but we don't use
READ_ONCE inside pud_offset(), so it doesn't help us.

Best answer is to code things so there is exactly one dereference of
the pointer protected by READ_ONCE. Very clear to read, very safe.

Maybe Longpeng can rework the patch around these principles?

Also I wonder if the READ_ONCE(*pmdp) is OK. gup_pmd_range() uses it,
but I can't explain why it shouldn't be pmd_read_atomic().

> > In the case where p4 goes from !present -> present (ie
> > handle_mm_fault()):
> > 
> > p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the
> > p4d_page_vaddr() will crash.
> > 
> > Basically the problem here is not just missing READ_ONCE, but that the
> > p4d is read multiple times at all. It should be written like gup_fast
> > does, to guarantee a single CPU read of the unstable data:
> > 
> >   p4d = READ_ONCE(*p4d_offset(pgdp, addr));
> >   if (!p4d_present(p4))
> >       return NULL;
> >   pud = pud_offset(&p4d, addr);
> > 
> > At least this is what I've been able to figure out :\
> 
> In that case, I believe there are a bunch of similar routines with this issue.

Yes, my look around page walk related users makes me come to a similar
worry.

Fortunately, I think this is largely theoretical as most likely the
compiler will generate a single store for these coding patterns. 

That said, there have been bugs in the past, see commit 26c191788f18
("mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race
condition") which is significantly related to the compiler lifting a
load inside pte_offset to before the required 'if (pmd_*)' checks.

> For this patch, I was primarily interested in seeing the obvious
> multiple dereferences in C fixed up.  This is above and beyond that!
> :)

Well, I think it is worth solving the underlying problem
properly. Otherwise we get weird solutions to data races like
pmd_trans_unstable()...

Jason
Longpeng(Mike) March 24, 2020, 2:37 a.m. UTC | #12
On 2020/3/24 6:52, Jason Gunthorpe wrote:
> On Mon, Mar 23, 2020 at 01:35:07PM -0700, Mike Kravetz wrote:
>> On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
>>> On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
>>>
>>>>>  	pgd = pgd_offset(mm, addr);
>>>>> -	if (!pgd_present(*pgd))
>>>>> +	if (!pgd_present(READ_ONCE(*pgd)))
>>>>>  		return NULL;
>>>>>  	p4d = p4d_offset(pgd, addr);
>>>>> -	if (!p4d_present(*p4d))
>>>>> +	if (!p4d_present(READ_ONCE(*p4d)))
>>>>>  		return NULL;
>>>>>  
>>>>>       pud = pud_offset(p4d, addr);
>>>>
>>>> One would argue that pgd and p4d can not change from present to !present
>>>> during the execution of this code.  To me, that seems like the issue which
>>>> would cause an issue.  Of course, I could be missing something.
>>>
>>> This I am not sure of, I think it must be true under the read side of
>>> the mmap_sem, but probably not guarenteed under RCU..
>>>
>>> In any case, it doesn't matter, the fact that *p4d can change at all
>>> is problematic. Unwinding the above inlines we get:
>>>
>>>   p4d = p4d_offset(pgd, addr)
>>>   if (!p4d_present(*p4d))
>>>       return NULL;
>>>   pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
>>>
>>> According to our memory model the compiler/CPU is free to execute this
>>> as:
>>>
>>>   p4d = p4d_offset(pgd, addr)
>>>   p4d_for_vaddr = *p4d;
>>>   if (!p4d_present(*p4d))
>>>       return NULL;
>>>   pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
>>>
>>
>> Wow!  How do you know this?  You don't need to answer :)
> 
> It says explicitly in Documentation/memory-barriers.txt - see
> section COMPILER BARRIER:
> 
>  (*) The compiler is within its rights to reorder loads and stores
>      to the same variable, and in some cases, the CPU is within its
>      rights to reorder loads to the same variable.  This means that
>      the following code:
> 
>         a[0] = x;
>         a[1] = x;
> 
>      Might result in an older value of x stored in a[1] than in a[0].
> 
> It also says READ_ONCE puts things in program order, but we don't use
> READ_ONCE inside pud_offset(), so it doesn't help us.
> 
> Best answer is to code things so there is exactly one dereference of
> the pointer protected by READ_ONCE. Very clear to read, very safe.
> 
> Maybe Longpeng can rework the patch around these principles?
> 
Thanks Jason and Mike, I learn a lot from your analysis.

So... the patch should like this ?

@@ -4909,29 +4909,33 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
+	p4d_t *p4g, p4d_entry;
+	pud_t *pud, pud_entry;
+	pmd_t *pmd, pmd_entry;

 	pgd = pgd_offset(mm, addr);
 	if (!pgd_present(*pgd))
 		return NULL;
-	p4d = p4d_offset(pgd, addr);
-	if (!p4d_present(*p4d))
+
+	p4g = p4d_offset(pgd, addr);
+	p4d_entry = READ_ONCE(*p4g);
+	if (!p4d_present(p4d_entry))
 		return NULL;

-	pud = pud_offset(p4d, addr);
-	if (sz != PUD_SIZE && pud_none(*pud))
+	pud = pud_offset(&p4d_entry, addr);
+	pud_entry = READ_ONCE(*pud);
+	if (sz != PUD_SIZE && pud_none(pud_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pud_huge(*pud) || !pud_present(*pud))
+	if (pud_huge(pud_entry) || !pud_present(pud_entry))
 		return (pte_t *)pud;

-	pmd = pmd_offset(pud, addr);
-	if (sz != PMD_SIZE && pmd_none(*pmd))
+	pmd = pmd_offset(&pud_entry, addr);
+	pmd_entry = READ_ONCE(*pmd);
+	if (sz != PMD_SIZE && pmd_none(pmd_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pmd_huge(*pmd) || !pmd_present(*pmd))
+	if (pmd_huge(pmd_entry) || !pmd_present(pmd_entry))
 		return (pte_t *)pmd;

> Also I wonder if the READ_ONCE(*pmdp) is OK. gup_pmd_range() uses it,
> but I can't explain why it shouldn't be pmd_read_atomic().
> 
>>> In the case where p4 goes from !present -> present (ie
>>> handle_mm_fault()):
>>>
>>> p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the
>>> p4d_page_vaddr() will crash.
>>>
>>> Basically the problem here is not just missing READ_ONCE, but that the
>>> p4d is read multiple times at all. It should be written like gup_fast
>>> does, to guarantee a single CPU read of the unstable data:
>>>
>>>   p4d = READ_ONCE(*p4d_offset(pgdp, addr));
>>>   if (!p4d_present(p4))
>>>       return NULL;
>>>   pud = pud_offset(&p4d, addr);
>>>
>>> At least this is what I've been able to figure out :\
>>
>> In that case, I believe there are a bunch of similar routines with this issue.
> 
> Yes, my look around page walk related users makes me come to a similar
> worry.
> 
> Fortunately, I think this is largely theoretical as most likely the
> compiler will generate a single store for these coding patterns. 
> 
> That said, there have been bugs in the past, see commit 26c191788f18
> ("mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race
> condition") which is significantly related to the compiler lifting a
> load inside pte_offset to before the required 'if (pmd_*)' checks.
> 
>> For this patch, I was primarily interested in seeing the obvious
>> multiple dereferences in C fixed up.  This is above and beyond that!
>> :)
> 
> Well, I think it is worth solving the underlying problem
> properly. Otherwise we get weird solutions to data races like
> pmd_trans_unstable()...
> 
> Jason
> .
> 

---
Regards,
Longpeng(Mike)
Jason Gunthorpe March 24, 2020, 11:55 a.m. UTC | #13
On Tue, Mar 24, 2020 at 10:37:49AM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> On 2020/3/24 6:52, Jason Gunthorpe wrote:
> > On Mon, Mar 23, 2020 at 01:35:07PM -0700, Mike Kravetz wrote:
> >> On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
> >>> On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
> >>>
> >>>>>  	pgd = pgd_offset(mm, addr);
> >>>>> -	if (!pgd_present(*pgd))
> >>>>> +	if (!pgd_present(READ_ONCE(*pgd)))
> >>>>>  		return NULL;
> >>>>>  	p4d = p4d_offset(pgd, addr);
> >>>>> -	if (!p4d_present(*p4d))
> >>>>> +	if (!p4d_present(READ_ONCE(*p4d)))
> >>>>>  		return NULL;
> >>>>>  
> >>>>>       pud = pud_offset(p4d, addr);
> >>>>
> >>>> One would argue that pgd and p4d can not change from present to !present
> >>>> during the execution of this code.  To me, that seems like the issue which
> >>>> would cause an issue.  Of course, I could be missing something.
> >>>
> >>> This I am not sure of, I think it must be true under the read side of
> >>> the mmap_sem, but probably not guarenteed under RCU..
> >>>
> >>> In any case, it doesn't matter, the fact that *p4d can change at all
> >>> is problematic. Unwinding the above inlines we get:
> >>>
> >>>   p4d = p4d_offset(pgd, addr)
> >>>   if (!p4d_present(*p4d))
> >>>       return NULL;
> >>>   pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> >>>
> >>> According to our memory model the compiler/CPU is free to execute this
> >>> as:
> >>>
> >>>   p4d = p4d_offset(pgd, addr)
> >>>   p4d_for_vaddr = *p4d;
> >>>   if (!p4d_present(*p4d))
> >>>       return NULL;
> >>>   pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
> >>>
> >>
> >> Wow!  How do you know this?  You don't need to answer :)
> > 
> > It says explicitly in Documentation/memory-barriers.txt - see
> > section COMPILER BARRIER:
> > 
> >  (*) The compiler is within its rights to reorder loads and stores
> >      to the same variable, and in some cases, the CPU is within its
> >      rights to reorder loads to the same variable.  This means that
> >      the following code:
> > 
> >         a[0] = x;
> >         a[1] = x;
> > 
> >      Might result in an older value of x stored in a[1] than in a[0].
> > 
> > It also says READ_ONCE puts things in program order, but we don't use
> > READ_ONCE inside pud_offset(), so it doesn't help us.
> > 
> > Best answer is to code things so there is exactly one dereference of
> > the pointer protected by READ_ONCE. Very clear to read, very safe.
> > 
> > Maybe Longpeng can rework the patch around these principles?
> > 
> Thanks Jason and Mike, I learn a lot from your analysis.
> 
> So... the patch should like this ?

Yes, the pattern looks right

The commit message should reference the above section of COMPILER
BARRIER and explain that de-referencing the entries is a data race, so
we must consolidate all the reads into one single place.

Also, since CH moved all the get_user_pages_fast code out of the
arch's many/all archs can drop their arch specific version of this
routine. This is really just a specialized version of gup_fast's
algorithm..

(also the arch versions seem different, why do some return actual
 ptes, not null?)

Jason
Mike Kravetz March 24, 2020, 3:25 p.m. UTC | #14
On 3/24/20 4:55 AM, Jason Gunthorpe wrote:
> Also, since CH moved all the get_user_pages_fast code out of the
> arch's many/all archs can drop their arch specific version of this
> routine. This is really just a specialized version of gup_fast's
> algorithm..
> 
> (also the arch versions seem different, why do some return actual
>  ptes, not null?)

Not sure I understand that last question.  The return value should be
a *pte or null.

/*
 * huge_pte_offset() - Walk the page table to resolve the hugepage
 * entry at address @addr
 *
 * Return: Pointer to page table or swap entry (PUD or PMD) for
 * address @addr, or NULL if a p*d_none() entry is encountered and the
 * size @sz doesn't match the hugepage size at this level of the page
 * table.
 */
Jason Gunthorpe March 24, 2020, 3:55 p.m. UTC | #15
On Tue, Mar 24, 2020 at 08:25:09AM -0700, Mike Kravetz wrote:
> On 3/24/20 4:55 AM, Jason Gunthorpe wrote:
> > Also, since CH moved all the get_user_pages_fast code out of the
> > arch's many/all archs can drop their arch specific version of this
> > routine. This is really just a specialized version of gup_fast's
> > algorithm..
> > 
> > (also the arch versions seem different, why do some return actual
> >  ptes, not null?)
> 
> Not sure I understand that last question.  The return value should be
> a *pte or null.

I mean the common code ends like this:

	pmd = pmd_offset(pud, addr);
	if (sz != PMD_SIZE && pmd_none(*pmd))
		return NULL;
	/* hugepage or swap? */
	if (pmd_huge(*pmd) || !pmd_present(*pmd))
		return (pte_t *)pmd;

	return NULL;

So it always returns a pointer into a PUD or PMD, while say, ppc
in __find_linux_pte() ends like:

	return pte_offset_kernel(&pmd, ea);

Which is pointing to a PTE

So does sparc:

        pmd = pmd_offset(pud, addr);
        if (pmd_none(*pmd))
                return NULL;
        if (is_hugetlb_pmd(*pmd))
                return (pte_t *)pmd;
        return pte_offset_map(pmd, addr);

Which is even worse because it is leaking a kmap..

etc

> /*
>  * huge_pte_offset() - Walk the page table to resolve the hugepage
>  * entry at address @addr
>  *
>  * Return: Pointer to page table or swap entry (PUD or PMD) for
                                              ^^^^^^^^^^^^^^^^^^^

Ie the above is not followed by the archs

I'm also scratching my head that a function that returns a pte_t *
always returns a PUD or PMD. Strange bit of type casting..

Jason
Mike Kravetz March 24, 2020, 4:19 p.m. UTC | #16
On 3/24/20 8:55 AM, Jason Gunthorpe wrote:
> On Tue, Mar 24, 2020 at 08:25:09AM -0700, Mike Kravetz wrote:
>> On 3/24/20 4:55 AM, Jason Gunthorpe wrote:
>>> Also, since CH moved all the get_user_pages_fast code out of the
>>> arch's many/all archs can drop their arch specific version of this
>>> routine. This is really just a specialized version of gup_fast's
>>> algorithm..
>>>
>>> (also the arch versions seem different, why do some return actual
>>>  ptes, not null?)
>>
>> Not sure I understand that last question.  The return value should be
>> a *pte or null.
> 
> I mean the common code ends like this:
> 
> 	pmd = pmd_offset(pud, addr);
> 	if (sz != PMD_SIZE && pmd_none(*pmd))
> 		return NULL;
> 	/* hugepage or swap? */
> 	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> 		return (pte_t *)pmd;
> 
> 	return NULL;
> 
> So it always returns a pointer into a PUD or PMD, while say, ppc
> in __find_linux_pte() ends like:
> 
> 	return pte_offset_kernel(&pmd, ea);
> 
> Which is pointing to a PTE

Ok, now I understand the question.  huge_pte_offset will/should only be
called for addresses that are in a vma backed by hugetlb pages.  So,
pte_offset_kernel() will only return page table type (PUD/PMD/etc) associated
with a huge page supported by the particular arch.

> So does sparc:
> 
>         pmd = pmd_offset(pud, addr);
>         if (pmd_none(*pmd))
>                 return NULL;
>         if (is_hugetlb_pmd(*pmd))
>                 return (pte_t *)pmd;
>         return pte_offset_map(pmd, addr);
> 
> Which is even worse because it is leaking a kmap..
> 
> etc
> 
>> /*
>>  * huge_pte_offset() - Walk the page table to resolve the hugepage
>>  * entry at address @addr
>>  *
>>  * Return: Pointer to page table or swap entry (PUD or PMD) for
>                                               ^^^^^^^^^^^^^^^^^^^
> 
> Ie the above is not followed by the archs
> 
> I'm also scratching my head that a function that returns a pte_t *
> always returns a PUD or PMD. Strange bit of type casting..

Yes, the casting is curious.  The casting continues in potential subsequent
calls to huge_pte_alloc().
Jason Gunthorpe March 24, 2020, 5:59 p.m. UTC | #17
On Tue, Mar 24, 2020 at 09:19:29AM -0700, Mike Kravetz wrote:
> On 3/24/20 8:55 AM, Jason Gunthorpe wrote:
> > On Tue, Mar 24, 2020 at 08:25:09AM -0700, Mike Kravetz wrote:
> >> On 3/24/20 4:55 AM, Jason Gunthorpe wrote:
> >>> Also, since CH moved all the get_user_pages_fast code out of the
> >>> arch's many/all archs can drop their arch specific version of this
> >>> routine. This is really just a specialized version of gup_fast's
> >>> algorithm..
> >>>
> >>> (also the arch versions seem different, why do some return actual
> >>>  ptes, not null?)
> >>
> >> Not sure I understand that last question.  The return value should be
> >> a *pte or null.
> > 
> > I mean the common code ends like this:
> > 
> > 	pmd = pmd_offset(pud, addr);
> > 	if (sz != PMD_SIZE && pmd_none(*pmd))
> > 		return NULL;
> > 	/* hugepage or swap? */
> > 	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> > 		return (pte_t *)pmd;
> > 
> > 	return NULL;
> > 
> > So it always returns a pointer into a PUD or PMD, while say, ppc
> > in __find_linux_pte() ends like:
> > 
> > 	return pte_offset_kernel(&pmd, ea);
> > 
> > Which is pointing to a PTE
> 
> Ok, now I understand the question.  huge_pte_offset will/should only be
> called for addresses that are in a vma backed by hugetlb pages.  So,
> pte_offset_kernel() will only return page table type (PUD/PMD/etc) associated
> with a huge page supported by the particular arch.

I thought pte_offset_kernel always returns PTEs (ie the 4k entries on
x86), I suppose what you are saying is that since the caller knows
this is always a PUD or PMD due to the VMA the pte_offset is dead code.

> > So does sparc:
> > 
> >         pmd = pmd_offset(pud, addr);
> >         if (pmd_none(*pmd))
> >                 return NULL;
> >         if (is_hugetlb_pmd(*pmd))
> >                 return (pte_t *)pmd;
> >         return pte_offset_map(pmd, addr);
> > 
> > Which is even worse because it is leaking a kmap..

Particularly here which is buggy dead code :)

Jason
Mike Kravetz March 24, 2020, 7:47 p.m. UTC | #18
On 3/24/20 10:59 AM, Jason Gunthorpe wrote:
> On Tue, Mar 24, 2020 at 09:19:29AM -0700, Mike Kravetz wrote:
>> On 3/24/20 8:55 AM, Jason Gunthorpe wrote:
>>> On Tue, Mar 24, 2020 at 08:25:09AM -0700, Mike Kravetz wrote:
>>>> On 3/24/20 4:55 AM, Jason Gunthorpe wrote:
>>>>> Also, since CH moved all the get_user_pages_fast code out of the
>>>>> arch's many/all archs can drop their arch specific version of this
>>>>> routine. This is really just a specialized version of gup_fast's
>>>>> algorithm..
>>>>>
>>>>> (also the arch versions seem different, why do some return actual
>>>>>  ptes, not null?)
>>>>
>>>> Not sure I understand that last question.  The return value should be
>>>> a *pte or null.
>>>
>>> I mean the common code ends like this:
>>>
>>> 	pmd = pmd_offset(pud, addr);
>>> 	if (sz != PMD_SIZE && pmd_none(*pmd))
>>> 		return NULL;
>>> 	/* hugepage or swap? */
>>> 	if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>> 		return (pte_t *)pmd;
>>>
>>> 	return NULL;
>>>
>>> So it always returns a pointer into a PUD or PMD, while say, ppc
>>> in __find_linux_pte() ends like:
>>>
>>> 	return pte_offset_kernel(&pmd, ea);
>>>
>>> Which is pointing to a PTE
>>
>> Ok, now I understand the question.  huge_pte_offset will/should only be
>> called for addresses that are in a vma backed by hugetlb pages.  So,
>> pte_offset_kernel() will only return page table type (PUD/PMD/etc) associated
>> with a huge page supported by the particular arch.
> 
> I thought pte_offset_kernel always returns PTEs (ie the 4k entries on
> x86), I suppose what you are saying is that since the caller knows
> this is always a PUD or PMD due to the VMA the pte_offset is dead code.

Yes, for x86 the address will correspond to a PUD or PMD or NULL.  For huge
page mappings/vmas on x86, there are no corresponding PTEs.

Patch
diff mbox series

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd8737a..90daf37 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4910,28 +4910,30 @@  pte_t *huge_pte_offset(struct mm_struct *mm,
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
+	pud_t *pud, pud_entry;
+	pmd_t *pmd, pmd_entry;
 
 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
+	if (!pgd_present(READ_ONCE(*pgd)))
 		return NULL;
 	p4d = p4d_offset(pgd, addr);
-	if (!p4d_present(*p4d))
+	if (!p4d_present(READ_ONCE(*p4d)))
 		return NULL;
 
 	pud = pud_offset(p4d, addr);
-	if (sz != PUD_SIZE && pud_none(*pud))
+	pud_entry = READ_ONCE(*pud);
+	if (sz != PUD_SIZE && pud_none(pud_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pud_huge(*pud) || !pud_present(*pud))
+	if (pud_huge(pud_entry) || !pud_present(pud_entry))
 		return (pte_t *)pud;
 
 	pmd = pmd_offset(pud, addr);
-	if (sz != PMD_SIZE && pmd_none(*pmd))
+	pmd_entry = READ_ONCE(*pmd);
+	if (sz != PMD_SIZE && pmd_none(pmd_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pmd_huge(*pmd) || !pmd_present(*pmd))
+	if (pmd_huge(pmd_entry) || !pmd_present(pmd_entry))
 		return (pte_t *)pmd;
 
 	return NULL;