diff mbox series

mm/hugetlb: avoid get wrong ptep caused by race

Message ID 1582027825-112728-1-git-send-email-longpeng2@huawei.com (mailing list archive)
State New, archived
Headers show
Series mm/hugetlb: avoid get wrong ptep caused by race | expand

Commit Message

Longpeng(Mike) Feb. 18, 2020, 12:10 p.m. UTC
Our machine encountered a panic after run for a long time and
the calltrace is:
RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
 [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
 [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
 [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
 [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
 [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
 [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
 [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
 [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
 [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
 [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
 [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
 [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
 [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
 [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
 [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
 [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
 [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
 [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
 [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
 [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
 [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
 [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
 [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
 [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27

( The kernel we used is older, but we think the latest kernel also has this
  bug after dig into this problem. )

For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
may return a wrong 'pmdp' if there is a race. Please look at the following
code snippet:
    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
        return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
        return (pte_t *)pud;

    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
        return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
        return (pte_t *)pmd;
    ...

The following sequence would trigger this bug:
1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
1. CPU0: "pud_huge(*pud)" is false
2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
3. CPU0: "!pud_present(*pud)" is false, continue
4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
However, we want CPU0 to return NULL or pudp.

We can avoid this race by read the pud only once.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
---
 mm/hugetlb.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

Comments

Sean Christopherson Feb. 18, 2020, 8:37 p.m. UTC | #1
On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
> Our machine encountered a panic after run for a long time and
> the calltrace is:

What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
dereference, etc...?

> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
> 
> ( The kernel we used is older, but we think the latest kernel also has this
>   bug after dig into this problem. )
> 
> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
> may return a wrong 'pmdp' if there is a race. Please look at the following
> code snippet:
>     ...
>     pud = pud_offset(p4d, addr);
>     if (sz != PUD_SIZE && pud_none(*pud))
>         return NULL;
>     /* hugepage or swap? */
>     if (pud_huge(*pud) || !pud_present(*pud))
>         return (pte_t *)pud;
> 
>     pmd = pmd_offset(pud, addr);
>     if (sz != PMD_SIZE && pmd_none(*pmd))
>         return NULL;
>     /* hugepage or swap? */
>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>         return (pte_t *)pmd;
>     ...
> 
> The following sequence would trigger this bug:
> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
> 1. CPU0: "pud_huge(*pud)" is false
> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
> 3. CPU0: "!pud_present(*pud)" is false, continue
> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
> However, we want CPU0 to return NULL or pudp.
> 
> We can avoid this race by read the pud only once.

Are there any other options for avoiding the panic you hit?  I ask because
there are a variety of flows that use a very similar code pattern, e.g.
lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
other flows could be confusing (or in my case, anxiety inducing[*]).  At
the least, adding a comment in huge_pte_offset() to explain the need for
READ_ONCE() would be helpful.

[*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via
    lookup_address_in_mm()) to identify large page mappings.  The function
    itself is susceptible to such a race, but KVM only does the lookup
    after it has done gup() and also ensures any zapping of ptes will cause
    KVM to restart the faulting (guest) instruction or that the zap will be
    blocked until after KVM does the lookup, i.e. racing with a transition
    from !PRESENT -> PRESENT should be impossible (in theory).

> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
> ---
>  mm/hugetlb.c | 34 ++++++++++++++++++----------------
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dd8737a..3bde229 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz)
>  {
> -	pgd_t *pgd;
> -	p4d_t *p4d;
> -	pud_t *pud;
> -	pmd_t *pmd;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
>  
> -	pgd = pgd_offset(mm, addr);
> -	if (!pgd_present(*pgd))
> +	pgdp = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgdp))
>  		return NULL;
> -	p4d = p4d_offset(pgd, addr);
> -	if (!p4d_present(*p4d))
> +	p4dp = p4d_offset(pgdp, addr);
> +	if (!p4d_present(*p4dp))
>  		return NULL;
>  
> -	pud = pud_offset(p4d, addr);
> -	if (sz != PUD_SIZE && pud_none(*pud))
> +	pudp = pud_offset(p4dp, addr);
> +	pud = READ_ONCE(*pudp);
> +	if (sz != PUD_SIZE && pud_none(pud))
>  		return NULL;
>  	/* hugepage or swap? */
> -	if (pud_huge(*pud) || !pud_present(*pud))
> -		return (pte_t *)pud;
> +	if (pud_huge(pud) || !pud_present(pud))
> +		return (pte_t *)pudp;
>  
> -	pmd = pmd_offset(pud, addr);
> -	if (sz != PMD_SIZE && pmd_none(*pmd))
> +	pmdp = pmd_offset(pudp, addr);
> +	pmd = READ_ONCE(*pmdp);
> +	if (sz != PMD_SIZE && pmd_none(pmd))
>  		return NULL;
>  	/* hugepage or swap? */
> -	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> -		return (pte_t *)pmd;
> +	if (pmd_huge(pmd) || !pmd_present(pmd))
> +		return (pte_t *)pmdp;
>  
>  	return NULL;
>  }
> -- 
> 1.8.3.1
> 
>
Matthew Wilcox Feb. 18, 2020, 8:52 p.m. UTC | #2
On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>  {
> -	pgd_t *pgd;
> -	p4d_t *p4d;
> -	pud_t *pud;
> -	pmd_t *pmd;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;

Renaming the variables as part of a fix is a really bad idea.  It obscures
the actual fix and makes everybody's life harder.  Plus, it's not even
renaming to follow the normal convention -- there are only two places
(migrate.c and gup.c) which follow this pattern in mm/ while there are
33 that do not.
Mike Kravetz Feb. 19, 2020, 12:51 a.m. UTC | #3
On 2/18/20 12:37 PM, Sean Christopherson wrote:
> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>> Our machine encountered a panic after run for a long time and
>> the calltrace is:
> 
> What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
> dereference, etc...?

I too would like some more information on the panic.
If your analysis is correct, then I would expect the 'ptep' returned by
huge_pte_offset() to not point to a pte but rather some random address.
This is because the 'pmd' calculated by pmd_offset(pud, addr) is not
really the address of a pmd.  So, perhaps there is an addressing exception
at huge_ptep_get() near the beginning of hugetlb_fault()?

	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
	if (ptep) {
		entry = huge_ptep_get(ptep);
		...
Longpeng(Mike) Feb. 19, 2020, 1:39 a.m. UTC | #4
在 2020/2/19 4:37, Sean Christopherson 写道:
> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>> Our machine encountered a panic after run for a long time and
>> the calltrace is:
> 
> What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
> dereference, etc...?
> 
A bad pointer dereference.

pgd -> pud -> user 1G hugepage
huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe
return the a bad pointer of the user 1G hugepage.

>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>
>> ( The kernel we used is older, but we think the latest kernel also has this
>>   bug after dig into this problem. )
>>
>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>> may return a wrong 'pmdp' if there is a race. Please look at the following
>> code snippet:
>>     ...
>>     pud = pud_offset(p4d, addr);
>>     if (sz != PUD_SIZE && pud_none(*pud))
>>         return NULL;
>>     /* hugepage or swap? */
>>     if (pud_huge(*pud) || !pud_present(*pud))
>>         return (pte_t *)pud;
>>
>>     pmd = pmd_offset(pud, addr);
>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>         return NULL;
>>     /* hugepage or swap? */
>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>         return (pte_t *)pmd;
>>     ...
>>
>> The following sequence would trigger this bug:
>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>> 1. CPU0: "pud_huge(*pud)" is false
>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>> 3. CPU0: "!pud_present(*pud)" is false, continue
>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>> However, we want CPU0 to return NULL or pudp.
>>
>> We can avoid this race by read the pud only once.
> 
> Are there any other options for avoiding the panic you hit?  I ask because
> there are a variety of flows that use a very similar code pattern, e.g.
> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
> other flows could be confusing (or in my case, anxiety inducing[*]).  At
> the least, adding a comment in huge_pte_offset() to explain the need for
> READ_ONCE() would be helpful.
>
I hope the hugetlb and mm maintainers could give some other options if they
approve this bug.
We change the code from
	if (pud_huge(*pud) || !pud_present(*pud))
to
	if (pud_huge(*pud)
		return (pte_t *)pud;
	busy loop for 500ms
	if (!pud_present(*pud))
		return (pte_t *)pud;
and the panic will be hit quickly.

ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).

The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
pud twice and the *pud maybe change in a race, so if we only read the pud once.
I use READ_ONCE here is just for safe, to prevents the complier mischief if
possible.

I'll add comments in v2.

> [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via
>     lookup_address_in_mm()) to identify large page mappings.  The function
>     itself is susceptible to such a race, but KVM only does the lookup
>     after it has done gup() and also ensures any zapping of ptes will cause
>     KVM to restart the faulting (guest) instruction or that the zap will be
>     blocked until after KVM does the lookup, i.e. racing with a transition
>     from !PRESENT -> PRESENT should be impossible (in theory).
> 
This bug is from hugetlb core, we could trigger it in other usages even if the
latest KVM won't.

>> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
>> ---
>>  mm/hugetlb.c | 34 ++++++++++++++++++----------------
>>  1 file changed, 18 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index dd8737a..3bde229 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>>  pte_t *huge_pte_offset(struct mm_struct *mm,
>>  		       unsigned long addr, unsigned long sz)
>>  {
>> -	pgd_t *pgd;
>> -	p4d_t *p4d;
>> -	pud_t *pud;
>> -	pmd_t *pmd;
>> +	pgd_t *pgdp;
>> +	p4d_t *p4dp;
>> +	pud_t *pudp, pud;
>> +	pmd_t *pmdp, pmd;
>>  
>> -	pgd = pgd_offset(mm, addr);
>> -	if (!pgd_present(*pgd))
>> +	pgdp = pgd_offset(mm, addr);
>> +	if (!pgd_present(*pgdp))
>>  		return NULL;
>> -	p4d = p4d_offset(pgd, addr);
>> -	if (!p4d_present(*p4d))
>> +	p4dp = p4d_offset(pgdp, addr);
>> +	if (!p4d_present(*p4dp))
>>  		return NULL;
>>  
>> -	pud = pud_offset(p4d, addr);
>> -	if (sz != PUD_SIZE && pud_none(*pud))
>> +	pudp = pud_offset(p4dp, addr);
>> +	pud = READ_ONCE(*pudp);
>> +	if (sz != PUD_SIZE && pud_none(pud))
>>  		return NULL;
>>  	/* hugepage or swap? */
>> -	if (pud_huge(*pud) || !pud_present(*pud))
>> -		return (pte_t *)pud;
>> +	if (pud_huge(pud) || !pud_present(pud))
>> +		return (pte_t *)pudp;
>>  
>> -	pmd = pmd_offset(pud, addr);
>> -	if (sz != PMD_SIZE && pmd_none(*pmd))
>> +	pmdp = pmd_offset(pudp, addr);
>> +	pmd = READ_ONCE(*pmdp);
>> +	if (sz != PMD_SIZE && pmd_none(pmd))
>>  		return NULL;
>>  	/* hugepage or swap? */
>> -	if (pmd_huge(*pmd) || !pmd_present(*pmd))
>> -		return (pte_t *)pmd;
>> +	if (pmd_huge(pmd) || !pmd_present(pmd))
>> +		return (pte_t *)pmdp;
>>  
>>  	return NULL;
>>  }
>> -- 
>> 1.8.3.1
>>
>>
> 
> .
>
Sean Christopherson Feb. 19, 2020, 1:58 a.m. UTC | #5
On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
> 在 2020/2/19 4:37, Sean Christopherson 写道:
> > On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
> >> Our machine encountered a panic after run for a long time and
> >> the calltrace is:
> > 
> > What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
> > dereference, etc...?
> > 
> A bad pointer dereference.
> 
> pgd -> pud -> user 1G hugepage
> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe
> return the a bad pointer of the user 1G hugepage.
> 
> >> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
> >> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
> >> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
> >> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
> >> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
> >> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
> >> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
> >> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> Call Trace:
> >>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
> >>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
> >>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
> >>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
> >>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
> >>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
> >>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
> >>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
> >>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
> >>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
> >>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
> >>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
> >>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
> >>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
> >>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
> >>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
> >>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
> >>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
> >>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
> >>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
> >>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
> >>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
> >>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
> >>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
> >>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
> >>
> >> ( The kernel we used is older, but we think the latest kernel also has this
> >>   bug after dig into this problem. )
> >>
> >> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
> >> may return a wrong 'pmdp' if there is a race. Please look at the following
> >> code snippet:
> >>     ...
> >>     pud = pud_offset(p4d, addr);
> >>     if (sz != PUD_SIZE && pud_none(*pud))
> >>         return NULL;
> >>     /* hugepage or swap? */
> >>     if (pud_huge(*pud) || !pud_present(*pud))
> >>         return (pte_t *)pud;
> >>
> >>     pmd = pmd_offset(pud, addr);
> >>     if (sz != PMD_SIZE && pmd_none(*pmd))
> >>         return NULL;
> >>     /* hugepage or swap? */
> >>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
> >>         return (pte_t *)pmd;
> >>     ...
> >>
> >> The following sequence would trigger this bug:
> >> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
> >> 1. CPU0: "pud_huge(*pud)" is false
> >> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
> >> 3. CPU0: "!pud_present(*pud)" is false, continue
> >> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
> >> However, we want CPU0 to return NULL or pudp.
> >>
> >> We can avoid this race by read the pud only once.
> > 
> > Are there any other options for avoiding the panic you hit?  I ask because
> > there are a variety of flows that use a very similar code pattern, e.g.
> > lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
> > other flows could be confusing (or in my case, anxiety inducing[*]).  At
> > the least, adding a comment in huge_pte_offset() to explain the need for
> > READ_ONCE() would be helpful.
> >
> I hope the hugetlb and mm maintainers could give some other options if they
> approve this bug.

The race and the fix make sense.  I assumed dereferencing garbage from the
huge page was the issue, but I wasn't 100% that was the case, which is why
I asked about alternative fixes.

> We change the code from
> 	if (pud_huge(*pud) || !pud_present(*pud))
> to
> 	if (pud_huge(*pud)
> 		return (pte_t *)pud;
> 	busy loop for 500ms
> 	if (!pud_present(*pud))
> 		return (pte_t *)pud;
> and the panic will be hit quickly.
> 
> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
> 
> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
> pud twice and the *pud maybe change in a race, so if we only read the pud once.
> I use READ_ONCE here is just for safe, to prevents the complier mischief if
> possible.

FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
convert everything as a follow-up patch (or patches).  I'm fairly confident
that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
bet my life on it.  I'd much rather the failing scenario be that KVM uses
a sub-optimal page size as opposed to exploding on a bad pointer.

> I'll add comments in v2.
> 
> > [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via
> >     lookup_address_in_mm()) to identify large page mappings.  The function
> >     itself is susceptible to such a race, but KVM only does the lookup
> >     after it has done gup() and also ensures any zapping of ptes will cause
> >     KVM to restart the faulting (guest) instruction or that the zap will be
> >     blocked until after KVM does the lookup, i.e. racing with a transition
> >     from !PRESENT -> PRESENT should be impossible (in theory).
> > 
> This bug is from hugetlb core, we could trigger it in other usages even if the
> latest KVM won't.

I was actually worried about the opposite, introducing a bug by moving to
lookup_address_in_mm().

> >> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
> >> ---
> >>  mm/hugetlb.c | 34 ++++++++++++++++++----------------
> >>  1 file changed, 18 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index dd8737a..3bde229 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> >>  pte_t *huge_pte_offset(struct mm_struct *mm,
> >>  		       unsigned long addr, unsigned long sz)
> >>  {
> >> -	pgd_t *pgd;
> >> -	p4d_t *p4d;
> >> -	pud_t *pud;
> >> -	pmd_t *pmd;
> >> +	pgd_t *pgdp;
> >> +	p4d_t *p4dp;
> >> +	pud_t *pudp, pud;
> >> +	pmd_t *pmdp, pmd;
> >>  
> >> -	pgd = pgd_offset(mm, addr);
> >> -	if (!pgd_present(*pgd))
> >> +	pgdp = pgd_offset(mm, addr);
> >> +	if (!pgd_present(*pgdp))
> >>  		return NULL;
> >> -	p4d = p4d_offset(pgd, addr);
> >> -	if (!p4d_present(*p4d))
> >> +	p4dp = p4d_offset(pgdp, addr);
> >> +	if (!p4d_present(*p4dp))
> >>  		return NULL;
> >>  
> >> -	pud = pud_offset(p4d, addr);
> >> -	if (sz != PUD_SIZE && pud_none(*pud))
> >> +	pudp = pud_offset(p4dp, addr);
> >> +	pud = READ_ONCE(*pudp);
> >> +	if (sz != PUD_SIZE && pud_none(pud))
> >>  		return NULL;
> >>  	/* hugepage or swap? */
> >> -	if (pud_huge(*pud) || !pud_present(*pud))
> >> -		return (pte_t *)pud;
> >> +	if (pud_huge(pud) || !pud_present(pud))
> >> +		return (pte_t *)pudp;
> >>  
> >> -	pmd = pmd_offset(pud, addr);
> >> -	if (sz != PMD_SIZE && pmd_none(*pmd))
> >> +	pmdp = pmd_offset(pudp, addr);
> >> +	pmd = READ_ONCE(*pmdp);
> >> +	if (sz != PMD_SIZE && pmd_none(pmd))
> >>  		return NULL;
> >>  	/* hugepage or swap? */
> >> -	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> >> -		return (pte_t *)pmd;
> >> +	if (pmd_huge(pmd) || !pmd_present(pmd))
> >> +		return (pte_t *)pmdp;
> >>  
> >>  	return NULL;
> >>  }
> >> -- 
> >> 1.8.3.1
> >>
> >>
> > 
> > .
> > 
> 
> 
> -- 
> Regards,
> Longpeng(Mike)
>
Longpeng(Mike) Feb. 19, 2020, 2:09 a.m. UTC | #6
在 2020/2/19 4:52, Matthew Wilcox 写道:
> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>  {
>> -	pgd_t *pgd;
>> -	p4d_t *p4d;
>> -	pud_t *pud;
>> -	pmd_t *pmd;
>> +	pgd_t *pgdp;
>> +	p4d_t *p4dp;
>> +	pud_t *pudp, pud;
>> +	pmd_t *pmdp, pmd;
> 
> Renaming the variables as part of a fix is a really bad idea.  It obscures
> the actual fix and makes everybody's life harder.  Plus, it's not even
> renaming to follow the normal convention -- there are only two places
> (migrate.c and gup.c) which follow this pattern in mm/ while there are
> 33 that do not.
> 
Good suggestion, I've never noticed this, thanks.
By the way, could you give an example if we use this way to fix the bug?

> 
> .
>
Mike Kravetz Feb. 19, 2020, 3:49 a.m. UTC | #7
On 2/18/20 6:09 PM, Longpeng (Mike) wrote:
> 在 2020/2/19 4:52, Matthew Wilcox 写道:
>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>>  {
>>> -	pgd_t *pgd;
>>> -	p4d_t *p4d;
>>> -	pud_t *pud;
>>> -	pmd_t *pmd;
>>> +	pgd_t *pgdp;
>>> +	p4d_t *p4dp;
>>> +	pud_t *pudp, pud;
>>> +	pmd_t *pmdp, pmd;
>>
>> Renaming the variables as part of a fix is a really bad idea.  It obscures
>> the actual fix and makes everybody's life harder.  Plus, it's not even
>> renaming to follow the normal convention -- there are only two places
>> (migrate.c and gup.c) which follow this pattern in mm/ while there are
>> 33 that do not.
>>
> Good suggestion, I've never noticed this, thanks.
> By the way, could you give an example if we use this way to fix the bug?

Matthew and others may have better suggestions for naming.  However, I would
keep the existing names and add:

pud_t pud_entry;
pmd_t pmd_entry;

Then the *_entry variables are the target of the READ_ONCE()

pud_entry = READ_ONCE(*pud);
if (sz != PUD_SIZE && pud_none(pud_entry))
...
...
pmd_entry = READ_ONCE(*pmd);
if (sz != PMD_SIZE && pmd_none(pmd_entry))
...
...

BTW, thank you for finding this issue!
Longpeng(Mike) Feb. 19, 2020, 12:21 p.m. UTC | #8
在 2020/2/19 9:58, Sean Christopherson 写道:
> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
>> 在 2020/2/19 4:37, Sean Christopherson 写道:
>>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>>> Our machine encountered a panic after run for a long time and
>>>> the calltrace is:
>>>
>>> What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
>>> dereference, etc...?
>>>
>> A bad pointer dereference.
>>
>> pgd -> pud -> user 1G hugepage
>> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe
>> return the a bad pointer of the user 1G hugepage.
>>
>>>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>>>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>>>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> Call Trace:
>>>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>>>
>>>> ( The kernel we used is older, but we think the latest kernel also has this
>>>>   bug after dig into this problem. )
>>>>
>>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>>>> may return a wrong 'pmdp' if there is a race. Please look at the following
>>>> code snippet:
>>>>     ...
>>>>     pud = pud_offset(p4d, addr);
>>>>     if (sz != PUD_SIZE && pud_none(*pud))
>>>>         return NULL;
>>>>     /* hugepage or swap? */
>>>>     if (pud_huge(*pud) || !pud_present(*pud))
>>>>         return (pte_t *)pud;
>>>>
>>>>     pmd = pmd_offset(pud, addr);
>>>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>>>         return NULL;
>>>>     /* hugepage or swap? */
>>>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>>         return (pte_t *)pmd;
>>>>     ...
>>>>
>>>> The following sequence would trigger this bug:
>>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>>>> 1. CPU0: "pud_huge(*pud)" is false
>>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>>>> 3. CPU0: "!pud_present(*pud)" is false, continue
>>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>>>> However, we want CPU0 to return NULL or pudp.
>>>>
>>>> We can avoid this race by read the pud only once.
>>>
>>> Are there any other options for avoiding the panic you hit?  I ask because
>>> there are a variety of flows that use a very similar code pattern, e.g.
>>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
>>> other flows could be confusing (or in my case, anxiety inducing[*]).  At
>>> the least, adding a comment in huge_pte_offset() to explain the need for
>>> READ_ONCE() would be helpful.
>>>
>> I hope the hugetlb and mm maintainers could give some other options if they
>> approve this bug.
> 
> The race and the fix make sense.  I assumed dereferencing garbage from the
> huge page was the issue, but I wasn't 100% that was the case, which is why
> I asked about alternative fixes.
> 
>> We change the code from
>> 	if (pud_huge(*pud) || !pud_present(*pud))
>> to
>> 	if (pud_huge(*pud)
>> 		return (pte_t *)pud;
>> 	busy loop for 500ms
>> 	if (!pud_present(*pud))
>> 		return (pte_t *)pud;
>> and the panic will be hit quickly.
>>
>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
>>
>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
>> pud twice and the *pud maybe change in a race, so if we only read the pud once.
>> I use READ_ONCE here is just for safe, to prevents the complier mischief if
>> possible.
> 
> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
> convert everything as a follow-up patch (or patches).  I'm fairly confident
> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
> bet my life on it.  I'd much rather the failing scenario be that KVM uses
> a sub-optimal page size as opposed to exploding on a bad pointer.
> 
Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do
live-upgrade(private feature that just modify the qemu and libvirt) and
live-migrate in turns for each one. However our live upgraded new QEMU won't do
touch_all_pages.
Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is
not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they
could access some pages belong to the same 1G-hugepage, both of them will vmexit
due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so
the race may encounter, right?

>> I'll add comments in v2.
>>
>>> [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via
>>>     lookup_address_in_mm()) to identify large page mappings.  The function
>>>     itself is susceptible to such a race, but KVM only does the lookup
>>>     after it has done gup() and also ensures any zapping of ptes will cause
>>>     KVM to restart the faulting (guest) instruction or that the zap will be
>>>     blocked until after KVM does the lookup, i.e. racing with a transition
>>>     from !PRESENT -> PRESENT should be impossible (in theory).
>>>
>> This bug is from hugetlb core, we could trigger it in other usages even if the
>> latest KVM won't.
> 
> I was actually worried about the opposite, introducing a bug by moving to
> lookup_address_in_mm().
> 
>>>> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
>>>> ---
>>>>  mm/hugetlb.c | 34 ++++++++++++++++++----------------
>>>>  1 file changed, 18 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index dd8737a..3bde229 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>>>>  pte_t *huge_pte_offset(struct mm_struct *mm,
>>>>  		       unsigned long addr, unsigned long sz)
>>>>  {
>>>> -	pgd_t *pgd;
>>>> -	p4d_t *p4d;
>>>> -	pud_t *pud;
>>>> -	pmd_t *pmd;
>>>> +	pgd_t *pgdp;
>>>> +	p4d_t *p4dp;
>>>> +	pud_t *pudp, pud;
>>>> +	pmd_t *pmdp, pmd;
>>>>  
>>>> -	pgd = pgd_offset(mm, addr);
>>>> -	if (!pgd_present(*pgd))
>>>> +	pgdp = pgd_offset(mm, addr);
>>>> +	if (!pgd_present(*pgdp))
>>>>  		return NULL;
>>>> -	p4d = p4d_offset(pgd, addr);
>>>> -	if (!p4d_present(*p4d))
>>>> +	p4dp = p4d_offset(pgdp, addr);
>>>> +	if (!p4d_present(*p4dp))
>>>>  		return NULL;
>>>>  
>>>> -	pud = pud_offset(p4d, addr);
>>>> -	if (sz != PUD_SIZE && pud_none(*pud))
>>>> +	pudp = pud_offset(p4dp, addr);
>>>> +	pud = READ_ONCE(*pudp);
>>>> +	if (sz != PUD_SIZE && pud_none(pud))
>>>>  		return NULL;
>>>>  	/* hugepage or swap? */
>>>> -	if (pud_huge(*pud) || !pud_present(*pud))
>>>> -		return (pte_t *)pud;
>>>> +	if (pud_huge(pud) || !pud_present(pud))
>>>> +		return (pte_t *)pudp;
>>>>  
>>>> -	pmd = pmd_offset(pud, addr);
>>>> -	if (sz != PMD_SIZE && pmd_none(*pmd))
>>>> +	pmdp = pmd_offset(pudp, addr);
>>>> +	pmd = READ_ONCE(*pmdp);
>>>> +	if (sz != PMD_SIZE && pmd_none(pmd))
>>>>  		return NULL;
>>>>  	/* hugepage or swap? */
>>>> -	if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>> -		return (pte_t *)pmd;
>>>> +	if (pmd_huge(pmd) || !pmd_present(pmd))
>>>> +		return (pte_t *)pmdp;
>>>>  
>>>>  	return NULL;
>>>>  }
>>>> -- 
>>>> 1.8.3.1
>>>>
>>>>
>>>
>>> .
>>>
>>
>>
>> -- 
>> Regards,
>> Longpeng(Mike)
>>
>
Longpeng(Mike) Feb. 19, 2020, 12:52 p.m. UTC | #9
在 2020/2/19 11:49, Mike Kravetz 写道:
> On 2/18/20 6:09 PM, Longpeng (Mike) wrote:
>> 在 2020/2/19 4:52, Matthew Wilcox 写道:
>>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>>>  {
>>>> -	pgd_t *pgd;
>>>> -	p4d_t *p4d;
>>>> -	pud_t *pud;
>>>> -	pmd_t *pmd;
>>>> +	pgd_t *pgdp;
>>>> +	p4d_t *p4dp;
>>>> +	pud_t *pudp, pud;
>>>> +	pmd_t *pmdp, pmd;
>>>
>>> Renaming the variables as part of a fix is a really bad idea.  It obscures
>>> the actual fix and makes everybody's life harder.  Plus, it's not even
>>> renaming to follow the normal convention -- there are only two places
>>> (migrate.c and gup.c) which follow this pattern in mm/ while there are
>>> 33 that do not.
>>>
>> Good suggestion, I've never noticed this, thanks.
>> By the way, could you give an example if we use this way to fix the bug?
> 
> Matthew and others may have better suggestions for naming.  However, I would
> keep the existing names and add:
> 
> pud_t pud_entry;
> pmd_t pmd_entry;
> 
> Then the *_entry variables are the target of the READ_ONCE()
> 
> pud_entry = READ_ONCE(*pud);
> if (sz != PUD_SIZE && pud_none(pud_entry))
> ...
> ...
> pmd_entry = READ_ONCE(*pmd);
> if (sz != PMD_SIZE && pmd_none(pmd_entry))
> ...
> ...
> 
Uh, looks much better.

BTW, I missed one of your email in my mail client, but I find it in lkml.org.
'''
I too would like some more information on the panic.
If your analysis is correct, then I would expect the 'ptep' returned by
huge_pte_offset() to not point to a pte but rather some random address.
This is because the 'pmd' calculated by pmd_offset(pud, addr) is not
really the address of a pmd.  So, perhaps there is an addressing exception
at huge_ptep_get() near the beginning of hugetlb_fault()?

	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
	if (ptep) {
		entry = huge_ptep_get(ptep);
		...
'''
Yep, your analysis above is the same as mine, we got a 'dummy pmd' and then
cause access a bad address.

What's your opinion about the solution to fix this problem, not only
huge_pte_offset, some other places also have the same problem(e.g.
lookup_address_in_pgd) ?

> BTW, thank you for finding this issue!
>
Sean Christopherson Feb. 19, 2020, 4:22 p.m. UTC | #10
On Wed, Feb 19, 2020 at 08:21:26PM +0800, Longpeng (Mike) wrote:
> 在 2020/2/19 9:58, Sean Christopherson 写道:
> > FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
> > convert everything as a follow-up patch (or patches).  I'm fairly confident
> > that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
> > bet my life on it.  I'd much rather the failing scenario be that KVM uses
> > a sub-optimal page size as opposed to exploding on a bad pointer.
> > 
> Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do
> live-upgrade(private feature that just modify the qemu and libvirt) and
> live-migrate in turns for each one. However our live upgraded new QEMU won't do
> touch_all_pages.
> Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is
> not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they
> could access some pages belong to the same 1G-hugepage, both of them will vmexit
> due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so
> the race may encounter, right?

Yep.  The code I'm referring to is similar but different code that just
happened to go into KVM for kernel 5.6.  It has no effect on the gup() flow
that leads to this bug.  I mentioned it above as an example of code outside
of hugetlb_fault() that would also benefit from moving to READ/WRITE_ONCE().
Mike Kravetz Feb. 19, 2020, 7:33 p.m. UTC | #11
+ Kirill
On 2/18/20 5:58 PM, Sean Christopherson wrote:
> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
>> 在 2020/2/19 4:37, Sean Christopherson 写道:
>>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>>> Our machine encountered a panic after run for a long time and
>>>> the calltrace is:
>>>
>>> What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
>>> dereference, etc...?
>>>
>> A bad pointer dereference.
>>
>> pgd -> pud -> user 1G hugepage
>> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe
>> return the a bad pointer of the user 1G hugepage.
>>
>>>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>>>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>>>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> Call Trace:
>>>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>>>
>>>> ( The kernel we used is older, but we think the latest kernel also has this
>>>>   bug after dig into this problem. )
>>>>
>>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>>>> may return a wrong 'pmdp' if there is a race. Please look at the following
>>>> code snippet:
>>>>     ...
>>>>     pud = pud_offset(p4d, addr);
>>>>     if (sz != PUD_SIZE && pud_none(*pud))
>>>>         return NULL;
>>>>     /* hugepage or swap? */
>>>>     if (pud_huge(*pud) || !pud_present(*pud))
>>>>         return (pte_t *)pud;
>>>>
>>>>     pmd = pmd_offset(pud, addr);
>>>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>>>         return NULL;
>>>>     /* hugepage or swap? */
>>>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>>         return (pte_t *)pmd;
>>>>     ...
>>>>
>>>> The following sequence would trigger this bug:
>>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>>>> 1. CPU0: "pud_huge(*pud)" is false
>>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>>>> 3. CPU0: "!pud_present(*pud)" is false, continue
>>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>>>> However, we want CPU0 to return NULL or pudp.
>>>>
>>>> We can avoid this race by read the pud only once.
>>>
>>> Are there any other options for avoiding the panic you hit?  I ask because
>>> there are a variety of flows that use a very similar code pattern, e.g.
>>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
>>> other flows could be confusing (or in my case, anxiety inducing[*]).  At
>>> the least, adding a comment in huge_pte_offset() to explain the need for
>>> READ_ONCE() would be helpful.
>>>
>> I hope the hugetlb and mm maintainers could give some other options if they
>> approve this bug.
> 
> The race and the fix make sense.  I assumed dereferencing garbage from the
> huge page was the issue, but I wasn't 100% that was the case, which is why
> I asked about alternative fixes.
> 
>> We change the code from
>> 	if (pud_huge(*pud) || !pud_present(*pud))
>> to
>> 	if (pud_huge(*pud)
>> 		return (pte_t *)pud;
>> 	busy loop for 500ms
>> 	if (!pud_present(*pud))
>> 		return (pte_t *)pud;
>> and the panic will be hit quickly.
>>
>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
>>
>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
>> pud twice and the *pud maybe change in a race, so if we only read the pud once.
>> I use READ_ONCE here is just for safe, to prevents the complier mischief if
>> possible.
> 
> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
> convert everything as a follow-up patch (or patches).  I'm fairly confident
> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
> bet my life on it.  I'd much rather the failing scenario be that KVM uses
> a sub-optimal page size as opposed to exploding on a bad pointer.

Longpeng(Mike) asked in another e-mail specifically about making similar
changes to lookup_address_in_mm().  Replying here as there is more context.

I 'think' lookup_address_in_mm is safe from this issue.  Why?  IIUC, the
problem with the huge_pte_offset routine is that the pud changes from
pud_none() to pud_huge() in the middle of
'if (pud_huge(*pud) || !pud_present(*pud))'.  In the case of
lookup_address_in_mm, we know pud was not pud_none() as it was previously
checked.  I am not aware of any other state transitions which could cause
us trouble.  However, I am no expert in this area.
Longpeng(Mike) Feb. 20, 2020, 2:30 a.m. UTC | #12
在 2020/2/20 3:33, Mike Kravetz 写道:
> + Kirill
> On 2/18/20 5:58 PM, Sean Christopherson wrote:
>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
>>> 在 2020/2/19 4:37, Sean Christopherson 写道:
>>>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote:
>>>>> Our machine encountered a panic after run for a long time and
>>>>> the calltrace is:
>>>>
>>>> What's the actual panic?  Is it a BUG() in hugetlb_fault(), a bad pointer
>>>> dereference, etc...?
>>>>
>>> A bad pointer dereference.
>>>
>>> pgd -> pud -> user 1G hugepage
>>> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe
>>> return the a bad pointer of the user 1G hugepage.
>>>
>>>>> RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
>>>>> RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
>>>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
>>>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
>>>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
>>>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
>>>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
>>>>> FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>>  [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
>>>>>  [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
>>>>>  [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
>>>>>  [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
>>>>>  [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
>>>>>  [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
>>>>>  [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
>>>>>  [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
>>>>>  [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
>>>>>  [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
>>>>>  [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
>>>>>  [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
>>>>>  [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
>>>>>  [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
>>>>>  [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
>>>>>  [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
>>>>>  [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
>>>>>  [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
>>>>>  [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
>>>>>  [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
>>>>>  [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
>>>>>  [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
>>>>>  [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
>>>>>  [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
>>>>>  [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
>>>>>
>>>>> ( The kernel we used is older, but we think the latest kernel also has this
>>>>>   bug after dig into this problem. )
>>>>>
>>>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
>>>>> may return a wrong 'pmdp' if there is a race. Please look at the following
>>>>> code snippet:
>>>>>     ...
>>>>>     pud = pud_offset(p4d, addr);
>>>>>     if (sz != PUD_SIZE && pud_none(*pud))
>>>>>         return NULL;
>>>>>     /* hugepage or swap? */
>>>>>     if (pud_huge(*pud) || !pud_present(*pud))
>>>>>         return (pte_t *)pud;
>>>>>
>>>>>     pmd = pmd_offset(pud, addr);
>>>>>     if (sz != PMD_SIZE && pmd_none(*pmd))
>>>>>         return NULL;
>>>>>     /* hugepage or swap? */
>>>>>     if (pmd_huge(*pmd) || !pmd_present(*pmd))
>>>>>         return (pte_t *)pmd;
>>>>>     ...
>>>>>
>>>>> The following sequence would trigger this bug:
>>>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
>>>>> 1. CPU0: "pud_huge(*pud)" is false
>>>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
>>>>> 3. CPU0: "!pud_present(*pud)" is false, continue
>>>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
>>>>> However, we want CPU0 to return NULL or pudp.
>>>>>
>>>>> We can avoid this race by read the pud only once.
>>>>
>>>> Are there any other options for avoiding the panic you hit?  I ask because
>>>> there are a variety of flows that use a very similar code pattern, e.g.
>>>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not
>>>> other flows could be confusing (or in my case, anxiety inducing[*]).  At
>>>> the least, adding a comment in huge_pte_offset() to explain the need for
>>>> READ_ONCE() would be helpful.
>>>>
>>> I hope the hugetlb and mm maintainers could give some other options if they
>>> approve this bug.
>>
>> The race and the fix make sense.  I assumed dereferencing garbage from the
>> huge page was the issue, but I wasn't 100% that was the case, which is why
>> I asked about alternative fixes.
>>
>>> We change the code from
>>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>> to
>>> 	if (pud_huge(*pud)
>>> 		return (pte_t *)pud;
>>> 	busy loop for 500ms
>>> 	if (!pud_present(*pud))
>>> 		return (pte_t *)pud;
>>> and the panic will be hit quickly.
>>>
>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
>>>
>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
>>> pud twice and the *pud maybe change in a race, so if we only read the pud once.
>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if
>>> possible.
>>
>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
>> convert everything as a follow-up patch (or patches).  I'm fairly confident
>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
>> bet my life on it.  I'd much rather the failing scenario be that KVM uses
>> a sub-optimal page size as opposed to exploding on a bad pointer.
> 
> Longpeng(Mike) asked in another e-mail specifically about making similar
> changes to lookup_address_in_mm().  Replying here as there is more context.
> 
> I 'think' lookup_address_in_mm is safe from this issue.  Why?  IIUC, the
> problem with the huge_pte_offset routine is that the pud changes from
> pud_none() to pud_huge() in the middle of
> 'if (pud_huge(*pud) || !pud_present(*pud))'.  In the case of
> lookup_address_in_mm, we know pud was not pud_none() as it was previously
> checked.  I am not aware of any other state transitions which could cause
> us trouble.  However, I am no expert in this area.
> 
So... I need just fix huge_pte_offset in mm/hugetlb.c, right?

Is it possible the pud changes from pud_huge() to pud_none() while another CPU
is walking the pagetable ?
Longpeng(Mike) Feb. 20, 2020, 2:32 a.m. UTC | #13
在 2020/2/20 0:22, Sean Christopherson 写道:
> On Wed, Feb 19, 2020 at 08:21:26PM +0800, Longpeng (Mike) wrote:
>> 在 2020/2/19 9:58, Sean Christopherson 写道:
>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
>>> convert everything as a follow-up patch (or patches).  I'm fairly confident
>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
>>> bet my life on it.  I'd much rather the failing scenario be that KVM uses
>>> a sub-optimal page size as opposed to exploding on a bad pointer.
>>>
>> Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do
>> live-upgrade(private feature that just modify the qemu and libvirt) and
>> live-migrate in turns for each one. However our live upgraded new QEMU won't do
>> touch_all_pages.
>> Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is
>> not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they
>> could access some pages belong to the same 1G-hugepage, both of them will vmexit
>> due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so
>> the race may encounter, right?
> 
> Yep.  The code I'm referring to is similar but different code that just
> happened to go into KVM for kernel 5.6.  It has no effect on the gup() flow
> that leads to this bug.  I mentioned it above as an example of code outside
> of hugetlb_fault() that would also benefit from moving to READ/WRITE_ONCE().
> 
> 
I understand better now, thanks for your patience. :)
Mike Kravetz Feb. 21, 2020, 12:22 a.m. UTC | #14
On 2/19/20 6:30 PM, Longpeng (Mike) wrote:
> 在 2020/2/20 3:33, Mike Kravetz 写道:
>> + Kirill
>> On 2/18/20 5:58 PM, Sean Christopherson wrote:
>>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
<snip>
>>> The race and the fix make sense.  I assumed dereferencing garbage from the
>>> huge page was the issue, but I wasn't 100% that was the case, which is why
>>> I asked about alternative fixes.
>>>
>>>> We change the code from
>>>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>>> to
>>>> 	if (pud_huge(*pud)
>>>> 		return (pte_t *)pud;
>>>> 	busy loop for 500ms
>>>> 	if (!pud_present(*pud))
>>>> 		return (pte_t *)pud;
>>>> and the panic will be hit quickly.
>>>>
>>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
>>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
>>>>
>>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
>>>> pud twice and the *pud maybe change in a race, so if we only read the pud once.
>>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if
>>>> possible.
>>>
>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
>>> convert everything as a follow-up patch (or patches).  I'm fairly confident
>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
>>> bet my life on it.  I'd much rather the failing scenario be that KVM uses
>>> a sub-optimal page size as opposed to exploding on a bad pointer.
>>
>> Longpeng(Mike) asked in another e-mail specifically about making similar
>> changes to lookup_address_in_mm().  Replying here as there is more context.
>>
>> I 'think' lookup_address_in_mm is safe from this issue.  Why?  IIUC, the
>> problem with the huge_pte_offset routine is that the pud changes from
>> pud_none() to pud_huge() in the middle of
>> 'if (pud_huge(*pud) || !pud_present(*pud))'.  In the case of
>> lookup_address_in_mm, we know pud was not pud_none() as it was previously
>> checked.  I am not aware of any other state transitions which could cause
>> us trouble.  However, I am no expert in this area.

Bad copy/paste by me.  Longpeng(Mike) was asking about lookup_address_in_pgd.

> So... I need just fix huge_pte_offset in mm/hugetlb.c, right?

Let's start with just a fix for huge_pte_offset() as you can easily reproduce
that issue by adding a delay.

> Is it possible the pud changes from pud_huge() to pud_none() while another CPU
> is walking the pagetable ?

I believe it is possible.  If we hole punch a hugetlbfs file, we will clear
the corresponding pud's.  Hence, we can go from pud_huge() to pud_none().
Unless I am missing something, that does imply we could have issues in places
such as lookup_address_in_pgd:

	pud = pud_offset(p4d, address);
	if (pud_none(*pud))
		return NULL;

	*level = PG_LEVEL_1G;
	if (pud_large(*pud) || !pud_present(*pud))
		return (pte_t *)pud;

I hope I am wrong, but it seems like pud_none(*pud) could become true after
the initial check, and before the (pud_large) check.  If so, there could be
a problem (addressing exception) when the code continues and looks up the pmd.

	pmd = pmd_offset(pud, address);
	if (pmd_none(*pmd))
		return NULL;

It has been mentioned before that there are many page table walks like this.
What am I missing that prevents races like this?  Or, have we just been lucky?
Longpeng(Mike) Feb. 22, 2020, 2:15 a.m. UTC | #15
在 2020/2/21 8:22, Mike Kravetz 写道:
> On 2/19/20 6:30 PM, Longpeng (Mike) wrote:
>> 在 2020/2/20 3:33, Mike Kravetz 写道:
>>> + Kirill
>>> On 2/18/20 5:58 PM, Sean Christopherson wrote:
>>>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote:
> <snip>
>>>> The race and the fix make sense.  I assumed dereferencing garbage from the
>>>> huge page was the issue, but I wasn't 100% that was the case, which is why
>>>> I asked about alternative fixes.
>>>>
>>>>> We change the code from
>>>>> 	if (pud_huge(*pud) || !pud_present(*pud))
>>>>> to
>>>>> 	if (pud_huge(*pud)
>>>>> 		return (pte_t *)pud;
>>>>> 	busy loop for 500ms
>>>>> 	if (!pud_present(*pud))
>>>>> 		return (pte_t *)pud;
>>>>> and the panic will be hit quickly.
>>>>>
>>>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this
>>>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables).
>>>>>
>>>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from
>>>>> pud twice and the *pud maybe change in a race, so if we only read the pud once.
>>>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if
>>>>> possible.
>>>>
>>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g.
>>>> convert everything as a follow-up patch (or patches).  I'm fairly confident
>>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly
>>>> bet my life on it.  I'd much rather the failing scenario be that KVM uses
>>>> a sub-optimal page size as opposed to exploding on a bad pointer.
>>>
>>> Longpeng(Mike) asked in another e-mail specifically about making similar
>>> changes to lookup_address_in_mm().  Replying here as there is more context.
>>>
>>> I 'think' lookup_address_in_mm is safe from this issue.  Why?  IIUC, the
>>> problem with the huge_pte_offset routine is that the pud changes from
>>> pud_none() to pud_huge() in the middle of
>>> 'if (pud_huge(*pud) || !pud_present(*pud))'.  In the case of
>>> lookup_address_in_mm, we know pud was not pud_none() as it was previously
>>> checked.  I am not aware of any other state transitions which could cause
>>> us trouble.  However, I am no expert in this area.
> 
> Bad copy/paste by me.  Longpeng(Mike) was asking about lookup_address_in_pgd.
> 
>> So... I need just fix huge_pte_offset in mm/hugetlb.c, right?
> 
> Let's start with just a fix for huge_pte_offset() as you can easily reproduce
> that issue by adding a delay.
> 
>> Is it possible the pud changes from pud_huge() to pud_none() while another CPU
>> is walking the pagetable ?
> 
All right, I'll send V2 to fix it, thanks :)

> I believe it is possible.  If we hole punch a hugetlbfs file, we will clear
> the corresponding pud's.  Hence, we can go from pud_huge() to pud_none().
> Unless I am missing something, that does imply we could have issues in places
> such as lookup_address_in_pgd:
> 
> 	pud = pud_offset(p4d, address);
> 	if (pud_none(*pud))
> 		return NULL;
> 
> 	*level = PG_LEVEL_1G;
> 	if (pud_large(*pud) || !pud_present(*pud))
> 		return (pte_t *)pud;
> 
> I hope I am wrong, but it seems like pud_none(*pud) could become true after
> the initial check, and before the (pud_large) check.  If so, there could be
> a problem (addressing exception) when the code continues and looks up the pmd.
> 
> 	pmd = pmd_offset(pud, address);
> 	if (pmd_none(*pmd))
> 		return NULL;
> 
> It has been mentioned before that there are many page table walks like this.
> What am I missing that prevents races like this?  Or, have we just been lucky?
> 
That's what I worry about. Maybe there is no usecase to hit it.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd8737a..3bde229 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4908,31 +4908,33 @@  pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz)
 {
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
 
-	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
+	pgdp = pgd_offset(mm, addr);
+	if (!pgd_present(*pgdp))
 		return NULL;
-	p4d = p4d_offset(pgd, addr);
-	if (!p4d_present(*p4d))
+	p4dp = p4d_offset(pgdp, addr);
+	if (!p4d_present(*p4dp))
 		return NULL;
 
-	pud = pud_offset(p4d, addr);
-	if (sz != PUD_SIZE && pud_none(*pud))
+	pudp = pud_offset(p4dp, addr);
+	pud = READ_ONCE(*pudp);
+	if (sz != PUD_SIZE && pud_none(pud))
 		return NULL;
 	/* hugepage or swap? */
-	if (pud_huge(*pud) || !pud_present(*pud))
-		return (pte_t *)pud;
+	if (pud_huge(pud) || !pud_present(pud))
+		return (pte_t *)pudp;
 
-	pmd = pmd_offset(pud, addr);
-	if (sz != PMD_SIZE && pmd_none(*pmd))
+	pmdp = pmd_offset(pudp, addr);
+	pmd = READ_ONCE(*pmdp);
+	if (sz != PMD_SIZE && pmd_none(pmd))
 		return NULL;
 	/* hugepage or swap? */
-	if (pmd_huge(*pmd) || !pmd_present(*pmd))
-		return (pte_t *)pmd;
+	if (pmd_huge(pmd) || !pmd_present(pmd))
+		return (pte_t *)pmdp;
 
 	return NULL;
 }