Message ID | 1582027825-112728-1-git-send-email-longpeng2@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/hugetlb: avoid get wrong ptep caused by race | expand |
On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: > Our machine encountered a panic after run for a long time and > the calltrace is: What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer dereference, etc...? > RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 > RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 > RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 > RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 > RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 > R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 > R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 > FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 > [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 > [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 > [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 > [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 > [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 > [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] > [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] > [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] > [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] > [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] > [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] > [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] > [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] > [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] > [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] > [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] > [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] > [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] > [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] > [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 > [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 > [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 > [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 > [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 > > ( The kernel we used is older, but we think the latest kernel also has this > bug after dig into this problem. ) > > For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it > may return a wrong 'pmdp' if there is a race. Please look at the following > code snippet: > ... > pud = pud_offset(p4d, addr); > if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > /* hugepage or swap? */ > if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > > pmd = pmd_offset(pud, addr); > if (sz != PMD_SIZE && pmd_none(*pmd)) > return NULL; > /* hugepage or swap? */ > if (pmd_huge(*pmd) || !pmd_present(*pmd)) > return (pte_t *)pmd; > ... > > The following sequence would trigger this bug: > 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue > 1. CPU0: "pud_huge(*pud)" is false > 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) > 3. CPU0: "!pud_present(*pud)" is false, continue > 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp > However, we want CPU0 to return NULL or pudp. > > We can avoid this race by read the pud only once. Are there any other options for avoiding the panic you hit? I ask because there are a variety of flows that use a very similar code pattern, e.g. lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not other flows could be confusing (or in my case, anxiety inducing[*]). At the least, adding a comment in huge_pte_offset() to explain the need for READ_ONCE() would be helpful. [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via lookup_address_in_mm()) to identify large page mappings. The function itself is susceptible to such a race, but KVM only does the lookup after it has done gup() and also ensures any zapping of ptes will cause KVM to restart the faulting (guest) instruction or that the zap will be blocked until after KVM does the lookup, i.e. racing with a transition from !PRESENT -> PRESENT should be impossible (in theory). > Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> > --- > mm/hugetlb.c | 34 ++++++++++++++++++---------------- > 1 file changed, 18 insertions(+), 16 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index dd8737a..3bde229 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > - pgd_t *pgd; > - p4d_t *p4d; > - pud_t *pud; > - pmd_t *pmd; > + pgd_t *pgdp; > + p4d_t *p4dp; > + pud_t *pudp, pud; > + pmd_t *pmdp, pmd; > > - pgd = pgd_offset(mm, addr); > - if (!pgd_present(*pgd)) > + pgdp = pgd_offset(mm, addr); > + if (!pgd_present(*pgdp)) > return NULL; > - p4d = p4d_offset(pgd, addr); > - if (!p4d_present(*p4d)) > + p4dp = p4d_offset(pgdp, addr); > + if (!p4d_present(*p4dp)) > return NULL; > > - pud = pud_offset(p4d, addr); > - if (sz != PUD_SIZE && pud_none(*pud)) > + pudp = pud_offset(p4dp, addr); > + pud = READ_ONCE(*pudp); > + if (sz != PUD_SIZE && pud_none(pud)) > return NULL; > /* hugepage or swap? */ > - if (pud_huge(*pud) || !pud_present(*pud)) > - return (pte_t *)pud; > + if (pud_huge(pud) || !pud_present(pud)) > + return (pte_t *)pudp; > > - pmd = pmd_offset(pud, addr); > - if (sz != PMD_SIZE && pmd_none(*pmd)) > + pmdp = pmd_offset(pudp, addr); > + pmd = READ_ONCE(*pmdp); > + if (sz != PMD_SIZE && pmd_none(pmd)) > return NULL; > /* hugepage or swap? */ > - if (pmd_huge(*pmd) || !pmd_present(*pmd)) > - return (pte_t *)pmd; > + if (pmd_huge(pmd) || !pmd_present(pmd)) > + return (pte_t *)pmdp; > > return NULL; > } > -- > 1.8.3.1 > >
On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: > { > - pgd_t *pgd; > - p4d_t *p4d; > - pud_t *pud; > - pmd_t *pmd; > + pgd_t *pgdp; > + p4d_t *p4dp; > + pud_t *pudp, pud; > + pmd_t *pmdp, pmd; Renaming the variables as part of a fix is a really bad idea. It obscures the actual fix and makes everybody's life harder. Plus, it's not even renaming to follow the normal convention -- there are only two places (migrate.c and gup.c) which follow this pattern in mm/ while there are 33 that do not.
On 2/18/20 12:37 PM, Sean Christopherson wrote: > On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >> Our machine encountered a panic after run for a long time and >> the calltrace is: > > What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer > dereference, etc...? I too would like some more information on the panic. If your analysis is correct, then I would expect the 'ptep' returned by huge_pte_offset() to not point to a pte but rather some random address. This is because the 'pmd' calculated by pmd_offset(pud, addr) is not really the address of a pmd. So, perhaps there is an addressing exception at huge_ptep_get() near the beginning of hugetlb_fault()? ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { entry = huge_ptep_get(ptep); ...
在 2020/2/19 4:37, Sean Christopherson 写道: > On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >> Our machine encountered a panic after run for a long time and >> the calltrace is: > > What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer > dereference, etc...? > A bad pointer dereference. pgd -> pud -> user 1G hugepage huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe return the a bad pointer of the user 1G hugepage. >> RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 >> RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 >> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 >> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 >> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 >> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 >> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 >> FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 >> [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 >> [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 >> [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 >> [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 >> [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 >> [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] >> [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] >> [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] >> [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] >> [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] >> [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] >> [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] >> [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] >> [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] >> [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] >> [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] >> [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] >> [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] >> [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] >> [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 >> [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 >> [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 >> [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 >> [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 >> >> ( The kernel we used is older, but we think the latest kernel also has this >> bug after dig into this problem. ) >> >> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it >> may return a wrong 'pmdp' if there is a race. Please look at the following >> code snippet: >> ... >> pud = pud_offset(p4d, addr); >> if (sz != PUD_SIZE && pud_none(*pud)) >> return NULL; >> /* hugepage or swap? */ >> if (pud_huge(*pud) || !pud_present(*pud)) >> return (pte_t *)pud; >> >> pmd = pmd_offset(pud, addr); >> if (sz != PMD_SIZE && pmd_none(*pmd)) >> return NULL; >> /* hugepage or swap? */ >> if (pmd_huge(*pmd) || !pmd_present(*pmd)) >> return (pte_t *)pmd; >> ... >> >> The following sequence would trigger this bug: >> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue >> 1. CPU0: "pud_huge(*pud)" is false >> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) >> 3. CPU0: "!pud_present(*pud)" is false, continue >> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp >> However, we want CPU0 to return NULL or pudp. >> >> We can avoid this race by read the pud only once. > > Are there any other options for avoiding the panic you hit? I ask because > there are a variety of flows that use a very similar code pattern, e.g. > lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not > other flows could be confusing (or in my case, anxiety inducing[*]). At > the least, adding a comment in huge_pte_offset() to explain the need for > READ_ONCE() would be helpful. > I hope the hugetlb and mm maintainers could give some other options if they approve this bug. We change the code from if (pud_huge(*pud) || !pud_present(*pud)) to if (pud_huge(*pud) return (pte_t *)pud; busy loop for 500ms if (!pud_present(*pud)) return (pte_t *)pud; and the panic will be hit quickly. ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from pud twice and the *pud maybe change in a race, so if we only read the pud once. I use READ_ONCE here is just for safe, to prevents the complier mischief if possible. I'll add comments in v2. > [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via > lookup_address_in_mm()) to identify large page mappings. The function > itself is susceptible to such a race, but KVM only does the lookup > after it has done gup() and also ensures any zapping of ptes will cause > KVM to restart the faulting (guest) instruction or that the zap will be > blocked until after KVM does the lookup, i.e. racing with a transition > from !PRESENT -> PRESENT should be impossible (in theory). > This bug is from hugetlb core, we could trigger it in other usages even if the latest KVM won't. >> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> >> --- >> mm/hugetlb.c | 34 ++++++++++++++++++---------------- >> 1 file changed, 18 insertions(+), 16 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index dd8737a..3bde229 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, >> pte_t *huge_pte_offset(struct mm_struct *mm, >> unsigned long addr, unsigned long sz) >> { >> - pgd_t *pgd; >> - p4d_t *p4d; >> - pud_t *pud; >> - pmd_t *pmd; >> + pgd_t *pgdp; >> + p4d_t *p4dp; >> + pud_t *pudp, pud; >> + pmd_t *pmdp, pmd; >> >> - pgd = pgd_offset(mm, addr); >> - if (!pgd_present(*pgd)) >> + pgdp = pgd_offset(mm, addr); >> + if (!pgd_present(*pgdp)) >> return NULL; >> - p4d = p4d_offset(pgd, addr); >> - if (!p4d_present(*p4d)) >> + p4dp = p4d_offset(pgdp, addr); >> + if (!p4d_present(*p4dp)) >> return NULL; >> >> - pud = pud_offset(p4d, addr); >> - if (sz != PUD_SIZE && pud_none(*pud)) >> + pudp = pud_offset(p4dp, addr); >> + pud = READ_ONCE(*pudp); >> + if (sz != PUD_SIZE && pud_none(pud)) >> return NULL; >> /* hugepage or swap? */ >> - if (pud_huge(*pud) || !pud_present(*pud)) >> - return (pte_t *)pud; >> + if (pud_huge(pud) || !pud_present(pud)) >> + return (pte_t *)pudp; >> >> - pmd = pmd_offset(pud, addr); >> - if (sz != PMD_SIZE && pmd_none(*pmd)) >> + pmdp = pmd_offset(pudp, addr); >> + pmd = READ_ONCE(*pmdp); >> + if (sz != PMD_SIZE && pmd_none(pmd)) >> return NULL; >> /* hugepage or swap? */ >> - if (pmd_huge(*pmd) || !pmd_present(*pmd)) >> - return (pte_t *)pmd; >> + if (pmd_huge(pmd) || !pmd_present(pmd)) >> + return (pte_t *)pmdp; >> >> return NULL; >> } >> -- >> 1.8.3.1 >> >> > > . >
On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: > 在 2020/2/19 4:37, Sean Christopherson 写道: > > On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: > >> Our machine encountered a panic after run for a long time and > >> the calltrace is: > > > > What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer > > dereference, etc...? > > > A bad pointer dereference. > > pgd -> pud -> user 1G hugepage > huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe > return the a bad pointer of the user 1G hugepage. > > >> RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 > >> RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 > >> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 > >> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 > >> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 > >> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 > >> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 > >> FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >> Call Trace: > >> [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 > >> [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 > >> [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 > >> [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 > >> [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 > >> [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 > >> [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] > >> [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] > >> [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] > >> [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] > >> [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] > >> [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] > >> [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] > >> [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] > >> [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] > >> [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] > >> [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] > >> [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] > >> [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] > >> [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] > >> [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 > >> [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 > >> [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 > >> [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 > >> [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 > >> > >> ( The kernel we used is older, but we think the latest kernel also has this > >> bug after dig into this problem. ) > >> > >> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it > >> may return a wrong 'pmdp' if there is a race. Please look at the following > >> code snippet: > >> ... > >> pud = pud_offset(p4d, addr); > >> if (sz != PUD_SIZE && pud_none(*pud)) > >> return NULL; > >> /* hugepage or swap? */ > >> if (pud_huge(*pud) || !pud_present(*pud)) > >> return (pte_t *)pud; > >> > >> pmd = pmd_offset(pud, addr); > >> if (sz != PMD_SIZE && pmd_none(*pmd)) > >> return NULL; > >> /* hugepage or swap? */ > >> if (pmd_huge(*pmd) || !pmd_present(*pmd)) > >> return (pte_t *)pmd; > >> ... > >> > >> The following sequence would trigger this bug: > >> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue > >> 1. CPU0: "pud_huge(*pud)" is false > >> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) > >> 3. CPU0: "!pud_present(*pud)" is false, continue > >> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp > >> However, we want CPU0 to return NULL or pudp. > >> > >> We can avoid this race by read the pud only once. > > > > Are there any other options for avoiding the panic you hit? I ask because > > there are a variety of flows that use a very similar code pattern, e.g. > > lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not > > other flows could be confusing (or in my case, anxiety inducing[*]). At > > the least, adding a comment in huge_pte_offset() to explain the need for > > READ_ONCE() would be helpful. > > > I hope the hugetlb and mm maintainers could give some other options if they > approve this bug. The race and the fix make sense. I assumed dereferencing garbage from the huge page was the issue, but I wasn't 100% that was the case, which is why I asked about alternative fixes. > We change the code from > if (pud_huge(*pud) || !pud_present(*pud)) > to > if (pud_huge(*pud) > return (pte_t *)pud; > busy loop for 500ms > if (!pud_present(*pud)) > return (pte_t *)pud; > and the panic will be hit quickly. > > ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this > commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). > > The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from > pud twice and the *pud maybe change in a race, so if we only read the pud once. > I use READ_ONCE here is just for safe, to prevents the complier mischief if > possible. FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. convert everything as a follow-up patch (or patches). I'm fairly confident that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly bet my life on it. I'd much rather the failing scenario be that KVM uses a sub-optimal page size as opposed to exploding on a bad pointer. > I'll add comments in v2. > > > [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via > > lookup_address_in_mm()) to identify large page mappings. The function > > itself is susceptible to such a race, but KVM only does the lookup > > after it has done gup() and also ensures any zapping of ptes will cause > > KVM to restart the faulting (guest) instruction or that the zap will be > > blocked until after KVM does the lookup, i.e. racing with a transition > > from !PRESENT -> PRESENT should be impossible (in theory). > > > This bug is from hugetlb core, we could trigger it in other usages even if the > latest KVM won't. I was actually worried about the opposite, introducing a bug by moving to lookup_address_in_mm(). > >> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> > >> --- > >> mm/hugetlb.c | 34 ++++++++++++++++++---------------- > >> 1 file changed, 18 insertions(+), 16 deletions(-) > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index dd8737a..3bde229 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > >> pte_t *huge_pte_offset(struct mm_struct *mm, > >> unsigned long addr, unsigned long sz) > >> { > >> - pgd_t *pgd; > >> - p4d_t *p4d; > >> - pud_t *pud; > >> - pmd_t *pmd; > >> + pgd_t *pgdp; > >> + p4d_t *p4dp; > >> + pud_t *pudp, pud; > >> + pmd_t *pmdp, pmd; > >> > >> - pgd = pgd_offset(mm, addr); > >> - if (!pgd_present(*pgd)) > >> + pgdp = pgd_offset(mm, addr); > >> + if (!pgd_present(*pgdp)) > >> return NULL; > >> - p4d = p4d_offset(pgd, addr); > >> - if (!p4d_present(*p4d)) > >> + p4dp = p4d_offset(pgdp, addr); > >> + if (!p4d_present(*p4dp)) > >> return NULL; > >> > >> - pud = pud_offset(p4d, addr); > >> - if (sz != PUD_SIZE && pud_none(*pud)) > >> + pudp = pud_offset(p4dp, addr); > >> + pud = READ_ONCE(*pudp); > >> + if (sz != PUD_SIZE && pud_none(pud)) > >> return NULL; > >> /* hugepage or swap? */ > >> - if (pud_huge(*pud) || !pud_present(*pud)) > >> - return (pte_t *)pud; > >> + if (pud_huge(pud) || !pud_present(pud)) > >> + return (pte_t *)pudp; > >> > >> - pmd = pmd_offset(pud, addr); > >> - if (sz != PMD_SIZE && pmd_none(*pmd)) > >> + pmdp = pmd_offset(pudp, addr); > >> + pmd = READ_ONCE(*pmdp); > >> + if (sz != PMD_SIZE && pmd_none(pmd)) > >> return NULL; > >> /* hugepage or swap? */ > >> - if (pmd_huge(*pmd) || !pmd_present(*pmd)) > >> - return (pte_t *)pmd; > >> + if (pmd_huge(pmd) || !pmd_present(pmd)) > >> + return (pte_t *)pmdp; > >> > >> return NULL; > >> } > >> -- > >> 1.8.3.1 > >> > >> > > > > . > > > > > -- > Regards, > Longpeng(Mike) >
在 2020/2/19 4:52, Matthew Wilcox 写道: > On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >> { >> - pgd_t *pgd; >> - p4d_t *p4d; >> - pud_t *pud; >> - pmd_t *pmd; >> + pgd_t *pgdp; >> + p4d_t *p4dp; >> + pud_t *pudp, pud; >> + pmd_t *pmdp, pmd; > > Renaming the variables as part of a fix is a really bad idea. It obscures > the actual fix and makes everybody's life harder. Plus, it's not even > renaming to follow the normal convention -- there are only two places > (migrate.c and gup.c) which follow this pattern in mm/ while there are > 33 that do not. > Good suggestion, I've never noticed this, thanks. By the way, could you give an example if we use this way to fix the bug? > > . >
On 2/18/20 6:09 PM, Longpeng (Mike) wrote: > 在 2020/2/19 4:52, Matthew Wilcox 写道: >> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >>> { >>> - pgd_t *pgd; >>> - p4d_t *p4d; >>> - pud_t *pud; >>> - pmd_t *pmd; >>> + pgd_t *pgdp; >>> + p4d_t *p4dp; >>> + pud_t *pudp, pud; >>> + pmd_t *pmdp, pmd; >> >> Renaming the variables as part of a fix is a really bad idea. It obscures >> the actual fix and makes everybody's life harder. Plus, it's not even >> renaming to follow the normal convention -- there are only two places >> (migrate.c and gup.c) which follow this pattern in mm/ while there are >> 33 that do not. >> > Good suggestion, I've never noticed this, thanks. > By the way, could you give an example if we use this way to fix the bug? Matthew and others may have better suggestions for naming. However, I would keep the existing names and add: pud_t pud_entry; pmd_t pmd_entry; Then the *_entry variables are the target of the READ_ONCE() pud_entry = READ_ONCE(*pud); if (sz != PUD_SIZE && pud_none(pud_entry)) ... ... pmd_entry = READ_ONCE(*pmd); if (sz != PMD_SIZE && pmd_none(pmd_entry)) ... ... BTW, thank you for finding this issue!
在 2020/2/19 9:58, Sean Christopherson 写道: > On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: >> 在 2020/2/19 4:37, Sean Christopherson 写道: >>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >>>> Our machine encountered a panic after run for a long time and >>>> the calltrace is: >>> >>> What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer >>> dereference, etc...? >>> >> A bad pointer dereference. >> >> pgd -> pud -> user 1G hugepage >> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe >> return the a bad pointer of the user 1G hugepage. >> >>>> RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 >>>> RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 >>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 >>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 >>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 >>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 >>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 >>>> FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 >>>> [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 >>>> [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 >>>> [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 >>>> [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 >>>> [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 >>>> [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] >>>> [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] >>>> [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] >>>> [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] >>>> [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] >>>> [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] >>>> [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] >>>> [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] >>>> [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] >>>> [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] >>>> [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] >>>> [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] >>>> [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] >>>> [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] >>>> [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 >>>> [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 >>>> [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 >>>> [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 >>>> [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 >>>> >>>> ( The kernel we used is older, but we think the latest kernel also has this >>>> bug after dig into this problem. ) >>>> >>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it >>>> may return a wrong 'pmdp' if there is a race. Please look at the following >>>> code snippet: >>>> ... >>>> pud = pud_offset(p4d, addr); >>>> if (sz != PUD_SIZE && pud_none(*pud)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>> return (pte_t *)pud; >>>> >>>> pmd = pmd_offset(pud, addr); >>>> if (sz != PMD_SIZE && pmd_none(*pmd)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> if (pmd_huge(*pmd) || !pmd_present(*pmd)) >>>> return (pte_t *)pmd; >>>> ... >>>> >>>> The following sequence would trigger this bug: >>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue >>>> 1. CPU0: "pud_huge(*pud)" is false >>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) >>>> 3. CPU0: "!pud_present(*pud)" is false, continue >>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp >>>> However, we want CPU0 to return NULL or pudp. >>>> >>>> We can avoid this race by read the pud only once. >>> >>> Are there any other options for avoiding the panic you hit? I ask because >>> there are a variety of flows that use a very similar code pattern, e.g. >>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not >>> other flows could be confusing (or in my case, anxiety inducing[*]). At >>> the least, adding a comment in huge_pte_offset() to explain the need for >>> READ_ONCE() would be helpful. >>> >> I hope the hugetlb and mm maintainers could give some other options if they >> approve this bug. > > The race and the fix make sense. I assumed dereferencing garbage from the > huge page was the issue, but I wasn't 100% that was the case, which is why > I asked about alternative fixes. > >> We change the code from >> if (pud_huge(*pud) || !pud_present(*pud)) >> to >> if (pud_huge(*pud) >> return (pte_t *)pud; >> busy loop for 500ms >> if (!pud_present(*pud)) >> return (pte_t *)pud; >> and the panic will be hit quickly. >> >> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >> >> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >> pud twice and the *pud maybe change in a race, so if we only read the pud once. >> I use READ_ONCE here is just for safe, to prevents the complier mischief if >> possible. > > FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. > convert everything as a follow-up patch (or patches). I'm fairly confident > that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly > bet my life on it. I'd much rather the failing scenario be that KVM uses > a sub-optimal page size as opposed to exploding on a bad pointer. > Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do live-upgrade(private feature that just modify the qemu and libvirt) and live-migrate in turns for each one. However our live upgraded new QEMU won't do touch_all_pages. Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they could access some pages belong to the same 1G-hugepage, both of them will vmexit due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so the race may encounter, right? >> I'll add comments in v2. >> >>> [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via >>> lookup_address_in_mm()) to identify large page mappings. The function >>> itself is susceptible to such a race, but KVM only does the lookup >>> after it has done gup() and also ensures any zapping of ptes will cause >>> KVM to restart the faulting (guest) instruction or that the zap will be >>> blocked until after KVM does the lookup, i.e. racing with a transition >>> from !PRESENT -> PRESENT should be impossible (in theory). >>> >> This bug is from hugetlb core, we could trigger it in other usages even if the >> latest KVM won't. > > I was actually worried about the opposite, introducing a bug by moving to > lookup_address_in_mm(). > >>>> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> >>>> --- >>>> mm/hugetlb.c | 34 ++++++++++++++++++---------------- >>>> 1 file changed, 18 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index dd8737a..3bde229 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, >>>> pte_t *huge_pte_offset(struct mm_struct *mm, >>>> unsigned long addr, unsigned long sz) >>>> { >>>> - pgd_t *pgd; >>>> - p4d_t *p4d; >>>> - pud_t *pud; >>>> - pmd_t *pmd; >>>> + pgd_t *pgdp; >>>> + p4d_t *p4dp; >>>> + pud_t *pudp, pud; >>>> + pmd_t *pmdp, pmd; >>>> >>>> - pgd = pgd_offset(mm, addr); >>>> - if (!pgd_present(*pgd)) >>>> + pgdp = pgd_offset(mm, addr); >>>> + if (!pgd_present(*pgdp)) >>>> return NULL; >>>> - p4d = p4d_offset(pgd, addr); >>>> - if (!p4d_present(*p4d)) >>>> + p4dp = p4d_offset(pgdp, addr); >>>> + if (!p4d_present(*p4dp)) >>>> return NULL; >>>> >>>> - pud = pud_offset(p4d, addr); >>>> - if (sz != PUD_SIZE && pud_none(*pud)) >>>> + pudp = pud_offset(p4dp, addr); >>>> + pud = READ_ONCE(*pudp); >>>> + if (sz != PUD_SIZE && pud_none(pud)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> - if (pud_huge(*pud) || !pud_present(*pud)) >>>> - return (pte_t *)pud; >>>> + if (pud_huge(pud) || !pud_present(pud)) >>>> + return (pte_t *)pudp; >>>> >>>> - pmd = pmd_offset(pud, addr); >>>> - if (sz != PMD_SIZE && pmd_none(*pmd)) >>>> + pmdp = pmd_offset(pudp, addr); >>>> + pmd = READ_ONCE(*pmdp); >>>> + if (sz != PMD_SIZE && pmd_none(pmd)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> - if (pmd_huge(*pmd) || !pmd_present(*pmd)) >>>> - return (pte_t *)pmd; >>>> + if (pmd_huge(pmd) || !pmd_present(pmd)) >>>> + return (pte_t *)pmdp; >>>> >>>> return NULL; >>>> } >>>> -- >>>> 1.8.3.1 >>>> >>>> >>> >>> . >>> >> >> >> -- >> Regards, >> Longpeng(Mike) >> >
在 2020/2/19 11:49, Mike Kravetz 写道: > On 2/18/20 6:09 PM, Longpeng (Mike) wrote: >> 在 2020/2/19 4:52, Matthew Wilcox 写道: >>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >>>> { >>>> - pgd_t *pgd; >>>> - p4d_t *p4d; >>>> - pud_t *pud; >>>> - pmd_t *pmd; >>>> + pgd_t *pgdp; >>>> + p4d_t *p4dp; >>>> + pud_t *pudp, pud; >>>> + pmd_t *pmdp, pmd; >>> >>> Renaming the variables as part of a fix is a really bad idea. It obscures >>> the actual fix and makes everybody's life harder. Plus, it's not even >>> renaming to follow the normal convention -- there are only two places >>> (migrate.c and gup.c) which follow this pattern in mm/ while there are >>> 33 that do not. >>> >> Good suggestion, I've never noticed this, thanks. >> By the way, could you give an example if we use this way to fix the bug? > > Matthew and others may have better suggestions for naming. However, I would > keep the existing names and add: > > pud_t pud_entry; > pmd_t pmd_entry; > > Then the *_entry variables are the target of the READ_ONCE() > > pud_entry = READ_ONCE(*pud); > if (sz != PUD_SIZE && pud_none(pud_entry)) > ... > ... > pmd_entry = READ_ONCE(*pmd); > if (sz != PMD_SIZE && pmd_none(pmd_entry)) > ... > ... > Uh, looks much better. BTW, I missed one of your email in my mail client, but I find it in lkml.org. ''' I too would like some more information on the panic. If your analysis is correct, then I would expect the 'ptep' returned by huge_pte_offset() to not point to a pte but rather some random address. This is because the 'pmd' calculated by pmd_offset(pud, addr) is not really the address of a pmd. So, perhaps there is an addressing exception at huge_ptep_get() near the beginning of hugetlb_fault()? ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { entry = huge_ptep_get(ptep); ... ''' Yep, your analysis above is the same as mine, we got a 'dummy pmd' and then cause access a bad address. What's your opinion about the solution to fix this problem, not only huge_pte_offset, some other places also have the same problem(e.g. lookup_address_in_pgd) ? > BTW, thank you for finding this issue! >
On Wed, Feb 19, 2020 at 08:21:26PM +0800, Longpeng (Mike) wrote: > 在 2020/2/19 9:58, Sean Christopherson 写道: > > FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. > > convert everything as a follow-up patch (or patches). I'm fairly confident > > that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly > > bet my life on it. I'd much rather the failing scenario be that KVM uses > > a sub-optimal page size as opposed to exploding on a bad pointer. > > > Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do > live-upgrade(private feature that just modify the qemu and libvirt) and > live-migrate in turns for each one. However our live upgraded new QEMU won't do > touch_all_pages. > Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is > not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they > could access some pages belong to the same 1G-hugepage, both of them will vmexit > due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so > the race may encounter, right? Yep. The code I'm referring to is similar but different code that just happened to go into KVM for kernel 5.6. It has no effect on the gup() flow that leads to this bug. I mentioned it above as an example of code outside of hugetlb_fault() that would also benefit from moving to READ/WRITE_ONCE().
+ Kirill On 2/18/20 5:58 PM, Sean Christopherson wrote: > On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: >> 在 2020/2/19 4:37, Sean Christopherson 写道: >>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >>>> Our machine encountered a panic after run for a long time and >>>> the calltrace is: >>> >>> What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer >>> dereference, etc...? >>> >> A bad pointer dereference. >> >> pgd -> pud -> user 1G hugepage >> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe >> return the a bad pointer of the user 1G hugepage. >> >>>> RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 >>>> RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 >>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 >>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 >>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 >>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 >>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 >>>> FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 >>>> [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 >>>> [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 >>>> [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 >>>> [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 >>>> [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 >>>> [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] >>>> [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] >>>> [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] >>>> [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] >>>> [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] >>>> [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] >>>> [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] >>>> [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] >>>> [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] >>>> [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] >>>> [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] >>>> [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] >>>> [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] >>>> [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] >>>> [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 >>>> [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 >>>> [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 >>>> [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 >>>> [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 >>>> >>>> ( The kernel we used is older, but we think the latest kernel also has this >>>> bug after dig into this problem. ) >>>> >>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it >>>> may return a wrong 'pmdp' if there is a race. Please look at the following >>>> code snippet: >>>> ... >>>> pud = pud_offset(p4d, addr); >>>> if (sz != PUD_SIZE && pud_none(*pud)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>> return (pte_t *)pud; >>>> >>>> pmd = pmd_offset(pud, addr); >>>> if (sz != PMD_SIZE && pmd_none(*pmd)) >>>> return NULL; >>>> /* hugepage or swap? */ >>>> if (pmd_huge(*pmd) || !pmd_present(*pmd)) >>>> return (pte_t *)pmd; >>>> ... >>>> >>>> The following sequence would trigger this bug: >>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue >>>> 1. CPU0: "pud_huge(*pud)" is false >>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) >>>> 3. CPU0: "!pud_present(*pud)" is false, continue >>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp >>>> However, we want CPU0 to return NULL or pudp. >>>> >>>> We can avoid this race by read the pud only once. >>> >>> Are there any other options for avoiding the panic you hit? I ask because >>> there are a variety of flows that use a very similar code pattern, e.g. >>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not >>> other flows could be confusing (or in my case, anxiety inducing[*]). At >>> the least, adding a comment in huge_pte_offset() to explain the need for >>> READ_ONCE() would be helpful. >>> >> I hope the hugetlb and mm maintainers could give some other options if they >> approve this bug. > > The race and the fix make sense. I assumed dereferencing garbage from the > huge page was the issue, but I wasn't 100% that was the case, which is why > I asked about alternative fixes. > >> We change the code from >> if (pud_huge(*pud) || !pud_present(*pud)) >> to >> if (pud_huge(*pud) >> return (pte_t *)pud; >> busy loop for 500ms >> if (!pud_present(*pud)) >> return (pte_t *)pud; >> and the panic will be hit quickly. >> >> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >> >> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >> pud twice and the *pud maybe change in a race, so if we only read the pud once. >> I use READ_ONCE here is just for safe, to prevents the complier mischief if >> possible. > > FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. > convert everything as a follow-up patch (or patches). I'm fairly confident > that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly > bet my life on it. I'd much rather the failing scenario be that KVM uses > a sub-optimal page size as opposed to exploding on a bad pointer. Longpeng(Mike) asked in another e-mail specifically about making similar changes to lookup_address_in_mm(). Replying here as there is more context. I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the problem with the huge_pte_offset routine is that the pud changes from pud_none() to pud_huge() in the middle of 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of lookup_address_in_mm, we know pud was not pud_none() as it was previously checked. I am not aware of any other state transitions which could cause us trouble. However, I am no expert in this area.
在 2020/2/20 3:33, Mike Kravetz 写道: > + Kirill > On 2/18/20 5:58 PM, Sean Christopherson wrote: >> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: >>> 在 2020/2/19 4:37, Sean Christopherson 写道: >>>> On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: >>>>> Our machine encountered a panic after run for a long time and >>>>> the calltrace is: >>>> >>>> What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer >>>> dereference, etc...? >>>> >>> A bad pointer dereference. >>> >>> pgd -> pud -> user 1G hugepage >>> huge_pte_offset() wants to return NULL or pud (point to the entry), but it maybe >>> return the a bad pointer of the user 1G hugepage. >>> >>>>> RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 >>>>> RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 >>>>> RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 >>>>> RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 >>>>> RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 >>>>> R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 >>>>> R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 >>>>> FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>>> Call Trace: >>>>> [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 >>>>> [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 >>>>> [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 >>>>> [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 >>>>> [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 >>>>> [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 >>>>> [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] >>>>> [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] >>>>> [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] >>>>> [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] >>>>> [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] >>>>> [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] >>>>> [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] >>>>> [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] >>>>> [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] >>>>> [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] >>>>> [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] >>>>> [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] >>>>> [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] >>>>> [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] >>>>> [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 >>>>> [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 >>>>> [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 >>>>> [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 >>>>> [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 >>>>> >>>>> ( The kernel we used is older, but we think the latest kernel also has this >>>>> bug after dig into this problem. ) >>>>> >>>>> For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it >>>>> may return a wrong 'pmdp' if there is a race. Please look at the following >>>>> code snippet: >>>>> ... >>>>> pud = pud_offset(p4d, addr); >>>>> if (sz != PUD_SIZE && pud_none(*pud)) >>>>> return NULL; >>>>> /* hugepage or swap? */ >>>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>>> return (pte_t *)pud; >>>>> >>>>> pmd = pmd_offset(pud, addr); >>>>> if (sz != PMD_SIZE && pmd_none(*pmd)) >>>>> return NULL; >>>>> /* hugepage or swap? */ >>>>> if (pmd_huge(*pmd) || !pmd_present(*pmd)) >>>>> return (pte_t *)pmd; >>>>> ... >>>>> >>>>> The following sequence would trigger this bug: >>>>> 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue >>>>> 1. CPU0: "pud_huge(*pud)" is false >>>>> 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) >>>>> 3. CPU0: "!pud_present(*pud)" is false, continue >>>>> 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp >>>>> However, we want CPU0 to return NULL or pudp. >>>>> >>>>> We can avoid this race by read the pud only once. >>>> >>>> Are there any other options for avoiding the panic you hit? I ask because >>>> there are a variety of flows that use a very similar code pattern, e.g. >>>> lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not >>>> other flows could be confusing (or in my case, anxiety inducing[*]). At >>>> the least, adding a comment in huge_pte_offset() to explain the need for >>>> READ_ONCE() would be helpful. >>>> >>> I hope the hugetlb and mm maintainers could give some other options if they >>> approve this bug. >> >> The race and the fix make sense. I assumed dereferencing garbage from the >> huge page was the issue, but I wasn't 100% that was the case, which is why >> I asked about alternative fixes. >> >>> We change the code from >>> if (pud_huge(*pud) || !pud_present(*pud)) >>> to >>> if (pud_huge(*pud) >>> return (pte_t *)pud; >>> busy loop for 500ms >>> if (!pud_present(*pud)) >>> return (pte_t *)pud; >>> and the panic will be hit quickly. >>> >>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >>> >>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >>> pud twice and the *pud maybe change in a race, so if we only read the pud once. >>> I use READ_ONCE here is just for safe, to prevents the complier mischief if >>> possible. >> >> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >> convert everything as a follow-up patch (or patches). I'm fairly confident >> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >> bet my life on it. I'd much rather the failing scenario be that KVM uses >> a sub-optimal page size as opposed to exploding on a bad pointer. > > Longpeng(Mike) asked in another e-mail specifically about making similar > changes to lookup_address_in_mm(). Replying here as there is more context. > > I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the > problem with the huge_pte_offset routine is that the pud changes from > pud_none() to pud_huge() in the middle of > 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of > lookup_address_in_mm, we know pud was not pud_none() as it was previously > checked. I am not aware of any other state transitions which could cause > us trouble. However, I am no expert in this area. > So... I need just fix huge_pte_offset in mm/hugetlb.c, right? Is it possible the pud changes from pud_huge() to pud_none() while another CPU is walking the pagetable ?
在 2020/2/20 0:22, Sean Christopherson 写道: > On Wed, Feb 19, 2020 at 08:21:26PM +0800, Longpeng (Mike) wrote: >> 在 2020/2/19 9:58, Sean Christopherson 写道: >>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >>> convert everything as a follow-up patch (or patches). I'm fairly confident >>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >>> bet my life on it. I'd much rather the failing scenario be that KVM uses >>> a sub-optimal page size as opposed to exploding on a bad pointer. >>> >> Um...our testcase starts 50 VMs with 2U4G(use 1G hugepage) and then do >> live-upgrade(private feature that just modify the qemu and libvirt) and >> live-migrate in turns for each one. However our live upgraded new QEMU won't do >> touch_all_pages. >> Suppose we start a VM without touch_all_pages in QEMU, the VM's guest memory is >> not mapped in the CR3 pagetable at the moment. When the 2 vcpus running, they >> could access some pages belong to the same 1G-hugepage, both of them will vmexit >> due to ept_violation and then call gup-->follow_hugetlb_page-->hugetlb_fault, so >> the race may encounter, right? > > Yep. The code I'm referring to is similar but different code that just > happened to go into KVM for kernel 5.6. It has no effect on the gup() flow > that leads to this bug. I mentioned it above as an example of code outside > of hugetlb_fault() that would also benefit from moving to READ/WRITE_ONCE(). > > I understand better now, thanks for your patience. :)
On 2/19/20 6:30 PM, Longpeng (Mike) wrote: > 在 2020/2/20 3:33, Mike Kravetz 写道: >> + Kirill >> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: <snip> >>> The race and the fix make sense. I assumed dereferencing garbage from the >>> huge page was the issue, but I wasn't 100% that was the case, which is why >>> I asked about alternative fixes. >>> >>>> We change the code from >>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>> to >>>> if (pud_huge(*pud) >>>> return (pte_t *)pud; >>>> busy loop for 500ms >>>> if (!pud_present(*pud)) >>>> return (pte_t *)pud; >>>> and the panic will be hit quickly. >>>> >>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >>>> >>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >>>> pud twice and the *pud maybe change in a race, so if we only read the pud once. >>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if >>>> possible. >>> >>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >>> convert everything as a follow-up patch (or patches). I'm fairly confident >>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >>> bet my life on it. I'd much rather the failing scenario be that KVM uses >>> a sub-optimal page size as opposed to exploding on a bad pointer. >> >> Longpeng(Mike) asked in another e-mail specifically about making similar >> changes to lookup_address_in_mm(). Replying here as there is more context. >> >> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the >> problem with the huge_pte_offset routine is that the pud changes from >> pud_none() to pud_huge() in the middle of >> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >> lookup_address_in_mm, we know pud was not pud_none() as it was previously >> checked. I am not aware of any other state transitions which could cause >> us trouble. However, I am no expert in this area. Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_in_pgd. > So... I need just fix huge_pte_offset in mm/hugetlb.c, right? Let's start with just a fix for huge_pte_offset() as you can easily reproduce that issue by adding a delay. > Is it possible the pud changes from pud_huge() to pud_none() while another CPU > is walking the pagetable ? I believe it is possible. If we hole punch a hugetlbfs file, we will clear the corresponding pud's. Hence, we can go from pud_huge() to pud_none(). Unless I am missing something, that does imply we could have issues in places such as lookup_address_in_pgd: pud = pud_offset(p4d, address); if (pud_none(*pud)) return NULL; *level = PG_LEVEL_1G; if (pud_large(*pud) || !pud_present(*pud)) return (pte_t *)pud; I hope I am wrong, but it seems like pud_none(*pud) could become true after the initial check, and before the (pud_large) check. If so, there could be a problem (addressing exception) when the code continues and looks up the pmd. pmd = pmd_offset(pud, address); if (pmd_none(*pmd)) return NULL; It has been mentioned before that there are many page table walks like this. What am I missing that prevents races like this? Or, have we just been lucky?
在 2020/2/21 8:22, Mike Kravetz 写道: > On 2/19/20 6:30 PM, Longpeng (Mike) wrote: >> 在 2020/2/20 3:33, Mike Kravetz 写道: >>> + Kirill >>> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: > <snip> >>>> The race and the fix make sense. I assumed dereferencing garbage from the >>>> huge page was the issue, but I wasn't 100% that was the case, which is why >>>> I asked about alternative fixes. >>>> >>>>> We change the code from >>>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>>> to >>>>> if (pud_huge(*pud) >>>>> return (pte_t *)pud; >>>>> busy loop for 500ms >>>>> if (!pud_present(*pud)) >>>>> return (pte_t *)pud; >>>>> and the panic will be hit quickly. >>>>> >>>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >>>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >>>>> >>>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >>>>> pud twice and the *pud maybe change in a race, so if we only read the pud once. >>>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if >>>>> possible. >>>> >>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >>>> convert everything as a follow-up patch (or patches). I'm fairly confident >>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >>>> bet my life on it. I'd much rather the failing scenario be that KVM uses >>>> a sub-optimal page size as opposed to exploding on a bad pointer. >>> >>> Longpeng(Mike) asked in another e-mail specifically about making similar >>> changes to lookup_address_in_mm(). Replying here as there is more context. >>> >>> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the >>> problem with the huge_pte_offset routine is that the pud changes from >>> pud_none() to pud_huge() in the middle of >>> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >>> lookup_address_in_mm, we know pud was not pud_none() as it was previously >>> checked. I am not aware of any other state transitions which could cause >>> us trouble. However, I am no expert in this area. > > Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_in_pgd. > >> So... I need just fix huge_pte_offset in mm/hugetlb.c, right? > > Let's start with just a fix for huge_pte_offset() as you can easily reproduce > that issue by adding a delay. > >> Is it possible the pud changes from pud_huge() to pud_none() while another CPU >> is walking the pagetable ? > All right, I'll send V2 to fix it, thanks :) > I believe it is possible. If we hole punch a hugetlbfs file, we will clear > the corresponding pud's. Hence, we can go from pud_huge() to pud_none(). > Unless I am missing something, that does imply we could have issues in places > such as lookup_address_in_pgd: > > pud = pud_offset(p4d, address); > if (pud_none(*pud)) > return NULL; > > *level = PG_LEVEL_1G; > if (pud_large(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > > I hope I am wrong, but it seems like pud_none(*pud) could become true after > the initial check, and before the (pud_large) check. If so, there could be > a problem (addressing exception) when the code continues and looks up the pmd. > > pmd = pmd_offset(pud, address); > if (pmd_none(*pmd)) > return NULL; > > It has been mentioned before that there are many page table walks like this. > What am I missing that prevents races like this? Or, have we just been lucky? > That's what I worry about. Maybe there is no usecase to hit it.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd8737a..3bde229 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz) { - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; + pgd_t *pgdp; + p4d_t *p4dp; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; - pgd = pgd_offset(mm, addr); - if (!pgd_present(*pgd)) + pgdp = pgd_offset(mm, addr); + if (!pgd_present(*pgdp)) return NULL; - p4d = p4d_offset(pgd, addr); - if (!p4d_present(*p4d)) + p4dp = p4d_offset(pgdp, addr); + if (!p4d_present(*p4dp)) return NULL; - pud = pud_offset(p4d, addr); - if (sz != PUD_SIZE && pud_none(*pud)) + pudp = pud_offset(p4dp, addr); + pud = READ_ONCE(*pudp); + if (sz != PUD_SIZE && pud_none(pud)) return NULL; /* hugepage or swap? */ - if (pud_huge(*pud) || !pud_present(*pud)) - return (pte_t *)pud; + if (pud_huge(pud) || !pud_present(pud)) + return (pte_t *)pudp; - pmd = pmd_offset(pud, addr); - if (sz != PMD_SIZE && pmd_none(*pmd)) + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (sz != PMD_SIZE && pmd_none(pmd)) return NULL; /* hugepage or swap? */ - if (pmd_huge(*pmd) || !pmd_present(*pmd)) - return (pte_t *)pmd; + if (pmd_huge(pmd) || !pmd_present(pmd)) + return (pte_t *)pmdp; return NULL; }
Our machine encountered a panic after run for a long time and the calltrace is: RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm] [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel] [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm] [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27 ( The kernel we used is older, but we think the latest kernel also has this bug after dig into this problem. ) For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it may return a wrong 'pmdp' if there is a race. Please look at the following code snippet: ... pud = pud_offset(p4d, addr); if (sz != PUD_SIZE && pud_none(*pud)) return NULL; /* hugepage or swap? */ if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud; pmd = pmd_offset(pud, addr); if (sz != PMD_SIZE && pmd_none(*pmd)) return NULL; /* hugepage or swap? */ if (pmd_huge(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; ... The following sequence would trigger this bug: 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue 1. CPU0: "pud_huge(*pud)" is false 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) 3. CPU0: "!pud_present(*pud)" is false, continue 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp However, we want CPU0 to return NULL or pudp. We can avoid this race by read the pud only once. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> --- mm/hugetlb.c | 34 ++++++++++++++++++---------------- 1 file changed, 18 insertions(+), 16 deletions(-)