Message ID | 20240501143310.1381675-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm: Fix race between __split_huge_pmd_locked() and GUP-fast | expand |
On 1 May 2024, at 10:33, Ryan Roberts wrote: > __split_huge_pmd_locked() can be called for a present THP, devmap or > (non-present) migration entry. It calls pmdp_invalidate() > unconditionally on the pmdp and only determines if it is present or not > based on the returned old pmd. This is a problem for the migration entry > case because pmd_mkinvalid(), called by pmdp_invalidate() must only be > called for a present pmd. > > On arm64 at least, pmd_mkinvalid() will mark the pmd such that any > future call to pmd_present() will return true. And therefore any > lockless pgtable walker could see the migration entry pmd in this state > and start interpretting the fields as if it were present, leading to > BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. > > x86 does not suffer the above problem, but instead pmd_mkinvalid() will > corrupt the offset field of the swap entry within the swap pte. See link > below for discussion of that problem. > > Fix all of this by only calling pmdp_invalidate() for a present pmd. And > for good measure let's add a warning to all implementations of > pmdp_invalidate[_ad](). I've manually reviewed all other > pmdp_invalidate[_ad]() call sites and believe all others to be > conformant. > > This is a theoretical bug found during code review. I don't have any > test case to trigger it in practice. > > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > > Right v3; this goes back to the original approach in v1 to fix core-mm rather > than push the fix into arm64, since we discovered that x86 can't handle > pmd_mkinvalid() being called for non-present pmds either. > > I'm pulling in more arch maintainers because this version adds some warnings in > arch code to help spot incorrect usage. > > Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, > he's agreed to either remove or revert it. > > > Changes since v1 [1] > ==================== > > - Improve pmdp_mkinvalid() docs to make it clear it can only be called for > present pmd (per JohnH, Zi Yan) > - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) > - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) > > > [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > > Documentation/mm/arch_pgtable_helpers.rst | 6 ++- > arch/powerpc/mm/book3s64/pgtable.c | 1 + > arch/s390/include/asm/pgtable.h | 4 +- > arch/sparc/mm/tlb.c | 1 + > arch/x86/mm/pgtable.c | 2 + > mm/huge_memory.c | 49 ++++++++++++----------- > mm/pgtable-generic.c | 2 + > 7 files changed, 39 insertions(+), 26 deletions(-) The changes in Documentation/mm and mm/* look good to me. Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com> I wonder if making Documentation/mm and mm/* changes in a separate patch would be better, since you will not need acks from arch maintainers and get the patch in quicker. -- Best Regards, Yan, Zi
On 5/1/24 20:03, Ryan Roberts wrote: > __split_huge_pmd_locked() can be called for a present THP, devmap or > (non-present) migration entry. It calls pmdp_invalidate() > unconditionally on the pmdp and only determines if it is present or not > based on the returned old pmd. This is a problem for the migration entry > case because pmd_mkinvalid(), called by pmdp_invalidate() must only be > called for a present pmd. > > On arm64 at least, pmd_mkinvalid() will mark the pmd such that any > future call to pmd_present() will return true. And therefore any > lockless pgtable walker could see the migration entry pmd in this state > and start interpretting the fields as if it were present, leading to > BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. > > x86 does not suffer the above problem, but instead pmd_mkinvalid() will > corrupt the offset field of the swap entry within the swap pte. See link > below for discussion of that problem. > > Fix all of this by only calling pmdp_invalidate() for a present pmd. And > for good measure let's add a warning to all implementations of > pmdp_invalidate[_ad](). I've manually reviewed all other > pmdp_invalidate[_ad]() call sites and believe all others to be > conformant. > > This is a theoretical bug found during code review. I don't have any > test case to trigger it in practice. > > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ > Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > > Right v3; this goes back to the original approach in v1 to fix core-mm rather > than push the fix into arm64, since we discovered that x86 can't handle > pmd_mkinvalid() being called for non-present pmds either. This is a better approach indeed. > > I'm pulling in more arch maintainers because this version adds some warnings in > arch code to help spot incorrect usage. > > Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, > he's agreed to either remove or revert it. > > > Changes since v1 [1] > ==================== > > - Improve pmdp_mkinvalid() docs to make it clear it can only be called for > present pmd (per JohnH, Zi Yan) > - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) > - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) > > > [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > > Documentation/mm/arch_pgtable_helpers.rst | 6 ++- > arch/powerpc/mm/book3s64/pgtable.c | 1 + > arch/s390/include/asm/pgtable.h | 4 +- > arch/sparc/mm/tlb.c | 1 + > arch/x86/mm/pgtable.c | 2 + > mm/huge_memory.c | 49 ++++++++++++----------- > mm/pgtable-generic.c | 2 + > 7 files changed, 39 insertions(+), 26 deletions(-) > > diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst > index 2466d3363af7..ad50ca6f495e 100644 > --- a/Documentation/mm/arch_pgtable_helpers.rst > +++ b/Documentation/mm/arch_pgtable_helpers.rst > @@ -140,7 +140,8 @@ PMD Page Table Helpers > +---------------------------+--------------------------------------------------+ > | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | > +---------------------------+--------------------------------------------------+ > -| pmd_mkinvalid | Invalidates a mapped PMD [1] | > +| pmd_mkinvalid | Invalidates a present PMD; do not call for | > +| | non-present PMD [1] | > +---------------------------+--------------------------------------------------+ > | pmd_set_huge | Creates a PMD huge mapping | > +---------------------------+--------------------------------------------------+ > @@ -196,7 +197,8 @@ PUD Page Table Helpers > +---------------------------+--------------------------------------------------+ > | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | > +---------------------------+--------------------------------------------------+ > -| pud_mkinvalid | Invalidates a mapped PUD [1] | > +| pud_mkinvalid | Invalidates a present PUD; do not call for | > +| | non-present PUD [1] | > +---------------------------+--------------------------------------------------+ > | pud_set_huge | Creates a PUD huge mapping | > +---------------------------+--------------------------------------------------+ LGTM but guess this will conflict with your other patch for mm/debug_vm_pgtable.c if you choose to update pud_mkinvalid() description for pmd_leaf(). https://lore.kernel.org/all/20240501144439.1389048-1-ryan.roberts@arm.com/ > diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c > index 83823db3488b..2975ea0841ba 100644 > --- a/arch/powerpc/mm/book3s64/pgtable.c > +++ b/arch/powerpc/mm/book3s64/pgtable.c > @@ -170,6 +170,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, > { > unsigned long old_pmd; > > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); > flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > return __pmd(old_pmd); > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h > index 60950e7a25f5..480bea44559d 100644 > --- a/arch/s390/include/asm/pgtable.h > +++ b/arch/s390/include/asm/pgtable.h > @@ -1768,8 +1768,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, > static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, > unsigned long addr, pmd_t *pmdp) > { > - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); > + pmd_t pmd; > > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); > return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); > } > > diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c > index b44d79d778c7..ef69127d7e5e 100644 > --- a/arch/sparc/mm/tlb.c > +++ b/arch/sparc/mm/tlb.c > @@ -249,6 +249,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, > { > pmd_t old, entry; > > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); > old = pmdp_establish(vma, address, pmdp, entry); > flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > index d007591b8059..103cbccf1d7d 100644 > --- a/arch/x86/mm/pgtable.c > +++ b/arch/x86/mm/pgtable.c > @@ -631,6 +631,8 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, > pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmdp) > { > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > + > /* > * No flush is necessary. Once an invalid PTE is established, the PTE's > * access and dirty bits cannot be updated. > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 89f58c7603b2..dd1fc105f70b 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2493,32 +2493,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > return __split_huge_zero_page_pmd(vma, haddr, pmd); > } > > - /* > - * Up to this point the pmd is present and huge and userland has the > - * whole access to the hugepage during the split (which happens in > - * place). If we overwrite the pmd with the not-huge version pointing > - * to the pte here (which of course we could if all CPUs were bug > - * free), userland could trigger a small page size TLB miss on the > - * small sized TLB while the hugepage TLB entry is still established in > - * the huge TLB. Some CPU doesn't like that. > - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum > - * 383 on page 105. Intel should be safe but is also warns that it's > - * only safe if the permission and cache attributes of the two entries > - * loaded in the two TLB is identical (which should be the case here). > - * But it is generally safer to never allow small and huge TLB entries > - * for the same virtual address to be loaded simultaneously. So instead > - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the > - * current pmd notpresent (atomically because here the pmd_trans_huge > - * must remain set at all times on the pmd until the split is complete > - * for this pmd), then we flush the SMP TLB and finally we write the > - * non-huge version of the pmd entry with pmd_populate. > - */ > - old_pmd = pmdp_invalidate(vma, haddr, pmd); > - > - pmd_migration = is_pmd_migration_entry(old_pmd); > + pmd_migration = is_pmd_migration_entry(*pmd); > if (unlikely(pmd_migration)) { > swp_entry_t entry; > > + old_pmd = *pmd; > entry = pmd_to_swp_entry(old_pmd); > page = pfn_swap_entry_to_page(entry); > write = is_writable_migration_entry(entry); > @@ -2529,6 +2508,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > soft_dirty = pmd_swp_soft_dirty(old_pmd); > uffd_wp = pmd_swp_uffd_wp(old_pmd); > } else { > + /* > + * Up to this point the pmd is present and huge and userland has > + * the whole access to the hugepage during the split (which > + * happens in place). If we overwrite the pmd with the not-huge > + * version pointing to the pte here (which of course we could if > + * all CPUs were bug free), userland could trigger a small page > + * size TLB miss on the small sized TLB while the hugepage TLB > + * entry is still established in the huge TLB. Some CPU doesn't > + * like that. See > + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum > + * 383 on page 105. Intel should be safe but is also warns that > + * it's only safe if the permission and cache attributes of the > + * two entries loaded in the two TLB is identical (which should > + * be the case here). But it is generally safer to never allow > + * small and huge TLB entries for the same virtual address to be > + * loaded simultaneously. So instead of doing "pmd_populate(); > + * flush_pmd_tlb_range();" we first mark the current pmd > + * notpresent (atomically because here the pmd_trans_huge must > + * remain set at all times on the pmd until the split is > + * complete for this pmd), then we flush the SMP TLB and finally > + * we write the non-huge version of the pmd entry with > + * pmd_populate. > + */ > + old_pmd = pmdp_invalidate(vma, haddr, pmd); > page = pmd_page(old_pmd); > folio = page_folio(page); > if (pmd_dirty(old_pmd)) { > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index 4fcd959dcc4d..a78a4adf711a 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) > pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmdp) > { > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); > flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > return old; > @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, > pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmdp) > { > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > return pmdp_invalidate(vma, address, pmdp); > } > #endif Rest LGTM but let's wait for this to run on multiple platforms. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
On 02/05/2024 02:27, Zi Yan wrote: > On 1 May 2024, at 10:33, Ryan Roberts wrote: > >> __split_huge_pmd_locked() can be called for a present THP, devmap or >> (non-present) migration entry. It calls pmdp_invalidate() >> unconditionally on the pmdp and only determines if it is present or not >> based on the returned old pmd. This is a problem for the migration entry >> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >> called for a present pmd. >> >> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >> future call to pmd_present() will return true. And therefore any >> lockless pgtable walker could see the migration entry pmd in this state >> and start interpretting the fields as if it were present, leading to >> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >> >> x86 does not suffer the above problem, but instead pmd_mkinvalid() will >> corrupt the offset field of the swap entry within the swap pte. See link >> below for discussion of that problem. >> >> Fix all of this by only calling pmdp_invalidate() for a present pmd. And >> for good measure let's add a warning to all implementations of >> pmdp_invalidate[_ad](). I've manually reviewed all other >> pmdp_invalidate[_ad]() call sites and believe all others to be >> conformant. >> >> This is a theoretical bug found during code review. I don't have any >> test case to trigger it in practice. >> >> Cc: stable@vger.kernel.org >> Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ >> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> >> Right v3; this goes back to the original approach in v1 to fix core-mm rather >> than push the fix into arm64, since we discovered that x86 can't handle >> pmd_mkinvalid() being called for non-present pmds either. >> >> I'm pulling in more arch maintainers because this version adds some warnings in >> arch code to help spot incorrect usage. >> >> Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, >> he's agreed to either remove or revert it. >> >> >> Changes since v1 [1] >> ==================== >> >> - Improve pmdp_mkinvalid() docs to make it clear it can only be called for >> present pmd (per JohnH, Zi Yan) >> - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) >> - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) >> >> >> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ >> >> Thanks, >> Ryan >> >> >> Documentation/mm/arch_pgtable_helpers.rst | 6 ++- >> arch/powerpc/mm/book3s64/pgtable.c | 1 + >> arch/s390/include/asm/pgtable.h | 4 +- >> arch/sparc/mm/tlb.c | 1 + >> arch/x86/mm/pgtable.c | 2 + >> mm/huge_memory.c | 49 ++++++++++++----------- >> mm/pgtable-generic.c | 2 + >> 7 files changed, 39 insertions(+), 26 deletions(-) > > The changes in Documentation/mm and mm/* look good to me. Thanks. > Reviewed-by: Zi Yan <ziy@nvidia.com> Thanks! > > I wonder if making Documentation/mm and mm/* changes in a separate patch > would be better, since you will not need acks from arch maintainers and > get the patch in quicker. Yeah maybe - I considered that, but then decided I'm literally just adding a debug warning to the arch code so it shouldn't be too controversial. Perhaps wait a few days for acks and if nothing turns up then I'll re-post with it split? > > > -- > Best Regards, > Yan, Zi
On 02/05/2024 04:03, Anshuman Khandual wrote: > > > On 5/1/24 20:03, Ryan Roberts wrote: >> __split_huge_pmd_locked() can be called for a present THP, devmap or >> (non-present) migration entry. It calls pmdp_invalidate() >> unconditionally on the pmdp and only determines if it is present or not >> based on the returned old pmd. This is a problem for the migration entry >> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >> called for a present pmd. >> >> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >> future call to pmd_present() will return true. And therefore any >> lockless pgtable walker could see the migration entry pmd in this state >> and start interpretting the fields as if it were present, leading to >> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >> >> x86 does not suffer the above problem, but instead pmd_mkinvalid() will >> corrupt the offset field of the swap entry within the swap pte. See link >> below for discussion of that problem. >> >> Fix all of this by only calling pmdp_invalidate() for a present pmd. And >> for good measure let's add a warning to all implementations of >> pmdp_invalidate[_ad](). I've manually reviewed all other >> pmdp_invalidate[_ad]() call sites and believe all others to be >> conformant. >> >> This is a theoretical bug found during code review. I don't have any >> test case to trigger it in practice. >> >> Cc: stable@vger.kernel.org >> Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ >> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> >> Right v3; this goes back to the original approach in v1 to fix core-mm rather >> than push the fix into arm64, since we discovered that x86 can't handle >> pmd_mkinvalid() being called for non-present pmds either. > > This is a better approach indeed. > >> >> I'm pulling in more arch maintainers because this version adds some warnings in >> arch code to help spot incorrect usage. >> >> Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, >> he's agreed to either remove or revert it. >> >> >> Changes since v1 [1] >> ==================== >> >> - Improve pmdp_mkinvalid() docs to make it clear it can only be called for >> present pmd (per JohnH, Zi Yan) >> - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) >> - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) >> >> >> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ >> >> Thanks, >> Ryan >> >> >> Documentation/mm/arch_pgtable_helpers.rst | 6 ++- >> arch/powerpc/mm/book3s64/pgtable.c | 1 + >> arch/s390/include/asm/pgtable.h | 4 +- >> arch/sparc/mm/tlb.c | 1 + >> arch/x86/mm/pgtable.c | 2 + >> mm/huge_memory.c | 49 ++++++++++++----------- >> mm/pgtable-generic.c | 2 + >> 7 files changed, 39 insertions(+), 26 deletions(-) >> >> diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst >> index 2466d3363af7..ad50ca6f495e 100644 >> --- a/Documentation/mm/arch_pgtable_helpers.rst >> +++ b/Documentation/mm/arch_pgtable_helpers.rst >> @@ -140,7 +140,8 @@ PMD Page Table Helpers >> +---------------------------+--------------------------------------------------+ >> | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | >> +---------------------------+--------------------------------------------------+ >> -| pmd_mkinvalid | Invalidates a mapped PMD [1] | >> +| pmd_mkinvalid | Invalidates a present PMD; do not call for | >> +| | non-present PMD [1] | >> +---------------------------+--------------------------------------------------+ >> | pmd_set_huge | Creates a PMD huge mapping | >> +---------------------------+--------------------------------------------------+ >> @@ -196,7 +197,8 @@ PUD Page Table Helpers >> +---------------------------+--------------------------------------------------+ >> | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | >> +---------------------------+--------------------------------------------------+ >> -| pud_mkinvalid | Invalidates a mapped PUD [1] | >> +| pud_mkinvalid | Invalidates a present PUD; do not call for | >> +| | non-present PUD [1] | >> +---------------------------+--------------------------------------------------+ >> | pud_set_huge | Creates a PUD huge mapping | >> +---------------------------+--------------------------------------------------+ > > LGTM but guess this will conflict with your other patch for mm/debug_vm_pgtable.c > if you choose to update pud_mkinvalid() description for pmd_leaf(). > > https://lore.kernel.org/all/20240501144439.1389048-1-ryan.roberts@arm.com/ Indeed, a reason to avoid adding docs to that patch :) > >> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c >> index 83823db3488b..2975ea0841ba 100644 >> --- a/arch/powerpc/mm/book3s64/pgtable.c >> +++ b/arch/powerpc/mm/book3s64/pgtable.c >> @@ -170,6 +170,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >> { >> unsigned long old_pmd; >> >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); >> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >> return __pmd(old_pmd); >> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h >> index 60950e7a25f5..480bea44559d 100644 >> --- a/arch/s390/include/asm/pgtable.h >> +++ b/arch/s390/include/asm/pgtable.h >> @@ -1768,8 +1768,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, >> static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, >> unsigned long addr, pmd_t *pmdp) >> { >> - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); >> + pmd_t pmd; >> >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); >> return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); >> } >> >> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c >> index b44d79d778c7..ef69127d7e5e 100644 >> --- a/arch/sparc/mm/tlb.c >> +++ b/arch/sparc/mm/tlb.c >> @@ -249,6 +249,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >> { >> pmd_t old, entry; >> >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); >> old = pmdp_establish(vma, address, pmdp, entry); >> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c >> index d007591b8059..103cbccf1d7d 100644 >> --- a/arch/x86/mm/pgtable.c >> +++ b/arch/x86/mm/pgtable.c >> @@ -631,6 +631,8 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, >> pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, >> pmd_t *pmdp) >> { >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> + >> /* >> * No flush is necessary. Once an invalid PTE is established, the PTE's >> * access and dirty bits cannot be updated. >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 89f58c7603b2..dd1fc105f70b 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2493,32 +2493,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> return __split_huge_zero_page_pmd(vma, haddr, pmd); >> } >> >> - /* >> - * Up to this point the pmd is present and huge and userland has the >> - * whole access to the hugepage during the split (which happens in >> - * place). If we overwrite the pmd with the not-huge version pointing >> - * to the pte here (which of course we could if all CPUs were bug >> - * free), userland could trigger a small page size TLB miss on the >> - * small sized TLB while the hugepage TLB entry is still established in >> - * the huge TLB. Some CPU doesn't like that. >> - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum >> - * 383 on page 105. Intel should be safe but is also warns that it's >> - * only safe if the permission and cache attributes of the two entries >> - * loaded in the two TLB is identical (which should be the case here). >> - * But it is generally safer to never allow small and huge TLB entries >> - * for the same virtual address to be loaded simultaneously. So instead >> - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the >> - * current pmd notpresent (atomically because here the pmd_trans_huge >> - * must remain set at all times on the pmd until the split is complete >> - * for this pmd), then we flush the SMP TLB and finally we write the >> - * non-huge version of the pmd entry with pmd_populate. >> - */ >> - old_pmd = pmdp_invalidate(vma, haddr, pmd); >> - >> - pmd_migration = is_pmd_migration_entry(old_pmd); >> + pmd_migration = is_pmd_migration_entry(*pmd); >> if (unlikely(pmd_migration)) { >> swp_entry_t entry; >> >> + old_pmd = *pmd; >> entry = pmd_to_swp_entry(old_pmd); >> page = pfn_swap_entry_to_page(entry); >> write = is_writable_migration_entry(entry); >> @@ -2529,6 +2508,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >> soft_dirty = pmd_swp_soft_dirty(old_pmd); >> uffd_wp = pmd_swp_uffd_wp(old_pmd); >> } else { >> + /* >> + * Up to this point the pmd is present and huge and userland has >> + * the whole access to the hugepage during the split (which >> + * happens in place). If we overwrite the pmd with the not-huge >> + * version pointing to the pte here (which of course we could if >> + * all CPUs were bug free), userland could trigger a small page >> + * size TLB miss on the small sized TLB while the hugepage TLB >> + * entry is still established in the huge TLB. Some CPU doesn't >> + * like that. See >> + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum >> + * 383 on page 105. Intel should be safe but is also warns that >> + * it's only safe if the permission and cache attributes of the >> + * two entries loaded in the two TLB is identical (which should >> + * be the case here). But it is generally safer to never allow >> + * small and huge TLB entries for the same virtual address to be >> + * loaded simultaneously. So instead of doing "pmd_populate(); >> + * flush_pmd_tlb_range();" we first mark the current pmd >> + * notpresent (atomically because here the pmd_trans_huge must >> + * remain set at all times on the pmd until the split is >> + * complete for this pmd), then we flush the SMP TLB and finally >> + * we write the non-huge version of the pmd entry with >> + * pmd_populate. >> + */ >> + old_pmd = pmdp_invalidate(vma, haddr, pmd); >> page = pmd_page(old_pmd); >> folio = page_folio(page); >> if (pmd_dirty(old_pmd)) { >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >> index 4fcd959dcc4d..a78a4adf711a 100644 >> --- a/mm/pgtable-generic.c >> +++ b/mm/pgtable-generic.c >> @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) >> pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >> pmd_t *pmdp) >> { >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); >> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >> return old; >> @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >> pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, >> pmd_t *pmdp) >> { >> + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); >> return pmdp_invalidate(vma, address, pmdp); >> } >> #endif > > Rest LGTM but let's wait for this to run on multiple platforms. > > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Thanks!
On 01.05.24 16:33, Ryan Roberts wrote: > __split_huge_pmd_locked() can be called for a present THP, devmap or > (non-present) migration entry. It calls pmdp_invalidate() > unconditionally on the pmdp and only determines if it is present or not > based on the returned old pmd. This is a problem for the migration entry > case because pmd_mkinvalid(), called by pmdp_invalidate() must only be > called for a present pmd. > > On arm64 at least, pmd_mkinvalid() will mark the pmd such that any > future call to pmd_present() will return true. And therefore any > lockless pgtable walker could see the migration entry pmd in this state > and start interpretting the fields as if it were present, leading to > BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. > > x86 does not suffer the above problem, but instead pmd_mkinvalid() will > corrupt the offset field of the swap entry within the swap pte. See link > below for discussion of that problem. Could that explain: https://lore.kernel.org/all/YjoGbhreg8lGCGIJ@linutronix.de/ Where the PFN of a migration entry might have been corrupted? Ccing Felix Patch itself looks good to me Acked-by: David Hildenbrand <david@redhat.com>
On 2 May 2024, at 3:33, Ryan Roberts wrote: > On 02/05/2024 02:27, Zi Yan wrote: >> On 1 May 2024, at 10:33, Ryan Roberts wrote: >> >>> __split_huge_pmd_locked() can be called for a present THP, devmap or >>> (non-present) migration entry. It calls pmdp_invalidate() >>> unconditionally on the pmdp and only determines if it is present or not >>> based on the returned old pmd. This is a problem for the migration entry >>> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >>> called for a present pmd. >>> >>> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >>> future call to pmd_present() will return true. And therefore any >>> lockless pgtable walker could see the migration entry pmd in this state >>> and start interpretting the fields as if it were present, leading to >>> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >>> >>> x86 does not suffer the above problem, but instead pmd_mkinvalid() will >>> corrupt the offset field of the swap entry within the swap pte. See link >>> below for discussion of that problem. >>> >>> Fix all of this by only calling pmdp_invalidate() for a present pmd. And >>> for good measure let's add a warning to all implementations of >>> pmdp_invalidate[_ad](). I've manually reviewed all other >>> pmdp_invalidate[_ad]() call sites and believe all others to be >>> conformant. >>> >>> This is a theoretical bug found during code review. I don't have any >>> test case to trigger it in practice. >>> >>> Cc: stable@vger.kernel.org >>> Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ >>> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> >>> Right v3; this goes back to the original approach in v1 to fix core-mm rather >>> than push the fix into arm64, since we discovered that x86 can't handle >>> pmd_mkinvalid() being called for non-present pmds either. >>> >>> I'm pulling in more arch maintainers because this version adds some warnings in >>> arch code to help spot incorrect usage. >>> >>> Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, >>> he's agreed to either remove or revert it. >>> >>> >>> Changes since v1 [1] >>> ==================== >>> >>> - Improve pmdp_mkinvalid() docs to make it clear it can only be called for >>> present pmd (per JohnH, Zi Yan) >>> - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) >>> - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) >>> >>> >>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ >>> [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ >>> >>> Thanks, >>> Ryan >>> >>> >>> Documentation/mm/arch_pgtable_helpers.rst | 6 ++- >>> arch/powerpc/mm/book3s64/pgtable.c | 1 + >>> arch/s390/include/asm/pgtable.h | 4 +- >>> arch/sparc/mm/tlb.c | 1 + >>> arch/x86/mm/pgtable.c | 2 + >>> mm/huge_memory.c | 49 ++++++++++++----------- >>> mm/pgtable-generic.c | 2 + >>> 7 files changed, 39 insertions(+), 26 deletions(-) >> >> The changes in Documentation/mm and mm/* look good to me. Thanks. >> Reviewed-by: Zi Yan <ziy@nvidia.com> > > Thanks! > >> >> I wonder if making Documentation/mm and mm/* changes in a separate patch >> would be better, since you will not need acks from arch maintainers and >> get the patch in quicker. > > Yeah maybe - I considered that, but then decided I'm literally just adding a > debug warning to the arch code so it shouldn't be too controversial. Perhaps > wait a few days for acks and if nothing turns up then I'll re-post with it split? Sure, sounds good to me. Thanks. -- Best Regards, Yan, Zi
On 02/05/2024 14:08, David Hildenbrand wrote: > On 01.05.24 16:33, Ryan Roberts wrote: >> __split_huge_pmd_locked() can be called for a present THP, devmap or >> (non-present) migration entry. It calls pmdp_invalidate() >> unconditionally on the pmdp and only determines if it is present or not >> based on the returned old pmd. This is a problem for the migration entry >> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >> called for a present pmd. >> >> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >> future call to pmd_present() will return true. And therefore any >> lockless pgtable walker could see the migration entry pmd in this state >> and start interpretting the fields as if it were present, leading to >> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >> >> x86 does not suffer the above problem, but instead pmd_mkinvalid() will >> corrupt the offset field of the swap entry within the swap pte. See link >> below for discussion of that problem. > > Could that explain: > > https://lore.kernel.org/all/YjoGbhreg8lGCGIJ@linutronix.de/ > > Where the PFN of a migration entry might have been corrupted? Ahh interesting! Yes, it seems to fit... > > Ccing Felix Are you able to reliably reproduce the bug, Felix? If so, would you mind trying with this patch to see if it goes away? > > > Patch itself looks good to me > > Acked-by: David Hildenbrand <david@redhat.com> Thanks!
On 2024-05-02 09:47, Ryan Roberts wrote: > On 02/05/2024 14:08, David Hildenbrand wrote: >> On 01.05.24 16:33, Ryan Roberts wrote: >>> __split_huge_pmd_locked() can be called for a present THP, devmap or >>> (non-present) migration entry. It calls pmdp_invalidate() >>> unconditionally on the pmdp and only determines if it is present or not >>> based on the returned old pmd. This is a problem for the migration entry >>> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >>> called for a present pmd. >>> >>> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >>> future call to pmd_present() will return true. And therefore any >>> lockless pgtable walker could see the migration entry pmd in this state >>> and start interpretting the fields as if it were present, leading to >>> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >>> >>> x86 does not suffer the above problem, but instead pmd_mkinvalid() will >>> corrupt the offset field of the swap entry within the swap pte. See link >>> below for discussion of that problem. >> Could that explain: >> >> https://lore.kernel.org/all/YjoGbhreg8lGCGIJ@linutronix.de/ >> >> Where the PFN of a migration entry might have been corrupted? > Ahh interesting! Yes, it seems to fit... > >> Ccing Felix > Are you able to reliably reproduce the bug, Felix? If so, would you mind trying > with this patch to see if it goes away? Sorry, this question got lost and I found it while cleaning my inbox. The team that ran into this problems reported that it was a one-off failure. They didn't see this problem again. So I can't help with verifying the fix. Regards, Felix > >> Patch itself looks good to me >> >> Acked-by: David Hildenbrand <david@redhat.com> > Thanks! >
diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index 2466d3363af7..ad50ca6f495e 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -140,7 +140,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkinvalid | Invalidates a mapped PMD [1] | +| pmd_mkinvalid | Invalidates a present PMD; do not call for | +| | non-present PMD [1] | +---------------------------+--------------------------------------------------+ | pmd_set_huge | Creates a PMD huge mapping | +---------------------------+--------------------------------------------------+ @@ -196,7 +197,8 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_mkinvalid | Invalidates a mapped PUD [1] | +| pud_mkinvalid | Invalidates a present PUD; do not call for | +| | non-present PUD [1] | +---------------------------+--------------------------------------------------+ | pud_set_huge | Creates a PUD huge mapping | +---------------------------+--------------------------------------------------+ diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 83823db3488b..2975ea0841ba 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -170,6 +170,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { unsigned long old_pmd; + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return __pmd(old_pmd); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 60950e7a25f5..480bea44559d 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1768,8 +1768,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); + pmd_t pmd; + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); } diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index b44d79d778c7..ef69127d7e5e 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -249,6 +249,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { pmd_t old, entry; + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); old = pmdp_establish(vma, address, pmdp, entry); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index d007591b8059..103cbccf1d7d 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -631,6 +631,8 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + /* * No flush is necessary. Once an invalid PTE is established, the PTE's * access and dirty bits cannot be updated. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 89f58c7603b2..dd1fc105f70b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2493,32 +2493,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return __split_huge_zero_page_pmd(vma, haddr, pmd); } - /* - * Up to this point the pmd is present and huge and userland has the - * whole access to the hugepage during the split (which happens in - * place). If we overwrite the pmd with the not-huge version pointing - * to the pte here (which of course we could if all CPUs were bug - * free), userland could trigger a small page size TLB miss on the - * small sized TLB while the hugepage TLB entry is still established in - * the huge TLB. Some CPU doesn't like that. - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum - * 383 on page 105. Intel should be safe but is also warns that it's - * only safe if the permission and cache attributes of the two entries - * loaded in the two TLB is identical (which should be the case here). - * But it is generally safer to never allow small and huge TLB entries - * for the same virtual address to be loaded simultaneously. So instead - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the - * current pmd notpresent (atomically because here the pmd_trans_huge - * must remain set at all times on the pmd until the split is complete - * for this pmd), then we flush the SMP TLB and finally we write the - * non-huge version of the pmd entry with pmd_populate. - */ - old_pmd = pmdp_invalidate(vma, haddr, pmd); - - pmd_migration = is_pmd_migration_entry(old_pmd); + pmd_migration = is_pmd_migration_entry(*pmd); if (unlikely(pmd_migration)) { swp_entry_t entry; + old_pmd = *pmd; entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); @@ -2529,6 +2508,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { + /* + * Up to this point the pmd is present and huge and userland has + * the whole access to the hugepage during the split (which + * happens in place). If we overwrite the pmd with the not-huge + * version pointing to the pte here (which of course we could if + * all CPUs were bug free), userland could trigger a small page + * size TLB miss on the small sized TLB while the hugepage TLB + * entry is still established in the huge TLB. Some CPU doesn't + * like that. See + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum + * 383 on page 105. Intel should be safe but is also warns that + * it's only safe if the permission and cache attributes of the + * two entries loaded in the two TLB is identical (which should + * be the case here). But it is generally safer to never allow + * small and huge TLB entries for the same virtual address to be + * loaded simultaneously. So instead of doing "pmd_populate(); + * flush_pmd_tlb_range();" we first mark the current pmd + * notpresent (atomically because here the pmd_trans_huge must + * remain set at all times on the pmd until the split is + * complete for this pmd), then we flush the SMP TLB and finally + * we write the non-huge version of the pmd entry with + * pmd_populate. + */ + old_pmd = pmdp_invalidate(vma, haddr, pmd); page = pmd_page(old_pmd); folio = page_folio(page); if (pmd_dirty(old_pmd)) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4fcd959dcc4d..a78a4adf711a 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old; @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); return pmdp_invalidate(vma, address, pmdp); } #endif
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd. On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem. Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant. This is a theoretical bug found during code review. I don't have any test case to trigger it in practice. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- Right v3; this goes back to the original approach in v1 to fix core-mm rather than push the fix into arm64, since we discovered that x86 can't handle pmd_mkinvalid() being called for non-present pmds either. I'm pulling in more arch maintainers because this version adds some warnings in arch code to help spot incorrect usage. Although Catalin had already accepted v2 (fixing arm64) [2] into for-next/fixes, he's agreed to either remove or revert it. Changes since v1 [1] ==================== - Improve pmdp_mkinvalid() docs to make it clear it can only be called for present pmd (per JohnH, Zi Yan) - Added warnings to arch overrides of pmdp_invalidate[_ad]() (per Zi Yan) - Moved comment next to new location of pmpd_invalidate() (per Zi Yan) [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/all/20240430133138.732088-1-ryan.roberts@arm.com/ Thanks, Ryan Documentation/mm/arch_pgtable_helpers.rst | 6 ++- arch/powerpc/mm/book3s64/pgtable.c | 1 + arch/s390/include/asm/pgtable.h | 4 +- arch/sparc/mm/tlb.c | 1 + arch/x86/mm/pgtable.c | 2 + mm/huge_memory.c | 49 ++++++++++++----------- mm/pgtable-generic.c | 2 + 7 files changed, 39 insertions(+), 26 deletions(-) -- 2.25.1