Message ID | 20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | userfaultfd: fix races around pmd_trans_huge() check | expand |
On 13.08.24 22:25, Jann Horn wrote: > This fixes two issues. > > I discovered that the following race can occur: > > mfill_atomic other thread > ============ ============ > <zap PMD> > pmdp_get_lockless() [reads none pmd] > <bail if trans_huge> > <if none:> > <pagefault creates transhuge zeropage> > __pte_alloc [no-op] > <zap PMD> > <bail if pmd_trans_huge(*dst_pmd)> > BUG_ON(pmd_none(*dst_pmd)) > > I have experimentally verified this in a kernel with extra mdelay() calls; > the BUG_ON(pmd_none(*dst_pmd)) triggers. > > On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow > pte_offset_map[_lock]() to fail"), this can't lead to anything worse than > a BUG_ON(), since the page table access helpers are actually designed to > deal with page tables concurrently disappearing; but on older kernels > (<=6.4), I think we could probably theoretically race past the two BUG_ON() > checks and end up treating a hugepage as a page table. > > The second issue is that, as Qi Zheng pointed out, there are other types of > huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs > (in particular, migration PMDs). > On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a > PMD that contains a migration entry (which just requires winning a single, > fairly wide race), it will pass the PMD to pte_offset_map_lock(), which > assumes that the PMD points to a page table. > Breakage follows: First, the kernel tries to take the PTE lock (which will > crash or maybe worse if there is no "struct page" for the address bits in > the migration entry PMD - I think at least on X86 there usually is no > corresponding "struct page" thanks to the PTE inversion mitigation, amd64 > looks different). > If that didn't crash, the kernel would next try to write a PTE into what it > wrongly thinks is a page table. > > As part of fixing these issues, get rid of the check for pmd_trans_huge() > before __pte_alloc() - that's redundant, we're going to have to check for > that after the __pte_alloc() anyway. > > Backport note: pmdp_get_lockless() is pmd_read_atomic() in older > kernels. > > Reported-by: Qi Zheng <zhengqi.arch@bytedance.com> > Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com > Cc: stable@vger.kernel.org > Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") > Signed-off-by: Jann Horn <jannh@google.com> > --- > mm/userfaultfd.c | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index e54e5c8907fa..290b2a0d84ac 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, > } > > dst_pmdval = pmdp_get_lockless(dst_pmd); > - /* > - * If the dst_pmd is mapped as THP don't > - * override it and just be strict. > - */ > - if (unlikely(pmd_trans_huge(dst_pmdval))) { > - err = -EEXIST; > - break; > - } > if (unlikely(pmd_none(dst_pmdval)) && > unlikely(__pte_alloc(dst_mm, dst_pmd))) { > err = -ENOMEM; > break; > } > - /* If an huge pmd materialized from under us fail */ > - if (unlikely(pmd_trans_huge(*dst_pmd))) { > + dst_pmdval = pmdp_get_lockless(dst_pmd); > + /* > + * If the dst_pmd is THP don't override it and just be strict. > + * (This includes the case where the PMD used to be THP and > + * changed back to none after __pte_alloc().) > + */ > + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || > + pmd_devmap(dst_pmdval))) { Likely in the future we should turn the latter part into a "pmd_leaf()" check. > + err = -EEXIST; > + break; > + } > + if (unlikely(pmd_bad(dst_pmdval))) { > err = -EFAULT; > break; > } > Acked-by: David Hildenbrand <david@redhat.com>
On Tue, Aug 13, 2024 at 10:37 PM David Hildenbrand <david@redhat.com> wrote: > On 13.08.24 22:25, Jann Horn wrote: > > + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || > > + pmd_devmap(dst_pmdval))) { > > Likely in the future we should turn the latter part into a "pmd_leaf()" > check. Yeah, it'd be a good idea to refactor that as a followup.
On 2024/8/14 04:25, Jann Horn wrote: > This fixes two issues. > > I discovered that the following race can occur: > > mfill_atomic other thread > ============ ============ > <zap PMD> > pmdp_get_lockless() [reads none pmd] > <bail if trans_huge> > <if none:> > <pagefault creates transhuge zeropage> > __pte_alloc [no-op] > <zap PMD> > <bail if pmd_trans_huge(*dst_pmd)> > BUG_ON(pmd_none(*dst_pmd)) > > I have experimentally verified this in a kernel with extra mdelay() calls; > the BUG_ON(pmd_none(*dst_pmd)) triggers. > > On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow > pte_offset_map[_lock]() to fail"), this can't lead to anything worse than > a BUG_ON(), since the page table access helpers are actually designed to > deal with page tables concurrently disappearing; but on older kernels > (<=6.4), I think we could probably theoretically race past the two BUG_ON() > checks and end up treating a hugepage as a page table. > > The second issue is that, as Qi Zheng pointed out, there are other types of > huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs > (in particular, migration PMDs). > On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a > PMD that contains a migration entry (which just requires winning a single, > fairly wide race), it will pass the PMD to pte_offset_map_lock(), which > assumes that the PMD points to a page table. > Breakage follows: First, the kernel tries to take the PTE lock (which will > crash or maybe worse if there is no "struct page" for the address bits in > the migration entry PMD - I think at least on X86 there usually is no > corresponding "struct page" thanks to the PTE inversion mitigation, amd64 > looks different). > If that didn't crash, the kernel would next try to write a PTE into what it > wrongly thinks is a page table. > > As part of fixing these issues, get rid of the check for pmd_trans_huge() > before __pte_alloc() - that's redundant, we're going to have to check for > that after the __pte_alloc() anyway. > > Backport note: pmdp_get_lockless() is pmd_read_atomic() in older > kernels. > > Reported-by: Qi Zheng <zhengqi.arch@bytedance.com> > Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com Ah, the issue fixed by this patch was not reported by me, so I think that this Reported-by does not need to be added. ;) > Cc: stable@vger.kernel.org > Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") > Signed-off-by: Jann Horn <jannh@google.com> > --- > mm/userfaultfd.c | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index e54e5c8907fa..290b2a0d84ac 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, > } > > dst_pmdval = pmdp_get_lockless(dst_pmd); > - /* > - * If the dst_pmd is mapped as THP don't > - * override it and just be strict. > - */ > - if (unlikely(pmd_trans_huge(dst_pmdval))) { > - err = -EEXIST; > - break; > - } > if (unlikely(pmd_none(dst_pmdval)) && > unlikely(__pte_alloc(dst_mm, dst_pmd))) { > err = -ENOMEM; > break; > } > - /* If an huge pmd materialized from under us fail */ > - if (unlikely(pmd_trans_huge(*dst_pmd))) { > + dst_pmdval = pmdp_get_lockless(dst_pmd); > + /* > + * If the dst_pmd is THP don't override it and just be strict. > + * (This includes the case where the PMD used to be THP and > + * changed back to none after __pte_alloc().) > + */ > + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || > + pmd_devmap(dst_pmdval))) { > + err = -EEXIST; > + break; > + } > + if (unlikely(pmd_bad(dst_pmdval))) { > err = -EFAULT; > break; > } Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> >
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e54e5c8907fa..290b2a0d84ac 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, } dst_pmdval = pmdp_get_lockless(dst_pmd); - /* - * If the dst_pmd is mapped as THP don't - * override it and just be strict. - */ - if (unlikely(pmd_trans_huge(dst_pmdval))) { - err = -EEXIST; - break; - } if (unlikely(pmd_none(dst_pmdval)) && unlikely(__pte_alloc(dst_mm, dst_pmd))) { err = -ENOMEM; break; } - /* If an huge pmd materialized from under us fail */ - if (unlikely(pmd_trans_huge(*dst_pmd))) { + dst_pmdval = pmdp_get_lockless(dst_pmd); + /* + * If the dst_pmd is THP don't override it and just be strict. + * (This includes the case where the PMD used to be THP and + * changed back to none after __pte_alloc().) + */ + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || + pmd_devmap(dst_pmdval))) { + err = -EEXIST; + break; + } + if (unlikely(pmd_bad(dst_pmdval))) { err = -EFAULT; break; }
This fixes two issues. I discovered that the following race can occur: mfill_atomic other thread ============ ============ <zap PMD> pmdp_get_lockless() [reads none pmd] <bail if trans_huge> <if none:> <pagefault creates transhuge zeropage> __pte_alloc [no-op] <zap PMD> <bail if pmd_trans_huge(*dst_pmd)> BUG_ON(pmd_none(*dst_pmd)) I have experimentally verified this in a kernel with extra mdelay() calls; the BUG_ON(pmd_none(*dst_pmd)) triggers. On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail"), this can't lead to anything worse than a BUG_ON(), since the page table access helpers are actually designed to deal with page tables concurrently disappearing; but on older kernels (<=6.4), I think we could probably theoretically race past the two BUG_ON() checks and end up treating a hugepage as a page table. The second issue is that, as Qi Zheng pointed out, there are other types of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs (in particular, migration PMDs). On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a PMD that contains a migration entry (which just requires winning a single, fairly wide race), it will pass the PMD to pte_offset_map_lock(), which assumes that the PMD points to a page table. Breakage follows: First, the kernel tries to take the PTE lock (which will crash or maybe worse if there is no "struct page" for the address bits in the migration entry PMD - I think at least on X86 there usually is no corresponding "struct page" thanks to the PTE inversion mitigation, amd64 looks different). If that didn't crash, the kernel would next try to write a PTE into what it wrongly thinks is a page table. As part of fixing these issues, get rid of the check for pmd_trans_huge() before __pte_alloc() - that's redundant, we're going to have to check for that after the __pte_alloc() anyway. Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels. Reported-by: Qi Zheng <zhengqi.arch@bytedance.com> Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com Cc: stable@vger.kernel.org Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Jann Horn <jannh@google.com> --- mm/userfaultfd.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)