diff mbox series

[BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()

Message ID CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com (mailing list archive)
State New
Headers show
Series [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() | expand

Commit Message

Jann Horn Oct. 7, 2024, 7:05 p.m. UTC
On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> To ensure locking is downgraded correctly, the mm is set regardless of
> MAP_FIXED or not (NULL vma).
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation.  This is done using the vms_clean_up_area()
> helper.  If there is a close vm_ops, that must also be called to ensure
> any cleanup is done before mapping over the area.  This also means that
> calling open has been added to the abort of an unmap operation, for now.

As currently implemented, this is not a valid optimization because it
violates the (unwritten?) rule that you must not call free_pgd_range()
on a region in the page tables which can concurrently be walked. A
region in the page tables can be concurrently walked if it overlaps a
VMA which is linked into rmaps which are not write-locked.

On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
the new mapping is created by expanding an adjacent VMA, the following
race with an ftruncate() is possible (because page tables for the old
mapping are removed while the new VMA in the same location is already
fully set up and linked into the rmap):


task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
========================     ==================
mmap_region
  vma_merge_new_range
    vma_expand
      commit_merge
        vma_prepare
          [take rmap locks]
        vma_set_range
          [expand adjacent mapping]
        vma_complete
          [drop rmap locks]
  vms_complete_munmap_vmas
    vms_clear_ptes
      unmap_vmas
        [removes ptes]
      free_pgtables
        [unlinks old vma from rmap]
                             unmap_mapping_range
                               unmap_mapping_pages
                                 i_mmap_lock_read
                                 unmap_mapping_range_tree
                                   [loop]
                                     unmap_mapping_range_vma
                                       zap_page_range_single
                                         unmap_single_vma
                                           unmap_page_range
                                             zap_p4d_range
                                               zap_pud_range
                                                 zap_pmd_range
                                                   [looks up pmd entry]
        free_pgd_range
          [frees pmd]
                                                   [UAF pmd entry access]

To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
to widen the race windows, then build and run the attached reproducer
mmap-fixed-race.c.

Under a kernel with KASAN, you should ideally get a KASAN splat like this:

[   90.012655][ T1113]
==================================================================
[   90.013937][ T1113] BUG: KASAN: use-after-free in
unmap_page_range+0x2d4e/0x3a80
[   90.015136][ T1113] Read of size 8 at addr ffff888006a3b000 by task
SLOWME2/1113
[   90.016322][ T1113]
[   90.016695][ T1113] CPU: 12 UID: 1000 PID: 1113 Comm: SLOWME2
Tainted: G        W          6.12.0-rc2-dirty #500
[   90.018355][ T1113] Tainted: [W]=WARN
[   90.018959][ T1113] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   90.020598][ T1113] Call Trace:
[   90.021126][ T1113]  <TASK>
[   90.021590][ T1113]  dump_stack_lvl+0x53/0x70
[   90.022307][ T1113]  print_report+0xce/0x670
[   90.023008][ T1113]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[   90.023942][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.024763][ T1113]  kasan_report+0xe2/0x120
[   90.025468][ T1113]  ? unmap_page_range+0x2d4e/0x3a80
[   90.026293][ T1113]  unmap_page_range+0x2d4e/0x3a80
[   90.027087][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.027855][ T1113]  ? next_uptodate_folio+0x148/0x890
[   90.029299][ T1113]  ? set_pte_range+0x265/0x6c0
[   90.030058][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.030826][ T1113]  ? page_table_check_set.part.0+0x2ab/0x3e0
[   90.031773][ T1113]  ? __pfx_unmap_page_range+0x10/0x10
[   90.032622][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.033394][ T1113]  ? unmap_single_vma+0xc6/0x2c0
[   90.034211][ T1113]  zap_page_range_single+0x28a/0x4b0
[   90.035052][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.035821][ T1113]  ? __pfx_zap_page_range_single+0x10/0x10
[   90.036739][ T1113]  ? __pte_offset_map+0x1d/0x250
[   90.037528][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.038295][ T1113]  ? do_fault+0x6c4/0x1270
[   90.038999][ T1113]  ? __pfx___handle_mm_fault+0x10/0x10
[   90.039862][ T1113]  unmap_mapping_range+0x1b2/0x240
[   90.040671][ T1113]  ? __pfx_unmap_mapping_range+0x10/0x10
[   90.041563][ T1113]  ? setattr_prepare+0xed/0x7e0
[   90.042330][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.043097][ T1113]  ? current_time+0x88/0x200
[   90.043826][ T1113]  shmem_setattr+0x880/0xea0
[   90.044556][ T1113]  notify_change+0x42b/0xea0
[   90.045913][ T1113]  ? do_truncate+0x10b/0x1b0
[   90.046641][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.047407][ T1113]  do_truncate+0x10b/0x1b0
[   90.048107][ T1113]  ? __pfx_do_truncate+0x10/0x10
[   90.048890][ T1113]  ? __set_task_comm+0x53/0x140
[   90.049668][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.050472][ T1113]  ? __do_sys_prctl+0x71e/0x11f0
[   90.051262][ T1113]  do_ftruncate+0x334/0x470
[   90.051976][ T1113]  ? srso_return_thunk+0x5/0x5f
[   90.052745][ T1113]  do_sys_ftruncate+0x3d/0x80
[   90.053493][ T1113]  do_syscall_64+0x4b/0x110
[   90.054209][ T1113]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   90.055142][ T1113] RIP: 0033:0x7f89cbf2bf47
[   90.055844][ T1113] Code: 77 01 c3 48 8b 15 b9 1e 0d 00 f7 d8 64 89
02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 b8 4d
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 89 1e 0d 00 f7 d8
64 89 02 b8
[   90.058924][ T1113] RSP: 002b:00007fff96968aa8 EFLAGS: 00000213
ORIG_RAX: 000000000000004d
[   90.060248][ T1113] RAX: ffffffffffffffda RBX: 00007fff96968c28
RCX: 00007f89cbf2bf47
[   90.061507][ T1113] RDX: 0000000000000000 RSI: 0000000000001000
RDI: 0000000000000003
[   90.063471][ T1113] RBP: 00007fff96968b10 R08: 0000000000000000
R09: 00007f89cbe29740
[   90.064727][ T1113] R10: 00007f89cbefb443 R11: 0000000000000213
R12: 0000000000000000
[   90.065990][ T1113] R13: 00007fff96968c38 R14: 000055ce5a58add8
R15: 00007f89cc05f020
[   90.067291][ T1113]  </TASK>
[   90.067772][ T1113]
[   90.068147][ T1113] The buggy address belongs to the physical page:
[   90.069168][ T1113] page: refcount:0 mapcount:0
mapping:0000000000000000 index:0xffff888006a3bf50 pfn:0x6a3b
[   90.070741][ T1113] flags: 0x100000000000000(node=0|zone=1)
[   90.071649][ T1113] raw: 0100000000000000 ffffea0000111748
ffffea000020c488 0000000000000000
[   90.073009][ T1113] raw: ffff888006a3bf50 0000000000000000
00000000ffffffff 0000000000000000
[   90.074368][ T1113] page dumped because: kasan: bad access detected
[   90.075382][ T1113]
[   90.075751][ T1113] Memory state around the buggy address:
[   90.076636][ T1113]  ffff888006a3af00: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.077905][ T1113]  ffff888006a3af80: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[   90.079874][ T1113] >ffff888006a3b000: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.081173][ T1113]                    ^
[   90.081821][ T1113]  ffff888006a3b080: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.083134][ T1113]  ffff888006a3b100: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[   90.084416][ T1113]
==================================================================

Comments

Liam R. Howlett Oct. 7, 2024, 8:31 p.m. UTC | #1
* Jann Horn <jannh@google.com> [241007 15:06]:
> On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > To ensure locking is downgraded correctly, the mm is set regardless of
> > MAP_FIXED or not (NULL vma).
> >
> > If a driver is mapping over an existing vma, then clear the ptes before
> > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > helper.  If there is a close vm_ops, that must also be called to ensure
> > any cleanup is done before mapping over the area.  This also means that
> > calling open has been added to the abort of an unmap operation, for now.
> 
> As currently implemented, this is not a valid optimization because it
> violates the (unwritten?) rule that you must not call free_pgd_range()
> on a region in the page tables which can concurrently be walked. A
> region in the page tables can be concurrently walked if it overlaps a
> VMA which is linked into rmaps which are not write-locked.

Just for clarity, this is the rmap write lock.

> 
> On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> the new mapping is created by expanding an adjacent VMA, the following
> race with an ftruncate() is possible (because page tables for the old
> mapping are removed while the new VMA in the same location is already
> fully set up and linked into the rmap):
> 
> 
> task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> ========================     ==================
> mmap_region
>   vma_merge_new_range
>     vma_expand
>       commit_merge
>         vma_prepare
>           [take rmap locks]
>         vma_set_range
>           [expand adjacent mapping]
>         vma_complete
>           [drop rmap locks]
>   vms_complete_munmap_vmas
>     vms_clear_ptes
>       unmap_vmas
>         [removes ptes]
>       free_pgtables
>         [unlinks old vma from rmap]
>                              unmap_mapping_range
>                                unmap_mapping_pages
>                                  i_mmap_lock_read
>                                  unmap_mapping_range_tree
>                                    [loop]
>                                      unmap_mapping_range_vma
>                                        zap_page_range_single
>                                          unmap_single_vma
>                                            unmap_page_range
>                                              zap_p4d_range
>                                                zap_pud_range
>                                                  zap_pmd_range
>                                                    [looks up pmd entry]
>         free_pgd_range
>           [frees pmd]
>                                                    [UAF pmd entry access]
> 
> To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> to widen the race windows, then build and run the attached reproducer
> mmap-fixed-race.c.
> 
> Under a kernel with KASAN, you should ideally get a KASAN splat like this:

Thanks for all the work you did finding the root cause here, I
appreciate it.

I think the correct fix is to take the rmap lock on free_pgtables, when
necessary.  There are a few code paths (error recovery) that are not
regularly run that will also need to change.

Regards,
Liam
Jann Horn Oct. 7, 2024, 9:31 p.m. UTC | #2
On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [241007 15:06]:
> > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > area be overwritten and then clean up the gathered vmas using
> > > vms_complete_munmap_vmas().
> > >
> > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > MAP_FIXED or not (NULL vma).
> > >
> > > If a driver is mapping over an existing vma, then clear the ptes before
> > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > any cleanup is done before mapping over the area.  This also means that
> > > calling open has been added to the abort of an unmap operation, for now.
> >
> > As currently implemented, this is not a valid optimization because it
> > violates the (unwritten?) rule that you must not call free_pgd_range()
> > on a region in the page tables which can concurrently be walked. A
> > region in the page tables can be concurrently walked if it overlaps a
> > VMA which is linked into rmaps which are not write-locked.
>
> Just for clarity, this is the rmap write lock.

Ah, yes.

> > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > the new mapping is created by expanding an adjacent VMA, the following
> > race with an ftruncate() is possible (because page tables for the old
> > mapping are removed while the new VMA in the same location is already
> > fully set up and linked into the rmap):
> >
> >
> > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > ========================     ==================
> > mmap_region
> >   vma_merge_new_range
> >     vma_expand
> >       commit_merge
> >         vma_prepare
> >           [take rmap locks]
> >         vma_set_range
> >           [expand adjacent mapping]
> >         vma_complete
> >           [drop rmap locks]
> >   vms_complete_munmap_vmas
> >     vms_clear_ptes
> >       unmap_vmas
> >         [removes ptes]
> >       free_pgtables
> >         [unlinks old vma from rmap]
> >                              unmap_mapping_range
> >                                unmap_mapping_pages
> >                                  i_mmap_lock_read
> >                                  unmap_mapping_range_tree
> >                                    [loop]
> >                                      unmap_mapping_range_vma
> >                                        zap_page_range_single
> >                                          unmap_single_vma
> >                                            unmap_page_range
> >                                              zap_p4d_range
> >                                                zap_pud_range
> >                                                  zap_pmd_range
> >                                                    [looks up pmd entry]
> >         free_pgd_range
> >           [frees pmd]
> >                                                    [UAF pmd entry access]
> >
> > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > to widen the race windows, then build and run the attached reproducer
> > mmap-fixed-race.c.
> >
> > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
>
> Thanks for all the work you did finding the root cause here, I
> appreciate it.

Ah, this is not a bug I ran into while testing, it's a bug I found
while reading the patch. It's much easier to explain the issue and
come up with a nice reproducer this way than when you start out from a
crash. :P

> I think the correct fix is to take the rmap lock on free_pgtables, when
> necessary.  There are a few code paths (error recovery) that are not
> regularly run that will also need to change.

Hmm, yes, I guess that might work. Though I think there might be more
races: One related aspect of this optimization that is unintuitive to
me is that, directly after vma_merge_new_range(), a concurrent rmap
walk could probably be walking the newly-extended VMA but still
observe PTEs belonging to the previous VMA. I don't know how robust
the various rmap walks are to things like encountering pfnmap PTEs in
non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
example, page_vma_mapped_walk() looks like, if you called it on a page
table range with huge PUD entries, but with a VMA without VM_HUGETLB,
something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
and a 1G hugepage might get misinterpreted as a page table? But I
haven't experimentally verified that.
Liam R. Howlett Oct. 8, 2024, 1:50 a.m. UTC | #3
* Jann Horn <jannh@google.com> [241007 17:31]:
> On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > area be overwritten and then clean up the gathered vmas using
> > > > vms_complete_munmap_vmas().
> > > >
> > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > MAP_FIXED or not (NULL vma).
> > > >
> > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > any cleanup is done before mapping over the area.  This also means that
> > > > calling open has been added to the abort of an unmap operation, for now.
> > >
> > > As currently implemented, this is not a valid optimization because it
> > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > on a region in the page tables which can concurrently be walked. A
> > > region in the page tables can be concurrently walked if it overlaps a
> > > VMA which is linked into rmaps which are not write-locked.
> >
> > Just for clarity, this is the rmap write lock.
> 
> Ah, yes.
> 
> > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > the new mapping is created by expanding an adjacent VMA, the following
> > > race with an ftruncate() is possible (because page tables for the old
> > > mapping are removed while the new VMA in the same location is already
> > > fully set up and linked into the rmap):
> > >
> > >
> > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > ========================     ==================
> > > mmap_region
> > >   vma_merge_new_range
> > >     vma_expand
> > >       commit_merge
> > >         vma_prepare
> > >           [take rmap locks]
> > >         vma_set_range
> > >           [expand adjacent mapping]
> > >         vma_complete
> > >           [drop rmap locks]
> > >   vms_complete_munmap_vmas
> > >     vms_clear_ptes
> > >       unmap_vmas
> > >         [removes ptes]
> > >       free_pgtables
> > >         [unlinks old vma from rmap]
> > >                              unmap_mapping_range
> > >                                unmap_mapping_pages
> > >                                  i_mmap_lock_read
> > >                                  unmap_mapping_range_tree
> > >                                    [loop]
> > >                                      unmap_mapping_range_vma
> > >                                        zap_page_range_single
> > >                                          unmap_single_vma
> > >                                            unmap_page_range
> > >                                              zap_p4d_range
> > >                                                zap_pud_range
> > >                                                  zap_pmd_range
> > >                                                    [looks up pmd entry]
> > >         free_pgd_range
> > >           [frees pmd]
> > >                                                    [UAF pmd entry access]
> > >
> > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > to widen the race windows, then build and run the attached reproducer
> > > mmap-fixed-race.c.
> > >
> > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> >
> > Thanks for all the work you did finding the root cause here, I
> > appreciate it.
> 
> Ah, this is not a bug I ran into while testing, it's a bug I found
> while reading the patch. It's much easier to explain the issue and
> come up with a nice reproducer this way than when you start out from a
> crash. :P
> 
> > I think the correct fix is to take the rmap lock on free_pgtables, when
> > necessary.  There are a few code paths (error recovery) that are not
> > regularly run that will also need to change.
> 
> Hmm, yes, I guess that might work. Though I think there might be more
> races: One related aspect of this optimization that is unintuitive to
> me is that, directly after vma_merge_new_range(), a concurrent rmap
> walk could probably be walking the newly-extended VMA but still
> observe PTEs belonging to the previous VMA. I don't know how robust
> the various rmap walks are to things like encountering pfnmap PTEs in
> non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> example, page_vma_mapped_walk() looks like, if you called it on a page
> table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> and a 1G hugepage might get misinterpreted as a page table? But I
> haven't experimentally verified that.

Yes, I am also concerned that reacquiring the lock will result in
another race.  I also don't think holding the lock for longer is a good
idea as it will most likely cause a regression by extending the lock for
the duration of the mmap() setup.  Although, maybe it would be fine if
we only keep it held if we are going to be removing a vma in the
MAP_FIXED case.

Another idea would be to change the pte to know if a vma is being
modified using the per-vma locking, but I'm not sure what action to take
if we detect the vma is being modified to avoid the issue.  This would
also need to be done to all walkers (or low enough in the stack).

By the way, this isn't an optimisation; this is to fix RCU walkers of
the vma tree seeing a hole between the underlying implementation of the
MAP_FIXED operations of munmap() and mmap().  This is needed for things
like the /proc/{pid}/maps rcu walker.  The page tables currently fall
back to the old way of locking if a hole is seen (and sane applications
shouldn't really be page faulting something being removed anyways..)

Thanks,
Liam
Jann Horn Oct. 8, 2024, 5:15 p.m. UTC | #4
On Tue, Oct 8, 2024 at 3:51 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [241007 17:31]:
> > On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > > area be overwritten and then clean up the gathered vmas using
> > > > > vms_complete_munmap_vmas().
> > > > >
> > > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > > MAP_FIXED or not (NULL vma).
> > > > >
> > > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > > any cleanup is done before mapping over the area.  This also means that
> > > > > calling open has been added to the abort of an unmap operation, for now.
> > > >
> > > > As currently implemented, this is not a valid optimization because it
> > > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > > on a region in the page tables which can concurrently be walked. A
> > > > region in the page tables can be concurrently walked if it overlaps a
> > > > VMA which is linked into rmaps which are not write-locked.
> > >
> > > Just for clarity, this is the rmap write lock.
> >
> > Ah, yes.
> >
> > > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > > the new mapping is created by expanding an adjacent VMA, the following
> > > > race with an ftruncate() is possible (because page tables for the old
> > > > mapping are removed while the new VMA in the same location is already
> > > > fully set up and linked into the rmap):
> > > >
> > > >
> > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > ========================     ==================
> > > > mmap_region
> > > >   vma_merge_new_range
> > > >     vma_expand
> > > >       commit_merge
> > > >         vma_prepare
> > > >           [take rmap locks]
> > > >         vma_set_range
> > > >           [expand adjacent mapping]
> > > >         vma_complete
> > > >           [drop rmap locks]
> > > >   vms_complete_munmap_vmas
> > > >     vms_clear_ptes
> > > >       unmap_vmas
> > > >         [removes ptes]
> > > >       free_pgtables
> > > >         [unlinks old vma from rmap]
> > > >                              unmap_mapping_range
> > > >                                unmap_mapping_pages
> > > >                                  i_mmap_lock_read
> > > >                                  unmap_mapping_range_tree
> > > >                                    [loop]
> > > >                                      unmap_mapping_range_vma
> > > >                                        zap_page_range_single
> > > >                                          unmap_single_vma
> > > >                                            unmap_page_range
> > > >                                              zap_p4d_range
> > > >                                                zap_pud_range
> > > >                                                  zap_pmd_range
> > > >                                                    [looks up pmd entry]
> > > >         free_pgd_range
> > > >           [frees pmd]
> > > >                                                    [UAF pmd entry access]
> > > >
> > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > to widen the race windows, then build and run the attached reproducer
> > > > mmap-fixed-race.c.
> > > >
> > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > >
> > > Thanks for all the work you did finding the root cause here, I
> > > appreciate it.
> >
> > Ah, this is not a bug I ran into while testing, it's a bug I found
> > while reading the patch. It's much easier to explain the issue and
> > come up with a nice reproducer this way than when you start out from a
> > crash. :P
> >
> > > I think the correct fix is to take the rmap lock on free_pgtables, when
> > > necessary.  There are a few code paths (error recovery) that are not
> > > regularly run that will also need to change.
> >
> > Hmm, yes, I guess that might work. Though I think there might be more
> > races: One related aspect of this optimization that is unintuitive to
> > me is that, directly after vma_merge_new_range(), a concurrent rmap
> > walk could probably be walking the newly-extended VMA but still
> > observe PTEs belonging to the previous VMA. I don't know how robust
> > the various rmap walks are to things like encountering pfnmap PTEs in
> > non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> > example, page_vma_mapped_walk() looks like, if you called it on a page
> > table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> > something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> > and a 1G hugepage might get misinterpreted as a page table? But I
> > haven't experimentally verified that.
>
> Yes, I am also concerned that reacquiring the lock will result in
> another race.  I also don't think holding the lock for longer is a good
> idea as it will most likely cause a regression by extending the lock for
> the duration of the mmap() setup.  Although, maybe it would be fine if
> we only keep it held if we are going to be removing a vma in the
> MAP_FIXED case.

I guess you could have a separate seqcount on the mm for "there might
be temporary holes in the tree", or temporarily store some markers in
the maple tree that say "there is nothing here right now, but if you
want to see a full picture while iterating, you have to try again
later"?

Or you could basically unmap the VMA while it is still in the VMA tree
but is already locked and marked as detached? So first you do
unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
rmap links, and deletes page tables), then prepare the new VMAs, and
then replace the old VMA's entries in the VMA tree with the new
entries? I guess in the end the result would semantically be pretty
similar to having markers in the maple tree.

> Another idea would be to change the pte to know if a vma is being
> modified using the per-vma locking, but I'm not sure what action to take
> if we detect the vma is being modified to avoid the issue.  This would
> also need to be done to all walkers (or low enough in the stack).

Sorry, can you clarify what pte the "change the pte" is referring to?

> By the way, this isn't an optimisation; this is to fix RCU walkers of
> the vma tree seeing a hole between the underlying implementation of the
> MAP_FIXED operations of munmap() and mmap().  This is needed for things
> like the /proc/{pid}/maps rcu walker.  The page tables currently fall
> back to the old way of locking if a hole is seen (and sane applications
> shouldn't really be page faulting something being removed anyways..)

Is that code in a tree somewhere?

What locking will those RCU walkers use when accessing VMAs? I guess
they probably anyway have to take the VMA locks to ensure they see
consistent state, though I guess with enough effort you could avoid it
(seqlock-style) on some fastpaths when the vma is not concurrently
modified and the fastpath doesn't need access to the VMA's file?
Suren Baghdasaryan Oct. 8, 2024, 5:51 p.m. UTC | #5
On Tue, Oct 8, 2024 at 10:16 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Oct 8, 2024 at 3:51 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > * Jann Horn <jannh@google.com> [241007 17:31]:
> > > On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > * Jann Horn <jannh@google.com> [241007 15:06]:
> > > > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > > > area be overwritten and then clean up the gathered vmas using
> > > > > > vms_complete_munmap_vmas().
> > > > > >
> > > > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > > > MAP_FIXED or not (NULL vma).
> > > > > >
> > > > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > > > the call_mmap() invocation.  This is done using the vms_clean_up_area()
> > > > > > helper.  If there is a close vm_ops, that must also be called to ensure
> > > > > > any cleanup is done before mapping over the area.  This also means that
> > > > > > calling open has been added to the abort of an unmap operation, for now.
> > > > >
> > > > > As currently implemented, this is not a valid optimization because it
> > > > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > > > on a region in the page tables which can concurrently be walked. A
> > > > > region in the page tables can be concurrently walked if it overlaps a
> > > > > VMA which is linked into rmaps which are not write-locked.
> > > >
> > > > Just for clarity, this is the rmap write lock.
> > >
> > > Ah, yes.
> > >
> > > > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > > > the new mapping is created by expanding an adjacent VMA, the following
> > > > > race with an ftruncate() is possible (because page tables for the old
> > > > > mapping are removed while the new VMA in the same location is already
> > > > > fully set up and linked into the rmap):
> > > > >
> > > > >
> > > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > > ========================     ==================
> > > > > mmap_region
> > > > >   vma_merge_new_range
> > > > >     vma_expand
> > > > >       commit_merge
> > > > >         vma_prepare
> > > > >           [take rmap locks]
> > > > >         vma_set_range
> > > > >           [expand adjacent mapping]
> > > > >         vma_complete
> > > > >           [drop rmap locks]
> > > > >   vms_complete_munmap_vmas
> > > > >     vms_clear_ptes
> > > > >       unmap_vmas
> > > > >         [removes ptes]
> > > > >       free_pgtables
> > > > >         [unlinks old vma from rmap]
> > > > >                              unmap_mapping_range
> > > > >                                unmap_mapping_pages
> > > > >                                  i_mmap_lock_read
> > > > >                                  unmap_mapping_range_tree
> > > > >                                    [loop]
> > > > >                                      unmap_mapping_range_vma
> > > > >                                        zap_page_range_single
> > > > >                                          unmap_single_vma
> > > > >                                            unmap_page_range
> > > > >                                              zap_p4d_range
> > > > >                                                zap_pud_range
> > > > >                                                  zap_pmd_range
> > > > >                                                    [looks up pmd entry]
> > > > >         free_pgd_range
> > > > >           [frees pmd]
> > > > >                                                    [UAF pmd entry access]
> > > > >
> > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > > to widen the race windows, then build and run the attached reproducer
> > > > > mmap-fixed-race.c.
> > > > >
> > > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > > >
> > > > Thanks for all the work you did finding the root cause here, I
> > > > appreciate it.
> > >
> > > Ah, this is not a bug I ran into while testing, it's a bug I found
> > > while reading the patch. It's much easier to explain the issue and
> > > come up with a nice reproducer this way than when you start out from a
> > > crash. :P
> > >
> > > > I think the correct fix is to take the rmap lock on free_pgtables, when
> > > > necessary.  There are a few code paths (error recovery) that are not
> > > > regularly run that will also need to change.
> > >
> > > Hmm, yes, I guess that might work. Though I think there might be more
> > > races: One related aspect of this optimization that is unintuitive to
> > > me is that, directly after vma_merge_new_range(), a concurrent rmap
> > > walk could probably be walking the newly-extended VMA but still
> > > observe PTEs belonging to the previous VMA. I don't know how robust
> > > the various rmap walks are to things like encountering pfnmap PTEs in
> > > non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> > > example, page_vma_mapped_walk() looks like, if you called it on a page
> > > table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> > > something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> > > and a 1G hugepage might get misinterpreted as a page table? But I
> > > haven't experimentally verified that.
> >
> > Yes, I am also concerned that reacquiring the lock will result in
> > another race.  I also don't think holding the lock for longer is a good
> > idea as it will most likely cause a regression by extending the lock for
> > the duration of the mmap() setup.  Although, maybe it would be fine if
> > we only keep it held if we are going to be removing a vma in the
> > MAP_FIXED case.
>
> I guess you could have a separate seqcount on the mm for "there might
> be temporary holes in the tree", or temporarily store some markers in
> the maple tree that say "there is nothing here right now, but if you
> want to see a full picture while iterating, you have to try again
> later"?
>
> Or you could basically unmap the VMA while it is still in the VMA tree
> but is already locked and marked as detached? So first you do
> unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
> rmap links, and deletes page tables), then prepare the new VMAs, and
> then replace the old VMA's entries in the VMA tree with the new
> entries? I guess in the end the result would semantically be pretty
> similar to having markers in the maple tree.
>
> > Another idea would be to change the pte to know if a vma is being
> > modified using the per-vma locking, but I'm not sure what action to take
> > if we detect the vma is being modified to avoid the issue.  This would
> > also need to be done to all walkers (or low enough in the stack).
>
> Sorry, can you clarify what pte the "change the pte" is referring to?
>
> > By the way, this isn't an optimisation; this is to fix RCU walkers of
> > the vma tree seeing a hole between the underlying implementation of the
> > MAP_FIXED operations of munmap() and mmap().  This is needed for things
> > like the /proc/{pid}/maps rcu walker.  The page tables currently fall
> > back to the old way of locking if a hole is seen (and sane applications
> > shouldn't really be page faulting something being removed anyways..)
>
> Is that code in a tree somewhere?
>
> What locking will those RCU walkers use when accessing VMAs? I guess
> they probably anyway have to take the VMA locks to ensure they see
> consistent state, though I guess with enough effort you could avoid it
> (seqlock-style) on some fastpaths when the vma is not concurrently
> modified and the fastpath doesn't need access to the VMA's file?

Sorry, it's not posted upstream yet but yes, the idea is to walk the
tree under RCU and detect concurrent changes using seq-counters. A
prototype was posted here:
https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/
but it had some issues I'm yet to resolve.
Jann Horn Oct. 8, 2024, 6:06 p.m. UTC | #6
On Tue, Oct 8, 2024 at 7:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Oct 8, 2024 at 10:16 AM Jann Horn <jannh@google.com> wrote:
> > Is that code in a tree somewhere?
> >
> > What locking will those RCU walkers use when accessing VMAs? I guess
> > they probably anyway have to take the VMA locks to ensure they see
> > consistent state, though I guess with enough effort you could avoid it
> > (seqlock-style) on some fastpaths when the vma is not concurrently
> > modified and the fastpath doesn't need access to the VMA's file?
>
> Sorry, it's not posted upstream yet but yes, the idea is to walk the
> tree under RCU and detect concurrent changes using seq-counters. A
> prototype was posted here:
> https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/
> but it had some issues I'm yet to resolve.

Ah, thanks for the pointer.
Liam R. Howlett Oct. 11, 2024, 2:26 p.m. UTC | #7
* Jann Horn <jannh@google.com> [241008 13:16]:
...
> > > > >
> > > > > task 1 (mmap, MAP_FIXED)     task 2 (ftruncate)
> > > > > ========================     ==================
> > > > > mmap_region
> > > > >   vma_merge_new_range
> > > > >     vma_expand
> > > > >       commit_merge
> > > > >         vma_prepare
> > > > >           [take rmap locks]
> > > > >         vma_set_range
> > > > >           [expand adjacent mapping]
> > > > >         vma_complete
> > > > >           [drop rmap locks]
> > > > >   vms_complete_munmap_vmas
> > > > >     vms_clear_ptes
> > > > >       unmap_vmas
> > > > >         [removes ptes]
> > > > >       free_pgtables
> > > > >         [unlinks old vma from rmap]
> > > > >                              unmap_mapping_range
> > > > >                                unmap_mapping_pages
> > > > >                                  i_mmap_lock_read
> > > > >                                  unmap_mapping_range_tree
> > > > >                                    [loop]
> > > > >                                      unmap_mapping_range_vma
> > > > >                                        zap_page_range_single
> > > > >                                          unmap_single_vma
> > > > >                                            unmap_page_range
> > > > >                                              zap_p4d_range
> > > > >                                                zap_pud_range
> > > > >                                                  zap_pmd_range
> > > > >                                                    [looks up pmd entry]
> > > > >         free_pgd_range
> > > > >           [frees pmd]
> > > > >                                                    [UAF pmd entry access]
> > > > >
> > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > > > to widen the race windows, then build and run the attached reproducer
> > > > > mmap-fixed-race.c.
> > > > >
> > > > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> > > >
...

> 
> Or you could basically unmap the VMA while it is still in the VMA tree
> but is already locked and marked as detached? So first you do
> unmap_vmas() and free_pgtables() (which clears the PTEs, removes the
> rmap links, and deletes page tables), then prepare the new VMAs, and
> then replace the old VMA's entries in the VMA tree with the new
> entries? I guess in the end the result would semantically be pretty
> similar to having markers in the maple tree.
> 

After trying a few other methods, I ended up doing something like you
said above.  I already had to do this if call_mmap() was to be used, so
the code change isn't that large.  Doing it unconditionally on MAP_FIXED
seems like the safest plan.

The other methods were unsuccessful due to the locking order that exists
in fsreclaim and other areas.

Basically, the vma tree will not see a gap, but the rmap will see a gap.

Unfortunately this expands the number of failures which cannot be undone
with my design but still less than existed before.  Most errors will
generate the historic vma gap, sadly.

Thanks,
Liam
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 2366578015ad..b95a43221058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@ 
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/delay.h>
 
 #include <trace/events/kmem.h>
 
@@ -410,6 +411,13 @@  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 				unlink_file_vma_batch_add(&vb, vma);
 			}
 			unlink_file_vma_batch_final(&vb);
+
+			if (strcmp(current->comm, "SLOWME1") == 0) {
+				pr_warn("%s: starting delay\n", __func__);
+				mdelay(2000);
+				pr_warn("%s: ending delay\n", __func__);
+			}
+
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
 		}
@@ -1711,6 +1719,13 @@  static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+
+	if (strcmp(current->comm, "SLOWME2") == 0) {
+		pr_warn("%s: starting delay\n", __func__);
+		mdelay(2000);
+		pr_warn("%s: ending delay\n", __func__);
+	}
+
 	do {
 		next = pmd_addr_end(addr, end);
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {