diff mbox series

mm: Lock VMA in dup_anon_vma() before setting ->anon_vma

Message ID 20230721034643.616851-1-jannh@google.com (mailing list archive)
State New
Headers show
Series mm: Lock VMA in dup_anon_vma() before setting ->anon_vma | expand

Commit Message

Jann Horn July 21, 2023, 3:46 a.m. UTC
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by another
VMA. This currently happens while `dst` is not write-locked.

This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock.
This is already icky in itself, since such page faults can now install
pages into `dst` that are attached to an `anon_vma` that is not yet tied
back to the `anon_vma` with an `anon_vma_chain`.
But if `anon_vma_clone()` fails due to an out-of-memory error, things get
much worse: `anon_vma_clone()` then reverts `dst->anon_vma` back to NULL,
and `dst` remains completely unconnected to the `anon_vma`, even though we
can have pages in the area covered by `dst` that point to the `anon_vma`.

This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.

This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.

I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.

For a kernel-assisted reproducer, see the notes section of the patch mail.

Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Cc: stable@vger.kernel.org
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
To reproduce, patch mm/rmap.c by adding "#include <linux/delay.h>" and
changing anon_vma_chain_alloc() like this:
 
 static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp)
 {
+       if (strcmp(current->comm, "FAILME") == 0) {
+               // inject delay and error
+               mdelay(2000);
+               return NULL;
+       }
        return kmem_cache_alloc(anon_vma_chain_cachep, gfp);
 }

Then build with KASAN and run this reproducer:


#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1L) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

static char *area;
static volatile int fault_thread_done;
static volatile int spin_launch;

static void *fault_thread(void *dummy) {
  while (!spin_launch) /*spin*/;
  sleep(1);
  area[0] = 1;
  fault_thread_done = 1;
  return NULL;
}

int main(void) {
  fault_thread_done = 0;
  pthread_t thread;
  if (pthread_create(&thread, NULL, fault_thread, NULL))
    errx(1, "pthread_create");

  // allocator spam
  int fd = SYSCHK(open("/etc/hostname", O_RDONLY));
  char *vmas[10000];
  for (int i=0; i<5000; i++) {
    vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0));
    *vmas[i] = 1;
  }

  // create a 3-page area, no anon_vma at this point, with guard vma behind it to prevent merging with neighboring anon_vmas
  area = SYSCHK(mmap((void*)0x10000, 0x4000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
  SYSCHK(mmap(area+0x3000, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0));
  // turn it into 3 VMAs
  SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC));

  // create an anon_vma for the tail VMA
  area[0x2000] = 1;

  // more allocator spam
  for (int i=5000; i<10000; i++) {
    vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0));
    *vmas[i] = 1;
  }

  printf("with anon_vma on tail VMA:\n\n");
  system("cat /proc/$PPID/smaps | head -n55");
  printf("\n\n");

  spin_launch=1;
  // mprotect() will try to merge the VMAs but bail out due to the injected
  // allocator failure
  SYSCHK(prctl(PR_SET_NAME, "FAILME"));
  SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE));
  SYSCHK(prctl(PR_SET_NAME, "normal"));

  printf("after merge from mprotect:\n\n");
  if (!fault_thread_done)
    errx(1, "fault thread not done yet???");
  system("cat /proc/$PPID/smaps | head -n55");
  printf("\n\n");

  // release the anon_vma
  SYSCHK(munmap(area+0x1000, 0x2000));

  // release spam
  for (int i=0; i<10000; i++)
    SYSCHK(munmap(vmas[i], 0x1000));

  // wait for RCU
  sleep(2);

  // trigger UAF?
  printf("trying to trigger uaf...\n");
  SYSCHK(madvise(area, 0x1000, 21/*MADV_PAGEOUT*/));
}


You should get an ASAN splat like:

BUG: KASAN: use-after-free in folio_lock_anon_vma_read+0x9d/0x2f0
Read of size 8 at addr ffff8880053a2660 by task normal/549

CPU: 1 PID: 549 Comm: normal Not tainted 6.5.0-rc2-00073-ge599e16c16a1-dirty #292
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x36/0x50
 print_report+0xcf/0x660
[...]
 kasan_report+0xc7/0x100
[...]
 folio_lock_anon_vma_read+0x9d/0x2f0
 rmap_walk_anon+0x282/0x350
[...]
 folio_referenced+0x277/0x2a0
[...]
 shrink_folio_list+0xc9f/0x15c0
[...]
 reclaim_folio_list+0xdc/0x1f0
[...]
 reclaim_pages+0x211/0x280
[...]
 madvise_cold_or_pageout_pte_range+0x2ea/0x6a0
[...]
 walk_pgd_range+0x6c5/0xb90
[...]
 __walk_page_range+0x27f/0x290
[...]
 walk_page_range+0x1fd/0x230
[...]
 madvise_pageout+0x1cd/0x2d0
[...]
 do_madvise+0xb58/0x1280
[...]
 __x64_sys_madvise+0x62/0x70
 do_syscall_64+0x3b/0x90
[...]


 mm/mmap.c | 1 +
 1 file changed, 1 insertion(+)


base-commit: e599e16c16a16be9907fb00608212df56d08d57b

Comments

Suren Baghdasaryan July 22, 2023, 12:05 a.m. UTC | #1
On Thu, Jul 20, 2023 at 8:46 PM Jann Horn <jannh@google.com> wrote:
>
> When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
> VMA that is being expanded to cover the area previously occupied by another
> VMA. This currently happens while `dst` is not write-locked.
>
> This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
> the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
> page faults can happen on `dst` under the per-VMA lock.
> This is already icky in itself, since such page faults can now install
> pages into `dst` that are attached to an `anon_vma` that is not yet tied
> back to the `anon_vma` with an `anon_vma_chain`.
> But if `anon_vma_clone()` fails due to an out-of-memory error, things get
> much worse: `anon_vma_clone()` then reverts `dst->anon_vma` back to NULL,
> and `dst` remains completely unconnected to the `anon_vma`, even though we
> can have pages in the area covered by `dst` that point to the `anon_vma`.
>
> This means the `anon_vma` of such pages can be freed while the pages are
> still mapped into userspace, which leads to UAF when a helper like
> folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
>
> This theoretically is a security bug, but I believe it is really hard to
> actually trigger as an unprivileged user because it requires that you can
> make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
> pretty hard to prevent that.
>
> I think doing the vma_start_write() call inside dup_anon_vma() is the most
> straightforward fix for now.

Indeed, this is a valid fix because we end up modifying the 'dst'
without locking it. Locking in vma_merge()/vma_expand() happens inside
vma_prepare() but that's too late because dup_anon_vma() would already
happen.

>
> For a kernel-assisted reproducer, see the notes section of the patch mail.
>
> Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
> Cc: stable@vger.kernel.org
> Cc: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Jann Horn <jannh@google.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
> To reproduce, patch mm/rmap.c by adding "#include <linux/delay.h>" and
> changing anon_vma_chain_alloc() like this:
>
>  static inline struct anon_vma_chain *anon_vma_chain_alloc(gfp_t gfp)
>  {
> +       if (strcmp(current->comm, "FAILME") == 0) {
> +               // inject delay and error
> +               mdelay(2000);
> +               return NULL;
> +       }
>         return kmem_cache_alloc(anon_vma_chain_cachep, gfp);
>  }
>
> Then build with KASAN and run this reproducer:
>
>
> #define _GNU_SOURCE
> #include <pthread.h>
> #include <err.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <sys/prctl.h>
>
> #define SYSCHK(x) ({          \
>   typeof(x) __res = (x);      \
>   if (__res == (typeof(x))-1L) \
>     err(1, "SYSCHK(" #x ")"); \
>   __res;                      \
> })
>
> static char *area;
> static volatile int fault_thread_done;
> static volatile int spin_launch;
>
> static void *fault_thread(void *dummy) {
>   while (!spin_launch) /*spin*/;
>   sleep(1);
>   area[0] = 1;
>   fault_thread_done = 1;
>   return NULL;
> }
>
> int main(void) {
>   fault_thread_done = 0;
>   pthread_t thread;
>   if (pthread_create(&thread, NULL, fault_thread, NULL))
>     errx(1, "pthread_create");
>
>   // allocator spam
>   int fd = SYSCHK(open("/etc/hostname", O_RDONLY));
>   char *vmas[10000];
>   for (int i=0; i<5000; i++) {
>     vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0));
>     *vmas[i] = 1;
>   }
>
>   // create a 3-page area, no anon_vma at this point, with guard vma behind it to prevent merging with neighboring anon_vmas
>   area = SYSCHK(mmap((void*)0x10000, 0x4000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
>   SYSCHK(mmap(area+0x3000, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, fd, 0));
>   // turn it into 3 VMAs
>   SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC));
>
>   // create an anon_vma for the tail VMA
>   area[0x2000] = 1;
>
>   // more allocator spam
>   for (int i=5000; i<10000; i++) {
>     vmas[i] = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0));
>     *vmas[i] = 1;
>   }
>
>   printf("with anon_vma on tail VMA:\n\n");
>   system("cat /proc/$PPID/smaps | head -n55");
>   printf("\n\n");
>
>   spin_launch=1;
>   // mprotect() will try to merge the VMAs but bail out due to the injected
>   // allocator failure
>   SYSCHK(prctl(PR_SET_NAME, "FAILME"));
>   SYSCHK(mprotect(area+0x1000, 0x1000, PROT_READ|PROT_WRITE));
>   SYSCHK(prctl(PR_SET_NAME, "normal"));
>
>   printf("after merge from mprotect:\n\n");
>   if (!fault_thread_done)
>     errx(1, "fault thread not done yet???");
>   system("cat /proc/$PPID/smaps | head -n55");
>   printf("\n\n");
>
>   // release the anon_vma
>   SYSCHK(munmap(area+0x1000, 0x2000));
>
>   // release spam
>   for (int i=0; i<10000; i++)
>     SYSCHK(munmap(vmas[i], 0x1000));
>
>   // wait for RCU
>   sleep(2);
>
>   // trigger UAF?
>   printf("trying to trigger uaf...\n");
>   SYSCHK(madvise(area, 0x1000, 21/*MADV_PAGEOUT*/));
> }
>
>
> You should get an ASAN splat like:
>
> BUG: KASAN: use-after-free in folio_lock_anon_vma_read+0x9d/0x2f0
> Read of size 8 at addr ffff8880053a2660 by task normal/549
>
> CPU: 1 PID: 549 Comm: normal Not tainted 6.5.0-rc2-00073-ge599e16c16a1-dirty #292
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x36/0x50
>  print_report+0xcf/0x660
> [...]
>  kasan_report+0xc7/0x100
> [...]
>  folio_lock_anon_vma_read+0x9d/0x2f0
>  rmap_walk_anon+0x282/0x350
> [...]
>  folio_referenced+0x277/0x2a0
> [...]
>  shrink_folio_list+0xc9f/0x15c0
> [...]
>  reclaim_folio_list+0xdc/0x1f0
> [...]
>  reclaim_pages+0x211/0x280
> [...]
>  madvise_cold_or_pageout_pte_range+0x2ea/0x6a0
> [...]
>  walk_pgd_range+0x6c5/0xb90
> [...]
>  __walk_page_range+0x27f/0x290
> [...]
>  walk_page_range+0x1fd/0x230
> [...]
>  madvise_pageout+0x1cd/0x2d0
> [...]
>  do_madvise+0xb58/0x1280
> [...]
>  __x64_sys_madvise+0x62/0x70
>  do_syscall_64+0x3b/0x90
> [...]
>
>
>  mm/mmap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3eda23c9ebe7..3937479d0e07 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -615,6 +615,7 @@ static inline int dup_anon_vma(struct vm_area_struct *dst,
>          * anon pages imported.
>          */
>         if (src->anon_vma && !dst->anon_vma) {
> +               vma_start_write(dst);
>                 dst->anon_vma = src->anon_vma;
>                 return anon_vma_clone(dst, src);
>         }
>
> base-commit: e599e16c16a16be9907fb00608212df56d08d57b
> --
> 2.41.0.487.g6d72f3e995-goog
>
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 3eda23c9ebe7..3937479d0e07 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -615,6 +615,7 @@  static inline int dup_anon_vma(struct vm_area_struct *dst,
 	 * anon pages imported.
 	 */
 	if (src->anon_vma && !dst->anon_vma) {
+		vma_start_write(dst);
 		dst->anon_vma = src->anon_vma;
 		return anon_vma_clone(dst, src);
 	}