[-next] fork: annotate a data race in vm_area_dup()
diff mbox series

Message ID 1581712403-27243-1-git-send-email-cai@lca.pw
State New
Headers show
Series
  • [-next] fork: annotate a data race in vm_area_dup()
Related show

Commit Message

Qian Cai Feb. 14, 2020, 8:33 p.m. UTC
struct vm_area_struct could be accessed concurrently as noticed by
KCSAN,

 write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
  vma_interval_tree_insert+0x101/0x150:
  rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
  (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
  __vma_link_file+0x6e/0xe0
  __vma_link_file at mm/mmap.c:629
  vma_link+0xa2/0x120
  mmap_region+0x753/0xb90
  do_mmap+0x45c/0x710
  vm_mmap_pgoff+0xc0/0x130
  ksys_mmap_pgoff+0x1d1/0x300
  __x64_sys_mmap+0x33/0x40
  do_syscall_64+0x91/0xc44
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
  vm_area_dup+0x6a/0xe0
  vm_area_dup at kernel/fork.c:362
  __split_vma+0x72/0x2a0
  __split_vma at mm/mmap.c:2661
  split_vma+0x5a/0x80
  mprotect_fixup+0x368/0x3f0
  do_mprotect_pkey+0x263/0x420
  __x64_sys_mprotect+0x51/0x70
  do_syscall_64+0x91/0xc44
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The write is holding mmap_sem while changing vm_area_struct.shared.rb.
Even though the read is lockless while making a copy, the clone will
have its own shared.rb reinitialized. Thus, mark it as an intentional
data race using the data_race() macro.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 kernel/fork.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Kirill A. Shutemov Feb. 17, 2020, 10:31 p.m. UTC | #1
On Fri, Feb 14, 2020 at 03:33:23PM -0500, Qian Cai wrote:
> struct vm_area_struct could be accessed concurrently as noticed by
> KCSAN,
> 
>  write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
>   vma_interval_tree_insert+0x101/0x150:
>   rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
>   (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
>   __vma_link_file+0x6e/0xe0
>   __vma_link_file at mm/mmap.c:629
>   vma_link+0xa2/0x120
>   mmap_region+0x753/0xb90
>   do_mmap+0x45c/0x710
>   vm_mmap_pgoff+0xc0/0x130
>   ksys_mmap_pgoff+0x1d1/0x300
>   __x64_sys_mmap+0x33/0x40
>   do_syscall_64+0x91/0xc44
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
>   vm_area_dup+0x6a/0xe0
>   vm_area_dup at kernel/fork.c:362
>   __split_vma+0x72/0x2a0
>   __split_vma at mm/mmap.c:2661
>   split_vma+0x5a/0x80
>   mprotect_fixup+0x368/0x3f0
>   do_mprotect_pkey+0x263/0x420
>   __x64_sys_mprotect+0x51/0x70
>   do_syscall_64+0x91/0xc44
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The write is holding mmap_sem while changing vm_area_struct.shared.rb.
> Even though the read is lockless while making a copy, the clone will
> have its own shared.rb reinitialized. Thus, mark it as an intentional
> data race using the data_race() macro.

I'm confused. AFAICS both sides hold mmap_sem on write:

 - vm_mmap_pgoff() takes mmap_sem for the write on the write side

 - do_mprotect_pkey() takes mmap_sem for the write on the read side


What do I miss?
Qian Cai Feb. 18, 2020, 3:59 a.m. UTC | #2
> On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> I'm confused. AFAICS both sides hold mmap_sem on write:
> 
> - vm_mmap_pgoff() takes mmap_sem for the write on the write side
> 
> - do_mprotect_pkey() takes mmap_sem for the write on the read side
> 
> 
> What do I miss?

Ah, good catch. I missed the locking for the read there. This is interesting because Marco
did confirmed that the concurrency could happen,

https://lore.kernel.org/lkml/20191025173511.181416-1-elver@google.com/

If that means KCSAN is not at fault, then I could think of two things,

1) someone downgrades the lock.

I don’t think that a case here. Only __do_munmap() will do that but I did not see how
it will affect us here.

2) the reader and writer are two different processes.

So, they held a different mmap_sem, but I can’t see how could two processes shared
the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the
writer, but I can’t see how it was also loaded in the reader.

Any ideas?
Kirill A. Shutemov Feb. 18, 2020, 10:30 a.m. UTC | #3
On Mon, Feb 17, 2020 at 10:59:47PM -0500, Qian Cai wrote:
> 
> 
> > On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > I'm confused. AFAICS both sides hold mmap_sem on write:
> > 
> > - vm_mmap_pgoff() takes mmap_sem for the write on the write side
> > 
> > - do_mprotect_pkey() takes mmap_sem for the write on the read side
> > 
> > 
> > What do I miss?
> 
> Ah, good catch. I missed the locking for the read there. This is interesting because Marco
> did confirmed that the concurrency could happen,
> 
> https://lore.kernel.org/lkml/20191025173511.181416-1-elver@google.com/
> 
> If that means KCSAN is not at fault, then I could think of two things,
> 
> 1) someone downgrades the lock.
> 
> I don’t think that a case here. Only __do_munmap() will do that but I did not see how
> it will affect us here.
> 
> 2) the reader and writer are two different processes.
> 
> So, they held a different mmap_sem, but I can’t see how could two processes shared
> the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the
> writer, but I can’t see how it was also loaded in the reader.
> 
> Any ideas?

I think I've got this:

vm_area_dup() blindly copies all fields of orignal VMA to the new one.
This includes coping vm_area_struct::shared.rb which is normally protected
by i_mmap_lock. But this is fine because the read value will be
overwritten on the following __vma_link_file() under proper protectection.

So the fix is correct, but justificaiton is lacking.

Also, I would like to more fine-grained annotation: marking with
data_race() 200 bytes copy may hide other issues.
Qian Cai Feb. 18, 2020, 12:40 p.m. UTC | #4
> On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> I think I've got this:
> 
> vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> This includes coping vm_area_struct::shared.rb which is normally protected
> by i_mmap_lock. But this is fine because the read value will be
> overwritten on the following __vma_link_file() under proper protectection.

Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.

> 
> So the fix is correct, but justificaiton is lacking.
> 
> Also, I would like to more fine-grained annotation: marking with
> data_race() 200 bytes copy may hide other issues.

That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
Marco Elver Feb. 18, 2020, 2:09 p.m. UTC | #5
On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote:
>
>
>
> > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > I think I've got this:
> >
> > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > This includes coping vm_area_struct::shared.rb which is normally protected
> > by i_mmap_lock. But this is fine because the read value will be
> > overwritten on the following __vma_link_file() under proper protectection.
>
> Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
>
> >
> > So the fix is correct, but justificaiton is lacking.
> >
> > Also, I would like to more fine-grained annotation: marking with
> > data_race() 200 bytes copy may hide other issues.
>
> That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?

There is no nice interface I can think of. All options will just cause
more problems, inconsistencies, or annoyances.

Ideally, to not introduce more types of macros and keep it consistent,
ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
"Check no concurrent writers to struct, except ignore the provided
fields".

This option doesn't quite work, unless you just restrict it to 1 field
(we can't use ranges, because e.g. vm_area_struct has
__randomize_layout). The next time around, you'll want 2 fields, and
it won't work. Also, do we know that 'shared.rb' is the only thing we
want to ignore?

If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
for structs with many fields as you'd have to list all of them. It's
similar to what you can already do currently (but I also don't
recommend because it's unmaintainable):

ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
... and so on ...
*new = data_race(*orig);

Also note that vm_area_struct has __randomize_layout, which makes
using ranges impossible. All in all, I don't see a terribly nice
option.

If, however, you knew that there are 1 or 2 fields only that you want
to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
if you want to make sure there are no writers nor _readers_).

Thanks,
-- Marco
Qian Cai Feb. 18, 2020, 3 p.m. UTC | #6
On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote:
> > 
> > 
> > 
> > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > I think I've got this:
> > > 
> > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > by i_mmap_lock. But this is fine because the read value will be
> > > overwritten on the following __vma_link_file() under proper protectection.
> > 
> > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> > 
> > > 
> > > So the fix is correct, but justificaiton is lacking.
> > > 
> > > Also, I would like to more fine-grained annotation: marking with
> > > data_race() 200 bytes copy may hide other issues.
> > 
> > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
> 
> There is no nice interface I can think of. All options will just cause
> more problems, inconsistencies, or annoyances.
> 
> Ideally, to not introduce more types of macros and keep it consistent,
> ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> "Check no concurrent writers to struct, except ignore the provided
> fields".
> 
> This option doesn't quite work, unless you just restrict it to 1 field
> (we can't use ranges, because e.g. vm_area_struct has
> __randomize_layout). The next time around, you'll want 2 fields, and
> it won't work. Also, do we know that 'shared.rb' is the only thing we
> want to ignore?
> 
> If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> for structs with many fields as you'd have to list all of them. It's
> similar to what you can already do currently (but I also don't
> recommend because it's unmaintainable):
> 
> ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> ... and so on ...
> *new = data_race(*orig);
> 
> Also note that vm_area_struct has __randomize_layout, which makes
> using ranges impossible. All in all, I don't see a terribly nice
> option.
> 
> If, however, you knew that there are 1 or 2 fields only that you want
> to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> if you want to make sure there are no writers nor _readers_).

I am testing an idea that just do,

lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
*new = data_race(*orig);

The idea is that as long as we have the exclusive mmap_sem held in all paths
(auditing indicated so), no writer should be able to mess up our vm_area_struct
except the "shared.rb" field which has no harm.
Kirill A. Shutemov Feb. 18, 2020, 3:18 p.m. UTC | #7
On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote:
> On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> > On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote:
> > > 
> > > 
> > > 
> > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > 
> > > > I think I've got this:
> > > > 
> > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > > by i_mmap_lock. But this is fine because the read value will be
> > > > overwritten on the following __vma_link_file() under proper protectection.
> > > 
> > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> > > 
> > > > 
> > > > So the fix is correct, but justificaiton is lacking.
> > > > 
> > > > Also, I would like to more fine-grained annotation: marking with
> > > > data_race() 200 bytes copy may hide other issues.
> > > 
> > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
> > 
> > There is no nice interface I can think of. All options will just cause
> > more problems, inconsistencies, or annoyances.
> > 
> > Ideally, to not introduce more types of macros and keep it consistent,
> > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> > "Check no concurrent writers to struct, except ignore the provided
> > fields".
> > 
> > This option doesn't quite work, unless you just restrict it to 1 field
> > (we can't use ranges, because e.g. vm_area_struct has
> > __randomize_layout). The next time around, you'll want 2 fields, and
> > it won't work. Also, do we know that 'shared.rb' is the only thing we
> > want to ignore?
> > 
> > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> > for structs with many fields as you'd have to list all of them. It's
> > similar to what you can already do currently (but I also don't
> > recommend because it's unmaintainable):
> > 
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> > ... and so on ...
> > *new = data_race(*orig);
> > 
> > Also note that vm_area_struct has __randomize_layout, which makes
> > using ranges impossible. All in all, I don't see a terribly nice
> > option.
> > 
> > If, however, you knew that there are 1 or 2 fields only that you want
> > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> > if you want to make sure there are no writers nor _readers_).
> 
> I am testing an idea that just do,
> 
> lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
> *new = data_race(*orig);
> 
> The idea is that as long as we have the exclusive mmap_sem held in all paths
> (auditing indicated so), no writer should be able to mess up our vm_area_struct
> except the "shared.rb" field which has no harm.

Well, some fields protected by page_table_lock and can be written to
without exclusive mmap_sem. Probably even without any mmap_sem: pin
mm_struct + page_table_lock should be enough.
Qian Cai Feb. 18, 2020, 4:46 p.m. UTC | #8
On Tue, 2020-02-18 at 18:18 +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote:
> > On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> > > On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > > 
> > > > > I think I've got this:
> > > > > 
> > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > > > by i_mmap_lock. But this is fine because the read value will be
> > > > > overwritten on the following __vma_link_file() under proper protectection.
> > > > 
> > > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> > > > 
> > > > > 
> > > > > So the fix is correct, but justificaiton is lacking.
> > > > > 
> > > > > Also, I would like to more fine-grained annotation: marking with
> > > > > data_race() 200 bytes copy may hide other issues.
> > > > 
> > > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
> > > 
> > > There is no nice interface I can think of. All options will just cause
> > > more problems, inconsistencies, or annoyances.
> > > 
> > > Ideally, to not introduce more types of macros and keep it consistent,
> > > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> > > "Check no concurrent writers to struct, except ignore the provided
> > > fields".
> > > 
> > > This option doesn't quite work, unless you just restrict it to 1 field
> > > (we can't use ranges, because e.g. vm_area_struct has
> > > __randomize_layout). The next time around, you'll want 2 fields, and
> > > it won't work. Also, do we know that 'shared.rb' is the only thing we
> > > want to ignore?
> > > 
> > > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> > > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> > > for structs with many fields as you'd have to list all of them. It's
> > > similar to what you can already do currently (but I also don't
> > > recommend because it's unmaintainable):
> > > 
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> > > ... and so on ...
> > > *new = data_race(*orig);
> > > 
> > > Also note that vm_area_struct has __randomize_layout, which makes
> > > using ranges impossible. All in all, I don't see a terribly nice
> > > option.
> > > 
> > > If, however, you knew that there are 1 or 2 fields only that you want
> > > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> > > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> > > if you want to make sure there are no writers nor _readers_).
> > 
> > I am testing an idea that just do,
> > 
> > lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
> > *new = data_race(*orig);
> > 
> > The idea is that as long as we have the exclusive mmap_sem held in all paths
> > (auditing indicated so), no writer should be able to mess up our vm_area_struct
> > except the "shared.rb" field which has no harm.
> 
> Well, some fields protected by page_table_lock and can be written to
> without exclusive mmap_sem. Probably even without any mmap_sem: pin
> mm_struct + page_table_lock should be enough.
> 

How about this?

diff --git a/kernel/fork.c b/kernel/fork.c
index cb2ae49e497e..68f273e0ebf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,7 +359,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct
*orig)
        struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep,
GFP_KERNEL);
 
        if (new) {
-               *new = *orig;
+               ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+               ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+
+               *new = data_race(*orig);
                INIT_LIST_HEAD(&new->anon_vma_chain);
                new->vm_next = new->vm_prev = NULL;
        }

Patch
diff mbox series

diff --git a/kernel/fork.c b/kernel/fork.c
index 41f784b6203a..81bdc6e8a6cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,7 +359,11 @@  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 
 	if (new) {
-		*new = *orig;
+		/*
+		 * @orig may be modified concurrently, but the clone will be
+		 * reinitialized.
+		 */
+		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
 	}
 	return new;