Message ID | 1581712403-27243-1-git-send-email-cai@lca.pw (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [-next] fork: annotate a data race in vm_area_dup() | expand |
On Fri, Feb 14, 2020 at 03:33:23PM -0500, Qian Cai wrote: > struct vm_area_struct could be accessed concurrently as noticed by > KCSAN, > > write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35: > vma_interval_tree_insert+0x101/0x150: > rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58 > (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23 > __vma_link_file+0x6e/0xe0 > __vma_link_file at mm/mmap.c:629 > vma_link+0xa2/0x120 > mmap_region+0x753/0xb90 > do_mmap+0x45c/0x710 > vm_mmap_pgoff+0xc0/0x130 > ksys_mmap_pgoff+0x1d1/0x300 > __x64_sys_mmap+0x33/0x40 > do_syscall_64+0x91/0xc44 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122: > vm_area_dup+0x6a/0xe0 > vm_area_dup at kernel/fork.c:362 > __split_vma+0x72/0x2a0 > __split_vma at mm/mmap.c:2661 > split_vma+0x5a/0x80 > mprotect_fixup+0x368/0x3f0 > do_mprotect_pkey+0x263/0x420 > __x64_sys_mprotect+0x51/0x70 > do_syscall_64+0x91/0xc44 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > The write is holding mmap_sem while changing vm_area_struct.shared.rb. > Even though the read is lockless while making a copy, the clone will > have its own shared.rb reinitialized. Thus, mark it as an intentional > data race using the data_race() macro. I'm confused. AFAICS both sides hold mmap_sem on write: - vm_mmap_pgoff() takes mmap_sem for the write on the write side - do_mprotect_pkey() takes mmap_sem for the write on the read side What do I miss?
> On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I'm confused. AFAICS both sides hold mmap_sem on write: > > - vm_mmap_pgoff() takes mmap_sem for the write on the write side > > - do_mprotect_pkey() takes mmap_sem for the write on the read side > > > What do I miss? Ah, good catch. I missed the locking for the read there. This is interesting because Marco did confirmed that the concurrency could happen, https://lore.kernel.org/lkml/20191025173511.181416-1-elver@google.com/ If that means KCSAN is not at fault, then I could think of two things, 1) someone downgrades the lock. I don’t think that a case here. Only __do_munmap() will do that but I did not see how it will affect us here. 2) the reader and writer are two different processes. So, they held a different mmap_sem, but I can’t see how could two processes shared the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the writer, but I can’t see how it was also loaded in the reader. Any ideas?
On Mon, Feb 17, 2020 at 10:59:47PM -0500, Qian Cai wrote: > > > > On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > I'm confused. AFAICS both sides hold mmap_sem on write: > > > > - vm_mmap_pgoff() takes mmap_sem for the write on the write side > > > > - do_mprotect_pkey() takes mmap_sem for the write on the read side > > > > > > What do I miss? > > Ah, good catch. I missed the locking for the read there. This is interesting because Marco > did confirmed that the concurrency could happen, > > https://lore.kernel.org/lkml/20191025173511.181416-1-elver@google.com/ > > If that means KCSAN is not at fault, then I could think of two things, > > 1) someone downgrades the lock. > > I don’t think that a case here. Only __do_munmap() will do that but I did not see how > it will affect us here. > > 2) the reader and writer are two different processes. > > So, they held a different mmap_sem, but I can’t see how could two processes shared > the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the > writer, but I can’t see how it was also loaded in the reader. > > Any ideas? I think I've got this: vm_area_dup() blindly copies all fields of orignal VMA to the new one. This includes coping vm_area_struct::shared.rb which is normally protected by i_mmap_lock. But this is fine because the read value will be overwritten on the following __vma_link_file() under proper protectection. So the fix is correct, but justificaiton is lacking. Also, I would like to more fine-grained annotation: marking with data_race() 200 bytes copy may hide other issues.
> On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I think I've got this: > > vm_area_dup() blindly copies all fields of orignal VMA to the new one. > This includes coping vm_area_struct::shared.rb which is normally protected > by i_mmap_lock. But this is fine because the read value will be > overwritten on the following __vma_link_file() under proper protectection. Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes. > > So the fix is correct, but justificaiton is lacking. > > Also, I would like to more fine-grained annotation: marking with > data_race() 200 bytes copy may hide other issues. That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote: > > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > I think I've got this: > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one. > > This includes coping vm_area_struct::shared.rb which is normally protected > > by i_mmap_lock. But this is fine because the read value will be > > overwritten on the following __vma_link_file() under proper protectection. > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes. > > > > > So the fix is correct, but justificaiton is lacking. > > > > Also, I would like to more fine-grained annotation: marking with > > data_race() 200 bytes copy may hide other issues. > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()? There is no nice interface I can think of. All options will just cause more problems, inconsistencies, or annoyances. Ideally, to not introduce more types of macros and keep it consistent, ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after: "Check no concurrent writers to struct, except ignore the provided fields". This option doesn't quite work, unless you just restrict it to 1 field (we can't use ranges, because e.g. vm_area_struct has __randomize_layout). The next time around, you'll want 2 fields, and it won't work. Also, do we know that 'shared.rb' is the only thing we want to ignore? If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying for structs with many fields as you'd have to list all of them. It's similar to what you can already do currently (but I also don't recommend because it's unmaintainable): ASSERT_EXCLUSIVE_WRITER(orig->vm_start); ASSERT_EXCLUSIVE_WRITER(orig->vm_end); ASSERT_EXCLUSIVE_WRITER(orig->vm_next); ASSERT_EXCLUSIVE_WRITER(orig->vm_prev); ... and so on ... *new = data_race(*orig); Also note that vm_area_struct has __randomize_layout, which makes using ranges impossible. All in all, I don't see a terribly nice option. If, however, you knew that there are 1 or 2 fields only that you want to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER + data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS if you want to make sure there are no writers nor _readers_). Thanks, -- Marco
On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote: > On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote: > > > > > > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > I think I've got this: > > > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one. > > > This includes coping vm_area_struct::shared.rb which is normally protected > > > by i_mmap_lock. But this is fine because the read value will be > > > overwritten on the following __vma_link_file() under proper protectection. > > > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes. > > > > > > > > So the fix is correct, but justificaiton is lacking. > > > > > > Also, I would like to more fine-grained annotation: marking with > > > data_race() 200 bytes copy may hide other issues. > > > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()? > > There is no nice interface I can think of. All options will just cause > more problems, inconsistencies, or annoyances. > > Ideally, to not introduce more types of macros and keep it consistent, > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after: > "Check no concurrent writers to struct, except ignore the provided > fields". > > This option doesn't quite work, unless you just restrict it to 1 field > (we can't use ranges, because e.g. vm_area_struct has > __randomize_layout). The next time around, you'll want 2 fields, and > it won't work. Also, do we know that 'shared.rb' is the only thing we > want to ignore? > > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying > for structs with many fields as you'd have to list all of them. It's > similar to what you can already do currently (but I also don't > recommend because it's unmaintainable): > > ASSERT_EXCLUSIVE_WRITER(orig->vm_start); > ASSERT_EXCLUSIVE_WRITER(orig->vm_end); > ASSERT_EXCLUSIVE_WRITER(orig->vm_next); > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev); > ... and so on ... > *new = data_race(*orig); > > Also note that vm_area_struct has __randomize_layout, which makes > using ranges impossible. All in all, I don't see a terribly nice > option. > > If, however, you knew that there are 1 or 2 fields only that you want > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER + > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS > if you want to make sure there are no writers nor _readers_). I am testing an idea that just do, lockdep_assert_held_write(&orig->vm_mm->mmap_sem); *new = data_race(*orig); The idea is that as long as we have the exclusive mmap_sem held in all paths (auditing indicated so), no writer should be able to mess up our vm_area_struct except the "shared.rb" field which has no harm.
On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote: > On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote: > > On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote: > > > > > > > > > > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > > > I think I've got this: > > > > > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one. > > > > This includes coping vm_area_struct::shared.rb which is normally protected > > > > by i_mmap_lock. But this is fine because the read value will be > > > > overwritten on the following __vma_link_file() under proper protectection. > > > > > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes. > > > > > > > > > > > So the fix is correct, but justificaiton is lacking. > > > > > > > > Also, I would like to more fine-grained annotation: marking with > > > > data_race() 200 bytes copy may hide other issues. > > > > > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()? > > > > There is no nice interface I can think of. All options will just cause > > more problems, inconsistencies, or annoyances. > > > > Ideally, to not introduce more types of macros and keep it consistent, > > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after: > > "Check no concurrent writers to struct, except ignore the provided > > fields". > > > > This option doesn't quite work, unless you just restrict it to 1 field > > (we can't use ranges, because e.g. vm_area_struct has > > __randomize_layout). The next time around, you'll want 2 fields, and > > it won't work. Also, do we know that 'shared.rb' is the only thing we > > want to ignore? > > > > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to > > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying > > for structs with many fields as you'd have to list all of them. It's > > similar to what you can already do currently (but I also don't > > recommend because it's unmaintainable): > > > > ASSERT_EXCLUSIVE_WRITER(orig->vm_start); > > ASSERT_EXCLUSIVE_WRITER(orig->vm_end); > > ASSERT_EXCLUSIVE_WRITER(orig->vm_next); > > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev); > > ... and so on ... > > *new = data_race(*orig); > > > > Also note that vm_area_struct has __randomize_layout, which makes > > using ranges impossible. All in all, I don't see a terribly nice > > option. > > > > If, however, you knew that there are 1 or 2 fields only that you want > > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER + > > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS > > if you want to make sure there are no writers nor _readers_). > > I am testing an idea that just do, > > lockdep_assert_held_write(&orig->vm_mm->mmap_sem); > *new = data_race(*orig); > > The idea is that as long as we have the exclusive mmap_sem held in all paths > (auditing indicated so), no writer should be able to mess up our vm_area_struct > except the "shared.rb" field which has no harm. Well, some fields protected by page_table_lock and can be written to without exclusive mmap_sem. Probably even without any mmap_sem: pin mm_struct + page_table_lock should be enough.
On Tue, 2020-02-18 at 18:18 +0300, Kirill A. Shutemov wrote: > On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote: > > On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote: > > > On Tue, 18 Feb 2020 at 13:40, Qian Cai <cai@lca.pw> wrote: > > > > > > > > > > > > > > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > > > > > I think I've got this: > > > > > > > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one. > > > > > This includes coping vm_area_struct::shared.rb which is normally protected > > > > > by i_mmap_lock. But this is fine because the read value will be > > > > > overwritten on the following __vma_link_file() under proper protectection. > > > > > > > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes. > > > > > > > > > > > > > > So the fix is correct, but justificaiton is lacking. > > > > > > > > > > Also, I would like to more fine-grained annotation: marking with > > > > > data_race() 200 bytes copy may hide other issues. > > > > > > > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()? > > > > > > There is no nice interface I can think of. All options will just cause > > > more problems, inconsistencies, or annoyances. > > > > > > Ideally, to not introduce more types of macros and keep it consistent, > > > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after: > > > "Check no concurrent writers to struct, except ignore the provided > > > fields". > > > > > > This option doesn't quite work, unless you just restrict it to 1 field > > > (we can't use ranges, because e.g. vm_area_struct has > > > __randomize_layout). The next time around, you'll want 2 fields, and > > > it won't work. Also, do we know that 'shared.rb' is the only thing we > > > want to ignore? > > > > > > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to > > > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying > > > for structs with many fields as you'd have to list all of them. It's > > > similar to what you can already do currently (but I also don't > > > recommend because it's unmaintainable): > > > > > > ASSERT_EXCLUSIVE_WRITER(orig->vm_start); > > > ASSERT_EXCLUSIVE_WRITER(orig->vm_end); > > > ASSERT_EXCLUSIVE_WRITER(orig->vm_next); > > > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev); > > > ... and so on ... > > > *new = data_race(*orig); > > > > > > Also note that vm_area_struct has __randomize_layout, which makes > > > using ranges impossible. All in all, I don't see a terribly nice > > > option. > > > > > > If, however, you knew that there are 1 or 2 fields only that you want > > > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER + > > > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS > > > if you want to make sure there are no writers nor _readers_). > > > > I am testing an idea that just do, > > > > lockdep_assert_held_write(&orig->vm_mm->mmap_sem); > > *new = data_race(*orig); > > > > The idea is that as long as we have the exclusive mmap_sem held in all paths > > (auditing indicated so), no writer should be able to mess up our vm_area_struct > > except the "shared.rb" field which has no harm. > > Well, some fields protected by page_table_lock and can be written to > without exclusive mmap_sem. Probably even without any mmap_sem: pin > mm_struct + page_table_lock should be enough. > How about this? diff --git a/kernel/fork.c b/kernel/fork.c index cb2ae49e497e..68f273e0ebf4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -359,7 +359,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (new) { - *new = *orig; + ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); + ASSERT_EXCLUSIVE_WRITER(orig->vm_file); + + *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); new->vm_next = new->vm_prev = NULL; }
diff --git a/kernel/fork.c b/kernel/fork.c index 41f784b6203a..81bdc6e8a6cf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -359,7 +359,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (new) { - *new = *orig; + /* + * @orig may be modified concurrently, but the clone will be + * reinitialized. + */ + *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); } return new;
struct vm_area_struct could be accessed concurrently as noticed by KCSAN, write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35: vma_interval_tree_insert+0x101/0x150: rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58 (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23 __vma_link_file+0x6e/0xe0 __vma_link_file at mm/mmap.c:629 vma_link+0xa2/0x120 mmap_region+0x753/0xb90 do_mmap+0x45c/0x710 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x1d1/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122: vm_area_dup+0x6a/0xe0 vm_area_dup at kernel/fork.c:362 __split_vma+0x72/0x2a0 __split_vma at mm/mmap.c:2661 split_vma+0x5a/0x80 mprotect_fixup+0x368/0x3f0 do_mprotect_pkey+0x263/0x420 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe The write is holding mmap_sem while changing vm_area_struct.shared.rb. Even though the read is lockless while making a copy, the clone will have its own shared.rb reinitialized. Thus, mark it as an intentional data race using the data_race() macro. Signed-off-by: Qian Cai <cai@lca.pw> --- kernel/fork.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)