diff mbox series

[v2] mm/filemap: fix a data race in filemap_fault()

Message ID 20200211030134.1847-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series [v2] mm/filemap: fix a data race in filemap_fault() | expand

Commit Message

Qian Cai Feb. 11, 2020, 3:01 a.m. UTC
struct file_ra_state ra.mmap_miss could be accessed concurrently during
page faults as noticed by KCSAN,

 BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

 write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
  filemap_fault+0x920/0xfc0
  do_sync_mmap_readahead at mm/filemap.c:2384
  (inlined by) filemap_fault at mm/filemap.c:2486
  __xfs_filemap_fault+0x112/0x3e0 [xfs]
  xfs_filemap_fault+0x74/0x90 [xfs]
  __do_fault+0x9e/0x220
  do_fault+0x4a0/0x920
  __handle_mm_fault+0xc69/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
  filemap_map_pages+0xc2e/0xd80
  filemap_map_pages at mm/filemap.c:2625
  do_fault+0x3da/0x920
  __handle_mm_fault+0xc69/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

ra.mmap_miss is used to contribute the readahead decisions, a data race
could be undesirable. Both the read and write is only under
non-exclusive mmap_sem, two concurrent writers could even overflow the
counter. Fixing the underflow by writing to a local variable before
committing a final store to ra.mmap_miss given a small inaccuracy of the
counter should be acceptable.

Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: fix the underflow issue pointed out by Matthew.

 mm/filemap.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Comments

Matthew Wilcox Feb. 11, 2020, 3:49 a.m. UTC | #1
On Mon, Feb 10, 2020 at 10:01:34PM -0500, Qian Cai wrote:
> struct file_ra_state ra.mmap_miss could be accessed concurrently during
> page faults as noticed by KCSAN,
> 
>  BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
> 
>  write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
>   filemap_fault+0x920/0xfc0
>   do_sync_mmap_readahead at mm/filemap.c:2384
>   (inlined by) filemap_fault at mm/filemap.c:2486
>   __xfs_filemap_fault+0x112/0x3e0 [xfs]
>   xfs_filemap_fault+0x74/0x90 [xfs]
>   __do_fault+0x9e/0x220
>   do_fault+0x4a0/0x920
>   __handle_mm_fault+0xc69/0xd00
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
>  read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
>   filemap_map_pages+0xc2e/0xd80
>   filemap_map_pages at mm/filemap.c:2625
>   do_fault+0x3da/0x920
>   __handle_mm_fault+0xc69/0xd00
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
>  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> 
> ra.mmap_miss is used to contribute the readahead decisions, a data race
> could be undesirable. Both the read and write is only under
> non-exclusive mmap_sem, two concurrent writers could even overflow the
> counter. Fixing the underflow by writing to a local variable before
> committing a final store to ra.mmap_miss given a small inaccuracy of the
> counter should be acceptable.
> 
> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Qian Cai <cai@lca.pw>

That's more than Suggested-by.  The correct way to submit this patch is:

From: Kirill A. Shutemov <kirill@shutemov.name>
(at the top of the patch, so it gets credited to Kirill)

then in this section:

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Tested-by: Qian Cai <cai@lca.pw>

And now you can add:

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Qian Cai Feb. 11, 2020, 3:55 a.m. UTC | #2
> On Feb 10, 2020, at 10:49 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Feb 10, 2020 at 10:01:34PM -0500, Qian Cai wrote:
>> struct file_ra_state ra.mmap_miss could be accessed concurrently during
>> page faults as noticed by KCSAN,
>> 
>> BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
>> 
>> write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
>>  filemap_fault+0x920/0xfc0
>>  do_sync_mmap_readahead at mm/filemap.c:2384
>>  (inlined by) filemap_fault at mm/filemap.c:2486
>>  __xfs_filemap_fault+0x112/0x3e0 [xfs]
>>  xfs_filemap_fault+0x74/0x90 [xfs]
>>  __do_fault+0x9e/0x220
>>  do_fault+0x4a0/0x920
>>  __handle_mm_fault+0xc69/0xd00
>>  handle_mm_fault+0xfc/0x2f0
>>  do_page_fault+0x263/0x6f9
>>  page_fault+0x34/0x40
>> 
>> read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
>>  filemap_map_pages+0xc2e/0xd80
>>  filemap_map_pages at mm/filemap.c:2625
>>  do_fault+0x3da/0x920
>>  __handle_mm_fault+0xc69/0xd00
>>  handle_mm_fault+0xfc/0x2f0
>>  do_page_fault+0x263/0x6f9
>>  page_fault+0x34/0x40
>> 
>> Reported by Kernel Concurrency Sanitizer on:
>> CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
>> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>> 
>> ra.mmap_miss is used to contribute the readahead decisions, a data race
>> could be undesirable. Both the read and write is only under
>> non-exclusive mmap_sem, two concurrent writers could even overflow the
>> counter. Fixing the underflow by writing to a local variable before
>> committing a final store to ra.mmap_miss given a small inaccuracy of the
>> counter should be acceptable.
>> 
>> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Qian Cai <cai@lca.pw>
> 
> That's more than Suggested-by.  The correct way to submit this patch is:
> 
> From: Kirill A. Shutemov <kirill@shutemov.name>
> (at the top of the patch, so it gets credited to Kirill)

Sure, if Kirill is going to provide his Signed-off-by in the first place, I’ll be happy to
submit it on his behalf.

> 
> then in this section:
> 
> Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
> Tested-by: Qian Cai <cai@lca.pw>
> 
> And now you can add:
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Kirill A . Shutemov Feb. 11, 2020, 1:24 p.m. UTC | #3
On Mon, Feb 10, 2020 at 10:55:45PM -0500, Qian Cai wrote:
> 
> 
> > On Feb 10, 2020, at 10:49 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Mon, Feb 10, 2020 at 10:01:34PM -0500, Qian Cai wrote:
> >> struct file_ra_state ra.mmap_miss could be accessed concurrently during
> >> page faults as noticed by KCSAN,
> >> 
> >> BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
> >> 
> >> write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
> >>  filemap_fault+0x920/0xfc0
> >>  do_sync_mmap_readahead at mm/filemap.c:2384
> >>  (inlined by) filemap_fault at mm/filemap.c:2486
> >>  __xfs_filemap_fault+0x112/0x3e0 [xfs]
> >>  xfs_filemap_fault+0x74/0x90 [xfs]
> >>  __do_fault+0x9e/0x220
> >>  do_fault+0x4a0/0x920
> >>  __handle_mm_fault+0xc69/0xd00
> >>  handle_mm_fault+0xfc/0x2f0
> >>  do_page_fault+0x263/0x6f9
> >>  page_fault+0x34/0x40
> >> 
> >> read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
> >>  filemap_map_pages+0xc2e/0xd80
> >>  filemap_map_pages at mm/filemap.c:2625
> >>  do_fault+0x3da/0x920
> >>  __handle_mm_fault+0xc69/0xd00
> >>  handle_mm_fault+0xfc/0x2f0
> >>  do_page_fault+0x263/0x6f9
> >>  page_fault+0x34/0x40
> >> 
> >> Reported by Kernel Concurrency Sanitizer on:
> >> CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
> >> Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> >> 
> >> ra.mmap_miss is used to contribute the readahead decisions, a data race
> >> could be undesirable. Both the read and write is only under
> >> non-exclusive mmap_sem, two concurrent writers could even overflow the
> >> counter. Fixing the underflow by writing to a local variable before
> >> committing a final store to ra.mmap_miss given a small inaccuracy of the
> >> counter should be acceptable.
> >> 
> >> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
> >> Signed-off-by: Qian Cai <cai@lca.pw>
> > 
> > That's more than Suggested-by.  The correct way to submit this patch is:
> > 
> > From: Kirill A. Shutemov <kirill@shutemov.name>
> > (at the top of the patch, so it gets credited to Kirill)
> 
> Sure, if Kirill is going to provide his Signed-off-by in the first place, I’ll be happy to
> submit it on his behalf.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..2e298db2e80f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2365,6 +2365,7 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct address_space *mapping = file->f_mapping;
 	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
+	unsigned int mmap_miss;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
@@ -2380,14 +2381,15 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 
 	/* Avoid banging the cache line if not needed */
-	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
-		ra->mmap_miss++;
+	mmap_miss = READ_ONCE(ra->mmap_miss);
+	if (mmap_miss < MMAP_LOTSAMISS * 10)
+		WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
 
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (mmap_miss > MMAP_LOTSAMISS)
 		return fpin;
 
 	/*
@@ -2413,13 +2415,15 @@  static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
 	struct file *fpin = NULL;
+	unsigned int mmap_miss;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
 		return fpin;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	mmap_miss = READ_ONCE(ra->mmap_miss);
+	if (mmap_miss)
+		WRITE_ONCE(ra->mmap_miss, --mmap_miss);
 	if (PageReadahead(page)) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
@@ -2586,6 +2590,7 @@  void filemap_map_pages(struct vm_fault *vmf,
 	unsigned long max_idx;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct page *page;
+	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_pgoff) {
@@ -2622,8 +2627,8 @@  void filemap_map_pages(struct vm_fault *vmf,
 		if (page->index >= max_idx)
 			goto unlock;
 
-		if (file->f_ra.mmap_miss > 0)
-			file->f_ra.mmap_miss--;
+		if (mmap_miss > 0)
+			mmap_miss--;
 
 		vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		if (vmf->pte)
@@ -2643,6 +2648,7 @@  void filemap_map_pages(struct vm_fault *vmf,
 			break;
 	}
 	rcu_read_unlock();
+	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 }
 EXPORT_SYMBOL(filemap_map_pages);