diff mbox series

fs: fix a data race in i_size_write/i_size_read

Message ID 20200219040426.1140-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series fs: fix a data race in i_size_write/i_size_read | expand

Commit Message

Qian Cai Feb. 19, 2020, 4:04 a.m. UTC
inode::i_size could be accessed concurently as noticed by KCSAN,

 BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end

 write to 0xffff8bf68fc0cac0 of 8 bytes by task 7484 on cpu 71:
  iomap_write_end+0xea/0x530
  i_size_write at include/linux/fs.h:888
  (inlined by) iomap_write_end at fs/iomap/buffered-io.c:782
  iomap_write_actor+0x132/0x200
  iomap_apply+0x245/0x8a5
  iomap_file_buffered_write+0xbd/0xf0
  xfs_file_buffered_aio_write+0x1c2/0x790 [xfs]
  xfs_file_write_iter+0x232/0x2d0 [xfs]
  new_sync_write+0x29c/0x3b0
  __vfs_write+0x92/0xa0
  vfs_write+0x103/0x260
  ksys_write+0x9d/0x130
  __x64_sys_write+0x4c/0x60
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffff8bf68fc0cac0 of 8 bytes by task 5901 on cpu 70:
  iomap_do_writepage+0xf4/0x450
  i_size_read at include/linux/fs.h:866
  (inlined by) iomap_do_writepage at fs/iomap/buffered-io.c:1558
  write_cache_pages+0x523/0xb20
  iomap_writepages+0x47/0x80
  xfs_vm_writepages+0xc7/0x100 [xfs]
  do_writepages+0x5e/0x130
  __writeback_single_inode+0xd5/0xb20
  writeback_sb_inodes+0x429/0x910
  __writeback_inodes_wb+0xc4/0x150
  wb_writeback+0x47b/0x830
  wb_workfn+0x688/0x930
  process_one_work+0x54f/0xb90
  worker_thread+0x80/0x5f0
  kthread+0x1cd/0x1f0
  ret_from_fork+0x27/0x50

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 70 PID: 5901 Comm: kworker/u257:2 Tainted: G             L    5.6.0-rc2-next-20200218+ #2
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
 Workqueue: writeback wb_workfn (flush-254:0)

The write is protected by exclusive inode::i_rwsem (in
xfs_file_buffered_aio_write()) but the read is not. A shattered value
could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE().

Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/fs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Al Viro Feb. 19, 2020, 4:52 a.m. UTC | #1
On Tue, Feb 18, 2020 at 11:04:26PM -0500, Qian Cai wrote:
> inode::i_size could be accessed concurently as noticed by KCSAN,
> 
>  BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end
> 
>  write to 0xffff8bf68fc0cac0 of 8 bytes by task 7484 on cpu 71:
>   iomap_write_end+0xea/0x530
>   i_size_write at include/linux/fs.h:888
>   (inlined by) iomap_write_end at fs/iomap/buffered-io.c:782
>   iomap_write_actor+0x132/0x200
>   iomap_apply+0x245/0x8a5
>   iomap_file_buffered_write+0xbd/0xf0
>   xfs_file_buffered_aio_write+0x1c2/0x790 [xfs]
>   xfs_file_write_iter+0x232/0x2d0 [xfs]
>   new_sync_write+0x29c/0x3b0
>   __vfs_write+0x92/0xa0
>   vfs_write+0x103/0x260
>   ksys_write+0x9d/0x130
>   __x64_sys_write+0x4c/0x60
>   do_syscall_64+0x91/0xb05
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  read to 0xffff8bf68fc0cac0 of 8 bytes by task 5901 on cpu 70:
>   iomap_do_writepage+0xf4/0x450
>   i_size_read at include/linux/fs.h:866
>   (inlined by) iomap_do_writepage at fs/iomap/buffered-io.c:1558
>   write_cache_pages+0x523/0xb20
>   iomap_writepages+0x47/0x80
>   xfs_vm_writepages+0xc7/0x100 [xfs]
>   do_writepages+0x5e/0x130
>   __writeback_single_inode+0xd5/0xb20
>   writeback_sb_inodes+0x429/0x910
>   __writeback_inodes_wb+0xc4/0x150
>   wb_writeback+0x47b/0x830
>   wb_workfn+0x688/0x930
>   process_one_work+0x54f/0xb90
>   worker_thread+0x80/0x5f0
>   kthread+0x1cd/0x1f0
>   ret_from_fork+0x27/0x50
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  CPU: 70 PID: 5901 Comm: kworker/u257:2 Tainted: G             L    5.6.0-rc2-next-20200218+ #2
>  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>  Workqueue: writeback wb_workfn (flush-254:0)
> 
> The write is protected by exclusive inode::i_rwsem (in
> xfs_file_buffered_aio_write()) but the read is not. A shattered value
> could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE().

If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up
being split, the kernel is FUBAR anyway.  Details, please - how could that
end up happening?
Qian Cai Feb. 19, 2020, 5:08 a.m. UTC | #2
> On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up
> being split, the kernel is FUBAR anyway.  Details, please - how could that
> end up happening?

My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have,

Load1
Store
Load2

where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE().
Al Viro Feb. 19, 2020, 5:23 a.m. UTC | #3
On Wed, Feb 19, 2020 at 12:08:40AM -0500, Qian Cai wrote:
> 
> 
> > On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up
> > being split, the kernel is FUBAR anyway.  Details, please - how could that
> > end up happening?
> 
> My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have,
> 
> Load1
> Store
> Load2
> 
> where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE().
> 

	I would suggest
* if some compiler does that, ask the persons responsible for that
"optimization" which flags should be used to disable it.
* if they fail to provide such, educate them regarding the
usefulness of their idea
* if that does not help, don't use the bloody piece of garbage.

	Again, is that pure theory (because I can't come up with
any reason why splitting a 32bit load would be any less legitimate
than doing the same to a 64bit one on a 64bit architecture),
or is there anything that really would pull that off?
Marco Elver Feb. 19, 2020, 9:21 a.m. UTC | #4
On Wed, 19 Feb 2020 at 06:23, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Feb 19, 2020 at 12:08:40AM -0500, Qian Cai wrote:
> >
> >
> > > On Feb 18, 2020, at 11:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > If aligned 64bit stores on 64bit host (note the BITS_PER_LONG ifdefs) end up
> > > being split, the kernel is FUBAR anyway.  Details, please - how could that
> > > end up happening?
> >
> > My understanding is the compiler might decide to split the load into saying two 4-byte loads. Then, we might have,
> >
> > Load1
> > Store
> > Load2
> >
> > where the load value could be a garbage. Also, Marco (the KCSAN maintainer) who knew more of compiler than me mentioned that there is no guarantee that the store will not be split either. Thus, the WRITE_ONCE().
> >

In theory, yes. But by now we're aware of the current convention of
assuming plain aligned writes up to word-size are atomic (which KCSAN
respects with default Kconfig).

>         I would suggest
> * if some compiler does that, ask the persons responsible for that
> "optimization" which flags should be used to disable it.
> * if they fail to provide such, educate them regarding the
> usefulness of their idea
> * if that does not help, don't use the bloody piece of garbage.
>
>         Again, is that pure theory (because I can't come up with
> any reason why splitting a 32bit load would be any less legitimate
> than doing the same to a 64bit one on a 64bit architecture),
> or is there anything that really would pull that off?

Right. In reality, for mainstream architectures, it appears quite unlikely.

There may be other valid reasons, such as documenting the fact the
write can happen concurrently with loads.

Let's assume the WRITE_ONCE can be dropped.

The load is a different story. While load tearing may not be an issue,
it's more likely that other optimizations can break the code. For
example load fusing can break code that expects repeated loads in a
loop. E.g. I found these uses of i_size_read in loops:

git grep -E '(for|while) \(.*i_size_read'
fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
fs/ocfs2/dir.c:                 for (i = 0; i < i_size_read(inode) &&
i < offset; ) {
fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
fs/ocfs2/dir.c:         while (ctx->pos < i_size_read(inode)
fs/squashfs/dir.c:      while (length < i_size_read(inode)) {
fs/squashfs/namei.c:    while (length < i_size_read(dir)) {

Can i_size writes happen concurrently, and if so will these break if
the compiler decides to just do i_size_read's load once, and keep the
result in a register?

Thanks,
-- Marco
David Sterba Feb. 19, 2020, 12:08 p.m. UTC | #5
On Tue, Feb 18, 2020 at 11:04:26PM -0500, Qian Cai wrote:
> inode::i_size could be accessed concurently as noticed by KCSAN,
> 
>  BUG: KCSAN: data-race in iomap_do_writepage / iomap_write_end
> 
> The write is protected by exclusive inode::i_rwsem (in
> xfs_file_buffered_aio_write()) but the read is not. A shattered value
> could introduce a logic bug. Fixed it using a pair of WRITE/READ_ONCE().

We had a different problem with lack of READ_ONCE/WRITE_ONCE for i_size,
the fix was the same though. There was i_size_read(inode) used in max()
macro and compiled code two reads (unlocked), and this led to a race
when where different value was checked and then used.

The thread:
https://lore.kernel.org/linux-fsdevel/20191011202050.8656-1-josef@toxicpanda.com/

We had to apply a workaround to btrfs code because the generic fix was
not merged, even with several reviews and fixing a real bug. The report
from KCSAN seems to require some sort of splitting the values. What we
saw happened on 64bit platform without that effect so I'd call that a
more likely to happen because the pattern max(i_size_read(inode), ...) is
not something we'd instinctively consider unsafe.
David Sterba Feb. 19, 2020, 12:18 p.m. UTC | #6
On Wed, Feb 19, 2020 at 10:21:46AM +0100, Marco Elver wrote:
> Right. In reality, for mainstream architectures, it appears quite unlikely.
> 
> There may be other valid reasons, such as documenting the fact the
> write can happen concurrently with loads.
> 
> Let's assume the WRITE_ONCE can be dropped.
> 
> The load is a different story. While load tearing may not be an issue,
> it's more likely that other optimizations can break the code. For
> example load fusing can break code that expects repeated loads in a
> loop. E.g. I found these uses of i_size_read in loops:
> 
> git grep -E '(for|while) \(.*i_size_read'
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c:                 for (i = 0; i < i_size_read(inode) &&
> i < offset; ) {
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c:         while (ctx->pos < i_size_read(inode)
> fs/squashfs/dir.c:      while (length < i_size_read(inode)) {
> fs/squashfs/namei.c:    while (length < i_size_read(dir)) {
> 
> Can i_size writes happen concurrently, and if so will these break if
> the compiler decides to just do i_size_read's load once, and keep the
> result in a register?

It depends on the semantics and the behaviour when the value is not
cached in a register might be the wrong one. A concrete example with
assembly and analysis can be found in d98da49977f6 ("btrfs: save i_size
to avoid double evaluation of i_size_read in compress_file_range"),
which is the workardound mentioned in the my other mail.

C:
    actual_end = min_t(u64, i_size_read(inode), end + 1);

Asm:

        mov    0x20(%rsp),%rax
        cmp    %rax,0x48(%r15)           # read
        movl   $0x0,0x18(%rsp)
        mov    %rax,%r12
        mov    %r14,%rax
        cmovbe 0x48(%r15),%r12           # eval
    
      Where r15 is inode and 0x48 is offset of i_size.
    
      The original fix was to revert 62b37622718c that would do an
      intermediate assignment and this would also avoid the doulble
      evaluation but is not future-proof, should the compiler merge the
      stores and call i_size_read anyway.
    
      There's a patch adding READ_ONCE to i_size_read but that's not being
      applied at the moment and we need to fix the bug. Instead, emulate
      READ_ONCE by two barrier()s that's what effectively happens. The
      assembly confirms single evaluation:
    
        mov    0x48(%rbp),%rax          # read once
        mov    0x20(%rsp),%rcx
        mov    $0x20,%edx
        cmp    %rax,%rcx
        cmovbe %rcx,%rax
        mov    %rax,(%rsp)
        mov    %rax,%rcx
        mov    %r14,%rax
    
      Where 0x48(%rbp) is inode->i_size stored to %eax.
Qian Cai Feb. 21, 2020, 4:19 a.m. UTC | #7
> On Feb 19, 2020, at 4:21 AM, Marco Elver <elver@google.com> wrote:
> 
> Let's assume the WRITE_ONCE can be dropped.
> 
> The load is a different story. While load tearing may not be an issue,
> it's more likely that other optimizations can break the code. For
> example load fusing can break code that expects repeated loads in a
> loop. E.g. I found these uses of i_size_read in loops:
> 
> git grep -E '(for|while) \(.*i_size_read'
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c:                 for (i = 0; i < i_size_read(inode) &&
> i < offset; ) {
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c:         while (ctx->pos < i_size_read(inode)
> fs/squashfs/dir.c:      while (length < i_size_read(inode)) {
> fs/squashfs/namei.c:    while (length < i_size_read(dir)) {
> 
> Can i_size writes happen concurrently, and if so will these break if
> the compiler decides to just do i_size_read's load once, and keep the
> result in a register?

Al, is it more acceptable to add READ_ONCE() only then?
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..25f98da90cf3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -863,7 +863,7 @@  static inline loff_t i_size_read(const struct inode *inode)
 	preempt_enable();
 	return i_size;
 #else
-	return inode->i_size;
+	return READ_ONCE(inode->i_size);
 #endif
 }
 
@@ -885,7 +885,7 @@  static inline void i_size_write(struct inode *inode, loff_t i_size)
 	inode->i_size = i_size;
 	preempt_enable();
 #else
-	inode->i_size = i_size;
+	WRITE_ONCE(inode->i_size, i_size);
 #endif
 }