fs: use READ_ONCE/WRITE_ONCE with the i_size helpers
diff mbox series

Message ID 20191011202050.8656-1-josef@toxicpanda.com
State New
Headers show
Series
  • fs: use READ_ONCE/WRITE_ONCE with the i_size helpers
Related show

Commit Message

Josef Bacik Oct. 11, 2019, 8:20 p.m. UTC
I spent the last few weeks running down a weird regression in btrfs we
were seeing in production.  It turned out to be introduced by
62b37622718c, which took the following

loff_t isize = i_size_read(inode);

actual_end = min_t(u64, isize, end + 1);

and turned it into

actual_end = min_t(u64, i_size_read(inode), end + 1);

The problem here is that the compiler is optimizing out the temporary
variables used in __cmp_once, so the resulting assembly looks like this

498             actual_end = min_t(u64, i_size_read(inode), end + 1);
   0xffffffff814b08c1 <+145>:   48 8b 44 24 28  mov    0x28(%rsp),%rax
   0xffffffff814b08c6 <+150>:   48 39 45 50     cmp    %rax,0x50(%rbp)
   0xffffffff814b08ca <+154>:   48 89 c6        mov    %rax,%rsi
   0xffffffff814b08cd <+157>:   48 0f 46 75 50  cmovbe 0x50(%rbp),%rsi

as you can see we read the value of the inode to compare, and then we
read it a second time to assign it.

This code is simply an optimization, so there's no locking to keep
i_size from changing, however we really need min_t to actually return
the minimum value for these two values, which it is failing to do.

We've reverted that patch for now to fix the problem, but it's only a
matter of time before the compiler becomes smart enough to optimize out
the loff_t isize intermediate variable as well.

Instead we want to make it explicit that i_size_read() should only read
the value once.  This will keep this class of problem from happening in
the future, regardless of what the compiler chooses to do.  With this
change we get the following assembly generated for this code

491             actual_end = min_t(u64, i_size_read(inode), end + 1);
   0xffffffff8148f625 <+149>:   48 8b 44 24 20  mov    0x20(%rsp),%rax

./include/linux/compiler.h:
199             __READ_ONCE_SIZE;
   0xffffffff8148f62a <+154>:   4c 8b 75 50     mov    0x50(%rbp),%r14

fs/btrfs/inode.c:
491             actual_end = min_t(u64, i_size_read(inode), end + 1);
   0xffffffff8148f62e <+158>:   49 39 c6        cmp    %rax,%r14
   0xffffffff8148f631 <+161>:   4c 0f 47 f0     cmova  %rax,%r14

and this works out properly, we only read the value once and so we won't
trip over this problem again.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/fs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Rik van Riel Oct. 12, 2019, 12:34 a.m. UTC | #1
On Fri, 2019-10-11 at 16:20 -0400, Josef Bacik wrote:
> 
> and this works out properly, we only read the value once and so we
> won't
> trip over this problem again.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Rik van Riel <riel@surriel.com>
Jan Kara Oct. 14, 2019, 8:51 a.m. UTC | #2
On Fri 11-10-19 16:20:50, Josef Bacik wrote:
> I spent the last few weeks running down a weird regression in btrfs we
> were seeing in production.  It turned out to be introduced by
> 62b37622718c, which took the following
> 
> loff_t isize = i_size_read(inode);
> 
> actual_end = min_t(u64, isize, end + 1);
> 
> and turned it into
> 
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> 
> The problem here is that the compiler is optimizing out the temporary
> variables used in __cmp_once, so the resulting assembly looks like this
> 
> 498             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff814b08c1 <+145>:   48 8b 44 24 28  mov    0x28(%rsp),%rax
>    0xffffffff814b08c6 <+150>:   48 39 45 50     cmp    %rax,0x50(%rbp)
>    0xffffffff814b08ca <+154>:   48 89 c6        mov    %rax,%rsi
>    0xffffffff814b08cd <+157>:   48 0f 46 75 50  cmovbe 0x50(%rbp),%rsi
> 
> as you can see we read the value of the inode to compare, and then we
> read it a second time to assign it.
> 
> This code is simply an optimization, so there's no locking to keep
> i_size from changing, however we really need min_t to actually return
> the minimum value for these two values, which it is failing to do.
> 
> We've reverted that patch for now to fix the problem, but it's only a
> matter of time before the compiler becomes smart enough to optimize out
> the loff_t isize intermediate variable as well.
> 
> Instead we want to make it explicit that i_size_read() should only read
> the value once.  This will keep this class of problem from happening in
> the future, regardless of what the compiler chooses to do.  With this
> change we get the following assembly generated for this code
> 
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f625 <+149>:   48 8b 44 24 20  mov    0x20(%rsp),%rax
> 
> ./include/linux/compiler.h:
> 199             __READ_ONCE_SIZE;
>    0xffffffff8148f62a <+154>:   4c 8b 75 50     mov    0x50(%rbp),%r14
> 
> fs/btrfs/inode.c:
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f62e <+158>:   49 39 c6        cmp    %rax,%r14
>    0xffffffff8148f631 <+161>:   4c 0f 47 f0     cmova  %rax,%r14
> 
> and this works out properly, we only read the value once and so we won't
> trip over this problem again.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Yeah, given i_size_read() is specifically meant for unlocked access to
i_size, it makes sense to hide the READ_ONCE() in that function (can
corresponding WRITE_ONCE() in i_size_write()). So feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..0e3f887e2dc5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -863,7 +863,7 @@ static inline loff_t i_size_read(const struct inode *inode)
>  	preempt_enable();
>  	return i_size;
>  #else
> -	return inode->i_size;
> +	return READ_ONCE(inode->i_size);
>  #endif
>  }
>  
> @@ -885,7 +885,7 @@ static inline void i_size_write(struct inode *inode, loff_t i_size)
>  	inode->i_size = i_size;
>  	preempt_enable();
>  #else
> -	inode->i_size = i_size;
> +	WRITE_ONCE(inode->i_size, i_size);
>  #endif
>  }
>  
> -- 
> 2.21.0
>
David Sterba Oct. 14, 2019, 12:02 p.m. UTC | #3
On Fri, Oct 11, 2019 at 04:20:50PM -0400, Josef Bacik wrote:
> I spent the last few weeks running down a weird regression in btrfs we
> were seeing in production.  It turned out to be introduced by
> 62b37622718c, which took the following
> 
> loff_t isize = i_size_read(inode);
> 
> actual_end = min_t(u64, isize, end + 1);
> 
> and turned it into
> 
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> 
> The problem here is that the compiler is optimizing out the temporary
> variables used in __cmp_once, so the resulting assembly looks like this
> 
> 498             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff814b08c1 <+145>:   48 8b 44 24 28  mov    0x28(%rsp),%rax
>    0xffffffff814b08c6 <+150>:   48 39 45 50     cmp    %rax,0x50(%rbp)
>    0xffffffff814b08ca <+154>:   48 89 c6        mov    %rax,%rsi
>    0xffffffff814b08cd <+157>:   48 0f 46 75 50  cmovbe 0x50(%rbp),%rsi
> 
> as you can see we read the value of the inode to compare, and then we
> read it a second time to assign it.
> 
> This code is simply an optimization, so there's no locking to keep
> i_size from changing, however we really need min_t to actually return
> the minimum value for these two values, which it is failing to do.
> 
> We've reverted that patch for now to fix the problem, but it's only a
> matter of time before the compiler becomes smart enough to optimize out
> the loff_t isize intermediate variable as well.

The cleanup patch proably made it more likely but otherwise does not fix
the problem so you got lucky with reverting it.

I'll repeat what I sent as reply to the btrfs patch https://patchwork.kernel.org/patch/11185435/ ,
addin the temporary variable is still not entirely correct and depends
on the compiler.

This patch adding READ_ONCE to i_size_read is IMHO the only safe way, as
the accessors are intended for unlocked use and for that
READ_ONCE/WRITE_ONCE are a must, per LKMM (linux kernel memody model). I
slightly wonder why this hasn't been so since long ago.

> Instead we want to make it explicit that i_size_read() should only read
> the value once.  This will keep this class of problem from happening in
> the future, regardless of what the compiler chooses to do.  With this
> change we get the following assembly generated for this code
> 
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f625 <+149>:   48 8b 44 24 20  mov    0x20(%rsp),%rax
> 
> ./include/linux/compiler.h:
> 199             __READ_ONCE_SIZE;
>    0xffffffff8148f62a <+154>:   4c 8b 75 50     mov    0x50(%rbp),%r14
> 
> fs/btrfs/inode.c:
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f62e <+158>:   49 39 c6        cmp    %rax,%r14
>    0xffffffff8148f631 <+161>:   4c 0f 47 f0     cmova  %rax,%r14
> 
> and this works out properly, we only read the value once and so we won't
> trip over this problem again.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

And CC: stable too.

Reviewed-by: David Sterba <dsterba@suse.com>

> ---
>  include/linux/fs.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..0e3f887e2dc5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -863,7 +863,7 @@ static inline loff_t i_size_read(const struct inode *inode)
>  	preempt_enable();
>  	return i_size;
>  #else
> -	return inode->i_size;
> +	return READ_ONCE(inode->i_size);

Regarding the other ways to access i_size, preempt_enable() are compiler
barriers, so this should be safe without explicit READ_ONCE.
Josef Bacik Oct. 24, 2019, 12:08 p.m. UTC | #4
On Fri, Oct 11, 2019 at 04:20:50PM -0400, Josef Bacik wrote:
> I spent the last few weeks running down a weird regression in btrfs we
> were seeing in production.  It turned out to be introduced by
> 62b37622718c, which took the following
> 
> loff_t isize = i_size_read(inode);
> 
> actual_end = min_t(u64, isize, end + 1);
> 
> and turned it into
> 
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> 
> The problem here is that the compiler is optimizing out the temporary
> variables used in __cmp_once, so the resulting assembly looks like this
> 
> 498             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff814b08c1 <+145>:   48 8b 44 24 28  mov    0x28(%rsp),%rax
>    0xffffffff814b08c6 <+150>:   48 39 45 50     cmp    %rax,0x50(%rbp)
>    0xffffffff814b08ca <+154>:   48 89 c6        mov    %rax,%rsi
>    0xffffffff814b08cd <+157>:   48 0f 46 75 50  cmovbe 0x50(%rbp),%rsi
> 
> as you can see we read the value of the inode to compare, and then we
> read it a second time to assign it.
> 
> This code is simply an optimization, so there's no locking to keep
> i_size from changing, however we really need min_t to actually return
> the minimum value for these two values, which it is failing to do.
> 
> We've reverted that patch for now to fix the problem, but it's only a
> matter of time before the compiler becomes smart enough to optimize out
> the loff_t isize intermediate variable as well.
> 
> Instead we want to make it explicit that i_size_read() should only read
> the value once.  This will keep this class of problem from happening in
> the future, regardless of what the compiler chooses to do.  With this
> change we get the following assembly generated for this code
> 
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f625 <+149>:   48 8b 44 24 20  mov    0x20(%rsp),%rax
> 
> ./include/linux/compiler.h:
> 199             __READ_ONCE_SIZE;
>    0xffffffff8148f62a <+154>:   4c 8b 75 50     mov    0x50(%rbp),%r14
> 
> fs/btrfs/inode.c:
> 491             actual_end = min_t(u64, i_size_read(inode), end + 1);
>    0xffffffff8148f62e <+158>:   49 39 c6        cmp    %rax,%r14
>    0xffffffff8148f631 <+161>:   4c 0f 47 f0     cmova  %rax,%r14
> 
> and this works out properly, we only read the value once and so we won't
> trip over this problem again.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Al,

Will you pick this up, or do you want me to send it along?  Thanks,

Josef
David Sterba Oct. 29, 2019, 5:22 p.m. UTC | #5
On Thu, Oct 24, 2019 at 08:08:44AM -0400, Josef Bacik wrote:
> Al,
> 
> Will you pick this up, or do you want me to send it along?  Thanks,

So is this patch on the way to 5.4-rc or shall we apply the
btrfs-specific fix? Thanks.

Patch
diff mbox series

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0d909d35763..0e3f887e2dc5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -863,7 +863,7 @@  static inline loff_t i_size_read(const struct inode *inode)
 	preempt_enable();
 	return i_size;
 #else
-	return inode->i_size;
+	return READ_ONCE(inode->i_size);
 #endif
 }
 
@@ -885,7 +885,7 @@  static inline void i_size_write(struct inode *inode, loff_t i_size)
 	inode->i_size = i_size;
 	preempt_enable();
 #else
-	inode->i_size = i_size;
+	WRITE_ONCE(inode->i_size, i_size);
 #endif
 }