diff mbox series

[5.10.x] btrfs: fix crash after non-aligned direct IO write with O_DSYNC

Message ID 94663c8a2172dc96b760d356a538d45c36f46040.1613062764.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series [5.10.x] btrfs: fix crash after non-aligned direct IO write with O_DSYNC | expand

Commit Message

Filipe Manana Feb. 11, 2021, 5 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
end up triggering an assertion and crashing. Example reproducer:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj

  mkfs.btrfs -f $DEV > /dev/null
  mount $DEV $MNT

  # Do a direct IO write with O_DSYNC into a non-aligned range...
  xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar

  umount $MNT

When running the reproducer an assertion fails and produces the following
trace:

  [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
  [ 2418.403745] ------------[ cut here ]------------
  [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
  [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
  [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
  [ 2418.407835] Code: e6 48 c7 (...)
  [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
  [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
  [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
  [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
  [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
  [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
  [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
  [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
  [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 2418.415669] Call Trace:
  [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
  [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
  [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
  [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
  [ 2418.418920]  new_sync_write+0x11f/0x1c0
  [ 2418.419430]  vfs_write+0x2bb/0x3b0
  [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
  [ 2418.420486]  do_syscall_64+0x33/0x80
  [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 2418.421486] RIP: 0033:0x7f54fda0b986
  [ 2418.421981] Code: 48 c7 c0 (...)
  [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
  [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
  [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
  [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
  [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
  [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
  [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
  [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---

1) At btrfs_file_write_iter() we set current->journal_info to
   BTRFS_DIO_SYNC_STUB;

2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();

3) We can't do the direct IO write because it starts at a non-aligned
   offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
   check_direct_IO() which does the alignment check), but we leave
   current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
   at btrfs_dio_iomap_begin(), because we assume we always get there;

4) Then at __btrfs_direct_write() we see that the attempt to do the
   direct IO write was not successful, 0 bytes written, so we fallback
   to a buffered write by calling btrfs_buffered_write();

5) There we call btrfs_check_data_free_space() which in turn calls
   btrfs_alloc_data_chunk_ondemand() and that calls
   btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;

6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
   BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
   BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:

  int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
                               enum btrfs_reserve_flush_enum flush)
  {
      struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
      int ret;

      ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
             flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
      ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
  (...)

So fix that by setting the journal to NULL whenever check_direct_IO()
returns a failure.

This bug only affects 5.10 kernels, and the regression was introduced in
5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
("btrfs: remove dio iomap DSYNC workaround"), which depends on other
changes that went into the merge window for 5.11. So this is a fix only
for 5.10.x stable kernels, as there are people hitting this.

Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
CC: stable@vger.kernel.org # 5.10 (and only 5.10)
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Wang Yugui Feb. 13, 2021, 1:04 a.m. UTC | #1
Hi,

> This bug only affects 5.10 kernels, and the regression was introduced in
> 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> changes that went into the merge window for 5.11. So this is a fix only
> for 5.10.x stable kernels, as there are people hitting this.

It is OK too to backport commit ecfdc08b8cc65d
 ("btrfs: remove dio iomap DSYNC workaround") to 5.10 for this problem?

the iomap issue for commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
is already fixed in 5.10?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/13


> Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index acc47e2ffb46..b536d21541a9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8026,8 +8026,12 @@ ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  	bool relock = false;
>  	ssize_t ret;
>  
> -	if (check_direct_IO(fs_info, iter, offset))
> +	if (check_direct_IO(fs_info, iter, offset)) {
> +		ASSERT(current->journal_info == NULL ||
> +		       current->journal_info == BTRFS_DIO_SYNC_STUB);
> +		current->journal_info = NULL;
>  		return 0;
> +	}
>  
>  	count = iov_iter_count(iter);
>  	if (iov_iter_rw(iter) == WRITE) {
> -- 
> 2.28.0
Filipe Manana Feb. 15, 2021, 11:05 a.m. UTC | #2
On Sat, Feb 13, 2021 at 1:07 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > This bug only affects 5.10 kernels, and the regression was introduced in
> > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> > changes that went into the merge window for 5.11. So this is a fix only
> > for 5.10.x stable kernels, as there are people hitting this.
>
> It is OK too to backport commit ecfdc08b8cc65d
>  ("btrfs: remove dio iomap DSYNC workaround") to 5.10 for this problem?
>
> the iomap issue for commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> is already fixed in 5.10?

Quoting the changelog:

"commit ecfdc08b8cc65d
("btrfs: remove dio iomap DSYNC workaround"), which depends on other
changes that went into the merge window for 5.11."

All the changes, are (at least):

commit ecfdc08b8cc65d737eebc26a1ee1875a097fd6a0   --> 5.11-rc1
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:21 2020 -0500

    btrfs: remove dio iomap DSYNC workaround

commit a42fa643169d2325602572633fcaa16862990e28
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:20 2020 -0500

    btrfs: call iomap_dio_complete() without inode_lock

commit 502756b380938022c848761837f8fa3976906aa1
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:19 2020 -0500

    btrfs: remove btrfs_inode::dio_sem

commit e9adabb9712ef9424cbbeeaa027d962ab5262e19
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:18 2020 -0500

    btrfs: use shared lock for direct writes within EOF

commit c352370633400d13765cc88080c969799ea51108
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:17 2020 -0500

    btrfs: push inode locking and unlocking into buffered/direct write

commit a14b78ad06aba0fa7e76d2bc13c5ba581a7f331a
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:16 2020 -0500

    btrfs: introduce btrfs_inode_lock()/unlock()

commit b8d8e1fd570a194904f545b135efc880d96a41a4
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Thu Sep 24 11:39:15 2020 -0500

    btrfs: introduce btrfs_write_check()

That's probably too much to add to stable at once, plus I'm assuming
all required iomap dependencies are in 5.10 already (it seems so,
unless I missed something).

Usually we don't add patches to stable that didn't go through Linus'
tree either (there were 1 or 2 very rare exceptions in the past I
think), but when a backport depends on so many patches, and not all
from the same patchset, the risk of getting something wrong is
significant. That's why I opted to send this patch, which is much more
simple.

David has more experience on that and it's up to him to decide.



>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/13
>
>
> > Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/inode.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index acc47e2ffb46..b536d21541a9 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -8026,8 +8026,12 @@ ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> >       bool relock = false;
> >       ssize_t ret;
> >
> > -     if (check_direct_IO(fs_info, iter, offset))
> > +     if (check_direct_IO(fs_info, iter, offset)) {
> > +             ASSERT(current->journal_info == NULL ||
> > +                    current->journal_info == BTRFS_DIO_SYNC_STUB);
> > +             current->journal_info = NULL;
> >               return 0;
> > +     }
> >
> >       count = iov_iter_count(iter);
> >       if (iov_iter_rw(iter) == WRITE) {
> > --
> > 2.28.0
>
>
David Sterba Feb. 16, 2021, 2:23 p.m. UTC | #3
On Mon, Feb 15, 2021 at 11:05:33AM +0000, Filipe Manana wrote:
> On Sat, Feb 13, 2021 at 1:07 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > This bug only affects 5.10 kernels, and the regression was introduced in
> > > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > > ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> > > changes that went into the merge window for 5.11. So this is a fix only
> > > for 5.10.x stable kernels, as there are people hitting this.
> >
> > It is OK too to backport commit ecfdc08b8cc65d
> >  ("btrfs: remove dio iomap DSYNC workaround") to 5.10 for this problem?
> >
> > the iomap issue for commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > is already fixed in 5.10?
> 
> Quoting the changelog:
> 
> "commit ecfdc08b8cc65d
> ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> changes that went into the merge window for 5.11."
> 
> All the changes, are (at least):
> 
> commit ecfdc08b8cc65d737eebc26a1ee1875a097fd6a0   --> 5.11-rc1
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:21 2020 -0500
> 
>     btrfs: remove dio iomap DSYNC workaround
> 
> commit a42fa643169d2325602572633fcaa16862990e28
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:20 2020 -0500
> 
>     btrfs: call iomap_dio_complete() without inode_lock
> 
> commit 502756b380938022c848761837f8fa3976906aa1
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:19 2020 -0500
> 
>     btrfs: remove btrfs_inode::dio_sem
> 
> commit e9adabb9712ef9424cbbeeaa027d962ab5262e19
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:18 2020 -0500
> 
>     btrfs: use shared lock for direct writes within EOF
> 
> commit c352370633400d13765cc88080c969799ea51108
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:17 2020 -0500
> 
>     btrfs: push inode locking and unlocking into buffered/direct write
> 
> commit a14b78ad06aba0fa7e76d2bc13c5ba581a7f331a
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:16 2020 -0500
> 
>     btrfs: introduce btrfs_inode_lock()/unlock()
> 
> commit b8d8e1fd570a194904f545b135efc880d96a41a4
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Thu Sep 24 11:39:15 2020 -0500
> 
>     btrfs: introduce btrfs_write_check()
> 
> That's probably too much to add to stable at once, plus I'm assuming
> all required iomap dependencies are in 5.10 already (it seems so,
> unless I missed something).
> 
> Usually we don't add patches to stable that didn't go through Linus'
> tree either (there were 1 or 2 very rare exceptions in the past I
> think), but when a backport depends on so many patches, and not all
> from the same patchset, the risk of getting something wrong is
> significant. That's why I opted to send this patch, which is much more
> simple.

Agreed, in this case the backport would be too big, just the diffstat
between b8d8e1fd570^..ecfdc08b8cc6 is

 5 files changed, 213 insertions(+), 240 deletions(-)

This fix is minimal and suitable for stable as an exception. You did not
CC stable@vger.kernel.org so you'll need to send it again. Please CC me
too in case there are some questions from stable team. Thanks.
Filipe Manana Feb. 16, 2021, 2:41 p.m. UTC | #4
On Tue, Feb 16, 2021 at 2:25 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Feb 15, 2021 at 11:05:33AM +0000, Filipe Manana wrote:
> > On Sat, Feb 13, 2021 at 1:07 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > > This bug only affects 5.10 kernels, and the regression was introduced in
> > > > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > > > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > > > ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> > > > changes that went into the merge window for 5.11. So this is a fix only
> > > > for 5.10.x stable kernels, as there are people hitting this.
> > >
> > > It is OK too to backport commit ecfdc08b8cc65d
> > >  ("btrfs: remove dio iomap DSYNC workaround") to 5.10 for this problem?
> > >
> > > the iomap issue for commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > > is already fixed in 5.10?
> >
> > Quoting the changelog:
> >
> > "commit ecfdc08b8cc65d
> > ("btrfs: remove dio iomap DSYNC workaround"), which depends on other
> > changes that went into the merge window for 5.11."
> >
> > All the changes, are (at least):
> >
> > commit ecfdc08b8cc65d737eebc26a1ee1875a097fd6a0   --> 5.11-rc1
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:21 2020 -0500
> >
> >     btrfs: remove dio iomap DSYNC workaround
> >
> > commit a42fa643169d2325602572633fcaa16862990e28
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:20 2020 -0500
> >
> >     btrfs: call iomap_dio_complete() without inode_lock
> >
> > commit 502756b380938022c848761837f8fa3976906aa1
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:19 2020 -0500
> >
> >     btrfs: remove btrfs_inode::dio_sem
> >
> > commit e9adabb9712ef9424cbbeeaa027d962ab5262e19
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:18 2020 -0500
> >
> >     btrfs: use shared lock for direct writes within EOF
> >
> > commit c352370633400d13765cc88080c969799ea51108
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:17 2020 -0500
> >
> >     btrfs: push inode locking and unlocking into buffered/direct write
> >
> > commit a14b78ad06aba0fa7e76d2bc13c5ba581a7f331a
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:16 2020 -0500
> >
> >     btrfs: introduce btrfs_inode_lock()/unlock()
> >
> > commit b8d8e1fd570a194904f545b135efc880d96a41a4
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Thu Sep 24 11:39:15 2020 -0500
> >
> >     btrfs: introduce btrfs_write_check()
> >
> > That's probably too much to add to stable at once, plus I'm assuming
> > all required iomap dependencies are in 5.10 already (it seems so,
> > unless I missed something).
> >
> > Usually we don't add patches to stable that didn't go through Linus'
> > tree either (there were 1 or 2 very rare exceptions in the past I
> > think), but when a backport depends on so many patches, and not all
> > from the same patchset, the risk of getting something wrong is
> > significant. That's why I opted to send this patch, which is much more
> > simple.
>
> Agreed, in this case the backport would be too big, just the diffstat
> between b8d8e1fd570^..ecfdc08b8cc6 is
>
>  5 files changed, 213 insertions(+), 240 deletions(-)
>
> This fix is minimal and suitable for stable as an exception. You did not
> CC stable@vger.kernel.org so you'll need to send it again. Please CC me
> too in case there are some questions from stable team. Thanks.

Ah yes, my usual script to send patches suppresses all cc by default,
I missed that.
Ok, sent again. Thanks.
Greg KH Feb. 16, 2021, 2:50 p.m. UTC | #5
On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
> end up triggering an assertion and crashing. Example reproducer:
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdj
>   MNT=/mnt/sdj
> 
>   mkfs.btrfs -f $DEV > /dev/null
>   mount $DEV $MNT
> 
>   # Do a direct IO write with O_DSYNC into a non-aligned range...
>   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
> 
>   umount $MNT
> 
> When running the reproducer an assertion fails and produces the following
> trace:
> 
>   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
>   [ 2418.403745] ------------[ cut here ]------------
>   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
>   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
>   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
>   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
>   [ 2418.407835] Code: e6 48 c7 (...)
>   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
>   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
>   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
>   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
>   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
>   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
>   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
>   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [ 2418.415669] Call Trace:
>   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
>   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
>   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
>   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
>   [ 2418.418920]  new_sync_write+0x11f/0x1c0
>   [ 2418.419430]  vfs_write+0x2bb/0x3b0
>   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
>   [ 2418.420486]  do_syscall_64+0x33/0x80
>   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   [ 2418.421486] RIP: 0033:0x7f54fda0b986
>   [ 2418.421981] Code: 48 c7 c0 (...)
>   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
>   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
>   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
>   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
>   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
>   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
>   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
>   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
> 
> 1) At btrfs_file_write_iter() we set current->journal_info to
>    BTRFS_DIO_SYNC_STUB;
> 
> 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
> 
> 3) We can't do the direct IO write because it starts at a non-aligned
>    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
>    check_direct_IO() which does the alignment check), but we leave
>    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
>    at btrfs_dio_iomap_begin(), because we assume we always get there;
> 
> 4) Then at __btrfs_direct_write() we see that the attempt to do the
>    direct IO write was not successful, 0 bytes written, so we fallback
>    to a buffered write by calling btrfs_buffered_write();
> 
> 5) There we call btrfs_check_data_free_space() which in turn calls
>    btrfs_alloc_data_chunk_ondemand() and that calls
>    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
> 
> 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
>    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
>    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
> 
>   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
>                                enum btrfs_reserve_flush_enum flush)
>   {
>       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
>       int ret;
> 
>       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
>              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
>       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
>   (...)
> 
> So fix that by setting the journal to NULL whenever check_direct_IO()
> returns a failure.
> 
> This bug only affects 5.10 kernels, and the regression was introduced in
> 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
> patchset that went into the merge window for 5.11. So this is a fix only
> for 5.10.x stable kernels, as there are people hitting this bug.
> 
> Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> CC: David Sterba <dsterba@suse.cz>
> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

As this is a one-off patch, I need the btrfs maintainers to ack this and
really justify why we can't take the larger patch or patch series here
instead, as that is almost always the correct thing to do instead.

thanks,

greg k-h
Filipe Manana Feb. 16, 2021, 2:52 p.m. UTC | #6
On 16/02/21 14:50, Greg KH wrote:
> On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
>> end up triggering an assertion and crashing. Example reproducer:
>>
>>   $ cat test.sh
>>   #!/bin/bash
>>
>>   DEV=/dev/sdj
>>   MNT=/mnt/sdj
>>
>>   mkfs.btrfs -f $DEV > /dev/null
>>   mount $DEV $MNT
>>
>>   # Do a direct IO write with O_DSYNC into a non-aligned range...
>>   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
>>
>>   umount $MNT
>>
>> When running the reproducer an assertion fails and produces the following
>> trace:
>>
>>   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
>>   [ 2418.403745] ------------[ cut here ]------------
>>   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
>>   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
>>   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
>>   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
>>   [ 2418.407835] Code: e6 48 c7 (...)
>>   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
>>   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
>>   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>>   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
>>   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
>>   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
>>   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
>>   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
>>   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [ 2418.415669] Call Trace:
>>   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
>>   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
>>   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
>>   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
>>   [ 2418.418920]  new_sync_write+0x11f/0x1c0
>>   [ 2418.419430]  vfs_write+0x2bb/0x3b0
>>   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
>>   [ 2418.420486]  do_syscall_64+0x33/0x80
>>   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>   [ 2418.421486] RIP: 0033:0x7f54fda0b986
>>   [ 2418.421981] Code: 48 c7 c0 (...)
>>   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
>>   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
>>   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
>>   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
>>   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
>>   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
>>   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
>>   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
>>
>> 1) At btrfs_file_write_iter() we set current->journal_info to
>>    BTRFS_DIO_SYNC_STUB;
>>
>> 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
>>
>> 3) We can't do the direct IO write because it starts at a non-aligned
>>    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
>>    check_direct_IO() which does the alignment check), but we leave
>>    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
>>    at btrfs_dio_iomap_begin(), because we assume we always get there;
>>
>> 4) Then at __btrfs_direct_write() we see that the attempt to do the
>>    direct IO write was not successful, 0 bytes written, so we fallback
>>    to a buffered write by calling btrfs_buffered_write();
>>
>> 5) There we call btrfs_check_data_free_space() which in turn calls
>>    btrfs_alloc_data_chunk_ondemand() and that calls
>>    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
>>
>> 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
>>    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
>>    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
>>
>>   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
>>                                enum btrfs_reserve_flush_enum flush)
>>   {
>>       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
>>       int ret;
>>
>>       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
>>              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
>>       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
>>   (...)
>>
>> So fix that by setting the journal to NULL whenever check_direct_IO()
>> returns a failure.
>>
>> This bug only affects 5.10 kernels, and the regression was introduced in
>> 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
>> The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
>> ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
>> patchset that went into the merge window for 5.11. So this is a fix only
>> for 5.10.x stable kernels, as there are people hitting this bug.
>>
>> Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
>> CC: stable@vger.kernel.org # 5.10 (and only 5.10)
>> CC: David Sterba <dsterba@suse.cz>
>> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> As this is a one-off patch, I need the btrfs maintainers to ack this and
> really justify why we can't take the larger patch or patch series here
> instead, as that is almost always the correct thing to do instead.

David will likely reply soon, but from another thread:

https://lore.kernel.org/linux-btrfs/20210216142324.GO1993@twin.jikos.cz/

Thanks.
> 
> thanks,
> 
> greg k-h
>
David Sterba Feb. 16, 2021, 3:15 p.m. UTC | #7
On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
> > end up triggering an assertion and crashing. Example reproducer:
> > 
> >   $ cat test.sh
> >   #!/bin/bash
> > 
> >   DEV=/dev/sdj
> >   MNT=/mnt/sdj
> > 
> >   mkfs.btrfs -f $DEV > /dev/null
> >   mount $DEV $MNT
> > 
> >   # Do a direct IO write with O_DSYNC into a non-aligned range...
> >   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
> > 
> >   umount $MNT
> > 
> > When running the reproducer an assertion fails and produces the following
> > trace:
> > 
> >   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
> >   [ 2418.403745] ------------[ cut here ]------------
> >   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
> >   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
> >   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
> >   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> >   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
> >   [ 2418.407835] Code: e6 48 c7 (...)
> >   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
> >   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
> >   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> >   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
> >   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
> >   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
> >   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
> >   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
> >   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >   [ 2418.415669] Call Trace:
> >   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
> >   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
> >   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
> >   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
> >   [ 2418.418920]  new_sync_write+0x11f/0x1c0
> >   [ 2418.419430]  vfs_write+0x2bb/0x3b0
> >   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
> >   [ 2418.420486]  do_syscall_64+0x33/0x80
> >   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >   [ 2418.421486] RIP: 0033:0x7f54fda0b986
> >   [ 2418.421981] Code: 48 c7 c0 (...)
> >   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> >   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
> >   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
> >   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
> >   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
> >   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
> >   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
> >   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
> > 
> > 1) At btrfs_file_write_iter() we set current->journal_info to
> >    BTRFS_DIO_SYNC_STUB;
> > 
> > 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
> > 
> > 3) We can't do the direct IO write because it starts at a non-aligned
> >    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
> >    check_direct_IO() which does the alignment check), but we leave
> >    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
> >    at btrfs_dio_iomap_begin(), because we assume we always get there;
> > 
> > 4) Then at __btrfs_direct_write() we see that the attempt to do the
> >    direct IO write was not successful, 0 bytes written, so we fallback
> >    to a buffered write by calling btrfs_buffered_write();
> > 
> > 5) There we call btrfs_check_data_free_space() which in turn calls
> >    btrfs_alloc_data_chunk_ondemand() and that calls
> >    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
> > 
> > 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
> >    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
> >    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
> > 
> >   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
> >                                enum btrfs_reserve_flush_enum flush)
> >   {
> >       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
> >       int ret;
> > 
> >       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
> >              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
> >       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
> >   (...)
> > 
> > So fix that by setting the journal to NULL whenever check_direct_IO()
> > returns a failure.
> > 
> > This bug only affects 5.10 kernels, and the regression was introduced in
> > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
> > patchset that went into the merge window for 5.11. So this is a fix only
> > for 5.10.x stable kernels, as there are people hitting this bug.
> > 
> > Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> > CC: David Sterba <dsterba@suse.cz>
> > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/inode.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> As this is a one-off patch, I need the btrfs maintainers to ack this and
> really justify why we can't take the larger patch or patch series here
> instead, as that is almost always the correct thing to do instead.

Acked-by: David Sterba <dsterba@suse.com>

The full backport would be patches

ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
502756b38093 btrfs: remove btrfs_inode::dio_sem
e9adabb9712e btrfs: use shared lock for direct writes within EOF
c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
b8d8e1fd570a btrfs: introduce btrfs_write_check()

and maybe more.

$ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
 btrfs_inode.h |   10 -
 ctree.h       |    8 +
 file.c        |  338 +++++++++++++++++++++++++++-------------------------------
 inode.c       |   96 +++++++---------
 transaction.h |    1 
 5 files changed, 213 insertions(+), 240 deletions(-)

That seems too much for a backport, the fix Filipe implemented is
simpler and IMO qualifies as the exceptional stable-only patch.
Greg KH Feb. 16, 2021, 3:34 p.m. UTC | #8
On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
> On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
> > > end up triggering an assertion and crashing. Example reproducer:
> > > 
> > >   $ cat test.sh
> > >   #!/bin/bash
> > > 
> > >   DEV=/dev/sdj
> > >   MNT=/mnt/sdj
> > > 
> > >   mkfs.btrfs -f $DEV > /dev/null
> > >   mount $DEV $MNT
> > > 
> > >   # Do a direct IO write with O_DSYNC into a non-aligned range...
> > >   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
> > > 
> > >   umount $MNT
> > > 
> > > When running the reproducer an assertion fails and produces the following
> > > trace:
> > > 
> > >   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
> > >   [ 2418.403745] ------------[ cut here ]------------
> > >   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
> > >   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
> > >   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
> > >   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > >   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
> > >   [ 2418.407835] Code: e6 48 c7 (...)
> > >   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
> > >   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
> > >   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> > >   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
> > >   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
> > >   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
> > >   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
> > >   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
> > >   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >   [ 2418.415669] Call Trace:
> > >   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
> > >   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
> > >   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
> > >   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
> > >   [ 2418.418920]  new_sync_write+0x11f/0x1c0
> > >   [ 2418.419430]  vfs_write+0x2bb/0x3b0
> > >   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
> > >   [ 2418.420486]  do_syscall_64+0x33/0x80
> > >   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >   [ 2418.421486] RIP: 0033:0x7f54fda0b986
> > >   [ 2418.421981] Code: 48 c7 c0 (...)
> > >   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > >   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
> > >   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
> > >   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
> > >   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
> > >   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
> > >   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
> > >   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
> > > 
> > > 1) At btrfs_file_write_iter() we set current->journal_info to
> > >    BTRFS_DIO_SYNC_STUB;
> > > 
> > > 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
> > > 
> > > 3) We can't do the direct IO write because it starts at a non-aligned
> > >    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
> > >    check_direct_IO() which does the alignment check), but we leave
> > >    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
> > >    at btrfs_dio_iomap_begin(), because we assume we always get there;
> > > 
> > > 4) Then at __btrfs_direct_write() we see that the attempt to do the
> > >    direct IO write was not successful, 0 bytes written, so we fallback
> > >    to a buffered write by calling btrfs_buffered_write();
> > > 
> > > 5) There we call btrfs_check_data_free_space() which in turn calls
> > >    btrfs_alloc_data_chunk_ondemand() and that calls
> > >    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
> > > 
> > > 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
> > >    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
> > >    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
> > > 
> > >   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
> > >                                enum btrfs_reserve_flush_enum flush)
> > >   {
> > >       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
> > >       int ret;
> > > 
> > >       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
> > >              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
> > >       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
> > >   (...)
> > > 
> > > So fix that by setting the journal to NULL whenever check_direct_IO()
> > > returns a failure.
> > > 
> > > This bug only affects 5.10 kernels, and the regression was introduced in
> > > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > > ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
> > > patchset that went into the merge window for 5.11. So this is a fix only
> > > for 5.10.x stable kernels, as there are people hitting this bug.
> > > 
> > > Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > > CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> > > CC: David Sterba <dsterba@suse.cz>
> > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/inode.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > As this is a one-off patch, I need the btrfs maintainers to ack this and
> > really justify why we can't take the larger patch or patch series here
> > instead, as that is almost always the correct thing to do instead.
> 
> Acked-by: David Sterba <dsterba@suse.com>
> 
> The full backport would be patches
> 
> ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
> a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
> 502756b38093 btrfs: remove btrfs_inode::dio_sem
> e9adabb9712e btrfs: use shared lock for direct writes within EOF
> c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
> a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
> b8d8e1fd570a btrfs: introduce btrfs_write_check()
> 
> and maybe more.
> 
> $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
>  btrfs_inode.h |   10 -
>  ctree.h       |    8 +
>  file.c        |  338 +++++++++++++++++++++++++++-------------------------------
>  inode.c       |   96 +++++++---------
>  transaction.h |    1 
>  5 files changed, 213 insertions(+), 240 deletions(-)
> 
> That seems too much for a backport, the fix Filipe implemented is
> simpler and IMO qualifies as the exceptional stable-only patch.

Why is that too much?  For 7 patches that's a small overall diffstat.
And you match identically what is upstream in Linus's tree.  That means
over time, backporting fixing is much easier, and understanding the code
for everyone is simpler.

It's almost always better to track what is in Linus's tree than to do
one-off patches as 95% of the time we do one-off patches they are buggy
and cause problems as no one else is running them.

So how about sending the above backported series instead please.

thanks,

greg k-h
David Sterba Feb. 16, 2021, 5:52 p.m. UTC | #9
On Tue, Feb 16, 2021 at 04:34:27PM +0100, Greg KH wrote:
> On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
> > On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> > > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > > As this is a one-off patch, I need the btrfs maintainers to ack this and
> > > really justify why we can't take the larger patch or patch series here
> > > instead, as that is almost always the correct thing to do instead.
> > 
> > Acked-by: David Sterba <dsterba@suse.com>
> > 
> > The full backport would be patches
> > 
> > ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
> > a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
> > 502756b38093 btrfs: remove btrfs_inode::dio_sem
> > e9adabb9712e btrfs: use shared lock for direct writes within EOF
> > c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
> > a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
> > b8d8e1fd570a btrfs: introduce btrfs_write_check()
> > 
> > and maybe more.
> > 
> > $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
> >  btrfs_inode.h |   10 -
> >  ctree.h       |    8 +
> >  file.c        |  338 +++++++++++++++++++++++++++-------------------------------
> >  inode.c       |   96 +++++++---------
> >  transaction.h |    1 
> >  5 files changed, 213 insertions(+), 240 deletions(-)
> > 
> > That seems too much for a backport, the fix Filipe implemented is
> > simpler and IMO qualifies as the exceptional stable-only patch.
> 
> Why is that too much?  For 7 patches that's a small overall diffstat.
> And you match identically what is upstream in Linus's tree.  That means
> over time, backporting fixing is much easier, and understanding the code
> for everyone is simpler.

The changes are not trivial and touch eg. inode locking and other
subsystems (iomap), so they're not self contained inside btrfs. And the
list of possibly related patches is not entirely known at this moment,
the above is an example that was obvious, but Filipe has expressed
doubts that it's complete and I agree.

Backporting them to 5.10.x would need same amount of testing and
validation that the 5.11 version got during the whole development cycle.

> It's almost always better to track what is in Linus's tree than to do
> one-off patches as 95% of the time we do one-off patches they are buggy
> and cause problems as no one else is running them.

While I understand that concern in general, in this case it's trading
changes by lots of code with a targeted fix with a reproducer, basically
fixing the buggy error handling path.

> So how about sending the above backported series instead please.

Considering the risk I don't want to do that.
Greg KH Feb. 22, 2021, 11:07 a.m. UTC | #10
On Tue, Feb 16, 2021 at 06:52:21PM +0100, David Sterba wrote:
> On Tue, Feb 16, 2021 at 04:34:27PM +0100, Greg KH wrote:
> > On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
> > > On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> > > > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > > > As this is a one-off patch, I need the btrfs maintainers to ack this and
> > > > really justify why we can't take the larger patch or patch series here
> > > > instead, as that is almost always the correct thing to do instead.
> > > 
> > > Acked-by: David Sterba <dsterba@suse.com>
> > > 
> > > The full backport would be patches
> > > 
> > > ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
> > > a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
> > > 502756b38093 btrfs: remove btrfs_inode::dio_sem
> > > e9adabb9712e btrfs: use shared lock for direct writes within EOF
> > > c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
> > > a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
> > > b8d8e1fd570a btrfs: introduce btrfs_write_check()
> > > 
> > > and maybe more.
> > > 
> > > $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
> > >  btrfs_inode.h |   10 -
> > >  ctree.h       |    8 +
> > >  file.c        |  338 +++++++++++++++++++++++++++-------------------------------
> > >  inode.c       |   96 +++++++---------
> > >  transaction.h |    1 
> > >  5 files changed, 213 insertions(+), 240 deletions(-)
> > > 
> > > That seems too much for a backport, the fix Filipe implemented is
> > > simpler and IMO qualifies as the exceptional stable-only patch.
> > 
> > Why is that too much?  For 7 patches that's a small overall diffstat.
> > And you match identically what is upstream in Linus's tree.  That means
> > over time, backporting fixing is much easier, and understanding the code
> > for everyone is simpler.
> 
> The changes are not trivial and touch eg. inode locking and other
> subsystems (iomap), so they're not self contained inside btrfs. And the
> list of possibly related patches is not entirely known at this moment,
> the above is an example that was obvious, but Filipe has expressed
> doubts that it's complete and I agree.
> 
> Backporting them to 5.10.x would need same amount of testing and
> validation that the 5.11 version got during the whole development cycle.
> 
> > It's almost always better to track what is in Linus's tree than to do
> > one-off patches as 95% of the time we do one-off patches they are buggy
> > and cause problems as no one else is running them.
> 
> While I understand that concern in general, in this case it's trading
> changes by lots of code with a targeted fix with a reproducer, basically
> fixing the buggy error handling path.
> 
> > So how about sending the above backported series instead please.
> 
> Considering the risk I don't want to do that.

Ok, thanks, now queued up.

greg k-h
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index acc47e2ffb46..b536d21541a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8026,8 +8026,12 @@  ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	bool relock = false;
 	ssize_t ret;
 
-	if (check_direct_IO(fs_info, iter, offset))
+	if (check_direct_IO(fs_info, iter, offset)) {
+		ASSERT(current->journal_info == NULL ||
+		       current->journal_info == BTRFS_DIO_SYNC_STUB);
+		current->journal_info = NULL;
 		return 0;
+	}
 
 	count = iov_iter_count(iter);
 	if (iov_iter_rw(iter) == WRITE) {