diff mbox series

[RFC] btrfs: don't call btrfs_sync_file from iomap context

Message ID 20200901130644.12655-1-johannes.thumshirn@wdc.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: don't call btrfs_sync_file from iomap context | expand

Commit Message

Johannes Thumshirn Sept. 1, 2020, 1:06 p.m. UTC
Fstests generic/113 exposes a deadlock introduced by the switch to iomap
for direct I/O.

[ 18.291293]
[ 18.291532] ============================================
[ 18.292115] WARNING: possible recursive locking detected
[ 18.292723] 5.9.0-rc2+ #746 Not tainted
[ 18.293145] --------------------------------------------
[ 18.293718] aio-stress/922 is trying to acquire lock:
[ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
[ 18.295450]
[ 18.295450] but task is already holding lock:
[ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
[ 18.297249]
[ 18.297249] other info that might help us debug this:
[ 18.297960] Possible unsafe locking scenario:
[ 18.297960]
[ 18.298605] CPU0
[ 18.298880] ----
[ 18.299151] lock(&sb->s_type->i_mutex_key#11);
[ 18.299653] lock(&sb->s_type->i_mutex_key#11);
[ 18.300156]
[ 18.300156] *** DEADLOCK ***
[ 18.300156]
[ 18.300802] May be due to missing lock nesting notation
[ 18.300802]
[ 18.301542] 2 locks held by aio-stress/922:
[ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
[ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
[ 18.304223]
[ 18.304223] stack backtrace:
[ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
[ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
[ 18.306532] Call Trace:
[ 18.306796] dump_stack+0x78/0xa0
[ 18.307145] __lock_acquire.cold+0x121/0x29f
[ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
[ 18.308140] lock_acquire+0x93/0x3b0
[ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
[ 18.309036] down_write+0x33/0x70
[ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
[ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
[ 18.310384] iomap_dio_complete+0x10d/0x120
[ 18.310824] iomap_dio_rw+0x3c8/0x520
[ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
[ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
[ 18.312264] ? find_held_lock+0x2b/0x80
[ 18.312662] aio_write+0xcd/0x180
[ 18.313011] ? __might_fault+0x31/0x80
[ 18.313408] ? find_held_lock+0x2b/0x80
[ 18.313817] ? __might_fault+0x31/0x80
[ 18.314217] io_submit_one+0x4e1/0xb30
[ 18.314606] ? find_held_lock+0x2b/0x80
[ 18.315010] __x64_sys_io_submit+0x71/0x220
[ 18.315449] do_syscall_64+0x33/0x40
[ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 18.316363] RIP: 0033:0x7f5940881f79
[ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
[ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
[ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
[ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
[ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
[ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070

This happens because iomap_dio_complete() calls into generic_write_sync()
if we have the data-sync flag set. But as we're still under the
inode_lock() from btrfs_file_write_iter() we will deadlock once
btrfs_sync_file() tries to acquire the inode_lock().

Calling into generic_write_sync() is not needed as __btrfs_direct_write()
already takes care of persisting the data on disk. We can temporarily drop
the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
iomap code won't try to call into the sync routines as well.

References: https://github.com/btrfs/fstests/issues/12
Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/file.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn Sept. 1, 2020, 1:11 p.m. UTC | #1
Oops forgot to Cc Goldwyn

On 01/09/2020 15:07, Johannes Thumshirn wrote:
> Fstests generic/113 exposes a deadlock introduced by the switch to iomap
> for direct I/O.
> 
> [ 18.291293]
> [ 18.291532] ============================================
> [ 18.292115] WARNING: possible recursive locking detected
> [ 18.292723] 5.9.0-rc2+ #746 Not tainted
> [ 18.293145] --------------------------------------------
> [ 18.293718] aio-stress/922 is trying to acquire lock:
> [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.295450]
> [ 18.295450] but task is already holding lock:
> [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.297249]
> [ 18.297249] other info that might help us debug this:
> [ 18.297960] Possible unsafe locking scenario:
> [ 18.297960]
> [ 18.298605] CPU0
> [ 18.298880] ----
> [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
> [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
> [ 18.300156]
> [ 18.300156] *** DEADLOCK ***
> [ 18.300156]
> [ 18.300802] May be due to missing lock nesting notation
> [ 18.300802]
> [ 18.301542] 2 locks held by aio-stress/922:
> [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
> [ 18.304223]
> [ 18.304223] stack backtrace:
> [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
> [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
> [ 18.306532] Call Trace:
> [ 18.306796] dump_stack+0x78/0xa0
> [ 18.307145] __lock_acquire.cold+0x121/0x29f
> [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
> [ 18.308140] lock_acquire+0x93/0x3b0
> [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309036] down_write+0x33/0x70
> [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.310384] iomap_dio_complete+0x10d/0x120
> [ 18.310824] iomap_dio_rw+0x3c8/0x520
> [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
> [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
> [ 18.312264] ? find_held_lock+0x2b/0x80
> [ 18.312662] aio_write+0xcd/0x180
> [ 18.313011] ? __might_fault+0x31/0x80
> [ 18.313408] ? find_held_lock+0x2b/0x80
> [ 18.313817] ? __might_fault+0x31/0x80
> [ 18.314217] io_submit_one+0x4e1/0xb30
> [ 18.314606] ? find_held_lock+0x2b/0x80
> [ 18.315010] __x64_sys_io_submit+0x71/0x220
> [ 18.315449] do_syscall_64+0x33/0x40
> [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 18.316363] RIP: 0033:0x7f5940881f79
> [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
> [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
> [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
> [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
> [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
> [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
> 
> This happens because iomap_dio_complete() calls into generic_write_sync()
> if we have the data-sync flag set. But as we're still under the
> inode_lock() from btrfs_file_write_iter() we will deadlock once
> btrfs_sync_file() tries to acquire the inode_lock().
> 
> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> already takes care of persisting the data on disk. We can temporarily drop
> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> iomap code won't try to call into the sync routines as well.
> 
> References: https://github.com/btrfs/fstests/issues/12
> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/file.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b62679382799..c75c0f2a5f72 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
>  	if (iocb->ki_flags & IOCB_DIRECT) {
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (num_written > 0)
>  		num_written = generic_write_sync(iocb, num_written);
>  
> -	if (sync)
> +	if (sync) {
> +		iocb->ki_flags |= IOCB_DSYNC;
>  		atomic_dec(&BTRFS_I(inode)->sync_writers);
> +	}
>  out:
>  	current->backing_dev_info = NULL;
>  	return num_written ? num_written : err;
>
Goldwyn Rodrigues Sept. 1, 2020, 2:17 p.m. UTC | #2
Hi Johannes,

Thanks for the fix.

On 22:06 01/09, Johannes Thumshirn wrote:
> Fstests generic/113 exposes a deadlock introduced by the switch to iomap
> for direct I/O.
> 
> [ 18.291293]
> [ 18.291532] ============================================
> [ 18.292115] WARNING: possible recursive locking detected
> [ 18.292723] 5.9.0-rc2+ #746 Not tainted
> [ 18.293145] --------------------------------------------
> [ 18.293718] aio-stress/922 is trying to acquire lock:
> [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.295450]
> [ 18.295450] but task is already holding lock:
> [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.297249]
> [ 18.297249] other info that might help us debug this:
> [ 18.297960] Possible unsafe locking scenario:
> [ 18.297960]
> [ 18.298605] CPU0
> [ 18.298880] ----
> [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
> [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
> [ 18.300156]
> [ 18.300156] *** DEADLOCK ***
> [ 18.300156]
> [ 18.300802] May be due to missing lock nesting notation
> [ 18.300802]
> [ 18.301542] 2 locks held by aio-stress/922:
> [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
> [ 18.304223]
> [ 18.304223] stack backtrace:
> [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
> [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
> [ 18.306532] Call Trace:
> [ 18.306796] dump_stack+0x78/0xa0
> [ 18.307145] __lock_acquire.cold+0x121/0x29f
> [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
> [ 18.308140] lock_acquire+0x93/0x3b0
> [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309036] down_write+0x33/0x70
> [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.310384] iomap_dio_complete+0x10d/0x120
> [ 18.310824] iomap_dio_rw+0x3c8/0x520
> [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
> [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
> [ 18.312264] ? find_held_lock+0x2b/0x80
> [ 18.312662] aio_write+0xcd/0x180
> [ 18.313011] ? __might_fault+0x31/0x80
> [ 18.313408] ? find_held_lock+0x2b/0x80
> [ 18.313817] ? __might_fault+0x31/0x80
> [ 18.314217] io_submit_one+0x4e1/0xb30
> [ 18.314606] ? find_held_lock+0x2b/0x80
> [ 18.315010] __x64_sys_io_submit+0x71/0x220
> [ 18.315449] do_syscall_64+0x33/0x40
> [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 18.316363] RIP: 0033:0x7f5940881f79
> [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
> [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
> [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
> [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
> [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
> [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
> 
> This happens because iomap_dio_complete() calls into generic_write_sync()
> if we have the data-sync flag set. But as we're still under the
> inode_lock() from btrfs_file_write_iter() we will deadlock once
> btrfs_sync_file() tries to acquire the inode_lock().
> 
> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> already takes care of persisting the data on disk. We can temporarily drop
> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> iomap code won't try to call into the sync routines as well.
> 
> References: https://github.com/btrfs/fstests/issues/12
> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/file.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b62679382799..c75c0f2a5f72 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
>  	if (iocb->ki_flags & IOCB_DIRECT) {
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (num_written > 0)
>  		num_written = generic_write_sync(iocb, num_written);
>  
> -	if (sync)
> +	if (sync) {
> +		iocb->ki_flags |= IOCB_DSYNC;
>  		atomic_dec(&BTRFS_I(inode)->sync_writers);
> +	}
>  out:
>  	current->backing_dev_info = NULL;
>  	return num_written ? num_written : err;

This must be applied before calling generic_write_sync() as opposed to
after so generic_write_sync() can be called with the correct parameters.
This is required for AIO cases.
Johannes Thumshirn Sept. 1, 2020, 2:20 p.m. UTC | #3
On 01/09/2020 16:17, Goldwyn Rodrigues wrote:
> This must be applied before calling generic_write_sync() as opposed to
> after so generic_write_sync() can be called with the correct parameters.
> This is required for AIO cases.

Thanks for spotting, will update and retest.

The question I have though is, is this a correct fix or am I papering over 
something (hence the RFC tag). Maybe we can relax the time we're under the 
inode_lock() a bit so we can avoid this deadlock. David, Josef, Filipe any
comments?
Filipe Manana Sept. 1, 2020, 2:37 p.m. UTC | #4
On Tue, Sep 1, 2020 at 2:06 PM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> Fstests generic/113 exposes a deadlock introduced by the switch to iomap
> for direct I/O.
>
> [ 18.291293]
> [ 18.291532] ============================================
> [ 18.292115] WARNING: possible recursive locking detected
> [ 18.292723] 5.9.0-rc2+ #746 Not tainted
> [ 18.293145] --------------------------------------------
> [ 18.293718] aio-stress/922 is trying to acquire lock:
> [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.295450]
> [ 18.295450] but task is already holding lock:
> [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.297249]
> [ 18.297249] other info that might help us debug this:
> [ 18.297960] Possible unsafe locking scenario:
> [ 18.297960]
> [ 18.298605] CPU0
> [ 18.298880] ----
> [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
> [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
> [ 18.300156]
> [ 18.300156] *** DEADLOCK ***
> [ 18.300156]
> [ 18.300802] May be due to missing lock nesting notation
> [ 18.300802]
> [ 18.301542] 2 locks held by aio-stress/922:
> [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
> [ 18.304223]
> [ 18.304223] stack backtrace:
> [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
> [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
> [ 18.306532] Call Trace:
> [ 18.306796] dump_stack+0x78/0xa0
> [ 18.307145] __lock_acquire.cold+0x121/0x29f
> [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
> [ 18.308140] lock_acquire+0x93/0x3b0
> [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309036] down_write+0x33/0x70
> [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.310384] iomap_dio_complete+0x10d/0x120
> [ 18.310824] iomap_dio_rw+0x3c8/0x520
> [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
> [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
> [ 18.312264] ? find_held_lock+0x2b/0x80
> [ 18.312662] aio_write+0xcd/0x180
> [ 18.313011] ? __might_fault+0x31/0x80
> [ 18.313408] ? find_held_lock+0x2b/0x80
> [ 18.313817] ? __might_fault+0x31/0x80
> [ 18.314217] io_submit_one+0x4e1/0xb30
> [ 18.314606] ? find_held_lock+0x2b/0x80
> [ 18.315010] __x64_sys_io_submit+0x71/0x220
> [ 18.315449] do_syscall_64+0x33/0x40
> [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 18.316363] RIP: 0033:0x7f5940881f79
> [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
> [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
> [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
> [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
> [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
> [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
>
> This happens because iomap_dio_complete() calls into generic_write_sync()
> if we have the data-sync flag set. But as we're still under the
> inode_lock() from btrfs_file_write_iter() we will deadlock once
> btrfs_sync_file() tries to acquire the inode_lock().
>
> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> already takes care of persisting the data on disk. We can temporarily drop
> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> iomap code won't try to call into the sync routines as well.
>
> References: https://github.com/btrfs/fstests/issues/12
> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/file.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b62679382799..c75c0f2a5f72 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>                 atomic_inc(&BTRFS_I(inode)->sync_writers);
>
>         if (iocb->ki_flags & IOCB_DIRECT) {
> +               iocb->ki_flags &= ~IOCB_DSYNC;
>                 num_written = __btrfs_direct_write(iocb, from);
>         } else {
>                 num_written = btrfs_buffered_write(iocb, from);
> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>         if (num_written > 0)
>                 num_written = generic_write_sync(iocb, num_written);
>
> -       if (sync)
> +       if (sync) {
> +               iocb->ki_flags |= IOCB_DSYNC;

This should be set before calling generic_write_sync() above,
otherwise we don't do the persistence required by dsync.

I have to be honest, I don't like this approach that much, it's a bit
fragile - what if some other code, invoked in between clearing and
setting back the flag, needs that flag to operate correctly?

Thanks.

>                 atomic_dec(&BTRFS_I(inode)->sync_writers);
> +       }
>  out:
>         current->backing_dev_info = NULL;
>         return num_written ? num_written : err;
> --
> 2.26.2
>
Johannes Thumshirn Sept. 1, 2020, 2:44 p.m. UTC | #5
On 01/09/2020 16:37, Filipe Manana wrote:
>> +               iocb->ki_flags |= IOCB_DSYNC;
> 
> This should be set before calling generic_write_sync() above,
> otherwise we don't do the persistence required by dsync.
> 
> I have to be honest, I don't like this approach that much, it's a bit
> fragile - what if some other code, invoked in between clearing and
> setting back the flag, needs that flag to operate correctly?

Yes I don't like it either. 

I've compared btrfs to ext4 and xfs and both don't hold on the inode 
lock for (nearly) the whole xxx_file_write_iter() time, so I think 
the correct fix would be to relax this time. But I haven't found any 
documentation yet on which functions need to be under the inode_lock 
and which not.

Any input is appreciated here.
Josef Bacik Sept. 1, 2020, 3:11 p.m. UTC | #6
On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
> Fstests generic/113 exposes a deadlock introduced by the switch to iomap
> for direct I/O.
> 
> [ 18.291293]
> [ 18.291532] ============================================
> [ 18.292115] WARNING: possible recursive locking detected
> [ 18.292723] 5.9.0-rc2+ #746 Not tainted
> [ 18.293145] --------------------------------------------
> [ 18.293718] aio-stress/922 is trying to acquire lock:
> [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.295450]
> [ 18.295450] but task is already holding lock:
> [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.297249]
> [ 18.297249] other info that might help us debug this:
> [ 18.297960] Possible unsafe locking scenario:
> [ 18.297960]
> [ 18.298605] CPU0
> [ 18.298880] ----
> [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
> [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
> [ 18.300156]
> [ 18.300156] *** DEADLOCK ***
> [ 18.300156]
> [ 18.300802] May be due to missing lock nesting notation
> [ 18.300802]
> [ 18.301542] 2 locks held by aio-stress/922:
> [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
> [ 18.304223]
> [ 18.304223] stack backtrace:
> [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
> [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
> [ 18.306532] Call Trace:
> [ 18.306796] dump_stack+0x78/0xa0
> [ 18.307145] __lock_acquire.cold+0x121/0x29f
> [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
> [ 18.308140] lock_acquire+0x93/0x3b0
> [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309036] down_write+0x33/0x70
> [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
> [ 18.310384] iomap_dio_complete+0x10d/0x120
> [ 18.310824] iomap_dio_rw+0x3c8/0x520
> [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
> [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
> [ 18.312264] ? find_held_lock+0x2b/0x80
> [ 18.312662] aio_write+0xcd/0x180
> [ 18.313011] ? __might_fault+0x31/0x80
> [ 18.313408] ? find_held_lock+0x2b/0x80
> [ 18.313817] ? __might_fault+0x31/0x80
> [ 18.314217] io_submit_one+0x4e1/0xb30
> [ 18.314606] ? find_held_lock+0x2b/0x80
> [ 18.315010] __x64_sys_io_submit+0x71/0x220
> [ 18.315449] do_syscall_64+0x33/0x40
> [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 18.316363] RIP: 0033:0x7f5940881f79
> [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
> [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
> [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
> [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
> [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
> [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
> 
> This happens because iomap_dio_complete() calls into generic_write_sync()
> if we have the data-sync flag set. But as we're still under the
> inode_lock() from btrfs_file_write_iter() we will deadlock once
> btrfs_sync_file() tries to acquire the inode_lock().
> 
> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> already takes care of persisting the data on disk. We can temporarily drop
> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> iomap code won't try to call into the sync routines as well.
> 
> References: https://github.com/btrfs/fstests/issues/12
> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/file.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b62679382799..c75c0f2a5f72 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   		atomic_inc(&BTRFS_I(inode)->sync_writers);
>   
>   	if (iocb->ki_flags & IOCB_DIRECT) {
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>   		num_written = __btrfs_direct_write(iocb, from);
>   	} else {
>   		num_written = btrfs_buffered_write(iocb, from);
> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   	if (num_written > 0)
>   		num_written = generic_write_sync(iocb, num_written);
>   
> -	if (sync)
> +	if (sync) {
> +		iocb->ki_flags |= IOCB_DSYNC;
>   		atomic_dec(&BTRFS_I(inode)->sync_writers);
> +	}
>   out:
>   	current->backing_dev_info = NULL;
>   	return num_written ? num_written : err;
> 

Christoph, I feel like this is broken.  Xfs and ext4 get away with this for 
different reasons, ext4 doesn't take the inode_lock() at all in fsync, and xfs 
takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses 
inode_lock() in ->fsync (not for the IO, just for the logging part).  A long 
time ago I specifically pushed the inode locking down into ->fsync() handlers to 
give us this sort of control.

I'm not 100% on the iomap stuff, but the fix seems like we need to move the 
generic_write_sync() out of iomap_dio_complete() completely, and the callers do 
their own thing, much like the normal generic_file_write_iter() does.  And then 
I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range() so we can 
avoid this sort of thing in the future.  What do you think?  Thanks,

Josef
Darrick J. Wong Sept. 1, 2020, 5:45 p.m. UTC | #7
On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
> > Fstests generic/113 exposes a deadlock introduced by the switch to iomap
> > for direct I/O.
> > 
> > [ 18.291293]
> > [ 18.291532] ============================================
> > [ 18.292115] WARNING: possible recursive locking detected
> > [ 18.292723] 5.9.0-rc2+ #746 Not tainted
> > [ 18.293145] --------------------------------------------
> > [ 18.293718] aio-stress/922 is trying to acquire lock:
> > [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
> > [ 18.295450]
> > [ 18.295450] but task is already holding lock:
> > [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> > [ 18.297249]
> > [ 18.297249] other info that might help us debug this:
> > [ 18.297960] Possible unsafe locking scenario:
> > [ 18.297960]
> > [ 18.298605] CPU0
> > [ 18.298880] ----
> > [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
> > [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
> > [ 18.300156]
> > [ 18.300156] *** DEADLOCK ***
> > [ 18.300156]
> > [ 18.300802] May be due to missing lock nesting notation
> > [ 18.300802]
> > [ 18.301542] 2 locks held by aio-stress/922:
> > [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
> > [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
> > [ 18.304223]
> > [ 18.304223] stack backtrace:
> > [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
> > [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
> > [ 18.306532] Call Trace:
> > [ 18.306796] dump_stack+0x78/0xa0
> > [ 18.307145] __lock_acquire.cold+0x121/0x29f
> > [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
> > [ 18.308140] lock_acquire+0x93/0x3b0
> > [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> > [ 18.309036] down_write+0x33/0x70
> > [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
> > [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
> > [ 18.310384] iomap_dio_complete+0x10d/0x120
> > [ 18.310824] iomap_dio_rw+0x3c8/0x520
> > [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
> > [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
> > [ 18.312264] ? find_held_lock+0x2b/0x80
> > [ 18.312662] aio_write+0xcd/0x180
> > [ 18.313011] ? __might_fault+0x31/0x80
> > [ 18.313408] ? find_held_lock+0x2b/0x80
> > [ 18.313817] ? __might_fault+0x31/0x80
> > [ 18.314217] io_submit_one+0x4e1/0xb30
> > [ 18.314606] ? find_held_lock+0x2b/0x80
> > [ 18.315010] __x64_sys_io_submit+0x71/0x220
> > [ 18.315449] do_syscall_64+0x33/0x40
> > [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 18.316363] RIP: 0033:0x7f5940881f79
> > [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
> > [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> > [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
> > [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
> > [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
> > [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
> > [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
> > 
> > This happens because iomap_dio_complete() calls into generic_write_sync()
> > if we have the data-sync flag set. But as we're still under the
> > inode_lock() from btrfs_file_write_iter() we will deadlock once
> > btrfs_sync_file() tries to acquire the inode_lock().
> > 
> > Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> > already takes care of persisting the data on disk. We can temporarily drop
> > the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> > iomap code won't try to call into the sync routines as well.
> > 
> > References: https://github.com/btrfs/fstests/issues/12
> > Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > ---
> >   fs/btrfs/file.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b62679382799..c75c0f2a5f72 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >   	if (iocb->ki_flags & IOCB_DIRECT) {
> > +		iocb->ki_flags &= ~IOCB_DSYNC;
> >   		num_written = __btrfs_direct_write(iocb, from);
> >   	} else {
> >   		num_written = btrfs_buffered_write(iocb, from);
> > @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   	if (num_written > 0)
> >   		num_written = generic_write_sync(iocb, num_written);
> > -	if (sync)
> > +	if (sync) {
> > +		iocb->ki_flags |= IOCB_DSYNC;
> >   		atomic_dec(&BTRFS_I(inode)->sync_writers);
> > +	}
> >   out:
> >   	current->backing_dev_info = NULL;
> >   	return num_written ? num_written : err;
> > 
> 
> Christoph, I feel like this is broken.  Xfs and ext4 get away with this for
> different reasons, ext4 doesn't take the inode_lock() at all in fsync, and
> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
> time ago I specifically pushed the inode locking down into ->fsync()
> handlers to give us this sort of control.
> 
> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
> generic_write_sync() out of iomap_dio_complete() completely, and the callers
> do their own thing, much like the normal generic_file_write_iter() does.
> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
> so we can avoid this sort of thing in the future.  What do you think?

Hmm, I was under the impression that the direct write completion in
either path (iomap or classic) could call generic_write_sync?  How did
this work in btrfs before the iomap conversion?

--D

> Thanks,
> 
> Josef
>
Josef Bacik Sept. 1, 2020, 5:55 p.m. UTC | #8
On 9/1/20 1:45 PM, Darrick J. Wong wrote:
> On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
>> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
>>> Fstests generic/113 exposes a deadlock introduced by the switch to iomap
>>> for direct I/O.
>>>
>>> [ 18.291293]
>>> [ 18.291532] ============================================
>>> [ 18.292115] WARNING: possible recursive locking detected
>>> [ 18.292723] 5.9.0-rc2+ #746 Not tainted
>>> [ 18.293145] --------------------------------------------
>>> [ 18.293718] aio-stress/922 is trying to acquire lock:
>>> [ 18.294274] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_sync_file+0xf7/0x560 [btrfs]
>>> [ 18.295450]
>>> [ 18.295450] but task is already holding lock:
>>> [ 18.296086] ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
>>> [ 18.297249]
>>> [ 18.297249] other info that might help us debug this:
>>> [ 18.297960] Possible unsafe locking scenario:
>>> [ 18.297960]
>>> [ 18.298605] CPU0
>>> [ 18.298880] ----
>>> [ 18.299151] lock(&sb->s_type->i_mutex_key#11);
>>> [ 18.299653] lock(&sb->s_type->i_mutex_key#11);
>>> [ 18.300156]
>>> [ 18.300156] *** DEADLOCK ***
>>> [ 18.300156]
>>> [ 18.300802] May be due to missing lock nesting notation
>>> [ 18.300802]
>>> [ 18.301542] 2 locks held by aio-stress/922:
>>> [ 18.302000] #0: ffff888217412010 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_file_write_iter+0x6e/0x630 [btrfs]
>>> [ 18.303194] #1: ffff888217411ea0 (&ei->dio_sem){++++}-{3:3}, at: btrfs_direct_IO+0x113/0x160 [btrfs]
>>> [ 18.304223]
>>> [ 18.304223] stack backtrace:
>>> [ 18.304695] CPU: 0 PID: 922 Comm: aio-stress Not tainted 5.9.0-rc2+ #746
>>> [ 18.305383] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
>>> [ 18.306532] Call Trace:
>>> [ 18.306796] dump_stack+0x78/0xa0
>>> [ 18.307145] __lock_acquire.cold+0x121/0x29f
>>> [ 18.307613] ? btrfs_dio_iomap_end+0x65/0x130 [btrfs]
>>> [ 18.308140] lock_acquire+0x93/0x3b0
>>> [ 18.308544] ? btrfs_sync_file+0xf7/0x560 [btrfs]
>>> [ 18.309036] down_write+0x33/0x70
>>> [ 18.309402] ? btrfs_sync_file+0xf7/0x560 [btrfs]
>>> [ 18.309912] btrfs_sync_file+0xf7/0x560 [btrfs]
>>> [ 18.310384] iomap_dio_complete+0x10d/0x120
>>> [ 18.310824] iomap_dio_rw+0x3c8/0x520
>>> [ 18.311225] btrfs_direct_IO+0xd3/0x160 [btrfs]
>>> [ 18.311727] btrfs_file_write_iter+0x1fe/0x630 [btrfs]
>>> [ 18.312264] ? find_held_lock+0x2b/0x80
>>> [ 18.312662] aio_write+0xcd/0x180
>>> [ 18.313011] ? __might_fault+0x31/0x80
>>> [ 18.313408] ? find_held_lock+0x2b/0x80
>>> [ 18.313817] ? __might_fault+0x31/0x80
>>> [ 18.314217] io_submit_one+0x4e1/0xb30
>>> [ 18.314606] ? find_held_lock+0x2b/0x80
>>> [ 18.315010] __x64_sys_io_submit+0x71/0x220
>>> [ 18.315449] do_syscall_64+0x33/0x40
>>> [ 18.315829] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 18.316363] RIP: 0033:0x7f5940881f79
>>> [ 18.316740] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e7 4e 0c 00 f7 d8 64 89 01 48
>>> [ 18.318651] RSP: 002b:00007f5934f51d88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
>>> [ 18.319428] RAX: ffffffffffffffda RBX: 00007f5934f52680 RCX: 00007f5940881f79
>>> [ 18.320168] RDX: 0000000000b56030 RSI: 0000000000000008 RDI: 00007f593171f000
>>> [ 18.320895] RBP: 00007f593171f000 R08: 0000000000000000 R09: 0000000000b56030
>>> [ 18.321630] R10: 00007fffd599e080 R11: 0000000000000246 R12: 0000000000000008
>>> [ 18.322369] R13: 0000000000000000 R14: 0000000000b56030 R15: 0000000000b56070
>>>
>>> This happens because iomap_dio_complete() calls into generic_write_sync()
>>> if we have the data-sync flag set. But as we're still under the
>>> inode_lock() from btrfs_file_write_iter() we will deadlock once
>>> btrfs_sync_file() tries to acquire the inode_lock().
>>>
>>> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
>>> already takes care of persisting the data on disk. We can temporarily drop
>>> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
>>> iomap code won't try to call into the sync routines as well.
>>>
>>> References: https://github.com/btrfs/fstests/issues/12
>>> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> ---
>>>    fs/btrfs/file.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index b62679382799..c75c0f2a5f72 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>    		atomic_inc(&BTRFS_I(inode)->sync_writers);
>>>    	if (iocb->ki_flags & IOCB_DIRECT) {
>>> +		iocb->ki_flags &= ~IOCB_DSYNC;
>>>    		num_written = __btrfs_direct_write(iocb, from);
>>>    	} else {
>>>    		num_written = btrfs_buffered_write(iocb, from);
>>> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>    	if (num_written > 0)
>>>    		num_written = generic_write_sync(iocb, num_written);
>>> -	if (sync)
>>> +	if (sync) {
>>> +		iocb->ki_flags |= IOCB_DSYNC;
>>>    		atomic_dec(&BTRFS_I(inode)->sync_writers);
>>> +	}
>>>    out:
>>>    	current->backing_dev_info = NULL;
>>>    	return num_written ? num_written : err;
>>>
>>
>> Christoph, I feel like this is broken.  Xfs and ext4 get away with this for
>> different reasons, ext4 doesn't take the inode_lock() at all in fsync, and
>> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
>> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
>> time ago I specifically pushed the inode locking down into ->fsync()
>> handlers to give us this sort of control.
>>
>> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
>> generic_write_sync() out of iomap_dio_complete() completely, and the callers
>> do their own thing, much like the normal generic_file_write_iter() does.
>> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
>> so we can avoid this sort of thing in the future.  What do you think?
> 
> Hmm, I was under the impression that the direct write completion in
> either path (iomap or classic) could call generic_write_sync?  How did
> this work in btrfs before the iomap conversion?

Looking at the code, we have this

		if (iocb->ki_flags & IOCB_DSYNC)
			retval = dio_set_defer_completion(dio);

it only happens on async, and then the generic_write_sync() happens outside of 
the context of the submitting task, and since we're async we wouldn't be waiting 
on the IO inside of the area where we're holding the inode_lock().  Thanks,

Josef
Goldwyn Rodrigues Sept. 1, 2020, 6:40 p.m. UTC | #9
On 14:44 01/09, Johannes Thumshirn wrote:
> On 01/09/2020 16:37, Filipe Manana wrote:
> >> +               iocb->ki_flags |= IOCB_DSYNC;
> > 
> > This should be set before calling generic_write_sync() above,
> > otherwise we don't do the persistence required by dsync.
> > 
> > I have to be honest, I don't like this approach that much, it's a bit
> > fragile - what if some other code, invoked in between clearing and
> > setting back the flag, needs that flag to operate correctly?
> 
> Yes I don't like it either. 
> 
> I've compared btrfs to ext4 and xfs and both don't hold on the inode 
> lock for (nearly) the whole xxx_file_write_iter() time, so I think 
> the correct fix would be to relax this time. But I haven't found any 
> documentation yet on which functions need to be under the inode_lock 
> and which not.

I think you have got this reversed. ext4 and xfs hold the inode lock for
the entire xxx_file_write_iter(). OTOH, they don't take the inode lock
during fsync operations. The comments in btrfs_sync_file() mention that
it needs to make sure inode lock is acquired to make sure no other
process dirty the pages while sync is running.

> 
> Any input is appreciated here.
>
Dave Chinner Sept. 1, 2020, 9:46 p.m. UTC | #10
On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
> > This happens because iomap_dio_complete() calls into generic_write_sync()
> > if we have the data-sync flag set. But as we're still under the
> > inode_lock() from btrfs_file_write_iter() we will deadlock once
> > btrfs_sync_file() tries to acquire the inode_lock().
> > 
> > Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> > already takes care of persisting the data on disk. We can temporarily drop
> > the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> > iomap code won't try to call into the sync routines as well.
> > 
> > References: https://github.com/btrfs/fstests/issues/12
> > Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > ---
> >   fs/btrfs/file.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b62679382799..c75c0f2a5f72 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >   	if (iocb->ki_flags & IOCB_DIRECT) {
> > +		iocb->ki_flags &= ~IOCB_DSYNC;
> >   		num_written = __btrfs_direct_write(iocb, from);
> >   	} else {
> >   		num_written = btrfs_buffered_write(iocb, from);
> > @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   	if (num_written > 0)
> >   		num_written = generic_write_sync(iocb, num_written);
> > -	if (sync)
> > +	if (sync) {
> > +		iocb->ki_flags |= IOCB_DSYNC;
> >   		atomic_dec(&BTRFS_I(inode)->sync_writers);
> > +	}
> >   out:
> >   	current->backing_dev_info = NULL;
> >   	return num_written ? num_written : err;
> > 
> 
> Christoph, I feel like this is broken.

No, it isn't broken, it's just a -different design- to the old
direct IO path. It was done this way done by design because the old
way of requiring separate paths for calling generic_write_sync() for
sync and AIO is ....  nasty, and doesn't allow for optimisation of
IO completion functionality that may be wholly dependent on
submission time inode state.

e.g. moving the O_DSYNC completion out of the context of the
IOMAP_F_DIRTY submission context means we can't reliably do FUA
writes to avoid calls to generic_write_sync() completely.
Compromising that functionality is going to cause major performance
regressions for high performance enterprise databases using O_DSYNC
AIO+DIO...

> Xfs and ext4 get away with this for
> different reasons,

No, they "don't get away with it", this is how it was designed to
work.

> ext4 doesn't take the inode_lock() at all in fsync, and
> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
> time ago I specifically pushed the inode locking down into ->fsync()
> handlers to give us this sort of control.
> 
> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
> generic_write_sync() out of iomap_dio_complete() completely, and the callers
> do their own thing, much like the normal generic_file_write_iter() does.

That effectively breaks O_DSYNC AIO and requires us to reintroduce
all the nasty code that the old direct IO path required both the
infrastructure and the filesystems to handle it. That's really not
acceptible solution to an internal btrfs locking issue...

> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
> so we can avoid this sort of thing in the future.  What do you think?

That's not going to work, either. There are filesystems that call
vfs_fsync_range() directly from under the inode_lock(). For example,
the fallocate() path in gfs2. And it's called under the ext4 and XFS
MMAPLOCK from the dax page fault path, which is the page fault
equivalent of the inode_lock(). IOWs, if you know that you aren't
going to take inode locks in your ->fsync() method, there's nothing
that says you cannot call vfs_fsync_range() while holding those
inode locks.

Cheers,

Dave.
Josef Bacik Sept. 1, 2020, 10:19 p.m. UTC | #11
On 9/1/20 5:46 PM, Dave Chinner wrote:
> On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
>> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
>>> This happens because iomap_dio_complete() calls into generic_write_sync()
>>> if we have the data-sync flag set. But as we're still under the
>>> inode_lock() from btrfs_file_write_iter() we will deadlock once
>>> btrfs_sync_file() tries to acquire the inode_lock().
>>>
>>> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
>>> already takes care of persisting the data on disk. We can temporarily drop
>>> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
>>> iomap code won't try to call into the sync routines as well.
>>>
>>> References: https://github.com/btrfs/fstests/issues/12
>>> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> ---
>>>    fs/btrfs/file.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>> index b62679382799..c75c0f2a5f72 100644
>>> --- a/fs/btrfs/file.c
>>> +++ b/fs/btrfs/file.c
>>> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>    		atomic_inc(&BTRFS_I(inode)->sync_writers);
>>>    	if (iocb->ki_flags & IOCB_DIRECT) {
>>> +		iocb->ki_flags &= ~IOCB_DSYNC;
>>>    		num_written = __btrfs_direct_write(iocb, from);
>>>    	} else {
>>>    		num_written = btrfs_buffered_write(iocb, from);
>>> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>    	if (num_written > 0)
>>>    		num_written = generic_write_sync(iocb, num_written);
>>> -	if (sync)
>>> +	if (sync) {
>>> +		iocb->ki_flags |= IOCB_DSYNC;
>>>    		atomic_dec(&BTRFS_I(inode)->sync_writers);
>>> +	}
>>>    out:
>>>    	current->backing_dev_info = NULL;
>>>    	return num_written ? num_written : err;
>>>
>>
>> Christoph, I feel like this is broken.
> 
> No, it isn't broken, it's just a -different design- to the old
> direct IO path. It was done this way done by design because the old
> way of requiring separate paths for calling generic_write_sync() for
> sync and AIO is ....  nasty, and doesn't allow for optimisation of
> IO completion functionality that may be wholly dependent on
> submission time inode state.
> 
> e.g. moving the O_DSYNC completion out of the context of the
> IOMAP_F_DIRTY submission context means we can't reliably do FUA
> writes to avoid calls to generic_write_sync() completely.
> Compromising that functionality is going to cause major performance
> regressions for high performance enterprise databases using O_DSYNC
> AIO+DIO...
> 
>> Xfs and ext4 get away with this for
>> different reasons,
> 
> No, they "don't get away with it", this is how it was designed to
> work.
> 

Didn't mean this as a slight, just saying this is why it works fine for you guys 
and doesn't work for us.  Because when we first were looking at this we couldn't 
understand how it didn't blow up for you and it did blow up for us.  I'm 
providing context, not saying you guys are broken or doing it wrong.

>> ext4 doesn't take the inode_lock() at all in fsync, and
>> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
>> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
>> time ago I specifically pushed the inode locking down into ->fsync()
>> handlers to give us this sort of control.
>>
>> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
>> generic_write_sync() out of iomap_dio_complete() completely, and the callers
>> do their own thing, much like the normal generic_file_write_iter() does.
> 
> That effectively breaks O_DSYNC AIO and requires us to reintroduce
> all the nasty code that the old direct IO path required both the
> infrastructure and the filesystems to handle it. That's really not
> acceptible solution to an internal btrfs locking issue...
> 
>> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
>> so we can avoid this sort of thing in the future.  What do you think?
> 
> That's not going to work, either. There are filesystems that call
> vfs_fsync_range() directly from under the inode_lock(). For example,
> the fallocate() path in gfs2. And it's called under the ext4 and XFS
> MMAPLOCK from the dax page fault path, which is the page fault
> equivalent of the inode_lock(). IOWs, if you know that you aren't
> going to take inode locks in your ->fsync() method, there's nothing
> that says you cannot call vfs_fsync_range() while holding those
> inode locks.

I converted ->fsync to not have the i_mutex taken before calling _years_ ago

02c24a82187d5a628c68edfe71ae60dc135cd178

and part of what I did was update the locking document around it.  So in my 
head, the locking rules were "No VFS locks held on entry".  Obviously that's not 
true today, but if we're going to change the assumptions around these things 
then we really ought to

1) Make sure they're true for _all_ file systems.
2) Document it when it's changed.

Ok so iomap was designed assuming it was safe to take the inode_lock() before 
calling ->fsync.  That's fine, but this is kind of a bad way to find out.  We 
really shouldn't have generic helper's that have different generic locking rules 
based on which file system uses them.  Because then we end up with situations 
like this, where suddenly we're having to come up with some weird solution 
because the generic thing only works for a subset of file systems.  Thanks,

Josef
Dave Chinner Sept. 1, 2020, 11:58 p.m. UTC | #12
On Tue, Sep 01, 2020 at 06:19:17PM -0400, Josef Bacik wrote:
> On 9/1/20 5:46 PM, Dave Chinner wrote:
> > On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
> > > On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
> > > > This happens because iomap_dio_complete() calls into generic_write_sync()
> > > > if we have the data-sync flag set. But as we're still under the
> > > > inode_lock() from btrfs_file_write_iter() we will deadlock once
> > > > btrfs_sync_file() tries to acquire the inode_lock().
> > > > 
> > > > Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> > > > already takes care of persisting the data on disk. We can temporarily drop
> > > > the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> > > > iomap code won't try to call into the sync routines as well.
> > > > 
> > > > References: https://github.com/btrfs/fstests/issues/12
> > > > Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> > > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > > > ---
> > > >    fs/btrfs/file.c | 5 ++++-
> > > >    1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > index b62679382799..c75c0f2a5f72 100644
> > > > --- a/fs/btrfs/file.c
> > > > +++ b/fs/btrfs/file.c
> > > > @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > > >    		atomic_inc(&BTRFS_I(inode)->sync_writers);
> > > >    	if (iocb->ki_flags & IOCB_DIRECT) {
> > > > +		iocb->ki_flags &= ~IOCB_DSYNC;
> > > >    		num_written = __btrfs_direct_write(iocb, from);
> > > >    	} else {
> > > >    		num_written = btrfs_buffered_write(iocb, from);
> > > > @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > > >    	if (num_written > 0)
> > > >    		num_written = generic_write_sync(iocb, num_written);
> > > > -	if (sync)
> > > > +	if (sync) {
> > > > +		iocb->ki_flags |= IOCB_DSYNC;
> > > >    		atomic_dec(&BTRFS_I(inode)->sync_writers);
> > > > +	}
> > > >    out:
> > > >    	current->backing_dev_info = NULL;
> > > >    	return num_written ? num_written : err;
> > > > 
> > > 
> > > Christoph, I feel like this is broken.
> > 
> > No, it isn't broken, it's just a -different design- to the old
> > direct IO path. It was done this way done by design because the old
> > way of requiring separate paths for calling generic_write_sync() for
> > sync and AIO is ....  nasty, and doesn't allow for optimisation of
> > IO completion functionality that may be wholly dependent on
> > submission time inode state.
> > 
> > e.g. moving the O_DSYNC completion out of the context of the
> > IOMAP_F_DIRTY submission context means we can't reliably do FUA
> > writes to avoid calls to generic_write_sync() completely.
> > Compromising that functionality is going to cause major performance
> > regressions for high performance enterprise databases using O_DSYNC
> > AIO+DIO...
> > 
> > > Xfs and ext4 get away with this for
> > > different reasons,
> > 
> > No, they "don't get away with it", this is how it was designed to
> > work.
> > 
> 
> Didn't mean this as a slight, just saying this is why it works fine for you
> guys and doesn't work for us.  Because when we first were looking at this we
> couldn't understand how it didn't blow up for you and it did blow up for us.
> I'm providing context, not saying you guys are broken or doing it wrong.
> 
> > > ext4 doesn't take the inode_lock() at all in fsync, and
> > > xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
> > > inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
> > > time ago I specifically pushed the inode locking down into ->fsync()
> > > handlers to give us this sort of control.
> > > 
> > > I'm not 100% on the iomap stuff, but the fix seems like we need to move the
> > > generic_write_sync() out of iomap_dio_complete() completely, and the callers
> > > do their own thing, much like the normal generic_file_write_iter() does.
> > 
> > That effectively breaks O_DSYNC AIO and requires us to reintroduce
> > all the nasty code that the old direct IO path required both the
> > infrastructure and the filesystems to handle it. That's really not
> > acceptible solution to an internal btrfs locking issue...
> > 
> > > And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
> > > so we can avoid this sort of thing in the future.  What do you think?
> > 
> > That's not going to work, either. There are filesystems that call
> > vfs_fsync_range() directly from under the inode_lock(). For example,
> > the fallocate() path in gfs2. And it's called under the ext4 and XFS
> > MMAPLOCK from the dax page fault path, which is the page fault
> > equivalent of the inode_lock(). IOWs, if you know that you aren't
> > going to take inode locks in your ->fsync() method, there's nothing
> > that says you cannot call vfs_fsync_range() while holding those
> > inode locks.
> 
> I converted ->fsync to not have the i_mutex taken before calling _years_ ago
> 
> 02c24a82187d5a628c68edfe71ae60dc135cd178
> 
> and part of what I did was update the locking document around it.  So in my
> head, the locking rules were "No VFS locks held on entry".  Obviously that's
> not true today, but if we're going to change the assumptions around these
> things then we really ought to
> 
> 1) Make sure they're true for _all_ file systems.
> 2) Document it when it's changed.
> 
> Ok so iomap was designed assuming it was safe to take the inode_lock()
> before calling ->fsync.

No, I did not say that. Stop trying to put your words in my mouth,
Josef.

The iomap design simply assumes that ->fsync can run without needing
to hold the same locks as IO submission requires. iomap
requires a ->fsync implementation that doesn't take the inode_lock()
if the inode_lock() is used to serialise IO submission. i.e. iomap
separates IO exclusion from -internal inode state serialisation-.

This design assumption means DIO submission can safely -flush dirty
data-, invalidate the page cache, sync the inode/journal to stable
storage, etc all while holding the IO exclusion lock to ensure
submission won't race with other metadata modification operations
like truncate, holepunch, etc.

IOWs, iomap expects the inode_lock() to work as an -IO submission
exclusion lock-, not as an inode _state protection_ lock. Yes, this
means iomap has a different IO locking model to the original direct
and bufferhead based buffered IO paths. However, these architectural
changes are required to replace bufferhead based locking, implement
the physical storage exclusion DAX requires, close mmap vs hole
punch races, etc.

In practice, this means all the inode state in XFS (and ext4) are
protected by internal inode locks rather than the inode_lock(). As
such, they they do not require the inode_lock() to be able to write
back dirty data or modify inode state or flush inode state to stable
storage.

Put simply: converting a filesystem to use iomap is not a "change
the filesystem interfacing code and it will work" modification.  We
ask that filesystems are modified to conform to the iomap IO
exclusion model; adding special cases for every potential
locking and mapping quirk every different filesystem has is part of
what turned the old direct IO code into an unmaintainable nightmare.

> That's fine, but this is kind of a bad way to find
> out.  We really shouldn't have generic helper's that have different generic
> locking rules based on which file system uses them.

We certainly can change the rules for new infrastructure. Indeed, we
had to change the rules to support DAX.  The whole point of the
iomap infrastructure was that it enabled us to use code that already
worked for DAX (the XFS code) in multiple filesystems. And as people
have realised that the DIO via iomap is much faster than the old DIO
code and is a much more efficient way of doing large buffered IO,
other filesystems have started to use it.

However....

> Because then we end up
> with situations like this, where suddenly we're having to come up with some
> weird solution because the generic thing only works for a subset of file
> systems.  Thanks,

.... we've always said "you need to change the filesystem code to
use iomap". This is simply a reflection on the fact that iomap has
different rules and constraints to the old code and so it's not a
direct plug in replacement. There are no short cuts here...

Cheers,

Dave.
Josef Bacik Sept. 2, 2020, 12:22 a.m. UTC | #13
On 9/1/20 7:58 PM, Dave Chinner wrote:
> On Tue, Sep 01, 2020 at 06:19:17PM -0400, Josef Bacik wrote:
>> On 9/1/20 5:46 PM, Dave Chinner wrote:
>>> On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
>>>> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
>>>>> This happens because iomap_dio_complete() calls into generic_write_sync()
>>>>> if we have the data-sync flag set. But as we're still under the
>>>>> inode_lock() from btrfs_file_write_iter() we will deadlock once
>>>>> btrfs_sync_file() tries to acquire the inode_lock().
>>>>>
>>>>> Calling into generic_write_sync() is not needed as __btrfs_direct_write()
>>>>> already takes care of persisting the data on disk. We can temporarily drop
>>>>> the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
>>>>> iomap code won't try to call into the sync routines as well.
>>>>>
>>>>> References: https://github.com/btrfs/fstests/issues/12
>>>>> Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>> ---
>>>>>     fs/btrfs/file.c | 5 ++++-
>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index b62679382799..c75c0f2a5f72 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>>>     		atomic_inc(&BTRFS_I(inode)->sync_writers);
>>>>>     	if (iocb->ki_flags & IOCB_DIRECT) {
>>>>> +		iocb->ki_flags &= ~IOCB_DSYNC;
>>>>>     		num_written = __btrfs_direct_write(iocb, from);
>>>>>     	} else {
>>>>>     		num_written = btrfs_buffered_write(iocb, from);
>>>>> @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>>>>     	if (num_written > 0)
>>>>>     		num_written = generic_write_sync(iocb, num_written);
>>>>> -	if (sync)
>>>>> +	if (sync) {
>>>>> +		iocb->ki_flags |= IOCB_DSYNC;
>>>>>     		atomic_dec(&BTRFS_I(inode)->sync_writers);
>>>>> +	}
>>>>>     out:
>>>>>     	current->backing_dev_info = NULL;
>>>>>     	return num_written ? num_written : err;
>>>>>
>>>>
>>>> Christoph, I feel like this is broken.
>>>
>>> No, it isn't broken, it's just a -different design- to the old
>>> direct IO path. It was done this way done by design because the old
>>> way of requiring separate paths for calling generic_write_sync() for
>>> sync and AIO is ....  nasty, and doesn't allow for optimisation of
>>> IO completion functionality that may be wholly dependent on
>>> submission time inode state.
>>>
>>> e.g. moving the O_DSYNC completion out of the context of the
>>> IOMAP_F_DIRTY submission context means we can't reliably do FUA
>>> writes to avoid calls to generic_write_sync() completely.
>>> Compromising that functionality is going to cause major performance
>>> regressions for high performance enterprise databases using O_DSYNC
>>> AIO+DIO...
>>>
>>>> Xfs and ext4 get away with this for
>>>> different reasons,
>>>
>>> No, they "don't get away with it", this is how it was designed to
>>> work.
>>>
>>
>> Didn't mean this as a slight, just saying this is why it works fine for you
>> guys and doesn't work for us.  Because when we first were looking at this we
>> couldn't understand how it didn't blow up for you and it did blow up for us.
>> I'm providing context, not saying you guys are broken or doing it wrong.
>>
>>>> ext4 doesn't take the inode_lock() at all in fsync, and
>>>> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
>>>> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
>>>> time ago I specifically pushed the inode locking down into ->fsync()
>>>> handlers to give us this sort of control.
>>>>
>>>> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
>>>> generic_write_sync() out of iomap_dio_complete() completely, and the callers
>>>> do their own thing, much like the normal generic_file_write_iter() does.
>>>
>>> That effectively breaks O_DSYNC AIO and requires us to reintroduce
>>> all the nasty code that the old direct IO path required both the
>>> infrastructure and the filesystems to handle it. That's really not
>>> acceptible solution to an internal btrfs locking issue...
>>>
>>>> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
>>>> so we can avoid this sort of thing in the future.  What do you think?
>>>
>>> That's not going to work, either. There are filesystems that call
>>> vfs_fsync_range() directly from under the inode_lock(). For example,
>>> the fallocate() path in gfs2. And it's called under the ext4 and XFS
>>> MMAPLOCK from the dax page fault path, which is the page fault
>>> equivalent of the inode_lock(). IOWs, if you know that you aren't
>>> going to take inode locks in your ->fsync() method, there's nothing
>>> that says you cannot call vfs_fsync_range() while holding those
>>> inode locks.
>>
>> I converted ->fsync to not have the i_mutex taken before calling _years_ ago
>>
>> 02c24a82187d5a628c68edfe71ae60dc135cd178
>>
>> and part of what I did was update the locking document around it.  So in my
>> head, the locking rules were "No VFS locks held on entry".  Obviously that's
>> not true today, but if we're going to change the assumptions around these
>> things then we really ought to
>>
>> 1) Make sure they're true for _all_ file systems.
>> 2) Document it when it's changed.
>>
>> Ok so iomap was designed assuming it was safe to take the inode_lock()
>> before calling ->fsync.
> 
> No, I did not say that. Stop trying to put your words in my mouth,
> Josef.
> 

Why are you being combative?  I'm simply trying to work towards a 
reasonable solution and figure out how we keep this class of problems 
from happening again.  I literally do not care at all about the design 
decisions for iomap, you are an intelligent person and I assume good 
intent, I assume you did it for good reasons.  I'm simply trying to work 
out holes in my understanding, and work towards a solution we're all 
happy with.  I'm not trying to argue with you or throw mud, I'm simply 
trying to understand.

> The iomap design simply assumes that ->fsync can run without needing
> to hold the same locks as IO submission requires. iomap
> requires a ->fsync implementation that doesn't take the inode_lock()
> if the inode_lock() is used to serialise IO submission. i.e. iomap
> separates IO exclusion from -internal inode state serialisation-.
> 

This is only slightly different from what I was saying, and in reality 
is the same thing because we all currently use the inode_lock() for the 
submission side.  So while the assumption may be slightly different, the 
outcome is the same.

> This design assumption means DIO submission can safely -flush dirty
> data-, invalidate the page cache, sync the inode/journal to stable
> storage, etc all while holding the IO exclusion lock to ensure
> submission won't race with other metadata modification operations
> like truncate, holepunch, etc.
> 
> IOWs, iomap expects the inode_lock() to work as an -IO submission
> exclusion lock-, not as an inode _state protection_ lock. Yes, this
> means iomap has a different IO locking model to the original direct
> and bufferhead based buffered IO paths. However, these architectural
> changes are required to replace bufferhead based locking, implement
> the physical storage exclusion DAX requires, close mmap vs hole
> punch races, etc.
> 
> In practice, this means all the inode state in XFS (and ext4) are
> protected by internal inode locks rather than the inode_lock(). As
> such, they they do not require the inode_lock() to be able to write
> back dirty data or modify inode state or flush inode state to stable
> storage.
> 
> Put simply: converting a filesystem to use iomap is not a "change
> the filesystem interfacing code and it will work" modification.  We
> ask that filesystems are modified to conform to the iomap IO
> exclusion model; adding special cases for every potential
> locking and mapping quirk every different filesystem has is part of
> what turned the old direct IO code into an unmaintainable nightmare.
> 
>> That's fine, but this is kind of a bad way to find
>> out.  We really shouldn't have generic helper's that have different generic
>> locking rules based on which file system uses them.
> 
> We certainly can change the rules for new infrastructure. Indeed, we
> had to change the rules to support DAX.  The whole point of the
> iomap infrastructure was that it enabled us to use code that already
> worked for DAX (the XFS code) in multiple filesystems. And as people
> have realised that the DIO via iomap is much faster than the old DIO
> code and is a much more efficient way of doing large buffered IO,
> other filesystems have started to use it.
> 
> However....
> 
>> Because then we end up
>> with situations like this, where suddenly we're having to come up with some
>> weird solution because the generic thing only works for a subset of file
>> systems.  Thanks,
> 
> .... we've always said "you need to change the filesystem code to
> use iomap". This is simply a reflection on the fact that iomap has
> different rules and constraints to the old code and so it's not a
> direct plug in replacement. There are no short cuts here...
> 

That's fine, again I'm not saying you did anything wrong with the 
implementation.  I'm saying these requirements were not clear, and with 
no warning or annotations or documentation we simply find out by random 
chance while Johannes is running with a non-standard xfstests setup. 
Perhaps that's what the system working looks like, but honestly I would 
have rather found out at the point that I applied, built, and reviewed 
the code so we could have addressed it then.  Instead now we have to rip 
it out until we figure out what to do about it.  Thanks,

Josef
Johannes Thumshirn Sept. 2, 2020, 7:12 a.m. UTC | #14
On 02/09/2020 02:22, Josef Bacik wrote:
> Instead now we have to rip 
> it out until we figure out what to do about it.

I don't think we need to rip out the iomap conversion. We can 
take my fix albeit not pretty, until we have reworked the locking
around ->fsync(). Probably with a big fat comment attached to it.

Byte,
	Johannes
Josef Bacik Sept. 2, 2020, 11:10 a.m. UTC | #15
On 9/2/20 3:12 AM, Johannes Thumshirn wrote:
> On 02/09/2020 02:22, Josef Bacik wrote:
>> Instead now we have to rip
>> it out until we figure out what to do about it.
> 
> I don't think we need to rip out the iomap conversion. We can
> take my fix albeit not pretty, until we have reworked the locking
> around ->fsync(). Probably with a big fat comment attached to it.
> 

We do, because your fix breaks DSYNC for AIO.  You didn't hit this with 
direct io, you hit it with AIO, and the reason you hit it is because you 
are on zram, so your bio's completed before we exited iomap_dio_rw.  So 
that was the last put on the iomap_dio, and thus we ran
iomap_dio_complete() and deadlocked.  We can't just drop the DSYNC thing 
for AIO because in the normal case where this doesn't happen we need to 
know when the last thing is finished in order to run ->fsync(), we can't 
just run it after submission.  Thanks,

Josef
Matthew Wilcox Sept. 2, 2020, 11:44 a.m. UTC | #16
On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
> Put simply: converting a filesystem to use iomap is not a "change
> the filesystem interfacing code and it will work" modification.  We
> ask that filesystems are modified to conform to the iomap IO
> exclusion model; adding special cases for every potential
> locking and mapping quirk every different filesystem has is part of
> what turned the old direct IO code into an unmaintainable nightmare.
> 
> > That's fine, but this is kind of a bad way to find
> > out.  We really shouldn't have generic helper's that have different generic
> > locking rules based on which file system uses them.
> 
> We certainly can change the rules for new infrastructure. Indeed, we
> had to change the rules to support DAX.  The whole point of the
> iomap infrastructure was that it enabled us to use code that already
> worked for DAX (the XFS code) in multiple filesystems. And as people
> have realised that the DIO via iomap is much faster than the old DIO
> code and is a much more efficient way of doing large buffered IO,
> other filesystems have started to use it.
> 
> However....
> 
> > Because then we end up
> > with situations like this, where suddenly we're having to come up with some
> > weird solution because the generic thing only works for a subset of file
> > systems.  Thanks,
> 
> .... we've always said "you need to change the filesystem code to
> use iomap". This is simply a reflection on the fact that iomap has
> different rules and constraints to the old code and so it's not a
> direct plug in replacement. There are no short cuts here...

Can you point me (and I suspect Josef!) towards the documentation of the
locking model?  I was hoping to find Documentation/filesystems/iomap.rst
but all the 'iomap' strings in Documentation/ refer to pci_iomap and
similar, except for this in the DAX documentation:

: - implementing ->read_iter and ->write_iter operations which use dax_iomap_rw()
:   when inode has S_DAX flag set
: - implementing an mmap file operation for DAX files which sets the
:   VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
:   include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
:   handlers should probably call dax_iomap_fault() passing the appropriate
:   fault size and iomap operations.
: - calling iomap_zero_range() passing appropriate iomap operations instead of
:   block_truncate_page() for DAX files
: - ensuring that there is sufficient locking between reads, writes,
:   truncates and page faults
: 
: The iomap handlers for allocating blocks must make sure that allocated blocks
: are zeroed out and converted to written extents before being returned to avoid
: exposure of uninitialized data through mmap.

which doesn't bear on this situation.
Dave Chinner Sept. 2, 2020, 12:20 p.m. UTC | #17
On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
> > Put simply: converting a filesystem to use iomap is not a "change
> > the filesystem interfacing code and it will work" modification.  We
> > ask that filesystems are modified to conform to the iomap IO
> > exclusion model; adding special cases for every potential
> > locking and mapping quirk every different filesystem has is part of
> > what turned the old direct IO code into an unmaintainable nightmare.
> > 
> > > That's fine, but this is kind of a bad way to find
> > > out.  We really shouldn't have generic helper's that have different generic
> > > locking rules based on which file system uses them.
> > 
> > We certainly can change the rules for new infrastructure. Indeed, we
> > had to change the rules to support DAX.  The whole point of the
> > iomap infrastructure was that it enabled us to use code that already
> > worked for DAX (the XFS code) in multiple filesystems. And as people
> > have realised that the DIO via iomap is much faster than the old DIO
> > code and is a much more efficient way of doing large buffered IO,
> > other filesystems have started to use it.
> > 
> > However....
> > 
> > > Because then we end up
> > > with situations like this, where suddenly we're having to come up with some
> > > weird solution because the generic thing only works for a subset of file
> > > systems.  Thanks,
> > 
> > .... we've always said "you need to change the filesystem code to
> > use iomap". This is simply a reflection on the fact that iomap has
> > different rules and constraints to the old code and so it's not a
> > direct plug in replacement. There are no short cuts here...
> 
> Can you point me (and I suspect Josef!) towards the documentation of the
> locking model?  I was hoping to find Documentation/filesystems/iomap.rst
> but all the 'iomap' strings in Documentation/ refer to pci_iomap and
> similar, except for this in the DAX documentation:

There's no locking model documentation because there is no locking
in the iomap direct IO code. The filesystem defines and does all the
locking, so there's pretty much nothing to document for iomap.

IOWs, the only thing iomap_dio_rw requires is that the IO completion
paths do not take same locks that the IO submission path
requires. And that's because:

/*
 * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
 * is being issued as AIO or not. [...]

So you obviously can't sit waiting for dio completion in
iomap_dio_rw() while holding the submission lock if completion
requires the submission lock to make progress.

FWIW, iomap_dio_rw() originally required the inode_lock() to be held
and contained a lockdep assert to verify this, but....

commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7
Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
Date:   Sat Nov 30 09:59:25 2019 -0600

    iomap: remove lockdep_assert_held()
    
    Filesystems such as btrfs can perform direct I/O without holding the
    inode->i_rwsem in some of the cases like writing within i_size.  So,
    remove the check for lockdep_assert_held() in iomap_dio_rw().
    
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>

... btrfs has special corner cases for direct IO locking and hence
we removed the lockdep assert....

IOWs, iomap_dio_rw() really does not care what strategy filesystems
use to serialise DIO against other operations.  Filesystems can use
whatever IO serialisation mechanism they want (mutex, rwsem, range
locks, etc) as long as they obey the one simple requirement: do not
take the DIO submission lock in the DIO completion path.

Cheers,

Dave.
Josef Bacik Sept. 2, 2020, 12:42 p.m. UTC | #18
On 9/2/20 8:20 AM, Dave Chinner wrote:
> On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote:
>> On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
>>> Put simply: converting a filesystem to use iomap is not a "change
>>> the filesystem interfacing code and it will work" modification.  We
>>> ask that filesystems are modified to conform to the iomap IO
>>> exclusion model; adding special cases for every potential
>>> locking and mapping quirk every different filesystem has is part of
>>> what turned the old direct IO code into an unmaintainable nightmare.
>>>
>>>> That's fine, but this is kind of a bad way to find
>>>> out.  We really shouldn't have generic helper's that have different generic
>>>> locking rules based on which file system uses them.
>>>
>>> We certainly can change the rules for new infrastructure. Indeed, we
>>> had to change the rules to support DAX.  The whole point of the
>>> iomap infrastructure was that it enabled us to use code that already
>>> worked for DAX (the XFS code) in multiple filesystems. And as people
>>> have realised that the DIO via iomap is much faster than the old DIO
>>> code and is a much more efficient way of doing large buffered IO,
>>> other filesystems have started to use it.
>>>
>>> However....
>>>
>>>> Because then we end up
>>>> with situations like this, where suddenly we're having to come up with some
>>>> weird solution because the generic thing only works for a subset of file
>>>> systems.  Thanks,
>>>
>>> .... we've always said "you need to change the filesystem code to
>>> use iomap". This is simply a reflection on the fact that iomap has
>>> different rules and constraints to the old code and so it's not a
>>> direct plug in replacement. There are no short cuts here...
>>
>> Can you point me (and I suspect Josef!) towards the documentation of the
>> locking model?  I was hoping to find Documentation/filesystems/iomap.rst
>> but all the 'iomap' strings in Documentation/ refer to pci_iomap and
>> similar, except for this in the DAX documentation:
> 
> There's no locking model documentation because there is no locking
> in the iomap direct IO code. The filesystem defines and does all the
> locking, so there's pretty much nothing to document for iomap.
> 
> IOWs, the only thing iomap_dio_rw requires is that the IO completion
> paths do not take same locks that the IO submission path
> requires. And that's because:
> 
> /*
>   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
>   * is being issued as AIO or not. [...]
> 
> So you obviously can't sit waiting for dio completion in
> iomap_dio_rw() while holding the submission lock if completion
> requires the submission lock to make progress.
> 
> FWIW, iomap_dio_rw() originally required the inode_lock() to be held
> and contained a lockdep assert to verify this, but....
> 
> commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7
> Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Date:   Sat Nov 30 09:59:25 2019 -0600
> 
>      iomap: remove lockdep_assert_held()
>      
>      Filesystems such as btrfs can perform direct I/O without holding the
>      inode->i_rwsem in some of the cases like writing within i_size.  So,
>      remove the check for lockdep_assert_held() in iomap_dio_rw().
>      
>      Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
>      Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>      Signed-off-by: David Sterba <dsterba@suse.com>
> 
> ... btrfs has special corner cases for direct IO locking and hence
> we removed the lockdep assert....
> 
> IOWs, iomap_dio_rw() really does not care what strategy filesystems
> use to serialise DIO against other operations.  Filesystems can use
> whatever IO serialisation mechanism they want (mutex, rwsem, range
> locks, etc) as long as they obey the one simple requirement: do not
> take the DIO submission lock in the DIO completion path.
> 

Goldwyn has been working on these patches for a long time, and is 
actually familiar with this code, and he missed that these two 
interfaces are being mixed.  This is a problem that I want to solve.  He 
didn't notice it in any of his testing, which IIRC was like 6 months to 
get this stuff actually into the btrfs tree.  If we're going to mix 
interfaces then it should be blatantly obvious to developers that's 
what's happening so the find out during development, not after the 
patches have landed, and certainly not after they've made it out to 
users.  Thanks,

Josef
Darrick J. Wong Sept. 2, 2020, 4:29 p.m. UTC | #19
On Wed, Sep 02, 2020 at 07:10:08AM -0400, Josef Bacik wrote:
> On 9/2/20 3:12 AM, Johannes Thumshirn wrote:
> > On 02/09/2020 02:22, Josef Bacik wrote:
> > > Instead now we have to rip
> > > it out until we figure out what to do about it.
> > 
> > I don't think we need to rip out the iomap conversion. We can
> > take my fix albeit not pretty, until we have reworked the locking
> > around ->fsync(). Probably with a big fat comment attached to it.
> > 
> 
> We do, because your fix breaks DSYNC for AIO.  You didn't hit this with
> direct io, you hit it with AIO, and the reason you hit it is because you are
> on zram, so your bio's completed before we exited iomap_dio_rw.  So that was
> the last put on the iomap_dio, and thus we ran
> iomap_dio_complete() and deadlocked.  We can't just drop the DSYNC thing for
> AIO because in the normal case where this doesn't happen we need to know
> when the last thing is finished in order to run ->fsync(), we can't just run
> it after submission.  Thanks,

Bleh, Oracle mail (or vger or something) is being slow again...

It occurred to me that we added iomap_dio_ops.submit_io for the benefit
of btrfs.  Could we solve all this for now by adding a ->write_sync
function pointer to iomap_dio_ops that could lead back into a btrfs
function that would flush the necessary bits without itself taking the
inode lock?  And if a ->write_sync is not supplied, then the caller gets
generic_write_sync?

It's kind of a bandaid, but maybe less bad of one than restructuring the
btrfs locking model under time pressure...

--D

> Josef
Josef Bacik Sept. 2, 2020, 4:47 p.m. UTC | #20
On 9/2/20 12:29 PM, Darrick J. Wong wrote:
> On Wed, Sep 02, 2020 at 07:10:08AM -0400, Josef Bacik wrote:
>> On 9/2/20 3:12 AM, Johannes Thumshirn wrote:
>>> On 02/09/2020 02:22, Josef Bacik wrote:
>>>> Instead now we have to rip
>>>> it out until we figure out what to do about it.
>>>
>>> I don't think we need to rip out the iomap conversion. We can
>>> take my fix albeit not pretty, until we have reworked the locking
>>> around ->fsync(). Probably with a big fat comment attached to it.
>>>
>>
>> We do, because your fix breaks DSYNC for AIO.  You didn't hit this with
>> direct io, you hit it with AIO, and the reason you hit it is because you are
>> on zram, so your bio's completed before we exited iomap_dio_rw.  So that was
>> the last put on the iomap_dio, and thus we ran
>> iomap_dio_complete() and deadlocked.  We can't just drop the DSYNC thing for
>> AIO because in the normal case where this doesn't happen we need to know
>> when the last thing is finished in order to run ->fsync(), we can't just run
>> it after submission.  Thanks,
> 
> Bleh, Oracle mail (or vger or something) is being slow again...
> 
> It occurred to me that we added iomap_dio_ops.submit_io for the benefit
> of btrfs.  Could we solve all this for now by adding a ->write_sync
> function pointer to iomap_dio_ops that could lead back into a btrfs
> function that would flush the necessary bits without itself taking the
> inode lock?  And if a ->write_sync is not supplied, then the caller gets
> generic_write_sync?
> 
> It's kind of a bandaid, but maybe less bad of one than restructuring the
> btrfs locking model under time pressure...
> 

I'd rather not mess around with the generic iomap stuff for this, coordinating 
changes between generic and fs stuff is annoying enough as it is.  We've got a 
strategy to work around this in btrfs so we don't have to rip out the iomap work 
right now.  And then we'll rip out the workaround once we've reworked the 
locking, since the locking stuff will require a fair bit of testing and soak 
time to be sure it's safe.  Thanks,

Josef
Dave Chinner Sept. 3, 2020, 2:28 a.m. UTC | #21
On Wed, Sep 02, 2020 at 08:42:25AM -0400, Josef Bacik wrote:
> On 9/2/20 8:20 AM, Dave Chinner wrote:
> > On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
> > > > Put simply: converting a filesystem to use iomap is not a "change
> > > > the filesystem interfacing code and it will work" modification.  We
> > > > ask that filesystems are modified to conform to the iomap IO
> > > > exclusion model; adding special cases for every potential
> > > > locking and mapping quirk every different filesystem has is part of
> > > > what turned the old direct IO code into an unmaintainable nightmare.
> > > > 
> > > > > That's fine, but this is kind of a bad way to find
> > > > > out.  We really shouldn't have generic helper's that have different generic
> > > > > locking rules based on which file system uses them.
> > > > 
> > > > We certainly can change the rules for new infrastructure. Indeed, we
> > > > had to change the rules to support DAX.  The whole point of the
> > > > iomap infrastructure was that it enabled us to use code that already
> > > > worked for DAX (the XFS code) in multiple filesystems. And as people
> > > > have realised that the DIO via iomap is much faster than the old DIO
> > > > code and is a much more efficient way of doing large buffered IO,
> > > > other filesystems have started to use it.
> > > > 
> > > > However....
> > > > 
> > > > > Because then we end up
> > > > > with situations like this, where suddenly we're having to come up with some
> > > > > weird solution because the generic thing only works for a subset of file
> > > > > systems.  Thanks,
> > > > 
> > > > .... we've always said "you need to change the filesystem code to
> > > > use iomap". This is simply a reflection on the fact that iomap has
> > > > different rules and constraints to the old code and so it's not a
> > > > direct plug in replacement. There are no short cuts here...
> > > 
> > > Can you point me (and I suspect Josef!) towards the documentation of the
> > > locking model?  I was hoping to find Documentation/filesystems/iomap.rst
> > > but all the 'iomap' strings in Documentation/ refer to pci_iomap and
> > > similar, except for this in the DAX documentation:
> > 
> > There's no locking model documentation because there is no locking
> > in the iomap direct IO code. The filesystem defines and does all the
> > locking, so there's pretty much nothing to document for iomap.
> > 
> > IOWs, the only thing iomap_dio_rw requires is that the IO completion
> > paths do not take same locks that the IO submission path
> > requires. And that's because:
> > 
> > /*
> >   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> >   * is being issued as AIO or not. [...]
> > 
> > So you obviously can't sit waiting for dio completion in
> > iomap_dio_rw() while holding the submission lock if completion
> > requires the submission lock to make progress.
> > 
> > FWIW, iomap_dio_rw() originally required the inode_lock() to be held
> > and contained a lockdep assert to verify this, but....
> > 
> > commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7
> > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Date:   Sat Nov 30 09:59:25 2019 -0600
> > 
> >      iomap: remove lockdep_assert_held()
> >      Filesystems such as btrfs can perform direct I/O without holding the
> >      inode->i_rwsem in some of the cases like writing within i_size.  So,
> >      remove the check for lockdep_assert_held() in iomap_dio_rw().
> >      Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> >      Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >      Signed-off-by: David Sterba <dsterba@suse.com>
> > 
> > ... btrfs has special corner cases for direct IO locking and hence
> > we removed the lockdep assert....
> > 
> > IOWs, iomap_dio_rw() really does not care what strategy filesystems
> > use to serialise DIO against other operations.  Filesystems can use
> > whatever IO serialisation mechanism they want (mutex, rwsem, range
> > locks, etc) as long as they obey the one simple requirement: do not
> > take the DIO submission lock in the DIO completion path.
> > 
> 
> Goldwyn has been working on these patches for a long time, and is actually
> familiar with this code, and he missed that these two interfaces are being
> mixed.  This is a problem that I want to solve.  He didn't notice it in any
> of his testing, which IIRC was like 6 months to get this stuff actually into
> the btrfs tree.

And that "been working on it for 6 months" is what really worries me
the most. Hence I've done some post-mortem analysis of the
"inode_lock" deadlock situation. I decided to do this after I
realised this same patch set was merged last cycle and reverted late in
the cycle because of problems it was found to have with page
invalidation when mixed IO types were used.

I did a bit of looking around yesterday afternoon, both in the btrfs
code to understand the iomap_dio_rw changes and at our test
coverage for DIO functionality.

The more I dug, the more I'm finding that the most important
question we need to answer is not "why wasn't this iomap locking
requirement documented", the question we should be trying to answer
is this:

	Why wasn't this *obvious* deadlock detected months ago?

First of all, the inode-lock() complaints are largely a red herring
as the typical btrfs DIO write path through the patchset does not
take the inode lock and hence it is actually "safe" to take the
inode lock in the completion path in the common case. i.e. for DIO
writes wholly inside EOF, submission drops the inode_lock() before
calling iomap_dio_rw() and so there is no inode_lock() deadlock
present. From that view point, it looks like there's only a specific
corner case where this deadlock might occur and that would explain
why it hasn't been discovered until now.

However.

The new btrfs dio write path always does
down_read(BTRFS_I()->dio_sem), even when the inode lock has not been
taken.  That's important because btrfs_sync_file() always does a
down_write(&BTRFS_I(inode)->dio_sem) call and will deadlock iin IO
completion if the IO submission code is holding the dio_sem.

So regardless of the inode_lock(), non-AIO O_DSYNC DIO writes
through iomap_dio_rw() should deadlock immediately on the first IO
with this locking arrangement. It will deadlock on either the
inode_lock or the dio_sem, depending on whether the ODSYNC DIO write
is wholly within EOF or not, but the deadlock is guaranteed to
occur. Hence we can completely ignore the "inode_lock vs fsync" side
show, because other btrfs internal locks will trigger the same
issue.

If this is correct, then how did this "should happen instantly"
bug go undiscovered for months?

Well....  It appears that fstests has no coverage of non-AIO
O_DSYNC DIO writes.  Tools like fsx and fstress don't have O_SYNC,
O_DSYNC or RWF_DSYNC modes and O_SYNC and O_DSYNC is not used by any
of the common DIO test programs. xfs_io can open files O_SYNC and
there's a test that uses xfs_io's RWF_DSYNC capability, but they all
use buffered IO and so none of tests that open or write data
synchronously use direct IO.

The only conclusion I can make from thsi is that the case that
should deadlock instantly isn't actually covered by fstests at all.
This conclusion only stands up this O_DSYNC code path was only
"tested" for feature coverage with fstests. However, it does imply
that no pre-implementation audit was done to determine if fstests
actually covered all the functionality that needed to be tested
here....

I tend to use xfs_io and fio for DIO feature correctness testing
long before I run fstests on new code.  That's how I developed and
tested the FUA optimisations - xfs_io and fio w/ RWF_DSYNC on XFS on
iscsi - so I've never noticed that fstests doesn't actually exercise
this syscall path directly.

Granted, the problem was eventually discovered by fstests, but this
also raised questions. The failing test was an AIO+DIO O_DSYNC test,
but the trigger has been described as a "unusual event on a rarely
tested configuration".

That "unusual event" was an DIO completion being run from submission
context because the IO completed before the submission had been
finish. This is not actually unusual - it's how all AIO on
synchronous IO devices complete. i.e. if you have a ram device or a
(fake) pmem device, every single AIO will complete in this way as
the "IO reference" held by submit_bio() is completed inside
submit_bio() before it returns to the submission context. Hence the
submission context always drops the last IO reference and completes
the IO.

Therefore running fstests on a ramdisk or (fake) pmem would have
triggered this deadlock *instantly* on the first O_DSYNC AIO+DIO
write that fstests issued. This implies that btrfs is rarely tested
on fast synchronous storage devices despite ramdisks being available
on every test machine that can run fstests. To provide a contrast,
the iomap infrastructure is regularly tested on such devices - both
Darrick and I have both have (fake) pmem test setups and exercise
synchronous completion code paths like this on a daily basis.

Hence it looks to me like the answer to the "why wasn't this found
earlier" question is a combination of multiple factors:

1. fstests has no specific non-AIO O_DSYNC DIO write unit tests, nor
do the stress tests allow use O_DSYNC or RWF_DSYNC.

2. No test coverage audit was performed prior to making a critical
change to the btrfs IO path so this specific lack of coverage was
not noticed until now.

3. after the first revert of this functionality, post-mortem
analysis either wasn't performed or didn't identify process and/or
test coverage issues that allowed serious issues in the patch set to
go undiscovered.

4. tools and benchmarks that could have easily discovered the
problem either weren't run or they weren't configured to test
and exercise all the IO path features the change affected.

5. btrfs is not regularly tested on a variety of storage that have
distinctly different IO path behaviours.

> If we're going to mix interfaces then it should be
> blatantly obvious to developers that's what's happening so the find out
> during development, not after the patches have landed, and certainly not
> after they've made it out to users.  Thanks,

As the above indicates, this issue _should_ have been blatantly
obvious months ago and documentation would not change this fact.
IOWs, even if the iomap requirement was documented and followed, a
locking bug in the btrfs implementation would still not have been
discovered until now because that's how long it took to actually
exercise the buggy code path and expose it.

So, yeah, the lack of documentation contributed to the bug being
present. But taking 6 months to actually exercise the new code
containing the bug is most definitely not an interface documentation
problem, nor a problem that can be fixed by correcting the interface
documentation....

Cheers,

Dave.
Filipe Manana Sept. 3, 2020, 9:49 a.m. UTC | #22
On Thu, Sep 3, 2020 at 3:28 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Sep 02, 2020 at 08:42:25AM -0400, Josef Bacik wrote:
> > On 9/2/20 8:20 AM, Dave Chinner wrote:
> > > On Wed, Sep 02, 2020 at 12:44:14PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 02, 2020 at 09:58:30AM +1000, Dave Chinner wrote:
> > > > > Put simply: converting a filesystem to use iomap is not a "change
> > > > > the filesystem interfacing code and it will work" modification.  We
> > > > > ask that filesystems are modified to conform to the iomap IO
> > > > > exclusion model; adding special cases for every potential
> > > > > locking and mapping quirk every different filesystem has is part of
> > > > > what turned the old direct IO code into an unmaintainable nightmare.
> > > > >
> > > > > > That's fine, but this is kind of a bad way to find
> > > > > > out.  We really shouldn't have generic helper's that have different generic
> > > > > > locking rules based on which file system uses them.
> > > > >
> > > > > We certainly can change the rules for new infrastructure. Indeed, we
> > > > > had to change the rules to support DAX.  The whole point of the
> > > > > iomap infrastructure was that it enabled us to use code that already
> > > > > worked for DAX (the XFS code) in multiple filesystems. And as people
> > > > > have realised that the DIO via iomap is much faster than the old DIO
> > > > > code and is a much more efficient way of doing large buffered IO,
> > > > > other filesystems have started to use it.
> > > > >
> > > > > However....
> > > > >
> > > > > > Because then we end up
> > > > > > with situations like this, where suddenly we're having to come up with some
> > > > > > weird solution because the generic thing only works for a subset of file
> > > > > > systems.  Thanks,
> > > > >
> > > > > .... we've always said "you need to change the filesystem code to
> > > > > use iomap". This is simply a reflection on the fact that iomap has
> > > > > different rules and constraints to the old code and so it's not a
> > > > > direct plug in replacement. There are no short cuts here...
> > > >
> > > > Can you point me (and I suspect Josef!) towards the documentation of the
> > > > locking model?  I was hoping to find Documentation/filesystems/iomap.rst
> > > > but all the 'iomap' strings in Documentation/ refer to pci_iomap and
> > > > similar, except for this in the DAX documentation:
> > >
> > > There's no locking model documentation because there is no locking
> > > in the iomap direct IO code. The filesystem defines and does all the
> > > locking, so there's pretty much nothing to document for iomap.
> > >
> > > IOWs, the only thing iomap_dio_rw requires is that the IO completion
> > > paths do not take same locks that the IO submission path
> > > requires. And that's because:
> > >
> > > /*
> > >   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> > >   * is being issued as AIO or not. [...]
> > >
> > > So you obviously can't sit waiting for dio completion in
> > > iomap_dio_rw() while holding the submission lock if completion
> > > requires the submission lock to make progress.
> > >
> > > FWIW, iomap_dio_rw() originally required the inode_lock() to be held
> > > and contained a lockdep assert to verify this, but....
> > >
> > > commit 3ad99bec6e82e32fa9faf2f84e74b134586b46f7
> > > Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > Date:   Sat Nov 30 09:59:25 2019 -0600
> > >
> > >      iomap: remove lockdep_assert_held()
> > >      Filesystems such as btrfs can perform direct I/O without holding the
> > >      inode->i_rwsem in some of the cases like writing within i_size.  So,
> > >      remove the check for lockdep_assert_held() in iomap_dio_rw().
> > >      Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > >      Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >      Signed-off-by: David Sterba <dsterba@suse.com>
> > >
> > > ... btrfs has special corner cases for direct IO locking and hence
> > > we removed the lockdep assert....
> > >
> > > IOWs, iomap_dio_rw() really does not care what strategy filesystems
> > > use to serialise DIO against other operations.  Filesystems can use
> > > whatever IO serialisation mechanism they want (mutex, rwsem, range
> > > locks, etc) as long as they obey the one simple requirement: do not
> > > take the DIO submission lock in the DIO completion path.
> > >
> >
> > Goldwyn has been working on these patches for a long time, and is actually
> > familiar with this code, and he missed that these two interfaces are being
> > mixed.  This is a problem that I want to solve.  He didn't notice it in any
> > of his testing, which IIRC was like 6 months to get this stuff actually into
> > the btrfs tree.
>
> And that "been working on it for 6 months" is what really worries me
> the most. Hence I've done some post-mortem analysis of the
> "inode_lock" deadlock situation. I decided to do this after I
> realised this same patch set was merged last cycle and reverted late in
> the cycle because of problems it was found to have with page
> invalidation when mixed IO types were used.
>
> I did a bit of looking around yesterday afternoon, both in the btrfs
> code to understand the iomap_dio_rw changes and at our test
> coverage for DIO functionality.
>
> The more I dug, the more I'm finding that the most important
> question we need to answer is not "why wasn't this iomap locking
> requirement documented", the question we should be trying to answer
> is this:
>
>         Why wasn't this *obvious* deadlock detected months ago?
>
> First of all, the inode-lock() complaints are largely a red herring
> as the typical btrfs DIO write path through the patchset does not
> take the inode lock and hence it is actually "safe" to take the
> inode lock in the completion path in the common case. i.e. for DIO
> writes wholly inside EOF, submission drops the inode_lock() before
> calling iomap_dio_rw() and so there is no inode_lock() deadlock
> present. From that view point, it looks like there's only a specific
> corner case where this deadlock might occur and that would explain
> why it hasn't been discovered until now.
>
> However.
>
> The new btrfs dio write path always does
> down_read(BTRFS_I()->dio_sem), even when the inode lock has not been
> taken.  That's important because btrfs_sync_file() always does a
> down_write(&BTRFS_I(inode)->dio_sem) call and will deadlock iin IO
> completion if the IO submission code is holding the dio_sem.

Taking a look at the latest integration branch (misc-next) after 2
weeks away, indeed the new iomap dio switch change still has that
problem.
I reported that in June:

https://lore.kernel.org/linux-btrfs/CAL3q7H4F9iQJy3tgwZrWOKwenAnnn7oSthQZUMEJ_vWx3WE3WQ@mail.gmail.com/

For me, with generic/113 the deadlock happens always.


>
> So regardless of the inode_lock(), non-AIO O_DSYNC DIO writes
> through iomap_dio_rw() should deadlock immediately on the first IO
> with this locking arrangement. It will deadlock on either the
> inode_lock or the dio_sem, depending on whether the ODSYNC DIO write
> is wholly within EOF or not, but the deadlock is guaranteed to
> occur. Hence we can completely ignore the "inode_lock vs fsync" side
> show, because other btrfs internal locks will trigger the same
> issue.
>
> If this is correct, then how did this "should happen instantly"
> bug go undiscovered for months?
>
> Well....  It appears that fstests has no coverage of non-AIO
> O_DSYNC DIO writes.  Tools like fsx and fstress don't have O_SYNC,
> O_DSYNC or RWF_DSYNC modes and O_SYNC and O_DSYNC is not used by any
> of the common DIO test programs. xfs_io can open files O_SYNC and
> there's a test that uses xfs_io's RWF_DSYNC capability, but they all
> use buffered IO and so none of tests that open or write data
> synchronously use direct IO.
>
> The only conclusion I can make from thsi is that the case that
> should deadlock instantly isn't actually covered by fstests at all.
> This conclusion only stands up this O_DSYNC code path was only
> "tested" for feature coverage with fstests. However, it does imply
> that no pre-implementation audit was done to determine if fstests
> actually covered all the functionality that needed to be tested
> here....
>
> I tend to use xfs_io and fio for DIO feature correctness testing
> long before I run fstests on new code.  That's how I developed and
> tested the FUA optimisations - xfs_io and fio w/ RWF_DSYNC on XFS on
> iscsi - so I've never noticed that fstests doesn't actually exercise
> this syscall path directly.
>
> Granted, the problem was eventually discovered by fstests, but this
> also raised questions. The failing test was an AIO+DIO O_DSYNC test,
> but the trigger has been described as a "unusual event on a rarely
> tested configuration".
>
> That "unusual event" was an DIO completion being run from submission
> context because the IO completed before the submission had been
> finish. This is not actually unusual - it's how all AIO on
> synchronous IO devices complete. i.e. if you have a ram device or a
> (fake) pmem device, every single AIO will complete in this way as
> the "IO reference" held by submit_bio() is completed inside
> submit_bio() before it returns to the submission context. Hence the
> submission context always drops the last IO reference and completes
> the IO.
>
> Therefore running fstests on a ramdisk or (fake) pmem would have
> triggered this deadlock *instantly* on the first O_DSYNC AIO+DIO
> write that fstests issued. This implies that btrfs is rarely tested
> on fast synchronous storage devices despite ramdisks being available
> on every test machine that can run fstests. To provide a contrast,
> the iomap infrastructure is regularly tested on such devices - both
> Darrick and I have both have (fake) pmem test setups and exercise
> synchronous completion code paths like this on a daily basis.
>
> Hence it looks to me like the answer to the "why wasn't this found
> earlier" question is a combination of multiple factors:
>
> 1. fstests has no specific non-AIO O_DSYNC DIO write unit tests, nor
> do the stress tests allow use O_DSYNC or RWF_DSYNC.
>
> 2. No test coverage audit was performed prior to making a critical
> change to the btrfs IO path so this specific lack of coverage was
> not noticed until now.
>
> 3. after the first revert of this functionality, post-mortem
> analysis either wasn't performed or didn't identify process and/or
> test coverage issues that allowed serious issues in the patch set to
> go undiscovered.
>
> 4. tools and benchmarks that could have easily discovered the
> problem either weren't run or they weren't configured to test
> and exercise all the IO path features the change affected.
>
> 5. btrfs is not regularly tested on a variety of storage that have
> distinctly different IO path behaviours.
>
> > If we're going to mix interfaces then it should be
> > blatantly obvious to developers that's what's happening so the find out
> > during development, not after the patches have landed, and certainly not
> > after they've made it out to users.  Thanks,
>
> As the above indicates, this issue _should_ have been blatantly
> obvious months ago and documentation would not change this fact.
> IOWs, even if the iomap requirement was documented and followed, a
> locking bug in the btrfs implementation would still not have been
> discovered until now because that's how long it took to actually
> exercise the buggy code path and expose it.
>
> So, yeah, the lack of documentation contributed to the bug being
> present. But taking 6 months to actually exercise the new code
> containing the bug is most definitely not an interface documentation
> problem, nor a problem that can be fixed by correcting the interface
> documentation....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Sept. 3, 2020, 4:32 p.m. UTC | #23
We could trivially do something like this to allow the file system
to call iomap_dio_complete without i_rwsem:

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab99072..d970c6bbbe115d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -76,7 +76,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 		dio->submit.cookie = submit_bio(bio);
 }
 
-static ssize_t iomap_dio_complete(struct iomap_dio *dio)
+ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	const struct iomap_dio_ops *dops = dio->dops;
 	struct kiocb *iocb = dio->iocb;
@@ -130,6 +130,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
 static void iomap_dio_complete_work(struct work_struct *work)
 {
@@ -406,8 +407,8 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
  * Returns -ENOTBLK In case of a page invalidation invalidation failure for
  * writes.  The callers needs to fall back to buffered I/O in this case.
  */
-ssize_t
-iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+struct iomap_dio *
+__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		bool wait_for_completion)
 {
@@ -421,14 +422,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct iomap_dio *dio;
 
 	if (!count)
-		return 0;
+		return NULL;
 
 	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
-		return -EIO;
+		return ERR_PTR(-EIO);
 
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	if (!dio)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	dio->iocb = iocb;
 	atomic_set(&dio->ref, 1);
@@ -558,7 +559,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->wait_for_completion = wait_for_completion;
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!wait_for_completion)
-			return -EIOCBQUEUED;
+			return ERR_PTR(-EIOCBQUEUED);
 
 		for (;;) {
 			set_current_state(TASK_UNINTERRUPTIBLE);
@@ -574,10 +575,25 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		__set_current_state(TASK_RUNNING);
 	}
 
-	return iomap_dio_complete(dio);
+	return dio;
 
 out_free_dio:
 	kfree(dio);
-	return ret;
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(__iomap_dio_rw);
+
+ssize_t
+iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+		bool wait_for_completion)
+{
+	struct iomap_dio *dio;
+
+	dio = __iomap_dio_rw(iocb, iter, ops, dops, wait_for_completion);
+	if (IS_ERR_OR_NULL(dio))
+		return PTR_ERR_OR_ZERO(dio);
+	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9a4..172b3397a1a371 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -13,6 +13,7 @@
 struct address_space;
 struct fiemap_extent_info;
 struct inode;
+struct iomap_dio;
 struct iomap_writepage_ctx;
 struct iov_iter;
 struct kiocb;
@@ -258,6 +259,10 @@ struct iomap_dio_ops {
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		bool wait_for_completion);
+struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+		bool wait_for_completion);
+ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
Josef Bacik Sept. 3, 2020, 4:46 p.m. UTC | #24
On 9/3/20 12:32 PM, Christoph Hellwig wrote:
> We could trivially do something like this to allow the file system
> to call iomap_dio_complete without i_rwsem:
> 

This is sort of how I envisioned it worked, which is why I was surprised when it 
bit us.  This would be the cleanest stop-gap right now, but as it stands I've 
already sent a patch to work around the problem in btrfs for now, and we plan to 
take up the locking rework next go around.  Thanks,

Josef
Dave Chinner Sept. 7, 2020, 12:04 a.m. UTC | #25
On Thu, Sep 03, 2020 at 06:32:36PM +0200, Christoph Hellwig wrote:
> We could trivially do something like this to allow the file system
> to call iomap_dio_complete without i_rwsem:

That just exposes another deadlock vector:

P0			P1
inode_lock()		fallocate(FALLOC_FL_ZERO_RANGE)
__iomap_dio_rw()	inode_lock()
			<block>
<submits IO>
<completes IO>
inode_unlock()
			<gets inode_lock()>
			inode_dio_wait()
iomap_dio_complete()
  generic_write_sync()
    btrfs_file_fsync()
      inode_lock()
      <deadlock>

Basically, the only safe thing to do is implement ->fsync without
holding the DIO IO submission lock....

Cheers,

Dave.
Goldwyn Rodrigues Sept. 15, 2020, 9:48 p.m. UTC | #26
On 10:04 07/09, Dave Chinner wrote:
> On Thu, Sep 03, 2020 at 06:32:36PM +0200, Christoph Hellwig wrote:
> > We could trivially do something like this to allow the file system
> > to call iomap_dio_complete without i_rwsem:
> 
> That just exposes another deadlock vector:
> 
> P0			P1
> inode_lock()		fallocate(FALLOC_FL_ZERO_RANGE)
> __iomap_dio_rw()	inode_lock()
> 			<block>
> <submits IO>
> <completes IO>
> inode_unlock()
> 			<gets inode_lock()>
> 			inode_dio_wait()
> iomap_dio_complete()
>   generic_write_sync()
>     btrfs_file_fsync()
>       inode_lock()
>       <deadlock>

Can inode_dio_end() be called before generic_write_sync(), as it is done
in fs/direct-io.c:dio_complete()?

Christoph's solution is a clean approach and would prefer to use it as
the final solution.
Dave Chinner Sept. 17, 2020, 3:09 a.m. UTC | #27
On Tue, Sep 15, 2020 at 04:48:53PM -0500, Goldwyn Rodrigues wrote:
> On 10:04 07/09, Dave Chinner wrote:
> > On Thu, Sep 03, 2020 at 06:32:36PM +0200, Christoph Hellwig wrote:
> > > We could trivially do something like this to allow the file system
> > > to call iomap_dio_complete without i_rwsem:
> > 
> > That just exposes another deadlock vector:
> > 
> > P0			P1
> > inode_lock()		fallocate(FALLOC_FL_ZERO_RANGE)
> > __iomap_dio_rw()	inode_lock()
> > 			<block>
> > <submits IO>
> > <completes IO>
> > inode_unlock()
> > 			<gets inode_lock()>
> > 			inode_dio_wait()
> > iomap_dio_complete()
> >   generic_write_sync()
> >     btrfs_file_fsync()
> >       inode_lock()
> >       <deadlock>
> 
> Can inode_dio_end() be called before generic_write_sync(), as it is done
> in fs/direct-io.c:dio_complete()?

Don't think so.  inode_dio_wait() is supposed to indicate that all
DIO is complete, and having the "make it stable" parts of an O_DSYNC
DIO still running after inode_dio_wait() returns means that we still
have DIO running....

For some filesystems, ensuring the DIO data is stable may involve
flushing other data (perhaps we did EOF zeroing before the file
extending DIO) and/or metadata to the log, so we need to guarantee
these DIO related operations are complete and stable before we say
the DIO is done.

> Christoph's solution is a clean approach and would prefer to use it as
> the final solution.

/me shrugs

Christoph's solution simply means you can't use inode_dio_wait() in
the filesystem. btrfs would need its own DIO barrier....

Cheers,

Dave.
Christoph Hellwig Sept. 17, 2020, 5:52 a.m. UTC | #28
On Thu, Sep 17, 2020 at 01:09:42PM +1000, Dave Chinner wrote:
> > > iomap_dio_complete()
> > >   generic_write_sync()
> > >     btrfs_file_fsync()
> > >       inode_lock()
> > >       <deadlock>
> > 
> > Can inode_dio_end() be called before generic_write_sync(), as it is done
> > in fs/direct-io.c:dio_complete()?
> 
> Don't think so.  inode_dio_wait() is supposed to indicate that all
> DIO is complete, and having the "make it stable" parts of an O_DSYNC
> DIO still running after inode_dio_wait() returns means that we still
> have DIO running....
> 
> For some filesystems, ensuring the DIO data is stable may involve
> flushing other data (perhaps we did EOF zeroing before the file
> extending DIO) and/or metadata to the log, so we need to guarantee
> these DIO related operations are complete and stable before we say
> the DIO is done.

inode_dio_wait really just waits for active I/O that writes to or reads
from the file.  It does not imply that the I/O is stable, just like
i_rwsem itself doesn't.  Various file systems have historically called
the syncing outside i_rwsem and inode_dio_wait (in fact that is what the
fs/direct-io.c code does, so XFS did as well until a few years ago), and
that isn't a problem at all - we just can't return to userspace (or call
ki_complete for in-kernel users) before the data is stable on disk.
Dave Chinner Sept. 17, 2020, 6:29 a.m. UTC | #29
On Thu, Sep 17, 2020 at 07:52:32AM +0200, Christoph Hellwig wrote:
> On Thu, Sep 17, 2020 at 01:09:42PM +1000, Dave Chinner wrote:
> > > > iomap_dio_complete()
> > > >   generic_write_sync()
> > > >     btrfs_file_fsync()
> > > >       inode_lock()
> > > >       <deadlock>
> > > 
> > > Can inode_dio_end() be called before generic_write_sync(), as it is done
> > > in fs/direct-io.c:dio_complete()?
> > 
> > Don't think so.  inode_dio_wait() is supposed to indicate that all
> > DIO is complete, and having the "make it stable" parts of an O_DSYNC
> > DIO still running after inode_dio_wait() returns means that we still
> > have DIO running....
> > 
> > For some filesystems, ensuring the DIO data is stable may involve
> > flushing other data (perhaps we did EOF zeroing before the file
> > extending DIO) and/or metadata to the log, so we need to guarantee
> > these DIO related operations are complete and stable before we say
> > the DIO is done.
> 
> inode_dio_wait really just waits for active I/O that writes to or reads
> from the file.  It does not imply that the I/O is stable, just like
> i_rwsem itself doesn't.

No, but iomap_dio_rw() considers a O_DSYNC write to be incomplete
until it is stable so that it presents consistent behaviour to
anythign calling inode_dio_wait().

> Various file systems have historically called
> the syncing outside i_rwsem and inode_dio_wait (in fact that is what the
> fs/direct-io.c code does, so XFS did as well until a few years ago), and
> that isn't a problem at all - we just can't return to userspace (or call
> ki_complete for in-kernel users) before the data is stable on disk.

I'm really not caring about userspace here - we use inode_dio_wait()
as an IO completion notification for the purposes of synchronising
internal filesystem state before modifying user data via direct
metadata manipulation. Hence I want sane, consistent, predictable IO
completion notification behaviour regardless of the implementation
path it goes through.

Cheers,

Dave.
Christoph Hellwig Sept. 17, 2020, 6:42 a.m. UTC | #30
On Thu, Sep 17, 2020 at 04:29:23PM +1000, Dave Chinner wrote:
> > inode_dio_wait really just waits for active I/O that writes to or reads
> > from the file.  It does not imply that the I/O is stable, just like
> > i_rwsem itself doesn't.
> 
> No, but iomap_dio_rw() considers a O_DSYNC write to be incomplete
> until it is stable so that it presents consistent behaviour to
> anythign calling inode_dio_wait().

But that point is that inode_dio_wait does not care about that
"consistency".  It cares about when the I/O is done.  I know because I
wrote it (and I regret that as we should have stuck with the non-owner
release of the rwsem which makes a whole lot more sense).

> 
> > Various file systems have historically called
> > the syncing outside i_rwsem and inode_dio_wait (in fact that is what the
> > fs/direct-io.c code does, so XFS did as well until a few years ago), and
> > that isn't a problem at all - we just can't return to userspace (or call
> > ki_complete for in-kernel users) before the data is stable on disk.
> 
> I'm really not caring about userspace here - we use inode_dio_wait()
> as an IO completion notification for the purposes of synchronising
> internal filesystem state before modifying user data via direct
> metadata manipulation. Hence I want sane, consistent, predictable IO
> completion notification behaviour regardless of the implementation
> path it goes through.

And none of that consistency matters.  Think of it:

 - an O_(D)SYNC write is nothing but a write plus a ranged fsync,
   even if we do some optimizations to speed up the fsync by e.g.
   using the FUA flag
 - another fsync can come up at any time after we completed a write
   (with or without O_SYNC)
 - so any synchronization using inode_dio_wait (or i_rwsem for that
   matter) must not care if an fsync runs in parallel.
 - take a look at where we call inode_dio_wait to verify this - the
   prime original use case was truncate as we can't have I/O in
   progress while trunating.  We then later extended it to all the
   truncate-like more compliated operations like hole punches, extent
   insert an collapse, etc.  But in all that cases what matters is
   the actual I/O, not the sync.  By having done direct I/O the
   page cache side of the sync doesn't matter to start with (but
   the callers all invalidate it anyway), so what matter is the metadata
   flush, aka the log force in the XFS case.  And for that we absolutely
   do not need to be before inode_dio_wait returns.

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
---end quoted text---
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b62679382799..c75c0f2a5f72 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2023,6 +2023,7 @@  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
+		iocb->ki_flags &= ~IOCB_DSYNC;
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
@@ -2046,8 +2047,10 @@  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (num_written > 0)
 		num_written = generic_write_sync(iocb, num_written);
 
-	if (sync)
+	if (sync) {
+		iocb->ki_flags |= IOCB_DSYNC;
 		atomic_dec(&BTRFS_I(inode)->sync_writers);
+	}
 out:
 	current->backing_dev_info = NULL;
 	return num_written ? num_written : err;