[4/7] btrfs: Switch to iomap_dio_rw() for dio

Message ID	20200522123837.1196-5-rgoldwyn@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=4/Cm=7E=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Goldwyn Rodrigues <rgoldwyn@suse.de> To: linux-btrfs@vger.kernel.org Cc: hch@infradead.org, dsterba@suse.cz, Goldwyn Rodrigues <rgoldwyn@suse.com> Subject: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio Date: Fri, 22 May 2020 07:38:34 -0500 Message-Id: <20200522123837.1196-5-rgoldwyn@suse.de> In-Reply-To: <20200522123837.1196-1-rgoldwyn@suse.de> References: <20200522123837.1196-1-rgoldwyn@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs direct-io using iomap \| expand [0/7,v8] btrfs direct-io using iomap [1/7] fs: Export generic_file_buffered_read() [2/7] iomap: add a filesystem hook for direct I/O bio submission [3/7] iomap: Remove lockdep_assert_held() [4/7] btrfs: Switch to iomap_dio_rw() for dio [5/7] fs: Remove dio_end_io() [6/7] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK [7/7] btrfs: btrfs: split btrfs_direct_IO

Goldwyn Rodrigues May 22, 2020, 12:38 p.m. UTC

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of
reads and exclusive in case of writes. This guards against simultaneous
truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:
For writes, call __endio_write_update_ordered()
For reads, Unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/file.c  |  21 +++-
 fs/btrfs/inode.c | 313 ++++++++++++++++++++++-------------------------
 3 files changed, 169 insertions(+), 166 deletions(-)

Johannes Thumshirn May 26, 2020, 3:03 p.m. UTC | #1

Just as a heads up, this one gives me lot's of Page cache invalidation
failure prints from dio_warn_stale_pagecache() on btrfs/004 with 
current misc-next:

[   16.607545] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   16.609328] File: /mnt/scratch/next/p0/d0/d77/de2/d5c/dc9/fee PID: 766 Comm: fsstress
[   16.743572] BTRFS info (device zram1): disk space caching is enabled
[   16.744620] BTRFS info (device zram1): has skinny extents
[   16.747458] BTRFS info (device zram1): enabling ssd optimizations
[   18.303877] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.305476] File: /mnt/scratch/bgnoise/p0/d5/d53/d21/f27 PID: 2064 Comm: fsstress
[   18.768426] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.770074] File: /mnt/scratch/bgnoise/p0/d9/de/f15 PID: 2490 Comm: fsstress
[   18.916118] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.917843] File: /mnt/scratch/bgnoise/p0/f0 PID: 2694 Comm: fsstress
[   21.170384] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.172375] File: /mnt/scratch/bgnoise/p0/f3 PID: 4325 Comm: fsstress
[   21.812452] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.814232] File: /mnt/scratch/bgnoise/p0/fb PID: 5000 Comm: fsstress
[   21.826027] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.827741] File: /mnt/scratch/bgnoise/p0/fb PID: 5000 Comm: fsstress
[   22.127966] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   22.129413] File: /mnt/scratch/bgnoise/p0/df/d28/d26/f3b PID: 5196 Comm: fsstress
[   22.160542] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   22.161972] File: /mnt/scratch/bgnoise/p0/df/d10/d5f/f64 PID: 5196 Comm: fsstress
[   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress

I have no idea yet why but I'm investigating.

Goldwyn Rodrigues May 26, 2020, 4:44 p.m. UTC | #2

On 15:03 26/05, Johannes Thumshirn wrote:
> Just as a heads up, this one gives me lot's of Page cache invalidation
> failure prints from dio_warn_stale_pagecache() on btrfs/004 with 
> current misc-next:
> 
<snip>

> [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> 
> I have no idea yet why but I'm investigating.

This is caused because we are trying to release a page when the extent
has locked the page and release page returns false. To minimize the
effect, I had proposed a patch [1] in v6. However, this created
more extent locking issues and so was dropped.

[1] https://patchwork.kernel.org/patch/11275063/

Filipe Manana May 28, 2020, 3:13 p.m. UTC | #3

On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 15:03 26/05, Johannes Thumshirn wrote:
> > Just as a heads up, this one gives me lot's of Page cache invalidation
> > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > current misc-next:
> >
> <snip>
>
> > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> >
> > I have no idea yet why but I'm investigating.
>
> This is caused because we are trying to release a page when the extent
> has locked the page and release page returns false.

By "we" you mean what exaclty, a direct IO read, a direct IO write?

And who locked the extent range before?

That seems alarming to me, specially if it's a direct IO write failing
to invalidate the page cache, since a subsequent buffered read could
get stale data (what's in the page cache), and not what the direct IO
write wrote.

Can you elaborate more on all those details?

Thanks.


> To minimize the
> effect, I had proposed a patch [1] in v6. However, this created
> more extent locking issues and so was dropped.
>
> [1] https://patchwork.kernel.org/patch/11275063/
>
> --
> Goldwyn

Goldwyn Rodrigues May 28, 2020, 4:34 p.m. UTC | #4

On 16:13 28/05, Filipe Manana wrote:
> On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 15:03 26/05, Johannes Thumshirn wrote:
> > > Just as a heads up, this one gives me lot's of Page cache invalidation
> > > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > > current misc-next:
> > >
> > <snip>
> >
> > > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> > >
> > > I have no idea yet why but I'm investigating.
> >
> > This is caused because we are trying to release a page when the extent
> > has locked the page and release page returns false.
> 
> By "we" you mean what exaclty, a direct IO read, a direct IO write?
> 
> And who locked the extent range before?

This is usually locked by a previous buffered write or read.

> 
> That seems alarming to me, specially if it's a direct IO write failing
> to invalidate the page cache, since a subsequent buffered read could
> get stale data (what's in the page cache), and not what the direct IO
> write wrote.
> 
> Can you elaborate more on all those details?

The origin of the message is when iomap_dio_rw() tries to invalidate the
inode pages, but fails and calls dio_warn_stale_pagecache().

In the vanilla code, generic_file_direct_write() aborts direct writes
and returns 0 so that it may fallback to buffered I/O. Perhaps this
should be changed in iomap_dio_rw() as well. I will write a patch to
accomodate that.

Filipe Manana May 28, 2020, 4:45 p.m. UTC | #5

On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 16:13 28/05, Filipe Manana wrote:
> > On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > >
> > > On 15:03 26/05, Johannes Thumshirn wrote:
> > > > Just as a heads up, this one gives me lot's of Page cache invalidation
> > > > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > > > current misc-next:
> > > >
> > > <snip>
> > >
> > > > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> > > >
> > > > I have no idea yet why but I'm investigating.
> > >
> > > This is caused because we are trying to release a page when the extent
> > > has locked the page and release page returns false.
> >
> > By "we" you mean what exaclty, a direct IO read, a direct IO write?
> >
> > And who locked the extent range before?
>
> This is usually locked by a previous buffered write or read.

A previous buffered write/read that has already finished or is still
in progress?

Because if it has finished we're not supposed to have the file range
locked anymore.

>
> >
> > That seems alarming to me, specially if it's a direct IO write failing
> > to invalidate the page cache, since a subsequent buffered read could
> > get stale data (what's in the page cache), and not what the direct IO
> > write wrote.
> >
> > Can you elaborate more on all those details?
>
> The origin of the message is when iomap_dio_rw() tries to invalidate the
> inode pages, but fails and calls dio_warn_stale_pagecache().
>
> In the vanilla code, generic_file_direct_write() aborts direct writes
> and returns 0 so that it may fallback to buffered I/O. Perhaps this
> should be changed in iomap_dio_rw() as well. I will write a patch to
> accomodate that.

On vanilla we have no problems of mixing buffered and direct
operations as long as they are done sequentially at least.
And even if done concurrently we take several measures to ensure that
are no surprises (locking ranges, waiting for any ordered extents in
progress, etc).

Thanks.

>
> --
> Goldwyn

Goldwyn Rodrigues May 28, 2020, 6:38 p.m. UTC | #6

On 17:45 28/05, Filipe Manana wrote:
> On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > And who locked the extent range before?
> >
> > This is usually locked by a previous buffered write or read.
> 
> A previous buffered write/read that has already finished or is still
> in progress?
> 
> Because if it has finished we're not supposed to have the file range
> locked anymore.

In progress. Mixing buffered I/O with direct writes.

> 
> >
> > >
> > > That seems alarming to me, specially if it's a direct IO write failing
> > > to invalidate the page cache, since a subsequent buffered read could
> > > get stale data (what's in the page cache), and not what the direct IO
> > > write wrote.
> > >
> > > Can you elaborate more on all those details?
> >
> > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > inode pages, but fails and calls dio_warn_stale_pagecache().
> >
> > In the vanilla code, generic_file_direct_write() aborts direct writes
> > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > should be changed in iomap_dio_rw() as well. I will write a patch to
> > accomodate that.
> 
> On vanilla we have no problems of mixing buffered and direct
> operations as long as they are done sequentially at least.
> And even if done concurrently we take several measures to ensure that
> are no surprises (locking ranges, waiting for any ordered extents in
> progress, etc).

Yes, it is because of the code in generic_file_direct_write(). Anyways,
I did some tests with the following patch, and it seems to work. I will
send a formal patch to so that it gets incorporated in iomap sequence as
well.

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e4addfc58107..215315be6233 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
         */
        ret = invalidate_inode_pages2_range(mapping,
                        pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-       if (ret)
-               dio_warn_stale_pagecache(iocb->ki_filp);
-       ret = 0;
+       /*
+        * If a page can not be invalidated, return 0 to fall back
+        * to buffered write.
+        */
+       if (ret) {
+               if (ret == -EBUSY)
+                       ret = 0;
+               goto out_free_dio;
+       }
 
        if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
            !inode->i_sb->s_dio_done_wq) {

Filipe Manana June 3, 2020, 11:35 a.m. UTC | #7

On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 17:45 28/05, Filipe Manana wrote:
> > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > And who locked the extent range before?
> > >
> > > This is usually locked by a previous buffered write or read.
> >
> > A previous buffered write/read that has already finished or is still
> > in progress?
> >
> > Because if it has finished we're not supposed to have the file range
> > locked anymore.
>
> In progress. Mixing buffered I/O with direct writes.
>
> >
> > >
> > > >
> > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > to invalidate the page cache, since a subsequent buffered read could
> > > > get stale data (what's in the page cache), and not what the direct IO
> > > > write wrote.
> > > >
> > > > Can you elaborate more on all those details?
> > >
> > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > >
> > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > accomodate that.
> >
> > On vanilla we have no problems of mixing buffered and direct
> > operations as long as they are done sequentially at least.
> > And even if done concurrently we take several measures to ensure that
> > are no surprises (locking ranges, waiting for any ordered extents in
> > progress, etc).
>
> Yes, it is because of the code in generic_file_direct_write(). Anyways,
> I did some tests with the following patch, and it seems to work. I will
> send a formal patch to so that it gets incorporated in iomap sequence as
> well.
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index e4addfc58107..215315be6233 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>          */
>         ret = invalidate_inode_pages2_range(mapping,
>                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -       if (ret)
> -               dio_warn_stale_pagecache(iocb->ki_filp);
> -       ret = 0;
> +       /*
> +        * If a page can not be invalidated, return 0 to fall back
> +        * to buffered write.
> +        */
> +       if (ret) {
> +               if (ret == -EBUSY)
> +                       ret = 0;
> +               goto out_free_dio;
> +       }
>
>         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
>             !inode->i_sb->s_dio_done_wq) {
>
>

Thanks. As I just replied on another thread for that patch, we
actually have a regression.
There's more than the annoying warning in dmesg, it also sets -EIO on
the inode's mapping and makes future fsyncs return that error despite
the fact that no actual errors or corruptions happened:

https://patchwork.kernel.org/patch/11576677/



>
> --
> Goldwyn

Filipe Manana June 5, 2020, 3:17 p.m. UTC | #8

On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 17:45 28/05, Filipe Manana wrote:
> > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > And who locked the extent range before?
> > > >
> > > > This is usually locked by a previous buffered write or read.
> > >
> > > A previous buffered write/read that has already finished or is still
> > > in progress?
> > >
> > > Because if it has finished we're not supposed to have the file range
> > > locked anymore.
> >
> > In progress. Mixing buffered I/O with direct writes.
> >
> > >
> > > >
> > > > >
> > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > write wrote.
> > > > >
> > > > > Can you elaborate more on all those details?
> > > >
> > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > >
> > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > accomodate that.
> > >
> > > On vanilla we have no problems of mixing buffered and direct
> > > operations as long as they are done sequentially at least.
> > > And even if done concurrently we take several measures to ensure that
> > > are no surprises (locking ranges, waiting for any ordered extents in
> > > progress, etc).
> >
> > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > I did some tests with the following patch, and it seems to work. I will
> > send a formal patch to so that it gets incorporated in iomap sequence as
> > well.
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e4addfc58107..215315be6233 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >          */
> >         ret = invalidate_inode_pages2_range(mapping,
> >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -       if (ret)
> > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > -       ret = 0;
> > +       /*
> > +        * If a page can not be invalidated, return 0 to fall back
> > +        * to buffered write.
> > +        */
> > +       if (ret) {
> > +               if (ret == -EBUSY)
> > +                       ret = 0;
> > +               goto out_free_dio;
> > +       }
> >
> >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> >             !inode->i_sb->s_dio_done_wq) {
> >
> >
>
> Thanks. As I just replied on another thread for that patch, we
> actually have a regression.
> There's more than the annoying warning in dmesg, it also sets -EIO on
> the inode's mapping and makes future fsyncs return that error despite
> the fact that no actual errors or corruptions happened:
>
> https://patchwork.kernel.org/patch/11576677/
>

There's also some deadlock/hang, I have triggered it twice today with
generic/113 on two different VMs:

[14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
[14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.300523] kworker/1:117   D    0 15962      2 0x80004000
[14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
[14621.302389] Call Trace:
[14621.302877]  __schedule+0x384/0xa30
[14621.303555]  schedule+0x33/0xe0
[14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.307113]  iomap_dio_complete+0x11b/0x260
[14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
[14621.308585]  iomap_dio_complete_work+0x17/0x30
[14621.309476]  process_one_work+0x275/0x6b0
[14621.310275]  worker_thread+0x4f/0x3e0
[14621.310869]  ? process_one_work+0x6b0/0x6b0
[14621.311403]  kthread+0x12a/0x170
[14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
[14621.312460]  ret_from_fork+0x3a/0x50
[14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
[14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.315724] kworker/1:199   D    0 16063      2 0x80004000
[14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
[14621.317101] Call Trace:
[14621.317437]  __schedule+0x384/0xa30
[14621.317928]  schedule+0x33/0xe0
[14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.320203]  iomap_dio_complete+0x11b/0x260
[14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
[14621.321249]  iomap_dio_complete_work+0x17/0x30
[14621.321844]  process_one_work+0x275/0x6b0
[14621.322376]  worker_thread+0x4f/0x3e0
[14621.322871]  ? process_one_work+0x6b0/0x6b0
[14621.323408]  kthread+0x12a/0x170
[14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
[14621.324473]  ret_from_fork+0x3a/0x50
[14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
[14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.327580] aio-stress      D    0 16274  14855 0x00004000
[14621.328280] Call Trace:
[14621.328602]  __schedule+0x384/0xa30
[14621.329056]  schedule+0x33/0xe0
[14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
[14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
[14621.331346]  iomap_dio_complete+0x11b/0x260
[14621.331886]  iomap_dio_rw+0x3bc/0x4c0
[14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
[14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
[14621.333749]  aio_write+0x148/0x1d0
[14621.334196]  ? lock_acquire+0xb1/0x4a0
[14621.334682]  ? __might_fault+0x3e/0x90
[14621.335172]  ? __fget_files+0x132/0x270
[14621.335668]  ? io_submit_one+0x946/0x1630
[14621.336184]  io_submit_one+0x946/0x1630
[14621.336680]  ? lock_acquire+0xb1/0x4a0
[14621.337175]  ? __might_fault+0x3e/0x90
[14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
[14621.338269]  __x64_sys_io_submit+0x9c/0x330
[14621.338812]  ? do_syscall_64+0x5c/0x280
[14621.339303]  do_syscall_64+0x5c/0x280
[14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[14621.340416] RIP: 0033:0x7fb6cd395717
[14621.340875] Code: Bad RIP value.
[14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
00000000000000d1
[14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
[14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
[14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
[14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
[14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
[14621.346820] INFO: lockdep is turned off.
[14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
[14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.128156] kworker/1:117   D    0 15962      2 0x80004000
[14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
[14742.129633] Call Trace:
[14742.130010]  __schedule+0x384/0xa30
[14742.130494]  schedule+0x33/0xe0
[14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
[14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.133712]  iomap_dio_complete+0x11b/0x260
[14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
[14742.135205]  iomap_dio_complete_work+0x17/0x30
[14742.136018]  process_one_work+0x275/0x6b0
[14742.136677]  worker_thread+0x4f/0x3e0
[14742.137154]  ? process_one_work+0x6b0/0x6b0
[14742.137805]  kthread+0x12a/0x170
[14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
[14742.138901]  ret_from_fork+0x3a/0x50
[14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
[14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.142056] kworker/1:199   D    0 16063      2 0x80004000
[14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
[14742.143397] Call Trace:
[14742.143654]  __schedule+0x384/0xa30
[14742.144017]  schedule+0x33/0xe0
[14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
[14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.145863]  iomap_dio_complete+0x11b/0x260
[14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
[14742.146701]  iomap_dio_complete_work+0x17/0x30
[14742.147168]  process_one_work+0x275/0x6b0
[14742.147579]  worker_thread+0x4f/0x3e0
[14742.147954]  ? process_one_work+0x6b0/0x6b0
[14742.148377]  kthread+0x12a/0x170
[14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
[14742.149257]  ret_from_fork+0x3a/0x50
[14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
[14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.151962] aio-stress      D    0 16274  14855 0x00004000
(...)

Thanks.

>
>
> >
> > --
> > Goldwyn
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”

Goldwyn Rodrigues June 5, 2020, 8:43 p.m. UTC | #9

On 16:17 05/06, Filipe Manana wrote:
> On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > >
> > > On 17:45 28/05, Filipe Manana wrote:
> > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > And who locked the extent range before?
> > > > >
> > > > > This is usually locked by a previous buffered write or read.
> > > >
> > > > A previous buffered write/read that has already finished or is still
> > > > in progress?
> > > >
> > > > Because if it has finished we're not supposed to have the file range
> > > > locked anymore.
> > >
> > > In progress. Mixing buffered I/O with direct writes.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > write wrote.
> > > > > >
> > > > > > Can you elaborate more on all those details?
> > > > >
> > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > >
> > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > accomodate that.
> > > >
> > > > On vanilla we have no problems of mixing buffered and direct
> > > > operations as long as they are done sequentially at least.
> > > > And even if done concurrently we take several measures to ensure that
> > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > progress, etc).
> > >
> > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > I did some tests with the following patch, and it seems to work. I will
> > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > well.
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index e4addfc58107..215315be6233 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >          */
> > >         ret = invalidate_inode_pages2_range(mapping,
> > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -       if (ret)
> > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > -       ret = 0;
> > > +       /*
> > > +        * If a page can not be invalidated, return 0 to fall back
> > > +        * to buffered write.
> > > +        */
> > > +       if (ret) {
> > > +               if (ret == -EBUSY)
> > > +                       ret = 0;
> > > +               goto out_free_dio;
> > > +       }
> > >
> > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > >             !inode->i_sb->s_dio_done_wq) {
> > >
> > >
> >
> > Thanks. As I just replied on another thread for that patch, we
> > actually have a regression.
> > There's more than the annoying warning in dmesg, it also sets -EIO on
> > the inode's mapping and makes future fsyncs return that error despite
> > the fact that no actual errors or corruptions happened:
> >
> > https://patchwork.kernel.org/patch/11576677/
> >
> 
> There's also some deadlock/hang, I have triggered it twice today with
> generic/113 on two different VMs:
> 
> [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> [14621.302389] Call Trace:
> [14621.302877]  __schedule+0x384/0xa30
> [14621.303555]  schedule+0x33/0xe0
> [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.307113]  iomap_dio_complete+0x11b/0x260
> [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> [14621.308585]  iomap_dio_complete_work+0x17/0x30
> [14621.309476]  process_one_work+0x275/0x6b0
> [14621.310275]  worker_thread+0x4f/0x3e0
> [14621.310869]  ? process_one_work+0x6b0/0x6b0
> [14621.311403]  kthread+0x12a/0x170
> [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14621.312460]  ret_from_fork+0x3a/0x50
> [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> [14621.317101] Call Trace:
> [14621.317437]  __schedule+0x384/0xa30
> [14621.317928]  schedule+0x33/0xe0
> [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.320203]  iomap_dio_complete+0x11b/0x260
> [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> [14621.321249]  iomap_dio_complete_work+0x17/0x30
> [14621.321844]  process_one_work+0x275/0x6b0
> [14621.322376]  worker_thread+0x4f/0x3e0
> [14621.322871]  ? process_one_work+0x6b0/0x6b0
> [14621.323408]  kthread+0x12a/0x170
> [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14621.324473]  ret_from_fork+0x3a/0x50
> [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> [14621.328280] Call Trace:
> [14621.328602]  __schedule+0x384/0xa30
> [14621.329056]  schedule+0x33/0xe0
> [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> [14621.331346]  iomap_dio_complete+0x11b/0x260
> [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> [14621.333749]  aio_write+0x148/0x1d0
> [14621.334196]  ? lock_acquire+0xb1/0x4a0
> [14621.334682]  ? __might_fault+0x3e/0x90
> [14621.335172]  ? __fget_files+0x132/0x270
> [14621.335668]  ? io_submit_one+0x946/0x1630
> [14621.336184]  io_submit_one+0x946/0x1630
> [14621.336680]  ? lock_acquire+0xb1/0x4a0
> [14621.337175]  ? __might_fault+0x3e/0x90
> [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> [14621.338812]  ? do_syscall_64+0x5c/0x280
> [14621.339303]  do_syscall_64+0x5c/0x280
> [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [14621.340416] RIP: 0033:0x7fb6cd395717
> [14621.340875] Code: Bad RIP value.
> [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> 00000000000000d1
> [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> [14621.346820] INFO: lockdep is turned off.
> [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> [14742.129633] Call Trace:
> [14742.130010]  __schedule+0x384/0xa30
> [14742.130494]  schedule+0x33/0xe0
> [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.133712]  iomap_dio_complete+0x11b/0x260
> [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> [14742.135205]  iomap_dio_complete_work+0x17/0x30
> [14742.136018]  process_one_work+0x275/0x6b0
> [14742.136677]  worker_thread+0x4f/0x3e0
> [14742.137154]  ? process_one_work+0x6b0/0x6b0
> [14742.137805]  kthread+0x12a/0x170
> [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14742.138901]  ret_from_fork+0x3a/0x50
> [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> [14742.143397] Call Trace:
> [14742.143654]  __schedule+0x384/0xa30
> [14742.144017]  schedule+0x33/0xe0
> [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.145863]  iomap_dio_complete+0x11b/0x260
> [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> [14742.146701]  iomap_dio_complete_work+0x17/0x30
> [14742.147168]  process_one_work+0x275/0x6b0
> [14742.147579]  worker_thread+0x4f/0x3e0
> [14742.147954]  ? process_one_work+0x6b0/0x6b0
> [14742.148377]  kthread+0x12a/0x170
> [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14742.149257]  ret_from_fork+0x3a/0x50
> [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> (...)
> 

Seems like the btrfs_inode->dio_sem. Would you have more information on
which process is holding on to it, or if there was a failure?

I will try to reproduce at my end. Thanks for testing.

Filipe Manana June 6, 2020, 9:57 a.m. UTC | #10

On Fri, Jun 5, 2020 at 9:43 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 16:17 05/06, Filipe Manana wrote:
> > On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > >
> > > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > >
> > > > On 17:45 28/05, Filipe Manana wrote:
> > > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > > And who locked the extent range before?
> > > > > >
> > > > > > This is usually locked by a previous buffered write or read.
> > > > >
> > > > > A previous buffered write/read that has already finished or is still
> > > > > in progress?
> > > > >
> > > > > Because if it has finished we're not supposed to have the file range
> > > > > locked anymore.
> > > >
> > > > In progress. Mixing buffered I/O with direct writes.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > > write wrote.
> > > > > > >
> > > > > > > Can you elaborate more on all those details?
> > > > > >
> > > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > > >
> > > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > > accomodate that.
> > > > >
> > > > > On vanilla we have no problems of mixing buffered and direct
> > > > > operations as long as they are done sequentially at least.
> > > > > And even if done concurrently we take several measures to ensure that
> > > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > > progress, etc).
> > > >
> > > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > > I did some tests with the following patch, and it seems to work. I will
> > > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > > well.
> > > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index e4addfc58107..215315be6233 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >          */
> > > >         ret = invalidate_inode_pages2_range(mapping,
> > > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -       if (ret)
> > > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -       ret = 0;
> > > > +       /*
> > > > +        * If a page can not be invalidated, return 0 to fall back
> > > > +        * to buffered write.
> > > > +        */
> > > > +       if (ret) {
> > > > +               if (ret == -EBUSY)
> > > > +                       ret = 0;
> > > > +               goto out_free_dio;
> > > > +       }
> > > >
> > > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > >             !inode->i_sb->s_dio_done_wq) {
> > > >
> > > >
> > >
> > > Thanks. As I just replied on another thread for that patch, we
> > > actually have a regression.
> > > There's more than the annoying warning in dmesg, it also sets -EIO on
> > > the inode's mapping and makes future fsyncs return that error despite
> > > the fact that no actual errors or corruptions happened:
> > >
> > > https://patchwork.kernel.org/patch/11576677/
> > >
> >
> > There's also some deadlock/hang, I have triggered it twice today with
> > generic/113 on two different VMs:
> >
> > [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> > [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> > [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> > [14621.302389] Call Trace:
> > [14621.302877]  __schedule+0x384/0xa30
> > [14621.303555]  schedule+0x33/0xe0
> > [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.307113]  iomap_dio_complete+0x11b/0x260
> > [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> > [14621.308585]  iomap_dio_complete_work+0x17/0x30
> > [14621.309476]  process_one_work+0x275/0x6b0
> > [14621.310275]  worker_thread+0x4f/0x3e0
> > [14621.310869]  ? process_one_work+0x6b0/0x6b0
> > [14621.311403]  kthread+0x12a/0x170
> > [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14621.312460]  ret_from_fork+0x3a/0x50
> > [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> > [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> > [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> > [14621.317101] Call Trace:
> > [14621.317437]  __schedule+0x384/0xa30
> > [14621.317928]  schedule+0x33/0xe0
> > [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.320203]  iomap_dio_complete+0x11b/0x260
> > [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> > [14621.321249]  iomap_dio_complete_work+0x17/0x30
> > [14621.321844]  process_one_work+0x275/0x6b0
> > [14621.322376]  worker_thread+0x4f/0x3e0
> > [14621.322871]  ? process_one_work+0x6b0/0x6b0
> > [14621.323408]  kthread+0x12a/0x170
> > [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14621.324473]  ret_from_fork+0x3a/0x50
> > [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> > [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> > [14621.328280] Call Trace:
> > [14621.328602]  __schedule+0x384/0xa30
> > [14621.329056]  schedule+0x33/0xe0
> > [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> > [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> > [14621.331346]  iomap_dio_complete+0x11b/0x260
> > [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> > [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> > [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> > [14621.333749]  aio_write+0x148/0x1d0
> > [14621.334196]  ? lock_acquire+0xb1/0x4a0
> > [14621.334682]  ? __might_fault+0x3e/0x90
> > [14621.335172]  ? __fget_files+0x132/0x270
> > [14621.335668]  ? io_submit_one+0x946/0x1630
> > [14621.336184]  io_submit_one+0x946/0x1630
> > [14621.336680]  ? lock_acquire+0xb1/0x4a0
> > [14621.337175]  ? __might_fault+0x3e/0x90
> > [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> > [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> > [14621.338812]  ? do_syscall_64+0x5c/0x280
> > [14621.339303]  do_syscall_64+0x5c/0x280
> > [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [14621.340416] RIP: 0033:0x7fb6cd395717
> > [14621.340875] Code: Bad RIP value.
> > [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> > 00000000000000d1
> > [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> > [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> > [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> > [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> > [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> > [14621.346820] INFO: lockdep is turned off.
> > [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> > [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> > [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> > [14742.129633] Call Trace:
> > [14742.130010]  __schedule+0x384/0xa30
> > [14742.130494]  schedule+0x33/0xe0
> > [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.133712]  iomap_dio_complete+0x11b/0x260
> > [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> > [14742.135205]  iomap_dio_complete_work+0x17/0x30
> > [14742.136018]  process_one_work+0x275/0x6b0
> > [14742.136677]  worker_thread+0x4f/0x3e0
> > [14742.137154]  ? process_one_work+0x6b0/0x6b0
> > [14742.137805]  kthread+0x12a/0x170
> > [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14742.138901]  ret_from_fork+0x3a/0x50
> > [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> > [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> > [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> > [14742.143397] Call Trace:
> > [14742.143654]  __schedule+0x384/0xa30
> > [14742.144017]  schedule+0x33/0xe0
> > [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.145863]  iomap_dio_complete+0x11b/0x260
> > [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> > [14742.146701]  iomap_dio_complete_work+0x17/0x30
> > [14742.147168]  process_one_work+0x275/0x6b0
> > [14742.147579]  worker_thread+0x4f/0x3e0
> > [14742.147954]  ? process_one_work+0x6b0/0x6b0
> > [14742.148377]  kthread+0x12a/0x170
> > [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14742.149257]  ret_from_fork+0x3a/0x50
> > [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> > [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> > (...)
> >
>
> Seems like the btrfs_inode->dio_sem. Would you have more information on
> which process is holding on to it, or if there was a failure?

This happens 100% of the time.

At btrfs_direct_write():

down_read(&BTRFS_I(inode)->dio_sem);
written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
       is_sync_kiocb(iocb));
up_read(&BTRFS_I(inode)->dio_sem);

than through iomap_dio_rw(), we end up calling btrfs_sync_file(),
which tries to down_write() the dio_sem.

>
> I will try to reproduce at my end. Thanks for testing.

You should have no difficulty reproducing it.

And another lockdep warning:

[  422.202844] ============================================
[  422.204113] WARNING: possible recursive locking detected
[  422.205014] 5.7.0-rc7-btrfs-next-59 #1 Not tainted
[  422.205722] --------------------------------------------
[  422.206598] aio-stress/1665 is trying to acquire lock:
[  422.207776] ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.209538]
               but task is already holding lock:
[  422.210424] ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_file_write_iter+0x80/0x870 [btrfs]
[  422.212075]
               other info that might help us debug this:
[  422.213089]  Possible unsafe locking scenario:

[  422.214141]        CPU0
[  422.214578]        ----
[  422.215098]   lock(&sb->s_type->i_mutex_key#14);
[  422.215990]   lock(&sb->s_type->i_mutex_key#14);
[  422.216940]
                *** DEADLOCK ***

[  422.218115]  May be due to missing lock nesting notation

[  422.219060] 2 locks held by aio-stress/1665:
[  422.219815]  #0: ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_file_write_iter+0x80/0x870 [btrfs]
[  422.221516]  #1: ffff8fbc9a24d4a8 (&ei->dio_sem){++++}-{3:3}, at:
btrfs_file_write_iter+0x6b4/0x870 [btrfs]
[  422.223437]
               stack backtrace:
[  422.224335] CPU: 0 PID: 1665 Comm: aio-stress Not tainted
5.7.0-rc7-btrfs-next-59 #1
[  422.225960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[  422.227728] Call Trace:
[  422.228046]  dump_stack+0x87/0xcb
[  422.228667]  __lock_acquire+0x17d0/0x21a0
[  422.229545]  lock_acquire+0xb1/0x4a0
[  422.230272]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.231250]  down_write+0x38/0x70
[  422.231821]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.232502]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.233169]  iomap_dio_complete+0x11b/0x260
[  422.233838]  iomap_dio_rw+0x3bc/0x4c0
[  422.234404]  ? btrfs_file_write_iter+0x6d9/0x870 [btrfs]
[  422.235311]  btrfs_file_write_iter+0x6d9/0x870 [btrfs]
[  422.236250]  aio_write+0x148/0x1d0
[  422.236872]  ? lock_acquire+0xb1/0x4a0
[  422.237560]  ? __might_fault+0x3e/0x90
[  422.238291]  ? __might_fault+0x3e/0x90
[  422.239002]  ? io_submit_one+0x946/0x1630
[  422.239874]  io_submit_one+0x946/0x1630
[  422.240565]  ? lock_acquire+0xb1/0x4a0
[  422.241162]  ? __might_fault+0x3e/0x90
[  422.241711]  ? __x64_sys_io_submit+0x9c/0x330
[  422.242336]  __x64_sys_io_submit+0x9c/0x330
[  422.242943]  ? do_syscall_64+0x5c/0x280
[  422.243420]  do_syscall_64+0x5c/0x280
[  422.243878]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  422.244626] RIP: 0033:0x7f77e38aa717
[  422.245219] Code: 00 75 08 8b 47 0c 39 47 08 74 08 e9 c3 ff ff ff
0f 1f 00 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 d1 00 00
00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84
00 00
[  422.247827] RSP: 002b:00007f77d6cf8de8 EFLAGS: 00000202 ORIG_RAX:
00000000000000d1
[  422.248924] RAX: ffffffffffffffda RBX: 000055c78674c400 RCX: 00007f77e38aa717
[  422.249959] RDX: 000055c78674c400 RSI: 0000000000000008 RDI: 00007f77e4059000
[  422.250924] RBP: 0000000000000008 R08: 000000a0afa4d2c9 R09: 0000000000009e34
[  422.251958] R10: 00007f77d6cf8df0 R11: 0000000000000202 R12: 000055c786720f10
[  422.252989] R13: 00007f77d6cf8e10 R14: 00007f77d6cf8e00 R15: 000055c786720f10

Thanks.

>
> --
> Goldwyn

Goldwyn Rodrigues June 8, 2020, 3:39 p.m. UTC | #11

On 10:57 06/06, Filipe Manana wrote:
> On Fri, Jun 5, 2020 at 9:43 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 16:17 05/06, Filipe Manana wrote:
> > > On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > > >
> > > > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > >
> > > > > On 17:45 28/05, Filipe Manana wrote:
> > > > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > > > And who locked the extent range before?
> > > > > > >
> > > > > > > This is usually locked by a previous buffered write or read.
> > > > > >
> > > > > > A previous buffered write/read that has already finished or is still
> > > > > > in progress?
> > > > > >
> > > > > > Because if it has finished we're not supposed to have the file range
> > > > > > locked anymore.
> > > > >
> > > > > In progress. Mixing buffered I/O with direct writes.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > > > write wrote.
> > > > > > > >
> > > > > > > > Can you elaborate more on all those details?
> > > > > > >
> > > > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > > > >
> > > > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > > > accomodate that.
> > > > > >
> > > > > > On vanilla we have no problems of mixing buffered and direct
> > > > > > operations as long as they are done sequentially at least.
> > > > > > And even if done concurrently we take several measures to ensure that
> > > > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > > > progress, etc).
> > > > >
> > > > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > > > I did some tests with the following patch, and it seems to work. I will
> > > > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > > > well.
> > > > >
> > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > index e4addfc58107..215315be6233 100644
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > >          */
> > > > >         ret = invalidate_inode_pages2_range(mapping,
> > > > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > > -       if (ret)
> > > > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > > > -       ret = 0;
> > > > > +       /*
> > > > > +        * If a page can not be invalidated, return 0 to fall back
> > > > > +        * to buffered write.
> > > > > +        */
> > > > > +       if (ret) {
> > > > > +               if (ret == -EBUSY)
> > > > > +                       ret = 0;
> > > > > +               goto out_free_dio;
> > > > > +       }
> > > > >
> > > > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > >             !inode->i_sb->s_dio_done_wq) {
> > > > >
> > > > >
> > > >
> > > > Thanks. As I just replied on another thread for that patch, we
> > > > actually have a regression.
> > > > There's more than the annoying warning in dmesg, it also sets -EIO on
> > > > the inode's mapping and makes future fsyncs return that error despite
> > > > the fact that no actual errors or corruptions happened:
> > > >
> > > > https://patchwork.kernel.org/patch/11576677/
> > > >
> > >
> > > There's also some deadlock/hang, I have triggered it twice today with
> > > generic/113 on two different VMs:
> > >
> > > [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> > > [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> > > [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14621.302389] Call Trace:
> > > [14621.302877]  __schedule+0x384/0xa30
> > > [14621.303555]  schedule+0x33/0xe0
> > > [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.307113]  iomap_dio_complete+0x11b/0x260
> > > [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14621.308585]  iomap_dio_complete_work+0x17/0x30
> > > [14621.309476]  process_one_work+0x275/0x6b0
> > > [14621.310275]  worker_thread+0x4f/0x3e0
> > > [14621.310869]  ? process_one_work+0x6b0/0x6b0
> > > [14621.311403]  kthread+0x12a/0x170
> > > [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14621.312460]  ret_from_fork+0x3a/0x50
> > > [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> > > [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> > > [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14621.317101] Call Trace:
> > > [14621.317437]  __schedule+0x384/0xa30
> > > [14621.317928]  schedule+0x33/0xe0
> > > [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.320203]  iomap_dio_complete+0x11b/0x260
> > > [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14621.321249]  iomap_dio_complete_work+0x17/0x30
> > > [14621.321844]  process_one_work+0x275/0x6b0
> > > [14621.322376]  worker_thread+0x4f/0x3e0
> > > [14621.322871]  ? process_one_work+0x6b0/0x6b0
> > > [14621.323408]  kthread+0x12a/0x170
> > > [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14621.324473]  ret_from_fork+0x3a/0x50
> > > [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> > > [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> > > [14621.328280] Call Trace:
> > > [14621.328602]  __schedule+0x384/0xa30
> > > [14621.329056]  schedule+0x33/0xe0
> > > [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> > > [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> > > [14621.331346]  iomap_dio_complete+0x11b/0x260
> > > [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> > > [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> > > [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> > > [14621.333749]  aio_write+0x148/0x1d0
> > > [14621.334196]  ? lock_acquire+0xb1/0x4a0
> > > [14621.334682]  ? __might_fault+0x3e/0x90
> > > [14621.335172]  ? __fget_files+0x132/0x270
> > > [14621.335668]  ? io_submit_one+0x946/0x1630
> > > [14621.336184]  io_submit_one+0x946/0x1630
> > > [14621.336680]  ? lock_acquire+0xb1/0x4a0
> > > [14621.337175]  ? __might_fault+0x3e/0x90
> > > [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> > > [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> > > [14621.338812]  ? do_syscall_64+0x5c/0x280
> > > [14621.339303]  do_syscall_64+0x5c/0x280
> > > [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [14621.340416] RIP: 0033:0x7fb6cd395717
> > > [14621.340875] Code: Bad RIP value.
> > > [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> > > 00000000000000d1
> > > [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> > > [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> > > [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> > > [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> > > [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> > > [14621.346820] INFO: lockdep is turned off.
> > > [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> > > [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> > > [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14742.129633] Call Trace:
> > > [14742.130010]  __schedule+0x384/0xa30
> > > [14742.130494]  schedule+0x33/0xe0
> > > [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.133712]  iomap_dio_complete+0x11b/0x260
> > > [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14742.135205]  iomap_dio_complete_work+0x17/0x30
> > > [14742.136018]  process_one_work+0x275/0x6b0
> > > [14742.136677]  worker_thread+0x4f/0x3e0
> > > [14742.137154]  ? process_one_work+0x6b0/0x6b0
> > > [14742.137805]  kthread+0x12a/0x170
> > > [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14742.138901]  ret_from_fork+0x3a/0x50
> > > [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> > > [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> > > [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14742.143397] Call Trace:
> > > [14742.143654]  __schedule+0x384/0xa30
> > > [14742.144017]  schedule+0x33/0xe0
> > > [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.145863]  iomap_dio_complete+0x11b/0x260
> > > [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14742.146701]  iomap_dio_complete_work+0x17/0x30
> > > [14742.147168]  process_one_work+0x275/0x6b0
> > > [14742.147579]  worker_thread+0x4f/0x3e0
> > > [14742.147954]  ? process_one_work+0x6b0/0x6b0
> > > [14742.148377]  kthread+0x12a/0x170
> > > [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14742.149257]  ret_from_fork+0x3a/0x50
> > > [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> > > [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> > > (...)
> > >
> >
> > Seems like the btrfs_inode->dio_sem. Would you have more information on
> > which process is holding on to it, or if there was a failure?
> 
> This happens 100% of the time.
> 
> At btrfs_direct_write():
> 
> down_read(&BTRFS_I(inode)->dio_sem);
> written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
>        is_sync_kiocb(iocb));
> up_read(&BTRFS_I(inode)->dio_sem);
> 
> than through iomap_dio_rw(), we end up calling btrfs_sync_file(),
> which tries to down_write() the dio_sem.

I see what is happening in the code. Also, we have generic_write_sync()
called  twice. I will work on this. Surprisingly generic/113 works just
fine for me.

Thanks,

[4/7] btrfs: Switch to iomap_dio_rw() for dio

Commit Message

Comments

Patch