diff mbox series

btrfs: fix stale page cache after race between readahead and direct IO write

Message ID 24617a89550bed4ef0e0db12d17187940de551b0.1738685146.git.fdmanana@suse.com (mailing list archive)
State New
Headers show
Series btrfs: fix stale page cache after race between readahead and direct IO write | expand

Commit Message

Filipe Manana Feb. 4, 2025, 4:16 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
read") we can now trigger a race between a task doing a direct IO write
and readahead. When this race is triggered it results in tasks getting
stale data when they attempt do a buffered read (including the task that
did the direct IO write).

This race can be sporadically triggered with test case generic/418, failing
like this:

   $ ./check generic/418
   FSTYP         -- btrfs
   PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
   MKFS_OPTIONS  -- /dev/sdc
   MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

   generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
       --- tests/generic/418.out	2020-06-10 19:29:03.850519863 +0100
       +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad	2025-02-03 15:42:36.974609476 +0000
       @@ -1,2 +1,5 @@
        QA output created by 418
       +cmpbuf: offset 0: Expected: 0x1, got 0x0
       +[6:0] FAIL - comparison failed, offset 24576
       +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
        Silence is golden
       ...
       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad'  to see the entire diff)
   Ran: generic/418
   Failures: generic/418
   Failed 1 of 1 tests

The race happens like this:

1) A file has a prealloc extent for the range [16K, 28K);

2) Task A starts a direct IO write against file range [24K, 28K).
   At the start of the direct IO write it invalidates the page cache at
   __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
   offset 24K;

3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
   [24K, 28K);

4) Task B starts a readahead for file range [16K, 28K), entering
   btrfs_readahead().

   First it attempts to read the page at offset 16K by entering
   btrfs_do_readpage(), where it calls get_extent_map(), locks the range
   [16K, 20K) and gets the extent map for the range [16K, 28K), caching
   it into the 'em_cached' variable declared in the local stack of
   btrfs_readahead(), and then unlocks the range [16K, 20K).

   Since the extent map has the prealloc flag, at btrfs_do_readpage() we
   zero out the page's content and don't submit any bio to read the page
   from the extent.

   Then it attempts to read the page at offset 20K entering
   btrfs_do_readpage() where we reuse the previously cached extent map
   (decided by get_extent_map()) since it spans the page's range and
   it's still in the inode's extent map tree.

   Just like for the previous page, we zero out the page's content since
   the extent map has the prealloc flag set.

   Then it attempts to read the page at offset 24K entering
   btrfs_do_readpage() where we reuse the previously cached extent map
   (decided by get_extent_map()) since it spans the page's range and
   it's still in the inode's extent map tree.

   Just like for the previous pages, we zero out the page's content since
   the extent map has the prealloc flag set. Note that we didn't lock the
   extent range [24K, 28K), so we didn't synchronize with the ongoig
   direct IO write being performed by task A;

5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
   for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
   BTRFS_ORDERED_PREALLOC set;

6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();

7) The ordered extent enters btrfs_finish_one_ordered() and locks the
   range [24K, 28K);

8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
   to invalidate the page at offset 24K by calling
   kiocb_invalidate_post_direct_write(), resulting in a call chain that
   ends up at btrfs_release_folio().

   The btrfs_release_folio() call ends up returning false because the range
   for the page at file offset 24K is currently locked by the task doing
   the ordered extent completion in the previous step (7), so we have:

   btrfs_release_folio() ->
      __btrfs_release_folio() ->
         try_release_extent_mapping() ->
	     try_release_extent_state()

   This last function checking that the range is locked and returning false
   and propagating it up to btrfs_release_folio().

   So this results in a failure to invalidate the page and
   kiocb_invalidate_post_direct_write() triggers this message logged in
   dmesg:

     Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!

   After this we leave the page cache with stale data for the file range
   [24K, 28K), filled with zeroes instead of the data written by direct IO
   write (all bytes with a 0x01 value), so any task attempting to read with
   buffered IO, including the task that did the direct IO write, will get
   all bytes in the range with a 0x00 value instead of the written data.

Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
at the two callers of btrfs_do_readpage() instead of doing it at
get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do
not hold the extent lock for entire read"), and unlocking the range after
all the calls to btrfs_do_readpage(). This way we never reuse a cached
extent map without flushing any pending ordered extents from a concurrent
direct IO write.

Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Qu Wenruo Feb. 5, 2025, 5:15 a.m. UTC | #1
在 2025/2/5 02:46, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
> read") we can now trigger a race between a task doing a direct IO write
> and readahead. When this race is triggered it results in tasks getting
> stale data when they attempt do a buffered read (including the task that
> did the direct IO write).
>
> This race can be sporadically triggered with test case generic/418, failing
> like this:
>
>     $ ./check generic/418
>     FSTYP         -- btrfs
>     PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
>     MKFS_OPTIONS  -- /dev/sdc
>     MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
>
>     generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
>         --- tests/generic/418.out	2020-06-10 19:29:03.850519863 +0100
>         +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad	2025-02-03 15:42:36.974609476 +0000
>         @@ -1,2 +1,5 @@
>          QA output created by 418
>         +cmpbuf: offset 0: Expected: 0x1, got 0x0
>         +[6:0] FAIL - comparison failed, offset 24576
>         +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
>          Silence is golden
>         ...
>         (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad'  to see the entire diff)
>     Ran: generic/418
>     Failures: generic/418
>     Failed 1 of 1 tests
>
> The race happens like this:
>
> 1) A file has a prealloc extent for the range [16K, 28K);
>
> 2) Task A starts a direct IO write against file range [24K, 28K).
>     At the start of the direct IO write it invalidates the page cache at
>     __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
>     offset 24K;
>
> 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
>     [24K, 28K);
>
> 4) Task B starts a readahead for file range [16K, 28K), entering
>     btrfs_readahead().
>
>     First it attempts to read the page at offset 16K by entering
>     btrfs_do_readpage(), where it calls get_extent_map(), locks the range
>     [16K, 20K) and gets the extent map for the range [16K, 28K), caching
>     it into the 'em_cached' variable declared in the local stack of
>     btrfs_readahead(), and then unlocks the range [16K, 20K).
>
>     Since the extent map has the prealloc flag, at btrfs_do_readpage() we
>     zero out the page's content and don't submit any bio to read the page
>     from the extent.
>
>     Then it attempts to read the page at offset 20K entering
>     btrfs_do_readpage() where we reuse the previously cached extent map
>     (decided by get_extent_map()) since it spans the page's range and
>     it's still in the inode's extent map tree.
>
>     Just like for the previous page, we zero out the page's content since
>     the extent map has the prealloc flag set.
>
>     Then it attempts to read the page at offset 24K entering
>     btrfs_do_readpage() where we reuse the previously cached extent map
>     (decided by get_extent_map()) since it spans the page's range and
>     it's still in the inode's extent map tree.
>
>     Just like for the previous pages, we zero out the page's content since
>     the extent map has the prealloc flag set. Note that we didn't lock the
>     extent range [24K, 28K), so we didn't synchronize with the ongoig
>     direct IO write being performed by task A;
>
> 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
>     for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
>     BTRFS_ORDERED_PREALLOC set;
>
> 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();
>
> 7) The ordered extent enters btrfs_finish_one_ordered() and locks the
>     range [24K, 28K);
>
> 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
>     to invalidate the page at offset 24K by calling
>     kiocb_invalidate_post_direct_write(), resulting in a call chain that
>     ends up at btrfs_release_folio().
>
>     The btrfs_release_folio() call ends up returning false because the range
>     for the page at file offset 24K is currently locked by the task doing
>     the ordered extent completion in the previous step (7), so we have:
>
>     btrfs_release_folio() ->
>        __btrfs_release_folio() ->
>           try_release_extent_mapping() ->
> 	     try_release_extent_state()
>
>     This last function checking that the range is locked and returning false
>     and propagating it up to btrfs_release_folio().
>
>     So this results in a failure to invalidate the page and
>     kiocb_invalidate_post_direct_write() triggers this message logged in
>     dmesg:
>
>       Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>
>     After this we leave the page cache with stale data for the file range
>     [24K, 28K), filled with zeroes instead of the data written by direct IO
>     write (all bytes with a 0x01 value), so any task attempting to read with
>     buffered IO, including the task that did the direct IO write, will get
>     all bytes in the range with a 0x00 value instead of the written data.
>
> Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
> at the two callers of btrfs_do_readpage() instead of doing it at
> get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do
> not hold the extent lock for entire read"), and unlocking the range after
> all the calls to btrfs_do_readpage(). This way we never reuse a cached
> extent map without flushing any pending ordered extents from a concurrent
> direct IO write.
>
> Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent_io.c | 18 +++++++++++++++---
>   1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 838ab2b6ed43..bdb9816bf125 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
>   					 u64 len, struct extent_map **em_cached)
>   {
>   	struct extent_map *em;
> -	struct extent_state *cached_state = NULL;
>
>   	ASSERT(em_cached);
>
> @@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
>   		*em_cached = NULL;
>   	}
>
> -	btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state);
>   	em = btrfs_get_extent(inode, folio, start, len);
>   	if (!IS_ERR(em)) {
>   		BUG_ON(*em_cached);
>   		refcount_inc(&em->refs);
>   		*em_cached = em;
>   	}
> -	unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state);
>
>   	return em;
>   }
> @@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>
>   int btrfs_read_folio(struct file *file, struct folio *folio)
>   {
> +	struct btrfs_inode *inode = folio_to_inode(folio);
> +	const u64 start = folio_pos(folio);
> +	const u64 end = start + folio_size(folio) - 1;

And great you're already taking larger data folio into consideration.

> +	struct extent_state *cached_state = NULL;
>   	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
>   	struct extent_map *em_cached = NULL;
>   	int ret;
>
> +	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);

This change is going to conflict with the subpage + partial uptodate
page optimization (which will also be the foundation for larger data folio):
https://lore.kernel.org/linux-btrfs/1d230d53e92c4f4a7ea4ce036f226b8daa16e5ae.1736848277.git.wqu@suse.com/

I'm fine to update the patch on my side after yours got merged.

>   	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> +	unlock_extent(&inode->io_tree, start, end, &cached_state);
> +
>   	free_extent_map(em_cached);
>
>   	/*
> @@ -2377,12 +2381,20 @@ void btrfs_readahead(struct readahead_control *rac)
>   {
>   	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
>   	struct folio *folio;
> +	struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> +	const u64 start = readahead_pos(rac);
> +	const u64 end = start + readahead_length(rac) - 1;
> +	struct extent_state *cached_state = NULL;
>   	struct extent_map *em_cached = NULL;
>   	u64 prev_em_start = (u64)-1;
>
> +	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> +

I'm not confident enough about this lock, as it can cross several pages.

E.g. if we have the following case, 4K page size 4K block size.

         0       4K      8K      12K     16K
         |       |///////|               |

And range [4K, 8K) is uptodate, and it has been submitted for writeback,
and finished writeback.
But it still have ordered extent, not yet finished.

Then we go into the following call chain:
btrfs_lock_and_write_flush()
|- btrfs_start_ordered_extent()
    |- filemap_fdatawrite_range()

Which will try to writeback the range [4K, 8K) again.
But since the folio at [4K, 8K) is going to be passed to
btrfs_do_readpage(), it should already have been locked.

Thus the writeback will never be able to lock the folio, thus causing a
deadlock.

Or did I miss something specific to readahead so it will avoid readahead
on already uptodate pages?


And even if it will not cause the above deadlock, I'm not confident the
mentioned subpage fix conflict can be proper fixed.
In the subpage fix, I can only add a folio parameter to
btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple
folios, how can we avoid the same subpage dead lock then?

Thanks,
Qu

>   	while ((folio = readahead_folio(rac)) != NULL)
>   		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
>
> +	unlock_extent(&inode->io_tree, start, end, &cached_state);
> +
>   	if (em_cached)
>   		free_extent_map(em_cached);
>   	submit_one_bio(&bio_ctrl);
Filipe Manana Feb. 5, 2025, 8:24 a.m. UTC | #2
On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2025/2/5 02:46, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
> > read") we can now trigger a race between a task doing a direct IO write
> > and readahead. When this race is triggered it results in tasks getting
> > stale data when they attempt do a buffered read (including the task that
> > did the direct IO write).
> >
> > This race can be sporadically triggered with test case generic/418, failing
> > like this:
> >
> >     $ ./check generic/418
> >     FSTYP         -- btrfs
> >     PLATFORM      -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb  3 12:28:46 WET 2025
> >     MKFS_OPTIONS  -- /dev/sdc
> >     MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> >
> >     generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
> >         --- tests/generic/418.out     2020-06-10 19:29:03.850519863 +0100
> >         +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad      2025-02-03 15:42:36.974609476 +0000
> >         @@ -1,2 +1,5 @@
> >          QA output created by 418
> >         +cmpbuf: offset 0: Expected: 0x1, got 0x0
> >         +[6:0] FAIL - comparison failed, offset 24576
> >         +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
> >          Silence is golden
> >         ...
> >         (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad'  to see the entire diff)
> >     Ran: generic/418
> >     Failures: generic/418
> >     Failed 1 of 1 tests
> >
> > The race happens like this:
> >
> > 1) A file has a prealloc extent for the range [16K, 28K);
> >
> > 2) Task A starts a direct IO write against file range [24K, 28K).
> >     At the start of the direct IO write it invalidates the page cache at
> >     __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
> >     offset 24K;
> >
> > 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
> >     [24K, 28K);
> >
> > 4) Task B starts a readahead for file range [16K, 28K), entering
> >     btrfs_readahead().
> >
> >     First it attempts to read the page at offset 16K by entering
> >     btrfs_do_readpage(), where it calls get_extent_map(), locks the range
> >     [16K, 20K) and gets the extent map for the range [16K, 28K), caching
> >     it into the 'em_cached' variable declared in the local stack of
> >     btrfs_readahead(), and then unlocks the range [16K, 20K).
> >
> >     Since the extent map has the prealloc flag, at btrfs_do_readpage() we
> >     zero out the page's content and don't submit any bio to read the page
> >     from the extent.
> >
> >     Then it attempts to read the page at offset 20K entering
> >     btrfs_do_readpage() where we reuse the previously cached extent map
> >     (decided by get_extent_map()) since it spans the page's range and
> >     it's still in the inode's extent map tree.
> >
> >     Just like for the previous page, we zero out the page's content since
> >     the extent map has the prealloc flag set.
> >
> >     Then it attempts to read the page at offset 24K entering
> >     btrfs_do_readpage() where we reuse the previously cached extent map
> >     (decided by get_extent_map()) since it spans the page's range and
> >     it's still in the inode's extent map tree.
> >
> >     Just like for the previous pages, we zero out the page's content since
> >     the extent map has the prealloc flag set. Note that we didn't lock the
> >     extent range [24K, 28K), so we didn't synchronize with the ongoig
> >     direct IO write being performed by task A;
> >
> > 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
> >     for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
> >     BTRFS_ORDERED_PREALLOC set;
> >
> > 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();
> >
> > 7) The ordered extent enters btrfs_finish_one_ordered() and locks the
> >     range [24K, 28K);
> >
> > 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
> >     to invalidate the page at offset 24K by calling
> >     kiocb_invalidate_post_direct_write(), resulting in a call chain that
> >     ends up at btrfs_release_folio().
> >
> >     The btrfs_release_folio() call ends up returning false because the range
> >     for the page at file offset 24K is currently locked by the task doing
> >     the ordered extent completion in the previous step (7), so we have:
> >
> >     btrfs_release_folio() ->
> >        __btrfs_release_folio() ->
> >           try_release_extent_mapping() ->
> >            try_release_extent_state()
> >
> >     This last function checking that the range is locked and returning false
> >     and propagating it up to btrfs_release_folio().
> >
> >     So this results in a failure to invalidate the page and
> >     kiocb_invalidate_post_direct_write() triggers this message logged in
> >     dmesg:
> >
> >       Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> >
> >     After this we leave the page cache with stale data for the file range
> >     [24K, 28K), filled with zeroes instead of the data written by direct IO
> >     write (all bytes with a 0x01 value), so any task attempting to read with
> >     buffered IO, including the task that did the direct IO write, will get
> >     all bytes in the range with a 0x00 value instead of the written data.
> >
> > Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
> > at the two callers of btrfs_do_readpage() instead of doing it at
> > get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do
> > not hold the extent lock for entire read"), and unlocking the range after
> > all the calls to btrfs_do_readpage(). This way we never reuse a cached
> > extent map without flushing any pending ordered extents from a concurrent
> > direct IO write.
> >
> > Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/extent_io.c | 18 +++++++++++++++---
> >   1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 838ab2b6ed43..bdb9816bf125 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
> >                                        u64 len, struct extent_map **em_cached)
> >   {
> >       struct extent_map *em;
> > -     struct extent_state *cached_state = NULL;
> >
> >       ASSERT(em_cached);
> >
> > @@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
> >               *em_cached = NULL;
> >       }
> >
> > -     btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state);
> >       em = btrfs_get_extent(inode, folio, start, len);
> >       if (!IS_ERR(em)) {
> >               BUG_ON(*em_cached);
> >               refcount_inc(&em->refs);
> >               *em_cached = em;
> >       }
> > -     unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state);
> >
> >       return em;
> >   }
> > @@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >
> >   int btrfs_read_folio(struct file *file, struct folio *folio)
> >   {
> > +     struct btrfs_inode *inode = folio_to_inode(folio);
> > +     const u64 start = folio_pos(folio);
> > +     const u64 end = start + folio_size(folio) - 1;
>
> And great you're already taking larger data folio into consideration.
>
> > +     struct extent_state *cached_state = NULL;
> >       struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> >       struct extent_map *em_cached = NULL;
> >       int ret;
> >
> > +     btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
>
> This change is going to conflict with the subpage + partial uptodate
> page optimization (which will also be the foundation for larger data folio):
> https://lore.kernel.org/linux-btrfs/1d230d53e92c4f4a7ea4ce036f226b8daa16e5ae.1736848277.git.wqu@suse.com/
>
> I'm fine to update the patch on my side after yours got merged.

I'm sorry about that, but it's a regression so it has to come first.

>
> >       ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> > +     unlock_extent(&inode->io_tree, start, end, &cached_state);
> > +
> >       free_extent_map(em_cached);
> >
> >       /*
> > @@ -2377,12 +2381,20 @@ void btrfs_readahead(struct readahead_control *rac)
> >   {
> >       struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
> >       struct folio *folio;
> > +     struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
> > +     const u64 start = readahead_pos(rac);
> > +     const u64 end = start + readahead_length(rac) - 1;
> > +     struct extent_state *cached_state = NULL;
> >       struct extent_map *em_cached = NULL;
> >       u64 prev_em_start = (u64)-1;
> >
> > +     btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> > +
>
> I'm not confident enough about this lock, as it can cross several pages.
>
> E.g. if we have the following case, 4K page size 4K block size.
>
>          0       4K      8K      12K     16K
>          |       |///////|               |
>
> And range [4K, 8K) is uptodate, and it has been submitted for writeback,
> and finished writeback.
> But it still have ordered extent, not yet finished.
>
> Then we go into the following call chain:
> btrfs_lock_and_write_flush()
> |- btrfs_start_ordered_extent()
>     |- filemap_fdatawrite_range()
>
> Which will try to writeback the range [4K, 8K) again.

But you just said above that the writeback has finished...

> But since the folio at [4K, 8K) is going to be passed to
> btrfs_do_readpage(), it should already have been locked.
>
> Thus the writeback will never be able to lock the folio, thus causing a
> deadlock.

readahead (and readpage) are never called for pages that are already
in the page cache,
so how can that happen?

And why do you think that is specific to the case where the range
covers more than one page?
I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't
lead to that?

>
> Or did I miss something specific to readahead so it will avoid readahead
> on already uptodate pages?

It will skip pages already in the page cache (whether uptodate, under
writeback, etc).

If you look closer, what this patch is doing is restoring the
behaviour we had before
commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
read"), with the
exception of doing the unlock in readahead/readpage instead of when
submitted bios complete.

If there was a possibility for such a deadlock, we would almost
certainly have noticed it before,
as that has been the behaviour for ages until that commit that landed
last summer.


>
>
> And even if it will not cause the above deadlock, I'm not confident the
> mentioned subpage fix conflict can be proper fixed.
> In the subpage fix, I can only add a folio parameter to
> btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple
> folios, how can we avoid the same subpage dead lock then?

About the subpage case I have no idea.
But given that readahead/readpage is never called for pages already in
the page cache, I don't think there should be a problem even for
subpage scenario.

We also have many places where we lock ranges wider than 1 page, so I
don't see why this specific place would be a problem.

Thanks.

>
> Thanks,
> Qu
>
> >       while ((folio = readahead_folio(rac)) != NULL)
> >               btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> >
> > +     unlock_extent(&inode->io_tree, start, end, &cached_state);
> > +
> >       if (em_cached)
> >               free_extent_map(em_cached);
> >       submit_one_bio(&bio_ctrl);
>
Qu Wenruo Feb. 5, 2025, 8:50 a.m. UTC | #3
在 2025/2/5 18:54, Filipe Manana 写道:
> On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
[...]
>> I'm not confident enough about this lock, as it can cross several pages.
>>
>> E.g. if we have the following case, 4K page size 4K block size.
>>
>>           0       4K      8K      12K     16K
>>           |       |///////|               |
>>
>> And range [4K, 8K) is uptodate, and it has been submitted for writeback,
>> and finished writeback.
>> But it still have ordered extent, not yet finished.
>>
>> Then we go into the following call chain:
>> btrfs_lock_and_write_flush()
>> |- btrfs_start_ordered_extent()
>>      |- filemap_fdatawrite_range()
>>
>> Which will try to writeback the range [4K, 8K) again.
>
> But you just said above that the writeback has finished...

You're right, for the writeback finished case, the page is no longer
dirty and write path will not try to hold the lock.

So it's still only possible for subpage with partial uptodate page support.
Not possible for the current cases.

>
>> But since the folio at [4K, 8K) is going to be passed to
>> btrfs_do_readpage(), it should already have been locked.
>>
>> Thus the writeback will never be able to lock the folio, thus causing a
>> deadlock.
>
> readahead (and readpage) are never called for pages that are already
> in the page cache,
> so how can that happen?

Mind to be more explicit on the readahead requirement?
E.g. if we have a folio that's dirty but not uptodate, what will happen?

I know readapge can be called on page that's already in the page cache,
especially for subpage cases (although not for now).

E.g. 16K page size, 4K block size.

    0       4K       8K      12K     16K
    |       |////////|               |

Where [4K, 8K) is dirtied by buffered write, which lands exactly at [4K,
8K).

It's not possible for now, because at buffered write we will read out
the whole page if the write is not page aligned.

But XFS/EXT4 allows to skip the full page read, and just dirty the block
aligned range, and I'm making btrfs to support that.
(https://lore.kernel.org/linux-btrfs/cover.1736848277.git.wqu@suse.com/)

So in that case, if we hit a such dirty but not uptodate page, will
readahead still try to read it, because the folio is not uptodate.

>
> And why do you think that is specific to the case where the range
> covers more than one page?
> I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't
> lead to that?

Right, the dead lock is really limited to the subpage + partial uptodate
case.

But I really believe, without such subpage mindset, we are never going
to support larger data folios.

>
>>
>> Or did I miss something specific to readahead so it will avoid readahead
>> on already uptodate pages?
>
> It will skip pages already in the page cache (whether uptodate, under
> writeback, etc).
>
> If you look closer, what this patch is doing is restoring the
> behaviour we had before
> commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
> read"), with the
> exception of doing the unlock in readahead/readpage instead of when
> submitted bios complete.

I'm fine reverting to the old behavior, and so far it looks good to me.

So I can give it a reviewed-by:

Reviewed-by: Qu Wenruo <wqu@suse.com>



My concern is, with the multi-folio lock back, what is the proper way to
avoid the subpage deadlock for subpage partial uptodate folio, and
future larger data folio?


My current fix will not be feasible, as it relies on the single folio
pointer, to determine if btrfs_start_ordered_extent() should skip
writeback, and now we have multi-folio range back.

Maybe we should just make those
btrfs_lock_and_flush_ordered_range()/btrfs_start_ordered_extent() to
avoid writeback for those read page related call sites?

I know it's not your responsibility to bother those subpage cases, but
if you can give some ideas/advices, it would be great.

Thanks,
Qu

>
> If there was a possibility for such a deadlock, we would almost
> certainly have noticed it before,
> as that has been the behaviour for ages until that commit that landed
> last summer.
>
>
>>
>>
>> And even if it will not cause the above deadlock, I'm not confident the
>> mentioned subpage fix conflict can be proper fixed.
>> In the subpage fix, I can only add a folio parameter to
>> btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple
>> folios, how can we avoid the same subpage dead lock then?
>
> About the subpage case I have no idea.
> But given that readahead/readpage is never called for pages already in
> the page cache, I don't think there should be a problem even for
> subpage scenario.
>
> We also have many places where we lock ranges wider than 1 page, so I
> don't see why this specific place would be a problem.
>
> Thanks.
>
>>
>> Thanks,
>> Qu
>>
>>>        while ((folio = readahead_folio(rac)) != NULL)
>>>                btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
>>>
>>> +     unlock_extent(&inode->io_tree, start, end, &cached_state);
>>> +
>>>        if (em_cached)
>>>                free_extent_map(em_cached);
>>>        submit_one_bio(&bio_ctrl);
>>
Filipe Manana Feb. 5, 2025, 10:43 a.m. UTC | #4
On Wed, Feb 5, 2025 at 8:51 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2025/2/5 18:54, Filipe Manana 写道:
> > On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> [...]
> >> I'm not confident enough about this lock, as it can cross several pages.
> >>
> >> E.g. if we have the following case, 4K page size 4K block size.
> >>
> >>           0       4K      8K      12K     16K
> >>           |       |///////|               |
> >>
> >> And range [4K, 8K) is uptodate, and it has been submitted for writeback,
> >> and finished writeback.
> >> But it still have ordered extent, not yet finished.
> >>
> >> Then we go into the following call chain:
> >> btrfs_lock_and_write_flush()
> >> |- btrfs_start_ordered_extent()
> >>      |- filemap_fdatawrite_range()
> >>
> >> Which will try to writeback the range [4K, 8K) again.
> >
> > But you just said above that the writeback has finished...
>
> You're right, for the writeback finished case, the page is no longer
> dirty and write path will not try to hold the lock.
>
> So it's still only possible for subpage with partial uptodate page support.
> Not possible for the current cases.
>
> >
> >> But since the folio at [4K, 8K) is going to be passed to
> >> btrfs_do_readpage(), it should already have been locked.
> >>
> >> Thus the writeback will never be able to lock the folio, thus causing a
> >> deadlock.
> >
> > readahead (and readpage) are never called for pages that are already
> > in the page cache,
> > so how can that happen?
>
> Mind to be more explicit on the readahead requirement?
> E.g. if we have a folio that's dirty but not uptodate, what will happen?

I would say flushing (start writeback and wait for it to complete) is
being triggered elsewhere outside btrfs,
otherwise we would always deadlock there, no matter if we range lock
only the range for that page or a wider range that includes the page.

And we haven't observed that for ages (for the non subpage case at least).

>
> I know readapge can be called on page that's already in the page cache,
> especially for subpage cases (although not for now).

So you're saying it can after some patchset that is not yet merged.

>
> E.g. 16K page size, 4K block size.
>
>     0       4K       8K      12K     16K
>     |       |////////|               |
>
> Where [4K, 8K) is dirtied by buffered write, which lands exactly at [4K,
> 8K).
>
> It's not possible for now, because at buffered write we will read out
> the whole page if the write is not page aligned.
>
> But XFS/EXT4 allows to skip the full page read, and just dirty the block
> aligned range, and I'm making btrfs to support that.
> (https://lore.kernel.org/linux-btrfs/cover.1736848277.git.wqu@suse.com/)
>
> So in that case, if we hit a such dirty but not uptodate page, will
> readahead still try to read it, because the folio is not uptodate.
>
> >
> > And why do you think that is specific to the case where the range
> > covers more than one page?
> > I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't
> > lead to that?
>
> Right, the dead lock is really limited to the subpage + partial uptodate
> case.
>
> But I really believe, without such subpage mindset, we are never going
> to support larger data folios.

I get it that it's frustrating and it's unfortunate, but I'm working
with the infrastructure we currently have and with a regression fix
and backport in mind.

>
> >
> >>
> >> Or did I miss something specific to readahead so it will avoid readahead
> >> on already uptodate pages?
> >
> > It will skip pages already in the page cache (whether uptodate, under
> > writeback, etc).
> >
> > If you look closer, what this patch is doing is restoring the
> > behaviour we had before
> > commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire
> > read"), with the
> > exception of doing the unlock in readahead/readpage instead of when
> > submitted bios complete.
>
> I'm fine reverting to the old behavior, and so far it looks good to me.
>
> So I can give it a reviewed-by:
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
>
>
> My concern is, with the multi-folio lock back, what is the proper way to
> avoid the subpage deadlock for subpage partial uptodate folio, and
> future larger data folio?
>
>
> My current fix will not be feasible, as it relies on the single folio
> pointer, to determine if btrfs_start_ordered_extent() should skip
> writeback, and now we have multi-folio range back.
>
> Maybe we should just make those
> btrfs_lock_and_flush_ordered_range()/btrfs_start_ordered_extent() to
> avoid writeback for those read page related call sites?

Maybe.
I can't advise because I haven't been actively involved in large folio
support or subpage support,
and it would take quite a lot of time for me to grok all we have and
the current problems (plus I
don't have access to a machine to test subpage code).

>
> I know it's not your responsibility to bother those subpage cases, but
> if you can give some ideas/advices, it would be great.
>
> Thanks,
> Qu
>
> >
> > If there was a possibility for such a deadlock, we would almost
> > certainly have noticed it before,
> > as that has been the behaviour for ages until that commit that landed
> > last summer.
> >
> >
> >>
> >>
> >> And even if it will not cause the above deadlock, I'm not confident the
> >> mentioned subpage fix conflict can be proper fixed.
> >> In the subpage fix, I can only add a folio parameter to
> >> btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple
> >> folios, how can we avoid the same subpage dead lock then?
> >
> > About the subpage case I have no idea.
> > But given that readahead/readpage is never called for pages already in
> > the page cache, I don't think there should be a problem even for
> > subpage scenario.
> >
> > We also have many places where we lock ranges wider than 1 page, so I
> > don't see why this specific place would be a problem.
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>        while ((folio = readahead_folio(rac)) != NULL)
> >>>                btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> >>>
> >>> +     unlock_extent(&inode->io_tree, start, end, &cached_state);
> >>> +
> >>>        if (em_cached)
> >>>                free_extent_map(em_cached);
> >>>        submit_one_bio(&bio_ctrl);
> >>
>
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 838ab2b6ed43..bdb9816bf125 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -898,7 +898,6 @@  static struct extent_map *get_extent_map(struct btrfs_inode *inode,
 					 u64 len, struct extent_map **em_cached)
 {
 	struct extent_map *em;
-	struct extent_state *cached_state = NULL;
 
 	ASSERT(em_cached);
 
@@ -914,14 +913,12 @@  static struct extent_map *get_extent_map(struct btrfs_inode *inode,
 		*em_cached = NULL;
 	}
 
-	btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state);
 	em = btrfs_get_extent(inode, folio, start, len);
 	if (!IS_ERR(em)) {
 		BUG_ON(*em_cached);
 		refcount_inc(&em->refs);
 		*em_cached = em;
 	}
-	unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state);
 
 	return em;
 }
@@ -1078,11 +1075,18 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 int btrfs_read_folio(struct file *file, struct folio *folio)
 {
+	struct btrfs_inode *inode = folio_to_inode(folio);
+	const u64 start = folio_pos(folio);
+	const u64 end = start + folio_size(folio) - 1;
+	struct extent_state *cached_state = NULL;
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
 	struct extent_map *em_cached = NULL;
 	int ret;
 
+	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
 	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+	unlock_extent(&inode->io_tree, start, end, &cached_state);
+
 	free_extent_map(em_cached);
 
 	/*
@@ -2377,12 +2381,20 @@  void btrfs_readahead(struct readahead_control *rac)
 {
 	struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
 	struct folio *folio;
+	struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
+	const u64 start = readahead_pos(rac);
+	const u64 end = start + readahead_length(rac) - 1;
+	struct extent_state *cached_state = NULL;
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
 
+	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
+
 	while ((folio = readahead_folio(rac)) != NULL)
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
 
+	unlock_extent(&inode->io_tree, start, end, &cached_state);
+
 	if (em_cached)
 		free_extent_map(em_cached);
 	submit_one_bio(&bio_ctrl);