Message ID | 24617a89550bed4ef0e0db12d17187940de551b0.1738685146.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: fix stale page cache after race between readahead and direct IO write | expand |
在 2025/2/5 02:46, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire > read") we can now trigger a race between a task doing a direct IO write > and readahead. When this race is triggered it results in tasks getting > stale data when they attempt do a buffered read (including the task that > did the direct IO write). > > This race can be sporadically triggered with test case generic/418, failing > like this: > > $ ./check generic/418 > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 > MKFS_OPTIONS -- /dev/sdc > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) > --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 > +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 > @@ -1,2 +1,5 @@ > QA output created by 418 > +cmpbuf: offset 0: Expected: 0x1, got 0x0 > +[6:0] FAIL - comparison failed, offset 24576 > +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 > Silence is golden > ... > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) > Ran: generic/418 > Failures: generic/418 > Failed 1 of 1 tests > > The race happens like this: > > 1) A file has a prealloc extent for the range [16K, 28K); > > 2) Task A starts a direct IO write against file range [24K, 28K). > At the start of the direct IO write it invalidates the page cache at > __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file > offset 24K; > > 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range > [24K, 28K); > > 4) Task B starts a readahead for file range [16K, 28K), entering > btrfs_readahead(). > > First it attempts to read the page at offset 16K by entering > btrfs_do_readpage(), where it calls get_extent_map(), locks the range > [16K, 20K) and gets the extent map for the range [16K, 28K), caching > it into the 'em_cached' variable declared in the local stack of > btrfs_readahead(), and then unlocks the range [16K, 20K). > > Since the extent map has the prealloc flag, at btrfs_do_readpage() we > zero out the page's content and don't submit any bio to read the page > from the extent. > > Then it attempts to read the page at offset 20K entering > btrfs_do_readpage() where we reuse the previously cached extent map > (decided by get_extent_map()) since it spans the page's range and > it's still in the inode's extent map tree. > > Just like for the previous page, we zero out the page's content since > the extent map has the prealloc flag set. > > Then it attempts to read the page at offset 24K entering > btrfs_do_readpage() where we reuse the previously cached extent map > (decided by get_extent_map()) since it spans the page's range and > it's still in the inode's extent map tree. > > Just like for the previous pages, we zero out the page's content since > the extent map has the prealloc flag set. Note that we didn't lock the > extent range [24K, 28K), so we didn't synchronize with the ongoig > direct IO write being performed by task A; > > 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent > for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and > BTRFS_ORDERED_PREALLOC set; > > 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); > > 7) The ordered extent enters btrfs_finish_one_ordered() and locks the > range [24K, 28K); > > 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries > to invalidate the page at offset 24K by calling > kiocb_invalidate_post_direct_write(), resulting in a call chain that > ends up at btrfs_release_folio(). > > The btrfs_release_folio() call ends up returning false because the range > for the page at file offset 24K is currently locked by the task doing > the ordered extent completion in the previous step (7), so we have: > > btrfs_release_folio() -> > __btrfs_release_folio() -> > try_release_extent_mapping() -> > try_release_extent_state() > > This last function checking that the range is locked and returning false > and propagating it up to btrfs_release_folio(). > > So this results in a failure to invalidate the page and > kiocb_invalidate_post_direct_write() triggers this message logged in > dmesg: > > Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! > > After this we leave the page cache with stale data for the file range > [24K, 28K), filled with zeroes instead of the data written by direct IO > write (all bytes with a 0x01 value), so any task attempting to read with > buffered IO, including the task that did the direct IO write, will get > all bytes in the range with a 0x00 value instead of the written data. > > Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), > at the two callers of btrfs_do_readpage() instead of doing it at > get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do > not hold the extent lock for entire read"), and unlocking the range after > all the calls to btrfs_do_readpage(). This way we never reuse a cached > extent map without flushing any pending ordered extents from a concurrent > direct IO write. > > Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/extent_io.c | 18 +++++++++++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 838ab2b6ed43..bdb9816bf125 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, > u64 len, struct extent_map **em_cached) > { > struct extent_map *em; > - struct extent_state *cached_state = NULL; > > ASSERT(em_cached); > > @@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, > *em_cached = NULL; > } > > - btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state); > em = btrfs_get_extent(inode, folio, start, len); > if (!IS_ERR(em)) { > BUG_ON(*em_cached); > refcount_inc(&em->refs); > *em_cached = em; > } > - unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state); > > return em; > } > @@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > int btrfs_read_folio(struct file *file, struct folio *folio) > { > + struct btrfs_inode *inode = folio_to_inode(folio); > + const u64 start = folio_pos(folio); > + const u64 end = start + folio_size(folio) - 1; And great you're already taking larger data folio into consideration. > + struct extent_state *cached_state = NULL; > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ }; > struct extent_map *em_cached = NULL; > int ret; > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); This change is going to conflict with the subpage + partial uptodate page optimization (which will also be the foundation for larger data folio): https://lore.kernel.org/linux-btrfs/1d230d53e92c4f4a7ea4ce036f226b8daa16e5ae.1736848277.git.wqu@suse.com/ I'm fine to update the patch on my side after yours got merged. > ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL); > + unlock_extent(&inode->io_tree, start, end, &cached_state); > + > free_extent_map(em_cached); > > /* > @@ -2377,12 +2381,20 @@ void btrfs_readahead(struct readahead_control *rac) > { > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD }; > struct folio *folio; > + struct btrfs_inode *inode = BTRFS_I(rac->mapping->host); > + const u64 start = readahead_pos(rac); > + const u64 end = start + readahead_length(rac) - 1; > + struct extent_state *cached_state = NULL; > struct extent_map *em_cached = NULL; > u64 prev_em_start = (u64)-1; > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); > + I'm not confident enough about this lock, as it can cross several pages. E.g. if we have the following case, 4K page size 4K block size. 0 4K 8K 12K 16K | |///////| | And range [4K, 8K) is uptodate, and it has been submitted for writeback, and finished writeback. But it still have ordered extent, not yet finished. Then we go into the following call chain: btrfs_lock_and_write_flush() |- btrfs_start_ordered_extent() |- filemap_fdatawrite_range() Which will try to writeback the range [4K, 8K) again. But since the folio at [4K, 8K) is going to be passed to btrfs_do_readpage(), it should already have been locked. Thus the writeback will never be able to lock the folio, thus causing a deadlock. Or did I miss something specific to readahead so it will avoid readahead on already uptodate pages? And even if it will not cause the above deadlock, I'm not confident the mentioned subpage fix conflict can be proper fixed. In the subpage fix, I can only add a folio parameter to btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple folios, how can we avoid the same subpage dead lock then? Thanks, Qu > while ((folio = readahead_folio(rac)) != NULL) > btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); > > + unlock_extent(&inode->io_tree, start, end, &cached_state); > + > if (em_cached) > free_extent_map(em_cached); > submit_one_bio(&bio_ctrl);
On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2025/2/5 02:46, fdmanana@kernel.org 写道: > > From: Filipe Manana <fdmanana@suse.com> > > > > After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire > > read") we can now trigger a race between a task doing a direct IO write > > and readahead. When this race is triggered it results in tasks getting > > stale data when they attempt do a buffered read (including the task that > > did the direct IO write). > > > > This race can be sporadically triggered with test case generic/418, failing > > like this: > > > > $ ./check generic/418 > > FSTYP -- btrfs > > PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 > > MKFS_OPTIONS -- /dev/sdc > > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > > > generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) > > --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 > > +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 > > @@ -1,2 +1,5 @@ > > QA output created by 418 > > +cmpbuf: offset 0: Expected: 0x1, got 0x0 > > +[6:0] FAIL - comparison failed, offset 24576 > > +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 > > Silence is golden > > ... > > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) > > Ran: generic/418 > > Failures: generic/418 > > Failed 1 of 1 tests > > > > The race happens like this: > > > > 1) A file has a prealloc extent for the range [16K, 28K); > > > > 2) Task A starts a direct IO write against file range [24K, 28K). > > At the start of the direct IO write it invalidates the page cache at > > __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file > > offset 24K; > > > > 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range > > [24K, 28K); > > > > 4) Task B starts a readahead for file range [16K, 28K), entering > > btrfs_readahead(). > > > > First it attempts to read the page at offset 16K by entering > > btrfs_do_readpage(), where it calls get_extent_map(), locks the range > > [16K, 20K) and gets the extent map for the range [16K, 28K), caching > > it into the 'em_cached' variable declared in the local stack of > > btrfs_readahead(), and then unlocks the range [16K, 20K). > > > > Since the extent map has the prealloc flag, at btrfs_do_readpage() we > > zero out the page's content and don't submit any bio to read the page > > from the extent. > > > > Then it attempts to read the page at offset 20K entering > > btrfs_do_readpage() where we reuse the previously cached extent map > > (decided by get_extent_map()) since it spans the page's range and > > it's still in the inode's extent map tree. > > > > Just like for the previous page, we zero out the page's content since > > the extent map has the prealloc flag set. > > > > Then it attempts to read the page at offset 24K entering > > btrfs_do_readpage() where we reuse the previously cached extent map > > (decided by get_extent_map()) since it spans the page's range and > > it's still in the inode's extent map tree. > > > > Just like for the previous pages, we zero out the page's content since > > the extent map has the prealloc flag set. Note that we didn't lock the > > extent range [24K, 28K), so we didn't synchronize with the ongoig > > direct IO write being performed by task A; > > > > 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent > > for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and > > BTRFS_ORDERED_PREALLOC set; > > > > 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); > > > > 7) The ordered extent enters btrfs_finish_one_ordered() and locks the > > range [24K, 28K); > > > > 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries > > to invalidate the page at offset 24K by calling > > kiocb_invalidate_post_direct_write(), resulting in a call chain that > > ends up at btrfs_release_folio(). > > > > The btrfs_release_folio() call ends up returning false because the range > > for the page at file offset 24K is currently locked by the task doing > > the ordered extent completion in the previous step (7), so we have: > > > > btrfs_release_folio() -> > > __btrfs_release_folio() -> > > try_release_extent_mapping() -> > > try_release_extent_state() > > > > This last function checking that the range is locked and returning false > > and propagating it up to btrfs_release_folio(). > > > > So this results in a failure to invalidate the page and > > kiocb_invalidate_post_direct_write() triggers this message logged in > > dmesg: > > > > Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! > > > > After this we leave the page cache with stale data for the file range > > [24K, 28K), filled with zeroes instead of the data written by direct IO > > write (all bytes with a 0x01 value), so any task attempting to read with > > buffered IO, including the task that did the direct IO write, will get > > all bytes in the range with a 0x00 value instead of the written data. > > > > Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), > > at the two callers of btrfs_do_readpage() instead of doing it at > > get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do > > not hold the extent lock for entire read"), and unlocking the range after > > all the calls to btrfs_do_readpage(). This way we never reuse a cached > > extent map without flushing any pending ordered extents from a concurrent > > direct IO write. > > > > Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > fs/btrfs/extent_io.c | 18 +++++++++++++++--- > > 1 file changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index 838ab2b6ed43..bdb9816bf125 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, > > u64 len, struct extent_map **em_cached) > > { > > struct extent_map *em; > > - struct extent_state *cached_state = NULL; > > > > ASSERT(em_cached); > > > > @@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, > > *em_cached = NULL; > > } > > > > - btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state); > > em = btrfs_get_extent(inode, folio, start, len); > > if (!IS_ERR(em)) { > > BUG_ON(*em_cached); > > refcount_inc(&em->refs); > > *em_cached = em; > > } > > - unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state); > > > > return em; > > } > > @@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, > > > > int btrfs_read_folio(struct file *file, struct folio *folio) > > { > > + struct btrfs_inode *inode = folio_to_inode(folio); > > + const u64 start = folio_pos(folio); > > + const u64 end = start + folio_size(folio) - 1; > > And great you're already taking larger data folio into consideration. > > > + struct extent_state *cached_state = NULL; > > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ }; > > struct extent_map *em_cached = NULL; > > int ret; > > > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); > > This change is going to conflict with the subpage + partial uptodate > page optimization (which will also be the foundation for larger data folio): > https://lore.kernel.org/linux-btrfs/1d230d53e92c4f4a7ea4ce036f226b8daa16e5ae.1736848277.git.wqu@suse.com/ > > I'm fine to update the patch on my side after yours got merged. I'm sorry about that, but it's a regression so it has to come first. > > > ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL); > > + unlock_extent(&inode->io_tree, start, end, &cached_state); > > + > > free_extent_map(em_cached); > > > > /* > > @@ -2377,12 +2381,20 @@ void btrfs_readahead(struct readahead_control *rac) > > { > > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD }; > > struct folio *folio; > > + struct btrfs_inode *inode = BTRFS_I(rac->mapping->host); > > + const u64 start = readahead_pos(rac); > > + const u64 end = start + readahead_length(rac) - 1; > > + struct extent_state *cached_state = NULL; > > struct extent_map *em_cached = NULL; > > u64 prev_em_start = (u64)-1; > > > > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); > > + > > I'm not confident enough about this lock, as it can cross several pages. > > E.g. if we have the following case, 4K page size 4K block size. > > 0 4K 8K 12K 16K > | |///////| | > > And range [4K, 8K) is uptodate, and it has been submitted for writeback, > and finished writeback. > But it still have ordered extent, not yet finished. > > Then we go into the following call chain: > btrfs_lock_and_write_flush() > |- btrfs_start_ordered_extent() > |- filemap_fdatawrite_range() > > Which will try to writeback the range [4K, 8K) again. But you just said above that the writeback has finished... > But since the folio at [4K, 8K) is going to be passed to > btrfs_do_readpage(), it should already have been locked. > > Thus the writeback will never be able to lock the folio, thus causing a > deadlock. readahead (and readpage) are never called for pages that are already in the page cache, so how can that happen? And why do you think that is specific to the case where the range covers more than one page? I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't lead to that? > > Or did I miss something specific to readahead so it will avoid readahead > on already uptodate pages? It will skip pages already in the page cache (whether uptodate, under writeback, etc). If you look closer, what this patch is doing is restoring the behaviour we had before commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read"), with the exception of doing the unlock in readahead/readpage instead of when submitted bios complete. If there was a possibility for such a deadlock, we would almost certainly have noticed it before, as that has been the behaviour for ages until that commit that landed last summer. > > > And even if it will not cause the above deadlock, I'm not confident the > mentioned subpage fix conflict can be proper fixed. > In the subpage fix, I can only add a folio parameter to > btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple > folios, how can we avoid the same subpage dead lock then? About the subpage case I have no idea. But given that readahead/readpage is never called for pages already in the page cache, I don't think there should be a problem even for subpage scenario. We also have many places where we lock ranges wider than 1 page, so I don't see why this specific place would be a problem. Thanks. > > Thanks, > Qu > > > while ((folio = readahead_folio(rac)) != NULL) > > btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); > > > > + unlock_extent(&inode->io_tree, start, end, &cached_state); > > + > > if (em_cached) > > free_extent_map(em_cached); > > submit_one_bio(&bio_ctrl); >
在 2025/2/5 18:54, Filipe Manana 写道: > On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: [...] >> I'm not confident enough about this lock, as it can cross several pages. >> >> E.g. if we have the following case, 4K page size 4K block size. >> >> 0 4K 8K 12K 16K >> | |///////| | >> >> And range [4K, 8K) is uptodate, and it has been submitted for writeback, >> and finished writeback. >> But it still have ordered extent, not yet finished. >> >> Then we go into the following call chain: >> btrfs_lock_and_write_flush() >> |- btrfs_start_ordered_extent() >> |- filemap_fdatawrite_range() >> >> Which will try to writeback the range [4K, 8K) again. > > But you just said above that the writeback has finished... You're right, for the writeback finished case, the page is no longer dirty and write path will not try to hold the lock. So it's still only possible for subpage with partial uptodate page support. Not possible for the current cases. > >> But since the folio at [4K, 8K) is going to be passed to >> btrfs_do_readpage(), it should already have been locked. >> >> Thus the writeback will never be able to lock the folio, thus causing a >> deadlock. > > readahead (and readpage) are never called for pages that are already > in the page cache, > so how can that happen? Mind to be more explicit on the readahead requirement? E.g. if we have a folio that's dirty but not uptodate, what will happen? I know readapge can be called on page that's already in the page cache, especially for subpage cases (although not for now). E.g. 16K page size, 4K block size. 0 4K 8K 12K 16K | |////////| | Where [4K, 8K) is dirtied by buffered write, which lands exactly at [4K, 8K). It's not possible for now, because at buffered write we will read out the whole page if the write is not page aligned. But XFS/EXT4 allows to skip the full page read, and just dirty the block aligned range, and I'm making btrfs to support that. (https://lore.kernel.org/linux-btrfs/cover.1736848277.git.wqu@suse.com/) So in that case, if we hit a such dirty but not uptodate page, will readahead still try to read it, because the folio is not uptodate. > > And why do you think that is specific to the case where the range > covers more than one page? > I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't > lead to that? Right, the dead lock is really limited to the subpage + partial uptodate case. But I really believe, without such subpage mindset, we are never going to support larger data folios. > >> >> Or did I miss something specific to readahead so it will avoid readahead >> on already uptodate pages? > > It will skip pages already in the page cache (whether uptodate, under > writeback, etc). > > If you look closer, what this patch is doing is restoring the > behaviour we had before > commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire > read"), with the > exception of doing the unlock in readahead/readpage instead of when > submitted bios complete. I'm fine reverting to the old behavior, and so far it looks good to me. So I can give it a reviewed-by: Reviewed-by: Qu Wenruo <wqu@suse.com> My concern is, with the multi-folio lock back, what is the proper way to avoid the subpage deadlock for subpage partial uptodate folio, and future larger data folio? My current fix will not be feasible, as it relies on the single folio pointer, to determine if btrfs_start_ordered_extent() should skip writeback, and now we have multi-folio range back. Maybe we should just make those btrfs_lock_and_flush_ordered_range()/btrfs_start_ordered_extent() to avoid writeback for those read page related call sites? I know it's not your responsibility to bother those subpage cases, but if you can give some ideas/advices, it would be great. Thanks, Qu > > If there was a possibility for such a deadlock, we would almost > certainly have noticed it before, > as that has been the behaviour for ages until that commit that landed > last summer. > > >> >> >> And even if it will not cause the above deadlock, I'm not confident the >> mentioned subpage fix conflict can be proper fixed. >> In the subpage fix, I can only add a folio parameter to >> btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple >> folios, how can we avoid the same subpage dead lock then? > > About the subpage case I have no idea. > But given that readahead/readpage is never called for pages already in > the page cache, I don't think there should be a problem even for > subpage scenario. > > We also have many places where we lock ranges wider than 1 page, so I > don't see why this specific place would be a problem. > > Thanks. > >> >> Thanks, >> Qu >> >>> while ((folio = readahead_folio(rac)) != NULL) >>> btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); >>> >>> + unlock_extent(&inode->io_tree, start, end, &cached_state); >>> + >>> if (em_cached) >>> free_extent_map(em_cached); >>> submit_one_bio(&bio_ctrl); >>
On Wed, Feb 5, 2025 at 8:51 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2025/2/5 18:54, Filipe Manana 写道: > > On Wed, Feb 5, 2025 at 5:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > [...] > >> I'm not confident enough about this lock, as it can cross several pages. > >> > >> E.g. if we have the following case, 4K page size 4K block size. > >> > >> 0 4K 8K 12K 16K > >> | |///////| | > >> > >> And range [4K, 8K) is uptodate, and it has been submitted for writeback, > >> and finished writeback. > >> But it still have ordered extent, not yet finished. > >> > >> Then we go into the following call chain: > >> btrfs_lock_and_write_flush() > >> |- btrfs_start_ordered_extent() > >> |- filemap_fdatawrite_range() > >> > >> Which will try to writeback the range [4K, 8K) again. > > > > But you just said above that the writeback has finished... > > You're right, for the writeback finished case, the page is no longer > dirty and write path will not try to hold the lock. > > So it's still only possible for subpage with partial uptodate page support. > Not possible for the current cases. > > > > >> But since the folio at [4K, 8K) is going to be passed to > >> btrfs_do_readpage(), it should already have been locked. > >> > >> Thus the writeback will never be able to lock the folio, thus causing a > >> deadlock. > > > > readahead (and readpage) are never called for pages that are already > > in the page cache, > > so how can that happen? > > Mind to be more explicit on the readahead requirement? > E.g. if we have a folio that's dirty but not uptodate, what will happen? I would say flushing (start writeback and wait for it to complete) is being triggered elsewhere outside btrfs, otherwise we would always deadlock there, no matter if we range lock only the range for that page or a wider range that includes the page. And we haven't observed that for ages (for the non subpage case at least). > > I know readapge can be called on page that's already in the page cache, > especially for subpage cases (although not for now). So you're saying it can after some patchset that is not yet merged. > > E.g. 16K page size, 4K block size. > > 0 4K 8K 12K 16K > | |////////| | > > Where [4K, 8K) is dirtied by buffered write, which lands exactly at [4K, > 8K). > > It's not possible for now, because at buffered write we will read out > the whole page if the write is not page aligned. > > But XFS/EXT4 allows to skip the full page read, and just dirty the block > aligned range, and I'm making btrfs to support that. > (https://lore.kernel.org/linux-btrfs/cover.1736848277.git.wqu@suse.com/) > > So in that case, if we hit a such dirty but not uptodate page, will > readahead still try to read it, because the folio is not uptodate. > > > > > And why do you think that is specific to the case where the range > > covers more than one page? > > I.e. why doesn't the current scenario of locking only [4K, 8K) doesn't > > lead to that? > > Right, the dead lock is really limited to the subpage + partial uptodate > case. > > But I really believe, without such subpage mindset, we are never going > to support larger data folios. I get it that it's frustrating and it's unfortunate, but I'm working with the infrastructure we currently have and with a regression fix and backport in mind. > > > > >> > >> Or did I miss something specific to readahead so it will avoid readahead > >> on already uptodate pages? > > > > It will skip pages already in the page cache (whether uptodate, under > > writeback, etc). > > > > If you look closer, what this patch is doing is restoring the > > behaviour we had before > > commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire > > read"), with the > > exception of doing the unlock in readahead/readpage instead of when > > submitted bios complete. > > I'm fine reverting to the old behavior, and so far it looks good to me. > > So I can give it a reviewed-by: > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > > > My concern is, with the multi-folio lock back, what is the proper way to > avoid the subpage deadlock for subpage partial uptodate folio, and > future larger data folio? > > > My current fix will not be feasible, as it relies on the single folio > pointer, to determine if btrfs_start_ordered_extent() should skip > writeback, and now we have multi-folio range back. > > Maybe we should just make those > btrfs_lock_and_flush_ordered_range()/btrfs_start_ordered_extent() to > avoid writeback for those read page related call sites? Maybe. I can't advise because I haven't been actively involved in large folio support or subpage support, and it would take quite a lot of time for me to grok all we have and the current problems (plus I don't have access to a machine to test subpage code). > > I know it's not your responsibility to bother those subpage cases, but > if you can give some ideas/advices, it would be great. > > Thanks, > Qu > > > > > If there was a possibility for such a deadlock, we would almost > > certainly have noticed it before, > > as that has been the behaviour for ages until that commit that landed > > last summer. > > > > > >> > >> > >> And even if it will not cause the above deadlock, I'm not confident the > >> mentioned subpage fix conflict can be proper fixed. > >> In the subpage fix, I can only add a folio parameter to > >> btrfs_lock_and_flush_ordered_range(), but in this case, we have multiple > >> folios, how can we avoid the same subpage dead lock then? > > > > About the subpage case I have no idea. > > But given that readahead/readpage is never called for pages already in > > the page cache, I don't think there should be a problem even for > > subpage scenario. > > > > We also have many places where we lock ranges wider than 1 page, so I > > don't see why this specific place would be a problem. > > > > Thanks. > > > >> > >> Thanks, > >> Qu > >> > >>> while ((folio = readahead_folio(rac)) != NULL) > >>> btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); > >>> > >>> + unlock_extent(&inode->io_tree, start, end, &cached_state); > >>> + > >>> if (em_cached) > >>> free_extent_map(em_cached); > >>> submit_one_bio(&bio_ctrl); > >> >
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 838ab2b6ed43..bdb9816bf125 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, u64 len, struct extent_map **em_cached) { struct extent_map *em; - struct extent_state *cached_state = NULL; ASSERT(em_cached); @@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode, *em_cached = NULL; } - btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state); em = btrfs_get_extent(inode, folio, start, len); if (!IS_ERR(em)) { BUG_ON(*em_cached); refcount_inc(&em->refs); *em_cached = em; } - unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state); return em; } @@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached, int btrfs_read_folio(struct file *file, struct folio *folio) { + struct btrfs_inode *inode = folio_to_inode(folio); + const u64 start = folio_pos(folio); + const u64 end = start + folio_size(folio) - 1; + struct extent_state *cached_state = NULL; struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ }; struct extent_map *em_cached = NULL; int ret; + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL); + unlock_extent(&inode->io_tree, start, end, &cached_state); + free_extent_map(em_cached); /* @@ -2377,12 +2381,20 @@ void btrfs_readahead(struct readahead_control *rac) { struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD }; struct folio *folio; + struct btrfs_inode *inode = BTRFS_I(rac->mapping->host); + const u64 start = readahead_pos(rac); + const u64 end = start + readahead_length(rac) - 1; + struct extent_state *cached_state = NULL; struct extent_map *em_cached = NULL; u64 prev_em_start = (u64)-1; + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state); + while ((folio = readahead_folio(rac)) != NULL) btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); + unlock_extent(&inode->io_tree, start, end, &cached_state); + if (em_cached) free_extent_map(em_cached); submit_one_bio(&bio_ctrl);