diff mbox series

btrfs: fix a rare race between metadata endio and eb freeing

Message ID 20210607090258.253660-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix a rare race between metadata endio and eb freeing | expand

Commit Message

Qu Wenruo June 7, 2021, 9:02 a.m. UTC
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.

No extra reproduce so far.

The ASSERT() get triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().

[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:

                T1                  |              T2
------------------------------------+-----------------------------
end_bio_extent_readpage()           |
|- btrfs_validate_metadata_buffer() |
|  |- free_extent_buffer()          |
|     Still have 2 refs             |
|- end_page_read()                  |
   |- if (unlikely(PagePrivate())   |
   |  The page still has Private    |
   |                                | free_extent_buffer()
   |                                | |  Only one ref 1, will be
   |                                | |  released
   |                                | |- detach_extent_buffer_page()
   |                                |    |- btrfs_detach_subpage()
   |- btrfs_set_page_uptodate()     |
      The page no longer has Private|
      >>> ASSERT() triggered <<<    |

This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.

But the race window is still there, we have to go another way to solve
it other than replying on random PagePrivate() check.

Data path is not affected, as data path will lock the page before read,
while unlock the page after the last read has finished, thus no race
window.

[FIX]
This patch will fix the bug by re-purpose btrfs_subpage::readers.

Now btrfs_subpage::readers will be a member shared by both metadata and
data.

For metadata path, we don't do the page unlock as metadata only relies on
extent locking.

At the same time, teach page_range_has_eb() to take into
btrfs_subpage::readers into consideration.

So that even if the last eb of a page get freed, page::private won't be
detached as long as there is still pending end_page_read() calls.

By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual.
But it should not be a big deal.

Fixes: ***SPACEHOLDER**** ("btrfs: submit read time repair only for each corrupted sector")
Signed-off-by: Qu Wegruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 30 +++++++++---------------------
 fs/btrfs/subpage.c   | 19 +++++++++++++++----
 fs/btrfs/subpage.h   |  9 ++++++++-
 3 files changed, 32 insertions(+), 26 deletions(-)

Comments

David Sterba June 7, 2021, 5:29 p.m. UTC | #1
On Mon, Jun 07, 2021 at 05:02:58PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a very rare ASSERT() triggering during full fstests run for
> subpage rw support.
> 
> No extra reproduce so far.
> 
> The ASSERT() get triggered for metadata read in
> btrfs_page_set_uptodate() inside end_page_read().
> 
> [CAUSE]
> There is still a small race window for metadata only, the race could
> happen like this:
> 
>                 T1                  |              T2
> ------------------------------------+-----------------------------
> end_bio_extent_readpage()           |
> |- btrfs_validate_metadata_buffer() |
> |  |- free_extent_buffer()          |
> |     Still have 2 refs             |
> |- end_page_read()                  |
>    |- if (unlikely(PagePrivate())   |
>    |  The page still has Private    |
>    |                                | free_extent_buffer()
>    |                                | |  Only one ref 1, will be
>    |                                | |  released
>    |                                | |- detach_extent_buffer_page()
>    |                                |    |- btrfs_detach_subpage()
>    |- btrfs_set_page_uptodate()     |
>       The page no longer has Private|
>       >>> ASSERT() triggered <<<    |
> 
> This race window is super small, thus pretty hard to hit, even with so
> many runs of fstests.
> 
> But the race window is still there, we have to go another way to solve
> it other than replying on random PagePrivate() check.
> 
> Data path is not affected, as data path will lock the page before read,
> while unlock the page after the last read has finished, thus no race
> window.
> 
> [FIX]
> This patch will fix the bug by re-purpose btrfs_subpage::readers.
> 
> Now btrfs_subpage::readers will be a member shared by both metadata and
> data.
> 
> For metadata path, we don't do the page unlock as metadata only relies on
> extent locking.
> 
> At the same time, teach page_range_has_eb() to take into
> btrfs_subpage::readers into consideration.
> 
> So that even if the last eb of a page get freed, page::private won't be
> detached as long as there is still pending end_page_read() calls.
> 
> By this we eliminate the race window, this will slight increase the
> metadata memory usage, as the page may not be released as frequently as
> usual.
> But it should not be a big deal.
> 
> Fixes: ***SPACEHOLDER**** ("btrfs: submit read time repair only for each corrupted sector")

This won't work, I won't refresh this patch each time I rebase
misc-next on a new rc. A reference by subject should be sufficient.
David Sterba June 7, 2021, 5:55 p.m. UTC | #2
On Mon, Jun 07, 2021 at 05:02:58PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a very rare ASSERT() triggering during full fstests run for
> subpage rw support.
> 
> No extra reproduce so far.
> 
> The ASSERT() get triggered for metadata read in
> btrfs_page_set_uptodate() inside end_page_read().
> 
> [CAUSE]
> There is still a small race window for metadata only, the race could
> happen like this:
> 
>                 T1                  |              T2
> ------------------------------------+-----------------------------
> end_bio_extent_readpage()           |
> |- btrfs_validate_metadata_buffer() |
> |  |- free_extent_buffer()          |
> |     Still have 2 refs             |
> |- end_page_read()                  |
>    |- if (unlikely(PagePrivate())   |
>    |  The page still has Private    |
>    |                                | free_extent_buffer()
>    |                                | |  Only one ref 1, will be
>    |                                | |  released
>    |                                | |- detach_extent_buffer_page()
>    |                                |    |- btrfs_detach_subpage()
>    |- btrfs_set_page_uptodate()     |
>       The page no longer has Private|
>       >>> ASSERT() triggered <<<    |
> 
> This race window is super small, thus pretty hard to hit, even with so
> many runs of fstests.
> 
> But the race window is still there, we have to go another way to solve
> it other than replying on random PagePrivate() check.
> 
> Data path is not affected, as data path will lock the page before read,
> while unlock the page after the last read has finished, thus no race
> window.
> 
> [FIX]
> This patch will fix the bug by re-purpose btrfs_subpage::readers.
> 
> Now btrfs_subpage::readers will be a member shared by both metadata and
> data.
> 
> For metadata path, we don't do the page unlock as metadata only relies on
> extent locking.
> 
> At the same time, teach page_range_has_eb() to take into
> btrfs_subpage::readers into consideration.
> 
> So that even if the last eb of a page get freed, page::private won't be
> detached as long as there is still pending end_page_read() calls.
> 
> By this we eliminate the race window, this will slight increase the
> metadata memory usage, as the page may not be released as frequently as
> usual.
> But it should not be a big deal.
> 
> Fixes: ***SPACEHOLDER**** ("btrfs: submit read time repair only for each corrupted sector")
> Signed-off-by: Qu Wegruo <wqu@suse.com>

Added to the subpage patch series in misc-next, with a note about the
commit that introduced it. The reasoning that the race is rare and we
want the description separate in case the approach is revisited.
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 77b59ca93419..2d3df12a2471 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2689,21 +2689,6 @@  static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 	ASSERT(page_offset(page) <= start &&
 	       start + len <= page_offset(page) + PAGE_SIZE);
 
-	/*
-	 * For subapge metadata case, all btrfs_page_* helpers need page to
-	 * have page::private populated.
-	 * But we can have rare case where the last eb in the page is only
-	 * referred by the IO, and it gets released immediately after it's
-	 * read and verified.
-	 *
-	 * This can detach the page private completely.
-	 * In that case, we can just skip the page status update completely,
-	 * as the page has no eb anymore.
-	 */
-	if (fs_info->sectorsize < PAGE_SIZE && unlikely(!PagePrivate(page))) {
-		ASSERT(!is_data_inode(page->mapping->host));
-		return;
-	}
 	if (uptodate) {
 		btrfs_page_set_uptodate(fs_info, page, start, len);
 	} else {
@@ -2713,11 +2698,7 @@  static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 
 	if (fs_info->sectorsize == PAGE_SIZE)
 		unlock_page(page);
-	else if (is_data_inode(page->mapping->host))
-		/*
-		 * For subpage data, unlock the page if we're the last reader.
-		 * For subpage metadata, page lock is not utilized for read.
-		 */
+	else
 		btrfs_subpage_end_reader(fs_info, page, start, len);
 }
 
@@ -5694,6 +5675,12 @@  static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
 		subpage = (struct btrfs_subpage *)page->private;
 		if (atomic_read(&subpage->eb_refs))
 			return true;
+		/*
+		 * Even there is no eb refs here, we may still have
+		 * end_page_read() call relying on page::private.
+		 */
+		if (atomic_read(&subpage->readers))
+			return true;
 	}
 	return false;
 }
@@ -5754,7 +5741,7 @@  static void detach_extent_buffer_page(struct extent_buffer *eb, struct page *pag
 
 	/*
 	 * We can only detach the page private if there are no other ebs in the
-	 * page range.
+	 * page range and no unfinished IO.
 	 */
 	if (!page_range_has_eb(fs_info, page))
 		btrfs_detach_subpage(fs_info, page);
@@ -6472,6 +6459,7 @@  static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	check_buffer_tree_ref(eb);
 	btrfs_subpage_clear_error(fs_info, page, eb->start, eb->len);
 
+	btrfs_subpage_start_reader(fs_info, page, eb->start, eb->len);
 	ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, &bio_ctrl,
 				 page, eb->start, eb->len,
 				 eb->start - page_offset(page),
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 552410fba0bd..f8ebd1cfb025 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -3,6 +3,7 @@ 
 #include <linux/slab.h>
 #include "ctree.h"
 #include "subpage.h"
+#include "btrfs_inode.h"
 
 /*
  * Subpage (sectorsize < PAGE_SIZE) support overview:
@@ -185,12 +186,10 @@  void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	const int nbits = len >> fs_info->sectorsize_bits;
-	int ret;
 
 	btrfs_subpage_assert(fs_info, page, start, len);
 
-	ret = atomic_add_return(nbits, &subpage->readers);
-	ASSERT(ret == nbits);
+	atomic_add(nbits, &subpage->readers);
 }
 
 void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
@@ -198,10 +197,22 @@  void btrfs_subpage_end_reader(const struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
 	const int nbits = len >> fs_info->sectorsize_bits;
+	bool is_data;
+	bool last;
 
 	btrfs_subpage_assert(fs_info, page, start, len);
+	is_data = is_data_inode(page->mapping->host);
 	ASSERT(atomic_read(&subpage->readers) >= nbits);
-	if (atomic_sub_and_test(nbits, &subpage->readers))
+	last = atomic_sub_and_test(nbits, &subpage->readers);
+
+	/*
+	 * For data we need to unlock the page if the last read has finished.
+	 *
+	 * And please don't replace @last with atomic_sub_and_test() call
+	 * inside if () condition.
+	 * As we want the atomic_sub_and_test() to be always executed.
+	 */
+	if (is_data && last)
 		unlock_page(page);
 }
 
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 7188e9d2fbea..c122a54c2ddb 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -22,6 +22,14 @@  struct btrfs_subpage {
 	u16 error_bitmap;
 	u16 dirty_bitmap;
 	u16 writeback_bitmap;
+	/*
+	 * Both data and metadata needs to trace how many readers are for the
+	 * page.
+	 * Data relies on @readers to unlock the page when last reader finished.
+	 * While metadata doesn't need page unlock, it needs to prevent
+	 * page::private get cleared before the last end_page_read().
+	 */
+	atomic_t readers;
 	union {
 		/*
 		 * Structures only used by metadata
@@ -32,7 +40,6 @@  struct btrfs_subpage {
 		atomic_t eb_refs;
 		/* Structures only used by data */
 		struct {
-			atomic_t readers;
 			atomic_t writers;
 
 			/* If a sector has pending ordered extent */