diff mbox series

[v2,3/8] btrfs: wait on async extents when flushing delalloc

Message ID 0ee87e54d0f14f0628d146e09fef34db2ce73e03.1624974951.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series ENOSPC delalloc flushing fixes | expand

Commit Message

Josef Bacik June 29, 2021, 1:59 p.m. UTC
I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode in
38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c      |  4 ----
 fs/btrfs/space-info.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 4 deletions(-)

Comments

Nikolay Borisov July 7, 2021, 11:09 a.m. UTC | #1
On 29.06.21 г. 16:59, Josef Bacik wrote:
> I've been debugging an early ENOSPC problem in production and finally
> root caused it to this problem.  When we switched to the per-inode in
> 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
> shrink_delalloc") I pulled out the async extent handling, because we
> were doing the correct thing by calling filemap_flush() if we had async
> extents set.  This would properly wait on any async extents by locking
> the page in the second flush, thus making sure our ordered extents were
> properly set up.
> 
> However when I switched us back to page based flushing, I used
> sync_inode(), which allows us to pass in our own wbc.  The problem here
> is that sync_inode() is smarter than the filemap_* helpers, it tries to
> avoid calling writepages at all.  This means that our second call could
> skip calling do_writepages altogether, and thus not wait on the pagelock
> for the async helpers.  This means we could come back before any ordered
> extents were created and then simply continue on in our flushing
> mechanisms and ENOSPC out when we have plenty of space to use.
> 
> Fix this by putting back the async pages logic in shrink_delalloc.  This
> allows us to bulk write out everything that we need to, and then we can
> wait in one place for the async helpers to catch up, and then wait on
> any ordered extents that are created.
> 
> Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

This patch really depend on the next one in order for it to be correct.
Imo this dependency should be explicitly stated in the change log and
the patches re-ordered.
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6eb20987351..b1f02e3fea5d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9714,10 +9714,6 @@  static int start_delalloc_inodes(struct btrfs_root *root,
 					 &work->work);
 		} else {
 			ret = sync_inode(inode, wbc);
-			if (!ret &&
-			    test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-				     &BTRFS_I(inode)->runtime_flags))
-				ret = sync_inode(inode, wbc);
 			btrfs_add_delayed_iput(inode);
 			if (ret || wbc->nr_to_write <= 0)
 				goto out;
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0c539a94c6d9..f140a89a3cdd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -534,9 +534,49 @@  static void shrink_delalloc(struct btrfs_fs_info *fs_info,
 	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
 		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
 		long nr_pages = min_t(u64, temp, LONG_MAX);
+		int async_pages;
 
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
+		/*
+		 * We need to make sure any outstanding async pages are now
+		 * processed before we continue.  This is because things like
+		 * sync_inode() try to be smart and skip writing if the inode is
+		 * marked clean.  We don't use filemap_fwrite for flushing
+		 * because we want to control how many pages we write out at a
+		 * time, thus this is the only safe way to make sure we've
+		 * waited for outstanding compressed workers to have started
+		 * their jobs and thus have ordered extents set up properly.
+		 *
+		 * This exists because we do not want to wait for each
+		 * individual inode to finish its async work, we simply want to
+		 * start the IO on everybody, and then come back here and wait
+		 * for all of the async work to catch up.  Once we're done with
+		 * that we know we'll have ordered extents for everything and we
+		 * can decide if we wait for that or not.
+		 *
+		 * If we choose to replace this in the future, make absolutely
+		 * sure that the proper waiting is being done in the async case,
+		 * as there have been bugs in that area before.
+		 */
+		async_pages = atomic_read(&fs_info->async_delalloc_pages);
+		if (!async_pages)
+			goto skip_async;
+
+		/*
+		 * We don't want to wait forever, if we wrote less pages in this
+		 * loop than we have outstanding, only wait for that number of
+		 * pages, otherwise we can wait for all async pages to finish
+		 * before continuing.
+		 */
+		if (async_pages > nr_pages)
+			async_pages -= nr_pages;
+		else
+			async_pages = 0;
+		wait_event(fs_info->async_submit_wait,
+			   atomic_read(&fs_info->async_delalloc_pages) <=
+			   async_pages);
+skip_async:
 		loops++;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);