[v4,4/5] btrfs: load block group size class when caching

Message ID	aea46ca71307f875c7ed4042c3c8d11def37dbbe.1671149056.git.boris@bur.io (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> Feedback-ID: i083147f8:Fastmail From: Boris Burkov <boris@bur.io> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 4/5] btrfs: load block group size class when caching Date: Thu, 15 Dec 2022 16:06:34 -0800 Message-Id: <aea46ca71307f875c7ed4042c3c8d11def37dbbe.1671149056.git.boris@bur.io> In-Reply-To: <cover.1671149056.git.boris@bur.io> References: <cover.1671149056.git.boris@bur.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: data block group size classes \| expand [v4,0/5] btrfs: data block group size classes [v4,1/5] btrfs: use ffe_ctl in btrfs allocator tracepoints [v4,2/5] btrfs: add more ffe tracepoints [v4,3/5] btrfs: introduce size class to block group allocator [v4,4/5] btrfs: load block group size class when caching [v4,5/5] btrfs: dont use size classes for zoned file systems

Message ID

aea46ca71307f875c7ed4042c3c8d11def37dbbe.1671149056.git.boris@bur.io (mailing list archive)

State

New, archived

Headers

Feedback-ID: i083147f8:Fastmail
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 4/5] btrfs: load block group size class when caching
Date: Thu, 15 Dec 2022 16:06:34 -0800
Message-Id: 
 <aea46ca71307f875c7ed4042c3c8d11def37dbbe.1671149056.git.boris@bur.io>
In-Reply-To: <cover.1671149056.git.boris@bur.io>
References: <cover.1671149056.git.boris@bur.io>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

btrfs: data block group size classes | expand

Commit Message

Boris Burkov Dec. 16, 2022, 12:06 a.m. UTC

Since the size class is an artifact of an arbitrary anti fragmentation
strategy, it doesn't really make sense to persist it. Furthermore, most
of the size class logic assumes fresh block groups. That is of course
not a reasonable assumption -- we will be upgrading kernels with
existing filesystems whose block groups are not classified.

To work around those issues, implement logic to compute the size class
of the block groups as we cache them in. To perfectly assess the state
of a block group, we would have to read the entire extent tree (since
the free space cache mashes together contiguous extent items) which
would be prohibitively expensive for larger file systems with more
extents.

We can do it relatively cheaply by implementing a simple heuristic of
sampling a handful of extents and picking the smallest one we see. In
the happy case where the block group was classified, we will only see
extents of the correct size. In the unhappy case, we will hopefully find
one of the smaller extents, but there is no perfect answer anyway.
Autorelocation will eventually churn up the block group if there is
significant free-ing anyway.

The work is done in the caching thread but after marking the block group
cached, as we tradeoff classification accuracy vs. slowing down
allocations.

There was no regression in mount performance at end state of the fsperf
test suite.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 130 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)

Comments

kernel test robot Dec. 16, 2022, 8:07 a.m. UTC | #1

Hi Boris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master next-20221216]
[cannot apply to v6.1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Boris-Burkov/btrfs-data-block-group-size-classes/20221216-080728
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/aea46ca71307f875c7ed4042c3c8d11def37dbbe.1671149056.git.boris%40bur.io
patch subject: [PATCH v4 4/5] btrfs: load block group size class when caching
config: i386-randconfig-a014
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/ac6e5ea20edac6e086c09fb7ef45c15467ce23d9
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-data-block-group-size-classes/20221216-080728
        git checkout ac6e5ea20edac6e086c09fb7ef45c15467ce23d9
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: fs/btrfs/block-group.o: in function `sample_block_group_extent_item':
>> fs/btrfs/block-group.c:584: undefined reference to `__udivdi3'


vim +584 fs/btrfs/block-group.c

   542	
   543	/*
   544	 * Get an arbitrary extent item index / max_index through the block group
   545	 *
   546	 * @block_group: the block group to sample from
   547	 * @index: the integral step through the block group to grab from
   548	 * @max_index: the granularity of the sampling
   549	 * @key: return value parameter for the item we find
   550	 *
   551	 * pre-conditions on indices:
   552	 * 0 <= index <= max_index
   553	 * 0 < max_index
   554	 *
   555	 * Returns: 0 on success, 1 if the search didn't yield a useful item, negative
   556	 * error code on error.
   557	 */
   558	static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
   559						  int index, int max_index,
   560						  struct btrfs_key *key)
   561	{
   562		struct btrfs_fs_info *fs_info = block_group->fs_info;
   563		struct btrfs_root *extent_root;
   564		int ret = 0;
   565		u64 search_offset;
   566		struct btrfs_path *path;
   567	
   568		ASSERT(index >= 0);
   569		ASSERT(index <= max_index);
   570		ASSERT(max_index > 0);
   571	
   572		path = btrfs_alloc_path();
   573		if (!path)
   574			return -ENOMEM;
   575	
   576		down_read(&fs_info->commit_root_sem);
   577		extent_root = btrfs_extent_root(fs_info, max_t(u64, block_group->start,
   578							       BTRFS_SUPER_INFO_OFFSET));
   579	
   580		path->skip_locking = 1;
   581		path->search_commit_root = 1;
   582		path->reada = READA_FORWARD;
   583	
 > 584		search_offset = index * (block_group->length / max_index);
   585		key->objectid = block_group->start + search_offset;
   586		key->offset = 0;
   587		key->type = BTRFS_EXTENT_ITEM_KEY;
   588	
   589		ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0);
   590		if (ret != 0)
   591			goto out;
   592		if (key->objectid < block_group->start ||
   593		    key->objectid > block_group->start + block_group->length) {
   594			ret = 1;
   595			goto out;
   596		}
   597		if (key->type != BTRFS_EXTENT_ITEM_KEY) {
   598			ret = 1;
   599			goto out;
   600		}
   601	out:
   602		btrfs_free_path(path);
   603		up_read(&fs_info->commit_root_sem);
   604		return ret;
   605	}
   606

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index fa1ab56fe6b3..75d5f952067a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -540,6 +540,134 @@  u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 	return total_added;
 }
 
+/*
+ * Get an arbitrary extent item index / max_index through the block group
+ *
+ * @block_group: the block group to sample from
+ * @index: the integral step through the block group to grab from
+ * @max_index: the granularity of the sampling
+ * @key: return value parameter for the item we find
+ *
+ * pre-conditions on indices:
+ * 0 <= index <= max_index
+ * 0 < max_index
+ *
+ * Returns: 0 on success, 1 if the search didn't yield a useful item, negative
+ * error code on error.
+ */
+static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
+					  int index, int max_index,
+					  struct btrfs_key *key)
+{
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_root *extent_root;
+	int ret = 0;
+	u64 search_offset;
+	struct btrfs_path *path;
+
+	ASSERT(index >= 0);
+	ASSERT(index <= max_index);
+	ASSERT(max_index > 0);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	down_read(&fs_info->commit_root_sem);
+	extent_root = btrfs_extent_root(fs_info, max_t(u64, block_group->start,
+						       BTRFS_SUPER_INFO_OFFSET));
+
+	path->skip_locking = 1;
+	path->search_commit_root = 1;
+	path->reada = READA_FORWARD;
+
+	search_offset = index * (block_group->length / max_index);
+	key->objectid = block_group->start + search_offset;
+	key->offset = 0;
+	key->type = BTRFS_EXTENT_ITEM_KEY;
+
+	ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0);
+	if (ret != 0)
+		goto out;
+	if (key->objectid < block_group->start ||
+	    key->objectid > block_group->start + block_group->length) {
+		ret = 1;
+		goto out;
+	}
+	if (key->type != BTRFS_EXTENT_ITEM_KEY) {
+		ret = 1;
+		goto out;
+	}
+out:
+	btrfs_free_path(path);
+	up_read(&fs_info->commit_root_sem);
+	return ret;
+}
+
+/*
+ * Best effort attempt to compute a block group's size class while caching it.
+ *
+ * @block_group: the block group we are caching
+ *
+ * We cannot infer the size class while adding free space extents, because that
+ * logic doesn't care about contiguous file extents (it doesn't differentiate
+ * between a 100M extent and 100 contiguous 1M extents). So we need to read the
+ * file extent items. Reading all of them is quite wasteful, because usually
+ * only a handful are enough to give a good answer. Therefore, we just grab 5 of
+ * them at even steps through the block group and pick the smallest size class
+ * we see. Since size class is best effort, and not guaranteed in general,
+ * inaccuracy is acceptable.
+ *
+ * To be more explicit about why this algorithm makes sense:
+ *
+ * If we are caching in a block group from disk, then there are three major cases
+ * to consider:
+ * 1. the block group is well behaved and all extents in it are the same size
+ * class.
+ * 2. the block group is mostly one size class with rare exceptions for last
+ * ditch allocations
+ * 3. the block group was populated before size classes and can have a totally
+ * arbitrary mix of size classes.
+ *
+ * In case 1, looking at any extent in the block group will yield the correct
+ * result. For the mixed cases, taking the minimum size class seems like a good
+ * approximation, since gaps from frees will be usable to the size class. For
+ * 2., a small handful of file extents is likely to yield the right answer. For
+ * 3, we can either read every file extent, or admit that this is best effort
+ * anyway and try to stay fast.
+ *
+ * Returns: 0 on success, negative error code on error.
+ */
+static int load_block_group_size_class(struct btrfs_block_group *block_group)
+{
+	struct btrfs_key key;
+	int i;
+	u64 min_size = block_group->length;
+	enum btrfs_block_group_size_class size_class = BTRFS_BG_SZ_NONE;
+	int ret;
+
+	if (!btrfs_is_block_group_data_only(block_group))
+		return 0;
+
+	for (i = 0; i < 5; ++i) {
+		ret = sample_block_group_extent_item(block_group, i, 5, &key);
+		if (ret < 0)
+			goto out;
+		if (ret > 0)
+			continue;
+		min_size = min_t(u64, min_size, key.offset);
+		size_class = btrfs_calc_block_group_size_class(min_size);
+	}
+	if (size_class != BTRFS_BG_SZ_NONE) {
+		spin_lock(&block_group->lock);
+		block_group->size_class = size_class;
+		spin_unlock(&block_group->lock);
+	}
+
+out:
+	return ret;
+}
+
 static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl)
 {
 	struct btrfs_block_group *block_group = caching_ctl->block_group;
@@ -739,6 +867,8 @@  static noinline void caching_thread(struct btrfs_work *work)
 
 	wake_up(&caching_ctl->wait);
 
+	load_block_group_size_class(block_group);
+
 	btrfs_put_caching_control(caching_ctl);
 	btrfs_put_block_group(block_group);
 }

[v4,4/5] btrfs: load block group size class when caching

Commit Message

Comments

Patch