diff mbox series

[v2] btrfs: use atomic64_t for free_objectid

Message ID 20250303182139.256498-1-maharmstone@fb.com (mailing list archive)
State New
Headers show
Series [v2] btrfs: use atomic64_t for free_objectid | expand

Commit Message

Mark Harmstone March 3, 2025, 6:21 p.m. UTC
Currently btrfs_get_free_objectid() uses a mutex to protect
free_objectid; I'm guessing this was because of the inode cache that we
used to have. The inode cache is no more, so simplify things by
replacing it with an atomic.

There's no issues with ordering: free_objectid gets set to an initial
value, then calls to btrfs_get_free_objectid() return a monotonically
increasing value.

This change means that btrfs_get_free_objectid() will no longer
potentially sleep, which was a blocker for adding a non-blocking mode
for inode and subvol creation.

This change moves the warning in btrfs_get_free_objectid() out of the lock.
Integer overflow isn't a plausible problem here; there's no way to create
inodes with an arbitrary number, short of hex-editing the block device,
and incrementing a uint64_t a billion times a second would take >500 years
for overflow to happen.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/ctree.h    |  4 +---
 fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
 fs/btrfs/qgroup.c   | 11 ++++++-----
 fs/btrfs/tree-log.c |  3 ---
 4 files changed, 25 insertions(+), 36 deletions(-)

Comments

David Sterba March 4, 2025, 9:52 a.m. UTC | #1
On Mon, Mar 03, 2025 at 06:21:16PM +0000, Mark Harmstone wrote:
> Currently btrfs_get_free_objectid() uses a mutex to protect
> free_objectid; I'm guessing this was because of the inode cache that we
> used to have. The inode cache is no more, so simplify things by
> replacing it with an atomic.
> 
> There's no issues with ordering: free_objectid gets set to an initial
> value, then calls to btrfs_get_free_objectid() return a monotonically
> increasing value.
> 
> This change means that btrfs_get_free_objectid() will no longer
> potentially sleep, which was a blocker for adding a non-blocking mode
> for inode and subvol creation.
> 
> This change moves the warning in btrfs_get_free_objectid() out of the lock.
> Integer overflow isn't a plausible problem here; there's no way to create
> inodes with an arbitrary number, short of hex-editing the block device,
> and incrementing a uint64_t a billion times a second would take >500 years
> for overflow to happen.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/ctree.h    |  4 +---
>  fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
>  fs/btrfs/qgroup.c   | 11 ++++++-----
>  fs/btrfs/tree-log.c |  3 ---
>  4 files changed, 25 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 075a06db43a1..23adbce4c516 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -179,8 +179,6 @@ struct btrfs_root {
>  	struct btrfs_fs_info *fs_info;
>  	struct extent_io_tree dirty_log_pages;
>  
> -	struct mutex objectid_mutex;
> -
>  	spinlock_t accounting_lock;
>  	struct btrfs_block_rsv *block_rsv;
>  
> @@ -214,7 +212,7 @@ struct btrfs_root {
>  
>  	u64 last_trans;
>  
> -	u64 free_objectid;
> +	atomic64_t free_objectid;
>  
>  	struct btrfs_key defrag_progress;
>  	struct btrfs_key defrag_max;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 52c2335ef62f..1597a8945f4c 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -657,7 +657,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>  	RB_CLEAR_NODE(&root->rb_node);
>  
>  	btrfs_set_root_last_trans(root, 0);
> -	root->free_objectid = 0;
> +	atomic64_set(&root->free_objectid, 0);
>  	root->nr_delalloc_inodes = 0;
>  	root->nr_ordered_extents = 0;
>  	xa_init(&root->inodes);
> @@ -676,7 +676,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>  	spin_lock_init(&root->ordered_extent_lock);
>  	spin_lock_init(&root->accounting_lock);
>  	spin_lock_init(&root->qgroup_meta_rsv_lock);
> -	mutex_init(&root->objectid_mutex);
>  	mutex_init(&root->log_mutex);
>  	mutex_init(&root->ordered_extent_mutex);
>  	mutex_init(&root->delalloc_mutex);
> @@ -1132,16 +1131,12 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
>  		}
>  	}
>  
> -	mutex_lock(&root->objectid_mutex);
>  	ret = btrfs_init_root_free_objectid(root);
> -	if (ret) {
> -		mutex_unlock(&root->objectid_mutex);
> +	if (ret)
>  		return ret;
> -	}
>  
> -	ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> -
> -	mutex_unlock(&root->objectid_mutex);
> +	ASSERT((u64)atomic64_read(&root->free_objectid) <=
> +		BTRFS_LAST_FREE_OBJECTID);

I'm not sure if this was mentioned in the previous discussion. This
assert will be always true, the atomic is signed 64 and cast to
unsigned. Under normal circumstances the atomic will not be negative so
it won't translate to a huge unsigned number by the cast.

What we need is an unsigned atomic type. The atomic64_t is a natural
choice and it probably has enough margin for the simple increment
allocation. But still I think we should make it a "u64".

The simplest implementation is to use spin lock around the updates,
seqlock_t is also possible but it effectively uses a spin lock too and
we don't need the read side protection.

Sorry, it's another change right before the code freeze so we may want
to postpone it to let us think it through. I'll prototype the idea and
do some tests, we can still target 6.15.
David Sterba March 5, 2025, 10:33 a.m. UTC | #2
On Tue, Mar 04, 2025 at 10:52:56AM +0100, David Sterba wrote:
> > -	ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> > -
> > -	mutex_unlock(&root->objectid_mutex);
> > +	ASSERT((u64)atomic64_read(&root->free_objectid) <=
> > +		BTRFS_LAST_FREE_OBJECTID);
> 
> I'm not sure if this was mentioned in the previous discussion. This
> assert will be always true, the atomic is signed 64 and cast to
> unsigned. Under normal circumstances the atomic will not be negative so
> it won't translate to a huge unsigned number by the cast.
> 
> What we need is an unsigned atomic type. The atomic64_t is a natural
> choice and it probably has enough margin for the simple increment
> allocation. But still I think we should make it a "u64".
> 
> The simplest implementation is to use spin lock around the updates,
> seqlock_t is also possible but it effectively uses a spin lock too and
> we don't need the read side protection.
> 
> Sorry, it's another change right before the code freeze so we may want
> to postpone it to let us think it through. I'll prototype the idea and
> do some tests, we can still target 6.15.

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 075a06db43a1..2620403fd4c9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -179,8 +179,6 @@ struct btrfs_root {
 	struct btrfs_fs_info *fs_info;
 	struct extent_io_tree dirty_log_pages;
 
-	struct mutex objectid_mutex;
-
 	spinlock_t accounting_lock;
 	struct btrfs_block_rsv *block_rsv;
 
@@ -214,7 +212,9 @@ struct btrfs_root {
 
 	u64 last_trans;
 
+	/* Locking is done only when incremented, read size relies on u64. */
 	u64 free_objectid;
+	spinlock_t objectid_lock;
 
 	struct btrfs_key defrag_progress;
 	struct btrfs_key defrag_max;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0cb559448933..e5569cec0547 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -676,7 +676,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	spin_lock_init(&root->ordered_extent_lock);
 	spin_lock_init(&root->accounting_lock);
 	spin_lock_init(&root->qgroup_meta_rsv_lock);
-	mutex_init(&root->objectid_mutex);
+	spin_lock_init(&root->objectid_lock);
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
@@ -1132,17 +1132,12 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
 		}
 	}
 
-	mutex_lock(&root->objectid_mutex);
 	ret = btrfs_init_root_free_objectid(root);
-	if (ret) {
-		mutex_unlock(&root->objectid_mutex);
+	if (ret)
 		return ret;
-	}
 
 	ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
 
-	mutex_unlock(&root->objectid_mutex);
-
 	return 0;
 }
 
@@ -2725,8 +2720,8 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
 		}
 
 		/*
-		 * No need to hold btrfs_root::objectid_mutex since the fs
-		 * hasn't been fully initialised and we are the only user
+		 * No need to lock btrfs_root::free_objectid since the fs
+		 * hasn't been fully initialised and we are the only user.
 		 */
 		ret = btrfs_init_root_free_objectid(tree_root);
 		if (ret < 0) {
@@ -4930,20 +4925,21 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
 
 int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
 {
-	int ret;
-	mutex_lock(&root->objectid_mutex);
+	u64 val;
 
-	if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
+	spin_lock(&root->objectid_lock);
+	val = root->free_objectid;
+	if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
+		spin_unlock(&root->objectid_lock);
 		btrfs_warn(root->fs_info,
 			   "the objectid of root %llu reaches its highest value",
 			   btrfs_root_id(root));
-		ret = -ENOSPC;
-		goto out;
+		return -ENOSPC;
 	}
+	root->free_objectid = val + 1;
+	spin_unlock(&root->objectid_lock);
 
-	*objectid = root->free_objectid++;
-	ret = 0;
-out:
-	mutex_unlock(&root->objectid_mutex);
-	return ret;
+	*objectid = val;
+
+	return 0;
 }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index d6fa36674270..1ce84bf59a09 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -472,9 +472,9 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 			 *
 			 * Ensure that we skip any such subvol ids.
 			 *
-			 * We don't need to lock because this is only called
-			 * during mount before we start doing things like creating
-			 * subvolumes.
+			 * We don't need to worry about updates to free_objectid,
+			 * this is only called during mount before we start
+			 * doing things like creating subvolumes.
 			 */
 			if (is_fstree(qgroup->qgroupid) &&
 			    qgroup->qgroupid > tree_root->free_objectid)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index fc5c761181eb..97e608b251fa 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -7325,9 +7325,6 @@ int btrfs_recover_log_trees(struct btrfs_root *log_root_tree)
 			 * We have just replayed everything, and the highest
 			 * objectid of fs roots probably has changed in case
 			 * some inode_item's got replayed.
-			 *
-			 * root->objectid_mutex is not acquired as log replay
-			 * could only happen during mount.
 			 */
 			ret = btrfs_init_root_free_objectid(root);
 			if (ret)
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 075a06db43a1..23adbce4c516 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -179,8 +179,6 @@  struct btrfs_root {
 	struct btrfs_fs_info *fs_info;
 	struct extent_io_tree dirty_log_pages;
 
-	struct mutex objectid_mutex;
-
 	spinlock_t accounting_lock;
 	struct btrfs_block_rsv *block_rsv;
 
@@ -214,7 +212,7 @@  struct btrfs_root {
 
 	u64 last_trans;
 
-	u64 free_objectid;
+	atomic64_t free_objectid;
 
 	struct btrfs_key defrag_progress;
 	struct btrfs_key defrag_max;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 52c2335ef62f..1597a8945f4c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -657,7 +657,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	RB_CLEAR_NODE(&root->rb_node);
 
 	btrfs_set_root_last_trans(root, 0);
-	root->free_objectid = 0;
+	atomic64_set(&root->free_objectid, 0);
 	root->nr_delalloc_inodes = 0;
 	root->nr_ordered_extents = 0;
 	xa_init(&root->inodes);
@@ -676,7 +676,6 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	spin_lock_init(&root->ordered_extent_lock);
 	spin_lock_init(&root->accounting_lock);
 	spin_lock_init(&root->qgroup_meta_rsv_lock);
-	mutex_init(&root->objectid_mutex);
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
@@ -1132,16 +1131,12 @@  static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
 		}
 	}
 
-	mutex_lock(&root->objectid_mutex);
 	ret = btrfs_init_root_free_objectid(root);
-	if (ret) {
-		mutex_unlock(&root->objectid_mutex);
+	if (ret)
 		return ret;
-	}
 
-	ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
-
-	mutex_unlock(&root->objectid_mutex);
+	ASSERT((u64)atomic64_read(&root->free_objectid) <=
+		BTRFS_LAST_FREE_OBJECTID);
 
 	return 0;
 }
@@ -2725,8 +2720,9 @@  static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
 		}
 
 		/*
-		 * No need to hold btrfs_root::objectid_mutex since the fs
-		 * hasn't been fully initialised and we are the only user
+		 * No need to worry about atomic ordering of btrfs_root::free_objectid
+		 * since the fs hasn't been fully initialised and we are the
+		 * only user
 		 */
 		ret = btrfs_init_root_free_objectid(tree_root);
 		if (ret < 0) {
@@ -2734,7 +2730,8 @@  static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
 			continue;
 		}
 
-		ASSERT(tree_root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
+		ASSERT((u64)atomic64_read(&tree_root->free_objectid) <=
+			BTRFS_LAST_FREE_OBJECTID);
 
 		ret = btrfs_read_roots(fs_info);
 		if (ret < 0) {
@@ -4924,10 +4921,11 @@  int btrfs_init_root_free_objectid(struct btrfs_root *root)
 		slot = path->slots[0] - 1;
 		l = path->nodes[0];
 		btrfs_item_key_to_cpu(l, &found_key, slot);
-		root->free_objectid = max_t(u64, found_key.objectid + 1,
-					    BTRFS_FIRST_FREE_OBJECTID);
+		atomic64_set(&root->free_objectid,
+			     max_t(u64, found_key.objectid + 1,
+				   BTRFS_FIRST_FREE_OBJECTID));
 	} else {
-		root->free_objectid = BTRFS_FIRST_FREE_OBJECTID;
+		atomic64_set(&root->free_objectid, BTRFS_FIRST_FREE_OBJECTID);
 	}
 
 	return 0;
@@ -4935,20 +4933,15 @@  int btrfs_init_root_free_objectid(struct btrfs_root *root)
 
 int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
 {
-	int ret;
-	mutex_lock(&root->objectid_mutex);
+	u64 val = atomic64_inc_return(&root->free_objectid) - 1;
 
-	if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
+	if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
 		btrfs_warn(root->fs_info,
 			   "the objectid of root %llu reaches its highest value",
 			   btrfs_root_id(root));
-		ret = -ENOSPC;
-		goto out;
+		return -ENOSPC;
 	}
 
-	*objectid = root->free_objectid++;
-	ret = 0;
-out:
-	mutex_unlock(&root->objectid_mutex);
-	return ret;
+	*objectid = val;
+	return 0;
 }
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index d6fa36674270..ebc08031b21d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -472,18 +472,19 @@  int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 			 *
 			 * Ensure that we skip any such subvol ids.
 			 *
-			 * We don't need to lock because this is only called
-			 * during mount before we start doing things like creating
-			 * subvolumes.
+			 * We don't need to worry about atomic ordering because
+			 * this is only called during mount before we start
+			 * doing things like creating subvolumes.
 			 */
 			if (is_fstree(qgroup->qgroupid) &&
-			    qgroup->qgroupid > tree_root->free_objectid)
+			    qgroup->qgroupid > (u64)atomic64_read(&tree_root->free_objectid))
 				/*
 				 * Don't need to check against BTRFS_LAST_FREE_OBJECTID,
 				 * as it will get checked on the next call to
 				 * btrfs_get_free_objectid.
 				 */
-				tree_root->free_objectid = qgroup->qgroupid + 1;
+				atomic64_set(&tree_root->free_objectid,
+					     qgroup->qgroupid + 1);
 		}
 		ret = btrfs_sysfs_add_one_qgroup(fs_info, qgroup);
 		if (ret < 0)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index fc5c761181eb..97e608b251fa 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -7325,9 +7325,6 @@  int btrfs_recover_log_trees(struct btrfs_root *log_root_tree)
 			 * We have just replayed everything, and the highest
 			 * objectid of fs roots probably has changed in case
 			 * some inode_item's got replayed.
-			 *
-			 * root->objectid_mutex is not acquired as log replay
-			 * could only happen during mount.
 			 */
 			ret = btrfs_init_root_free_objectid(root);
 			if (ret)