btrfs: use atomic64_t for free_objectid

Message ID	20250122122459.1148668-1-maharmstone@fb.com (mailing list archive)
State	New, archived
Headers	show Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E39DB1741D2 for <linux-btrfs@vger.kernel.org>; Wed, 22 Jan 2025 12:25:10 +0000 (UTC) From: Mark Harmstone <maharmstone@fb.com> To: <linux-btrfs@vger.kernel.org> CC: Mark Harmstone <maharmstone@fb.com> Subject: [PATCH] btrfs: use atomic64_t for free_objectid Date: Wed, 22 Jan 2025 12:21:28 +0000 Message-ID: <20250122122459.1148668-1-maharmstone@fb.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain
Series	btrfs: use atomic64_t for free_objectid \| expand btrfs: use atomic64_t for free_objectid

Mark Harmstone Jan. 22, 2025, 12:21 p.m. UTC

Currently btrfs_get_free_objectid() uses a mutex to protect
free_objectid; I'm guessing this was because of the inode cache that we
used to have. The inode cache is no more, so simplify things by
replacing it with an atomic.

There's no issues with ordering: free_objectid gets set to an initial
value, then calls to btrfs_get_free_objectid() return a monotonically
increasing value.

This change means that btrfs_get_free_objectid() will no longer
potentially sleep, which was a blocker for adding a non-blocking mode
for inode and subvol creation.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
---
 fs/btrfs/ctree.h    |  4 +---
 fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
 fs/btrfs/qgroup.c   | 11 ++++++-----
 fs/btrfs/tree-log.c |  3 ---
 4 files changed, 25 insertions(+), 36 deletions(-)

Filipe Manana Jan. 22, 2025, 12:39 p.m. UTC | #1

On Wed, Jan 22, 2025 at 12:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>
> Currently btrfs_get_free_objectid() uses a mutex to protect
> free_objectid; I'm guessing this was because of the inode cache that we
> used to have. The inode cache is no more, so simplify things by
> replacing it with an atomic.
>
> There's no issues with ordering: free_objectid gets set to an initial
> value, then calls to btrfs_get_free_objectid() return a monotonically
> increasing value.
>
> This change means that btrfs_get_free_objectid() will no longer
> potentially sleep, which was a blocker for adding a non-blocking mode
> for inode and subvol creation.
>
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/ctree.h    |  4 +---
>  fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
>  fs/btrfs/qgroup.c   | 11 ++++++-----
>  fs/btrfs/tree-log.c |  3 ---
>  4 files changed, 25 insertions(+), 36 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1096a80a64e7..04175698525b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -179,8 +179,6 @@ struct btrfs_root {
>         struct btrfs_fs_info *fs_info;
>         struct extent_io_tree dirty_log_pages;
>
> -       struct mutex objectid_mutex;
> -
>         spinlock_t accounting_lock;
>         struct btrfs_block_rsv *block_rsv;
>
> @@ -214,7 +212,7 @@ struct btrfs_root {
>
>         u64 last_trans;
>
> -       u64 free_objectid;
> +       atomic64_t free_objectid;
>
>         struct btrfs_key defrag_progress;
>         struct btrfs_key defrag_max;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f09db62e61a1..0543d9c3f8c0 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -659,7 +659,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>         RB_CLEAR_NODE(&root->rb_node);
>
>         btrfs_set_root_last_trans(root, 0);
> -       root->free_objectid = 0;
> +       atomic64_set(&root->free_objectid, 0);
>         root->nr_delalloc_inodes = 0;
>         root->nr_ordered_extents = 0;
>         xa_init(&root->inodes);
> @@ -678,7 +678,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>         spin_lock_init(&root->ordered_extent_lock);
>         spin_lock_init(&root->accounting_lock);
>         spin_lock_init(&root->qgroup_meta_rsv_lock);
> -       mutex_init(&root->objectid_mutex);
>         mutex_init(&root->log_mutex);
>         mutex_init(&root->ordered_extent_mutex);
>         mutex_init(&root->delalloc_mutex);
> @@ -1133,16 +1132,12 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
>                 }
>         }
>
> -       mutex_lock(&root->objectid_mutex);
>         ret = btrfs_init_root_free_objectid(root);
> -       if (ret) {
> -               mutex_unlock(&root->objectid_mutex);
> +       if (ret)
>                 goto fail;
> -       }
>
> -       ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> -
> -       mutex_unlock(&root->objectid_mutex);
> +       ASSERT((u64)atomic64_read(&root->free_objectid) <=
> +               BTRFS_LAST_FREE_OBJECTID);
>
>         return 0;
>  fail:
> @@ -2730,8 +2725,9 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
>                 }
>
>                 /*
> -                * No need to hold btrfs_root::objectid_mutex since the fs
> -                * hasn't been fully initialised and we are the only user
> +                * No need to worry about atomic ordering of btrfs_root::free_objectid
> +                * since the fs hasn't been fully initialised and we are the
> +                * only user
>                  */
>                 ret = btrfs_init_root_free_objectid(tree_root);
>                 if (ret < 0) {
> @@ -2739,7 +2735,8 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
>                         continue;
>                 }
>
> -               ASSERT(tree_root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> +               ASSERT((u64)atomic64_read(&tree_root->free_objectid) <=
> +                       BTRFS_LAST_FREE_OBJECTID);
>
>                 ret = btrfs_read_roots(fs_info);
>                 if (ret < 0) {
> @@ -4931,10 +4928,11 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
>                 slot = path->slots[0] - 1;
>                 l = path->nodes[0];
>                 btrfs_item_key_to_cpu(l, &found_key, slot);
> -               root->free_objectid = max_t(u64, found_key.objectid + 1,
> -                                           BTRFS_FIRST_FREE_OBJECTID);
> +               atomic64_set(&root->free_objectid,
> +                            max_t(u64, found_key.objectid + 1,
> +                                  BTRFS_FIRST_FREE_OBJECTID));
>         } else {
> -               root->free_objectid = BTRFS_FIRST_FREE_OBJECTID;
> +               atomic64_set(&root->free_objectid, BTRFS_FIRST_FREE_OBJECTID);
>         }
>         ret = 0;
>  error:
> @@ -4944,20 +4942,15 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
>
>  int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
>  {
> -       int ret;
> -       mutex_lock(&root->objectid_mutex);
> +       u64 val = atomic64_inc_return(&root->free_objectid) - 1;
>
> -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
>                 btrfs_warn(root->fs_info,
>                            "the objectid of root %llu reaches its highest value",
>                            btrfs_root_id(root));
> -               ret = -ENOSPC;
> -               goto out;
> +               return -ENOSPC;
>         }
>
> -       *objectid = root->free_objectid++;
> -       ret = 0;

So this gives different semantics now.

Before we increment free_objectid only if it's less than
BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
more object IDs and must return -ENOSPC.

But now we always increment and then do the check, so after some calls
to btrfs_get_free_objectid() we wrap around the counter due to
overflow and eventually allow reusing already assigned object IDs.

I'm afraid the lock is still needed because of that. To make it more
lightweight maybe switch the mutex to a spinlock.

Thanks.

> -out:
> -       mutex_unlock(&root->objectid_mutex);
> -       return ret;
> +       *objectid = val;
> +       return 0;
>  }
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aaf16019d829..b41e06d5d2fb 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -472,18 +472,19 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
>                          *
>                          * Ensure that we skip any such subvol ids.
>                          *
> -                        * We don't need to lock because this is only called
> -                        * during mount before we start doing things like creating
> -                        * subvolumes.
> +                        * We don't need to worry about atomic ordering because
> +                        * this is only called during mount before we start
> +                        * doing things like creating subvolumes.
>                          */
>                         if (is_fstree(qgroup->qgroupid) &&
> -                           qgroup->qgroupid > tree_root->free_objectid)
> +                           qgroup->qgroupid > (u64)atomic64_read(&tree_root->free_objectid))
>                                 /*
>                                  * Don't need to check against BTRFS_LAST_FREE_OBJECTID,
>                                  * as it will get checked on the next call to
>                                  * btrfs_get_free_objectid.
>                                  */
> -                               tree_root->free_objectid = qgroup->qgroupid + 1;
> +                               atomic64_set(&tree_root->free_objectid,
> +                                            qgroup->qgroupid + 1);
>                 }
>                 ret = btrfs_sysfs_add_one_qgroup(fs_info, qgroup);
>                 if (ret < 0)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 955d1677e865..9d19528fee17 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -7325,9 +7325,6 @@ int btrfs_recover_log_trees(struct btrfs_root *log_root_tree)
>                          * We have just replayed everything, and the highest
>                          * objectid of fs roots probably has changed in case
>                          * some inode_item's got replayed.
> -                        *
> -                        * root->objectid_mutex is not acquired as log replay
> -                        * could only happen during mount.
>                          */
>                         ret = btrfs_init_root_free_objectid(root);
>                         if (ret)
> --
> 2.45.2
>
>

Mark Harmstone Jan. 22, 2025, 12:59 p.m. UTC | #2

On 22/1/25 12:39, Filipe Manana wrote:
> On Wed, Jan 22, 2025 at 12:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
>> @@ -4944,20 +4942,15 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
>>
>>   int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
>>   {
>> -       int ret;
>> -       mutex_lock(&root->objectid_mutex);
>> +       u64 val = atomic64_inc_return(&root->free_objectid) - 1;
>>
>> -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
>> +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
>>                  btrfs_warn(root->fs_info,
>>                             "the objectid of root %llu reaches its highest value",
>>                             btrfs_root_id(root));
>> -               ret = -ENOSPC;
>> -               goto out;
>> +               return -ENOSPC;
>>          }
>>
>> -       *objectid = root->free_objectid++;
>> -       ret = 0;
> 
> So this gives different semantics now.
> 
> Before we increment free_objectid only if it's less than
> BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> more object IDs and must return -ENOSPC.
> 
> But now we always increment and then do the check, so after some calls
> to btrfs_get_free_objectid() we wrap around the counter due to
> overflow and eventually allow reusing already assigned object IDs.
> 
> I'm afraid the lock is still needed because of that. To make it more
> lightweight maybe switch the mutex to a spinlock.
> 
> Thanks.

Thanks Filipe. Do we even need the check, really? Even a denial of 
service attack wouldn't be able to practically call the function ~2^64 
times. And there's no way to create an inode with an arbitrary number, 
short of manually hex-editing the disk.

Mark

Filipe Manana Jan. 22, 2025, 1:04 p.m. UTC | #3

On Wed, Jan 22, 2025 at 12:59 PM Mark Harmstone <maharmstone@meta.com> wrote:
>
> On 22/1/25 12:39, Filipe Manana wrote:
> > On Wed, Jan 22, 2025 at 12:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
> >> @@ -4944,20 +4942,15 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
> >>
> >>   int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
> >>   {
> >> -       int ret;
> >> -       mutex_lock(&root->objectid_mutex);
> >> +       u64 val = atomic64_inc_return(&root->free_objectid) - 1;
> >>
> >> -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> >> +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> >>                  btrfs_warn(root->fs_info,
> >>                             "the objectid of root %llu reaches its highest value",
> >>                             btrfs_root_id(root));
> >> -               ret = -ENOSPC;
> >> -               goto out;
> >> +               return -ENOSPC;
> >>          }
> >>
> >> -       *objectid = root->free_objectid++;
> >> -       ret = 0;
> >
> > So this gives different semantics now.
> >
> > Before we increment free_objectid only if it's less than
> > BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> > more object IDs and must return -ENOSPC.
> >
> > But now we always increment and then do the check, so after some calls
> > to btrfs_get_free_objectid() we wrap around the counter due to
> > overflow and eventually allow reusing already assigned object IDs.
> >
> > I'm afraid the lock is still needed because of that. To make it more
> > lightweight maybe switch the mutex to a spinlock.
> >
> > Thanks.
>
> Thanks Filipe. Do we even need the check, really? Even a denial of
> service attack wouldn't be able to practically call the function ~2^64
> times. And there's no way to create an inode with an arbitrary number,
> short of manually hex-editing the disk.

Well we do such limit checks everywhere, why would we ignore them only here?
Even if they are very unlikely to be hit in practice, if they happen,
we can get into serious trouble.

So I don't think it's wise to ignore the limit.

Thanks.

>
> Mark

Daniel Vacek Jan. 22, 2025, 1:51 p.m. UTC | #4

On Wed, 22 Jan 2025 at 13:40, Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Wed, Jan 22, 2025 at 12:25 PM Mark Harmstone <maharmstone@fb.com> wrote:
> >
> > Currently btrfs_get_free_objectid() uses a mutex to protect
> > free_objectid; I'm guessing this was because of the inode cache that we
> > used to have. The inode cache is no more, so simplify things by
> > replacing it with an atomic.
> >
> > There's no issues with ordering: free_objectid gets set to an initial
> > value, then calls to btrfs_get_free_objectid() return a monotonically
> > increasing value.
> >
> > This change means that btrfs_get_free_objectid() will no longer
> > potentially sleep, which was a blocker for adding a non-blocking mode
> > for inode and subvol creation.
> >
> > Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> > ---
> >  fs/btrfs/ctree.h    |  4 +---
> >  fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
> >  fs/btrfs/qgroup.c   | 11 ++++++-----
> >  fs/btrfs/tree-log.c |  3 ---
> >  4 files changed, 25 insertions(+), 36 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 1096a80a64e7..04175698525b 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -179,8 +179,6 @@ struct btrfs_root {
> >         struct btrfs_fs_info *fs_info;
> >         struct extent_io_tree dirty_log_pages;
> >
> > -       struct mutex objectid_mutex;
> > -
> >         spinlock_t accounting_lock;
> >         struct btrfs_block_rsv *block_rsv;
> >
> > @@ -214,7 +212,7 @@ struct btrfs_root {
> >
> >         u64 last_trans;
> >
> > -       u64 free_objectid;
> > +       atomic64_t free_objectid;
> >
> >         struct btrfs_key defrag_progress;
> >         struct btrfs_key defrag_max;
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index f09db62e61a1..0543d9c3f8c0 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -659,7 +659,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
> >         RB_CLEAR_NODE(&root->rb_node);
> >
> >         btrfs_set_root_last_trans(root, 0);
> > -       root->free_objectid = 0;
> > +       atomic64_set(&root->free_objectid, 0);
> >         root->nr_delalloc_inodes = 0;
> >         root->nr_ordered_extents = 0;
> >         xa_init(&root->inodes);
> > @@ -678,7 +678,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
> >         spin_lock_init(&root->ordered_extent_lock);
> >         spin_lock_init(&root->accounting_lock);
> >         spin_lock_init(&root->qgroup_meta_rsv_lock);
> > -       mutex_init(&root->objectid_mutex);
> >         mutex_init(&root->log_mutex);
> >         mutex_init(&root->ordered_extent_mutex);
> >         mutex_init(&root->delalloc_mutex);
> > @@ -1133,16 +1132,12 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
> >                 }
> >         }
> >
> > -       mutex_lock(&root->objectid_mutex);
> >         ret = btrfs_init_root_free_objectid(root);
> > -       if (ret) {
> > -               mutex_unlock(&root->objectid_mutex);
> > +       if (ret)
> >                 goto fail;
> > -       }
> >
> > -       ASSERT(root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> > -
> > -       mutex_unlock(&root->objectid_mutex);
> > +       ASSERT((u64)atomic64_read(&root->free_objectid) <=
> > +               BTRFS_LAST_FREE_OBJECTID);
> >
> >         return 0;
> >  fail:
> > @@ -2730,8 +2725,9 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
> >                 }
> >
> >                 /*
> > -                * No need to hold btrfs_root::objectid_mutex since the fs
> > -                * hasn't been fully initialised and we are the only user
> > +                * No need to worry about atomic ordering of btrfs_root::free_objectid
> > +                * since the fs hasn't been fully initialised and we are the
> > +                * only user
> >                  */
> >                 ret = btrfs_init_root_free_objectid(tree_root);
> >                 if (ret < 0) {
> > @@ -2739,7 +2735,8 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
> >                         continue;
> >                 }
> >
> > -               ASSERT(tree_root->free_objectid <= BTRFS_LAST_FREE_OBJECTID);
> > +               ASSERT((u64)atomic64_read(&tree_root->free_objectid) <=
> > +                       BTRFS_LAST_FREE_OBJECTID);
> >
> >                 ret = btrfs_read_roots(fs_info);
> >                 if (ret < 0) {
> > @@ -4931,10 +4928,11 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
> >                 slot = path->slots[0] - 1;
> >                 l = path->nodes[0];
> >                 btrfs_item_key_to_cpu(l, &found_key, slot);
> > -               root->free_objectid = max_t(u64, found_key.objectid + 1,
> > -                                           BTRFS_FIRST_FREE_OBJECTID);
> > +               atomic64_set(&root->free_objectid,
> > +                            max_t(u64, found_key.objectid + 1,
> > +                                  BTRFS_FIRST_FREE_OBJECTID));
> >         } else {
> > -               root->free_objectid = BTRFS_FIRST_FREE_OBJECTID;
> > +               atomic64_set(&root->free_objectid, BTRFS_FIRST_FREE_OBJECTID);
> >         }
> >         ret = 0;
> >  error:
> > @@ -4944,20 +4942,15 @@ int btrfs_init_root_free_objectid(struct btrfs_root *root)
> >
> >  int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid)
> >  {
> > -       int ret;
> > -       mutex_lock(&root->objectid_mutex);
> > +       u64 val = atomic64_inc_return(&root->free_objectid) - 1;
> >
> > -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> > +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> >                 btrfs_warn(root->fs_info,
> >                            "the objectid of root %llu reaches its highest value",
> >                            btrfs_root_id(root));
> > -               ret = -ENOSPC;
> > -               goto out;
> > +               return -ENOSPC;
> >         }
> >
> > -       *objectid = root->free_objectid++;
> > -       ret = 0;
>
> So this gives different semantics now.
>
> Before we increment free_objectid only if it's less than
> BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> more object IDs and must return -ENOSPC.
>
> But now we always increment and then do the check, so after some calls
> to btrfs_get_free_objectid() we wrap around the counter due to
> overflow and eventually allow reusing already assigned object IDs.
>
> I'm afraid the lock is still needed because of that. To make it more
> lightweight maybe switch the mutex to a spinlock.

How about this?

```
retry:  val = atomic64_read(&root->free_objectid);
        ....;
        if (atomic64_cmpxchg(&root->free_objectid, val, val+1) != val)
                goto retry;
        *objectid = val;
        return 0;
```

> Thanks.
>
> > -out:
> > -       mutex_unlock(&root->objectid_mutex);
> > -       return ret;
> > +       *objectid = val;
> > +       return 0;
> >  }
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index aaf16019d829..b41e06d5d2fb 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -472,18 +472,19 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
> >                          *
> >                          * Ensure that we skip any such subvol ids.
> >                          *
> > -                        * We don't need to lock because this is only called
> > -                        * during mount before we start doing things like creating
> > -                        * subvolumes.
> > +                        * We don't need to worry about atomic ordering because
> > +                        * this is only called during mount before we start
> > +                        * doing things like creating subvolumes.
> >                          */
> >                         if (is_fstree(qgroup->qgroupid) &&
> > -                           qgroup->qgroupid > tree_root->free_objectid)
> > +                           qgroup->qgroupid > (u64)atomic64_read(&tree_root->free_objectid))
> >                                 /*
> >                                  * Don't need to check against BTRFS_LAST_FREE_OBJECTID,
> >                                  * as it will get checked on the next call to
> >                                  * btrfs_get_free_objectid.
> >                                  */
> > -                               tree_root->free_objectid = qgroup->qgroupid + 1;
> > +                               atomic64_set(&tree_root->free_objectid,
> > +                                            qgroup->qgroupid + 1);
> >                 }
> >                 ret = btrfs_sysfs_add_one_qgroup(fs_info, qgroup);
> >                 if (ret < 0)
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 955d1677e865..9d19528fee17 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -7325,9 +7325,6 @@ int btrfs_recover_log_trees(struct btrfs_root *log_root_tree)
> >                          * We have just replayed everything, and the highest
> >                          * objectid of fs roots probably has changed in case
> >                          * some inode_item's got replayed.
> > -                        *
> > -                        * root->objectid_mutex is not acquired as log replay
> > -                        * could only happen during mount.
> >                          */
> >                         ret = btrfs_init_root_free_objectid(root);
> >                         if (ret)
> > --
> > 2.45.2
> >
> >
>

David Sterba Jan. 22, 2025, 3:07 p.m. UTC | #5

On Wed, Jan 22, 2025 at 12:21:28PM +0000, Mark Harmstone wrote:
> Currently btrfs_get_free_objectid() uses a mutex to protect
> free_objectid; I'm guessing this was because of the inode cache that we
> used to have. The inode cache is no more, so simplify things by
> replacing it with an atomic.
> 
> There's no issues with ordering: free_objectid gets set to an initial
> value, then calls to btrfs_get_free_objectid() return a monotonically
> increasing value.
> 
> This change means that btrfs_get_free_objectid() will no longer
> potentially sleep, which was a blocker for adding a non-blocking mode
> for inode and subvol creation.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
>  fs/btrfs/ctree.h    |  4 +---
>  fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
>  fs/btrfs/qgroup.c   | 11 ++++++-----
>  fs/btrfs/tree-log.c |  3 ---
>  4 files changed, 25 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1096a80a64e7..04175698525b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -179,8 +179,6 @@ struct btrfs_root {
>  	struct btrfs_fs_info *fs_info;
>  	struct extent_io_tree dirty_log_pages;
>  
> -	struct mutex objectid_mutex;
> -
>  	spinlock_t accounting_lock;
>  	struct btrfs_block_rsv *block_rsv;
>  
> @@ -214,7 +212,7 @@ struct btrfs_root {
>  
>  	u64 last_trans;
>  
> -	u64 free_objectid;
> +	atomic64_t free_objectid;

Besides the other things pointed out, this also changes the type from
u64 to s64 or requiring casts so we keep u64 as this is what the on-disk
format defines.

Mark Harmstone Jan. 22, 2025, 6:19 p.m. UTC | #6

On 22/1/25 15:07, David Sterba wrote:
> > 
> On Wed, Jan 22, 2025 at 12:21:28PM +0000, Mark Harmstone wrote:
>> Currently btrfs_get_free_objectid() uses a mutex to protect
>> free_objectid; I'm guessing this was because of the inode cache that we
>> used to have. The inode cache is no more, so simplify things by
>> replacing it with an atomic.
>>
>> There's no issues with ordering: free_objectid gets set to an initial
>> value, then calls to btrfs_get_free_objectid() return a monotonically
>> increasing value.
>>
>> This change means that btrfs_get_free_objectid() will no longer
>> potentially sleep, which was a blocker for adding a non-blocking mode
>> for inode and subvol creation.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>>   fs/btrfs/ctree.h    |  4 +---
>>   fs/btrfs/disk-io.c  | 43 ++++++++++++++++++-------------------------
>>   fs/btrfs/qgroup.c   | 11 ++++++-----
>>   fs/btrfs/tree-log.c |  3 ---
>>   4 files changed, 25 insertions(+), 36 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 1096a80a64e7..04175698525b 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -179,8 +179,6 @@ struct btrfs_root {
>>   	struct btrfs_fs_info *fs_info;
>>   	struct extent_io_tree dirty_log_pages;
>>   
>> -	struct mutex objectid_mutex;
>> -
>>   	spinlock_t accounting_lock;
>>   	struct btrfs_block_rsv *block_rsv;
>>   
>> @@ -214,7 +212,7 @@ struct btrfs_root {
>>   
>>   	u64 last_trans;
>>   
>> -	u64 free_objectid;
>> +	atomic64_t free_objectid;
> 
> Besides the other things pointed out, this also changes the type from
> u64 to s64 or requiring casts so we keep u64 as this is what the on-disk
> format defines.

It does, but there's casts to u64 every time it's read (which, asserts 
aside, is only in btrfs_get_free_objectid).

David Sterba Jan. 23, 2025, 9:59 p.m. UTC | #7

On Wed, Jan 22, 2025 at 02:51:10PM +0100, Daniel Vacek wrote:
> On Wed, 22 Jan 2025 at 13:40, Filipe Manana <fdmanana@kernel.org> wrote:
> > > -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> > > +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> > >                 btrfs_warn(root->fs_info,
> > >                            "the objectid of root %llu reaches its highest value",
> > >                            btrfs_root_id(root));
> > > -               ret = -ENOSPC;
> > > -               goto out;
> > > +               return -ENOSPC;
> > >         }
> > >
> > > -       *objectid = root->free_objectid++;
> > > -       ret = 0;
> >
> > So this gives different semantics now.
> >
> > Before we increment free_objectid only if it's less than
> > BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> > more object IDs and must return -ENOSPC.
> >
> > But now we always increment and then do the check, so after some calls
> > to btrfs_get_free_objectid() we wrap around the counter due to
> > overflow and eventually allow reusing already assigned object IDs.
> >
> > I'm afraid the lock is still needed because of that. To make it more
> > lightweight maybe switch the mutex to a spinlock.
> 
> How about this?
> 
> ```
> retry:  val = atomic64_read(&root->free_objectid);
>         ....;
>         if (atomic64_cmpxchg(&root->free_objectid, val, val+1) != val)
>                 goto retry;
>         *objectid = val;
>         return 0;
> ```

Doesn't this waste some ids when there are many concurrent requests?

David Sterba Jan. 23, 2025, 10:11 p.m. UTC | #8

On Wed, Jan 22, 2025 at 12:39:21PM +0000, Filipe Manana wrote:
> But now we always increment and then do the check, so after some calls
> to btrfs_get_free_objectid() we wrap around the counter due to
> overflow and eventually allow reusing already assigned object IDs.
> 
> I'm afraid the lock is still needed because of that. To make it more
> lightweight maybe switch the mutex to a spinlock.

For inode number allocations a spinlock should work. For tree ids we may
still need the mutex:

btrfs_init_fs_root
  mutex
  btrfs_init_root_free_objectid
    btrfs_alloc_path
    btrfs_search_slot

The difference is that inode number allocations are per-fs root, while
the subvolume ids are exclusively in the tree_root.

Daniel Vacek Jan. 24, 2025, 8:21 a.m. UTC | #9

On Thu, 23 Jan 2025 at 23:00, David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Jan 22, 2025 at 02:51:10PM +0100, Daniel Vacek wrote:
> > On Wed, 22 Jan 2025 at 13:40, Filipe Manana <fdmanana@kernel.org> wrote:
> > > > -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> > > > +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> > > >                 btrfs_warn(root->fs_info,
> > > >                            "the objectid of root %llu reaches its highest value",
> > > >                            btrfs_root_id(root));
> > > > -               ret = -ENOSPC;
> > > > -               goto out;
> > > > +               return -ENOSPC;
> > > >         }
> > > >
> > > > -       *objectid = root->free_objectid++;
> > > > -       ret = 0;
> > >
> > > So this gives different semantics now.
> > >
> > > Before we increment free_objectid only if it's less than
> > > BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> > > more object IDs and must return -ENOSPC.
> > >
> > > But now we always increment and then do the check, so after some calls
> > > to btrfs_get_free_objectid() we wrap around the counter due to
> > > overflow and eventually allow reusing already assigned object IDs.
> > >
> > > I'm afraid the lock is still needed because of that. To make it more
> > > lightweight maybe switch the mutex to a spinlock.
> >
> > How about this?
> >
> > ```
> > retry:  val = atomic64_read(&root->free_objectid);
> >         ....;
> >         if (atomic64_cmpxchg(&root->free_objectid, val, val+1) != val)
> >                 goto retry;
> >         *objectid = val;
> >         return 0;
> > ```
>
> Doesn't this waste some ids when there are many concurrent requests?

Quite the opposite, it's meant to prevent that. That's why I suggested
it as the original patch was wasting them and that's what Filipe
pointed out.

It will only retry precisely when more concurrent requests race here.
And thanks to cmpxchg only one of them wins and increments. The others
retry in another iteration of the loop.

I think this way no lock is needed and the previous behavior is preserved.

Filipe Manana Jan. 24, 2025, 12:25 p.m. UTC | #10

On Fri, Jan 24, 2025 at 8:22 AM Daniel Vacek <neelx@suse.com> wrote:
>
> On Thu, 23 Jan 2025 at 23:00, David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Jan 22, 2025 at 02:51:10PM +0100, Daniel Vacek wrote:
> > > On Wed, 22 Jan 2025 at 13:40, Filipe Manana <fdmanana@kernel.org> wrote:
> > > > > -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> > > > > +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> > > > >                 btrfs_warn(root->fs_info,
> > > > >                            "the objectid of root %llu reaches its highest value",
> > > > >                            btrfs_root_id(root));
> > > > > -               ret = -ENOSPC;
> > > > > -               goto out;
> > > > > +               return -ENOSPC;
> > > > >         }
> > > > >
> > > > > -       *objectid = root->free_objectid++;
> > > > > -       ret = 0;
> > > >
> > > > So this gives different semantics now.
> > > >
> > > > Before we increment free_objectid only if it's less than
> > > > BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> > > > more object IDs and must return -ENOSPC.
> > > >
> > > > But now we always increment and then do the check, so after some calls
> > > > to btrfs_get_free_objectid() we wrap around the counter due to
> > > > overflow and eventually allow reusing already assigned object IDs.
> > > >
> > > > I'm afraid the lock is still needed because of that. To make it more
> > > > lightweight maybe switch the mutex to a spinlock.
> > >
> > > How about this?
> > >
> > > ```
> > > retry:  val = atomic64_read(&root->free_objectid);
> > >         ....;
> > >         if (atomic64_cmpxchg(&root->free_objectid, val, val+1) != val)
> > >                 goto retry;
> > >         *objectid = val;
> > >         return 0;
> > > ```
> >
> > Doesn't this waste some ids when there are many concurrent requests?
>
> Quite the opposite, it's meant to prevent that. That's why I suggested
> it as the original patch was wasting them and that's what Filipe
> pointed out.

Not wasting, but allowing the counter to wrap around and return
already in use object IDs, far more serious than wasting.
And besides that, the counter wrap-around would allow the return of
invalid object IDs, in the range from 0 to BTRFS_FIRST_FREE_OBJECTID
(256).

>
> It will only retry precisely when more concurrent requests race here.
> And thanks to cmpxchg only one of them wins and increments. The others
> retry in another iteration of the loop.
>
> I think this way no lock is needed and the previous behavior is preserved.

That looks fine to me. But under heavy concurrency, there's the
potential to loop too much, so I would at least add a cond_resched()
call before doing the goto.
With the mutex there's the advantage of not looping and wasting CPU if
such high concurrency happens, tasks will be blocked and yield the cpu
for other tasks.

Any improvements in performance could also be measured easily with
fs_mark, which will heavily hit this code path.
I would prefer if the patch adds fs_mark numbers (or from any other
test) in the changelogs.

Thanks.

Daniel Vacek Jan. 24, 2025, 3:10 p.m. UTC | #11

On Thu, 23 Jan 2025 at 23:11, David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Jan 22, 2025 at 12:39:21PM +0000, Filipe Manana wrote:
> > But now we always increment and then do the check, so after some calls
> > to btrfs_get_free_objectid() we wrap around the counter due to
> > overflow and eventually allow reusing already assigned object IDs.
> >
> > I'm afraid the lock is still needed because of that. To make it more
> > lightweight maybe switch the mutex to a spinlock.
>
> For inode number allocations a spinlock should work. For tree ids we may
> still need the mutex:
>
> btrfs_init_fs_root
>   mutex
>   btrfs_init_root_free_objectid
>     btrfs_alloc_path
>     btrfs_search_slot
>
> The difference is that inode number allocations are per-fs root, while
> the subvolume ids are exclusively in the tree_root.

Can someone else hold a reference to the same given root and perhaps
already created new objects while `btrfs_init_fs_root(root, ...)` is
called?
This sounds counter-intuitive, to re-initialize an already being used
root. But that's the only explanation for the mutex here, IIUC.
Otherwise it would not be needed.

Daniel Vacek Jan. 24, 2025, 3:13 p.m. UTC | #12

On Fri, 24 Jan 2025 at 13:26, Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Jan 24, 2025 at 8:22 AM Daniel Vacek <neelx@suse.com> wrote:
> >
> > On Thu, 23 Jan 2025 at 23:00, David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Wed, Jan 22, 2025 at 02:51:10PM +0100, Daniel Vacek wrote:
> > > > On Wed, 22 Jan 2025 at 13:40, Filipe Manana <fdmanana@kernel.org> wrote:
> > > > > > -       if (unlikely(root->free_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
> > > > > > +       if (unlikely(val >= BTRFS_LAST_FREE_OBJECTID)) {
> > > > > >                 btrfs_warn(root->fs_info,
> > > > > >                            "the objectid of root %llu reaches its highest value",
> > > > > >                            btrfs_root_id(root));
> > > > > > -               ret = -ENOSPC;
> > > > > > -               goto out;
> > > > > > +               return -ENOSPC;
> > > > > >         }
> > > > > >
> > > > > > -       *objectid = root->free_objectid++;
> > > > > > -       ret = 0;
> > > > >
> > > > > So this gives different semantics now.
> > > > >
> > > > > Before we increment free_objectid only if it's less than
> > > > > BTRFS_LAST_FREE_OBJECTID, so once we reach that value we can't assign
> > > > > more object IDs and must return -ENOSPC.
> > > > >
> > > > > But now we always increment and then do the check, so after some calls
> > > > > to btrfs_get_free_objectid() we wrap around the counter due to
> > > > > overflow and eventually allow reusing already assigned object IDs.
> > > > >
> > > > > I'm afraid the lock is still needed because of that. To make it more
> > > > > lightweight maybe switch the mutex to a spinlock.
> > > >
> > > > How about this?
> > > >
> > > > ```
> > > > retry:  val = atomic64_read(&root->free_objectid);
> > > >         ....;
> > > >         if (atomic64_cmpxchg(&root->free_objectid, val, val+1) != val)
> > > >                 goto retry;
> > > >         *objectid = val;
> > > >         return 0;
> > > > ```
> > >
> > > Doesn't this waste some ids when there are many concurrent requests?
> >
> > Quite the opposite, it's meant to prevent that. That's why I suggested
> > it as the original patch was wasting them and that's what Filipe
> > pointed out.
>
> Not wasting, but allowing the counter to wrap around and return
> already in use object IDs, far more serious than wasting.
> And besides that, the counter wrap-around would allow the return of
> invalid object IDs, in the range from 0 to BTRFS_FIRST_FREE_OBJECTID
> (256).

Oh, sorry about the confusion. Those three dots ... were meant as a
placeholder for the original -ENOSPC condition. Again, keeping the
original logic without any changes other than getting rid of the lock.

> >
> > It will only retry precisely when more concurrent requests race here.
> > And thanks to cmpxchg only one of them wins and increments. The others
> > retry in another iteration of the loop.
> >
> > I think this way no lock is needed and the previous behavior is preserved.
>
> That looks fine to me. But under heavy concurrency, there's the
> potential to loop too much, so I would at least add a cond_resched()
> call before doing the goto.

Possibly.

> With the mutex there's the advantage of not looping and wasting CPU if
> such high concurrency happens, tasks will be blocked and yield the cpu
> for other tasks.

Right. My understanding was that if this function is heavily
contended, the mutex would be a major bottleneck. Which you would be
likely aware of. Otherwise this is just a protection against rare
races. Anyways, `cond_resched()` should not hurt even if not strictly
needed. Better safe than sorry.

> Any improvements in performance could also be measured easily with
> fs_mark, which will heavily hit this code path.
> I would prefer if the patch adds fs_mark numbers (or from any other
> test) in the changelogs.
>
> Thanks.

Mark Harmstone Jan. 27, 2025, 5:42 p.m. UTC | #13

On 24/1/25 12:25, Filipe Manana wrote:
>>
>> It will only retry precisely when more concurrent requests race here.
>> And thanks to cmpxchg only one of them wins and increments. The others
>> retry in another iteration of the loop.
>>
>> I think this way no lock is needed and the previous behavior is preserved.
> 
> That looks fine to me. But under heavy concurrency, there's the
> potential to loop too much, so I would at least add a cond_resched()
> call before doing the goto.
> With the mutex there's the advantage of not looping and wasting CPU if
> such high concurrency happens, tasks will be blocked and yield the cpu
> for other tasks.

And then we have the problem of the function potentially sleeping, which 
was one of the things I'm trying to avoid.

I still think an atomic is the correct choice here:

* Skipped integers aren't a problem, as there's no reliance on the 
numbers being contiguous.
* The overflow check is wasted cycles, and should never have been there 
in the first place. Even if we were to grab a new integer a billion 
times a second, it would take 584 years to wrap around.

Mark

Filipe Manana Jan. 27, 2025, 5:53 p.m. UTC | #14

On Mon, Jan 27, 2025 at 5:42 PM Mark Harmstone <maharmstone@meta.com> wrote:
>
> On 24/1/25 12:25, Filipe Manana wrote:
> >>
> >> It will only retry precisely when more concurrent requests race here.
> >> And thanks to cmpxchg only one of them wins and increments. The others
> >> retry in another iteration of the loop.
> >>
> >> I think this way no lock is needed and the previous behavior is preserved.
> >
> > That looks fine to me. But under heavy concurrency, there's the
> > potential to loop too much, so I would at least add a cond_resched()
> > call before doing the goto.
> > With the mutex there's the advantage of not looping and wasting CPU if
> > such high concurrency happens, tasks will be blocked and yield the cpu
> > for other tasks.
>
> And then we have the problem of the function potentially sleeping, which
> was one of the things I'm trying to avoid.

The sleep should be there to avoid looping too much and starving other
tasks in case of too heavy concurrency.

>
> I still think an atomic is the correct choice here:

I'm not saying it isn't.

>
> * Skipped integers aren't a problem, as there's no reliance on the
> numbers being contiguous.

That wasn't the problem with the patch.

> * The overflow check is wasted cycles, and should never have been there
> in the first place. Even if we were to grab a new integer a billion
> times a second, it would take 584 years to wrap around.

That still feels wrong to me. We do limit checks everywhere, even
because corruptions and bugs can happen.
Removing the limit checks, could also result in using the invalid
range from 0 to BTRFS_FIRST_FREE_OBJECTID (256), besides reusing
numbers after that range.

Do you actually have numbers to compare the atomic vs mutex?

>
> Mark

David Sterba Jan. 27, 2025, 8:17 p.m. UTC | #15

On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
> On 24/1/25 12:25, Filipe Manana wrote:
> >>
> >> It will only retry precisely when more concurrent requests race here.
> >> And thanks to cmpxchg only one of them wins and increments. The others
> >> retry in another iteration of the loop.
> >>
> >> I think this way no lock is needed and the previous behavior is preserved.
> > 
> > That looks fine to me. But under heavy concurrency, there's the
> > potential to loop too much, so I would at least add a cond_resched()
> > call before doing the goto.
> > With the mutex there's the advantage of not looping and wasting CPU if
> > such high concurrency happens, tasks will be blocked and yield the cpu
> > for other tasks.
> 
> And then we have the problem of the function potentially sleeping, which 
> was one of the things I'm trying to avoid.
> 
> I still think an atomic is the correct choice here:
> 
> * Skipped integers aren't a problem, as there's no reliance on the 
> numbers being contiguous.
> * The overflow check is wasted cycles, and should never have been there 
> in the first place.

We do many checks that almost never happen but declare the range of the
expected values and can catch eventual bugs caused by the "impossible"
reasons like memory bitflips or consequences of other errors that only
show up due to such checks. I would not cosider one condition wasted
cycles in this case, unless we really are optimizing a hot path and
counting the cycles.

> Even if we were to grab a new integer a billion 
> times a second, it would take 584 years to wrap around.

Good to know, but I would not use that as an argument.  This may hold if
you predict based on contemporary hardware. New technologies can bring
an order of magnitude improvement, eg. like NVMe brought compared to SSD
technology.

Boris Burkov Jan. 29, 2025, 10:58 p.m. UTC | #16

On Mon, Jan 27, 2025 at 09:17:17PM +0100, David Sterba wrote:
> On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
> > On 24/1/25 12:25, Filipe Manana wrote:
> > >>
> > >> It will only retry precisely when more concurrent requests race here.
> > >> And thanks to cmpxchg only one of them wins and increments. The others
> > >> retry in another iteration of the loop.
> > >>
> > >> I think this way no lock is needed and the previous behavior is preserved.
> > > 
> > > That looks fine to me. But under heavy concurrency, there's the
> > > potential to loop too much, so I would at least add a cond_resched()
> > > call before doing the goto.
> > > With the mutex there's the advantage of not looping and wasting CPU if
> > > such high concurrency happens, tasks will be blocked and yield the cpu
> > > for other tasks.
> > 
> > And then we have the problem of the function potentially sleeping, which 
> > was one of the things I'm trying to avoid.
> > 
> > I still think an atomic is the correct choice here:
> > 
> > * Skipped integers aren't a problem, as there's no reliance on the 
> > numbers being contiguous.
> > * The overflow check is wasted cycles, and should never have been there 
> > in the first place.
> 
> We do many checks that almost never happen but declare the range of the
> expected values and can catch eventual bugs caused by the "impossible"
> reasons like memory bitflips or consequences of other errors that only
> show up due to such checks. I would not cosider one condition wasted
> cycles in this case, unless we really are optimizing a hot path and
> counting the cycles.
> 

Could you explain a bit more what you view the philosophy on "impossible"
inputs to be? In this case, we are *not* safe from full on memory
corruption, as some other thread might doodle on our free_objectid after
we do the check. It helps with a bad write for inode metadata, but in
that case, we still have the following on our side:

1. we have multiple redundant inode items, so fsck/tree checker can
notice an inconsistency when we see an item with a random massive
objectid out of order. I haven't re-read it to see if it does, but it
seems easy enough to implement if not.

2. a single bit flip, even of the MSB, still doesn't put us THAT close
to the end of the range and Mark's argument about 2^64 being a big
number still presents a reasonable, if not bulletproof, defense.

I generally support correctness trumping performance, but suppose the
existing code was the atomic and we got a patch to do the inc under a
mutex, justified by the possible bug. Would we be excited by that patch,
or would we say it's not a real bug?

I was thinking some more about it, and was wondering if we could get
away with setting the max objectid to something smaller than -256 s.t.
we felt safe with some trivial algorithm like "atomic_inc with dec on
failure", which would obviously not fly with a buffer of only 256.

Another option is aborting the fs when we get an obviously impossibly
large inode. In that case, the disaster scenario of overflowing into a
dupe is averted, and it will never happen except in the case of
corruption, so it's not a worse UX than ENOSPC. Presumably, we can ensure
that we can't commit the transaction once a thread hits this error, so
no one should be able to write an item with an overflowed inode number.

Those slightly rambly ideas aside, I also support Daniel's solution. I
would worry about doing it without cond_resched as we don't really know
how much we currently rely on the queueing behavior of mutex.

> > Even if we were to grab a new integer a billion 
> > times a second, it would take 584 years to wrap around.
> 
> Good to know, but I would not use that as an argument.  This may hold if
> you predict based on contemporary hardware. New technologies can bring
> an order of magnitude improvement, eg. like NVMe brought compared to SSD
> technology.

I eagerly look forward to my 40GHz processor :)

Daniel Vacek Jan. 30, 2025, 11:34 a.m. UTC | #17

On Wed, 29 Jan 2025 at 23:57, Boris Burkov <boris@bur.io> wrote:
>
> On Mon, Jan 27, 2025 at 09:17:17PM +0100, David Sterba wrote:
> > On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
> > > On 24/1/25 12:25, Filipe Manana wrote:
> > > >>
> > > >> It will only retry precisely when more concurrent requests race here.
> > > >> And thanks to cmpxchg only one of them wins and increments. The others
> > > >> retry in another iteration of the loop.
> > > >>
> > > >> I think this way no lock is needed and the previous behavior is preserved.
> > > >
> > > > That looks fine to me. But under heavy concurrency, there's the
> > > > potential to loop too much, so I would at least add a cond_resched()
> > > > call before doing the goto.
> > > > With the mutex there's the advantage of not looping and wasting CPU if
> > > > such high concurrency happens, tasks will be blocked and yield the cpu
> > > > for other tasks.
> > >
> > > And then we have the problem of the function potentially sleeping, which
> > > was one of the things I'm trying to avoid.
> > >
> > > I still think an atomic is the correct choice here:
> > >
> > > * Skipped integers aren't a problem, as there's no reliance on the
> > > numbers being contiguous.
> > > * The overflow check is wasted cycles, and should never have been there
> > > in the first place.
> >
> > We do many checks that almost never happen but declare the range of the
> > expected values and can catch eventual bugs caused by the "impossible"
> > reasons like memory bitflips or consequences of other errors that only
> > show up due to such checks. I would not cosider one condition wasted
> > cycles in this case, unless we really are optimizing a hot path and
> > counting the cycles.
> >
>
> Could you explain a bit more what you view the philosophy on "impossible"
> inputs to be? In this case, we are *not* safe from full on memory
> corruption, as some other thread might doodle on our free_objectid after
> we do the check. It helps with a bad write for inode metadata, but in
> that case, we still have the following on our side:
>
> 1. we have multiple redundant inode items, so fsck/tree checker can
> notice an inconsistency when we see an item with a random massive
> objectid out of order. I haven't re-read it to see if it does, but it
> seems easy enough to implement if not.
>
> 2. a single bit flip, even of the MSB, still doesn't put us THAT close
> to the end of the range and Mark's argument about 2^64 being a big
> number still presents a reasonable, if not bulletproof, defense.
>
> I generally support correctness trumping performance, but suppose the
> existing code was the atomic and we got a patch to do the inc under a
> mutex, justified by the possible bug. Would we be excited by that patch,
> or would we say it's not a real bug?
>
> I was thinking some more about it, and was wondering if we could get
> away with setting the max objectid to something smaller than -256 s.t.
> we felt safe with some trivial algorithm like "atomic_inc with dec on
> failure", which would obviously not fly with a buffer of only 256.

You mean at least NR_CPUS, right? That would imply wrapping into
`preempt_disable()` section.
But yeah, that could be another possible solution here.
The pros would be a single pass (no loop) and hence no
`cond_resched()` needed even.
For the cons, there are multiple `root->free_objectid <=
BTRFS_LAST_FREE_OBJECTID` asserts and other uses of the const macro
which would need to be reviewed and considered.

> Another option is aborting the fs when we get an obviously impossibly
> large inode. In that case, the disaster scenario of overflowing into a
> dupe is averted, and it will never happen except in the case of
> corruption, so it's not a worse UX than ENOSPC. Presumably, we can ensure
> that we can't commit the transaction once a thread hits this error, so
> no one should be able to write an item with an overflowed inode number.
>
> Those slightly rambly ideas aside, I also support Daniel's solution. I
> would worry about doing it without cond_resched as we don't really know
> how much we currently rely on the queueing behavior of mutex.

Let me emphasize once again, I still have this feeling that if this
mutex was contended, we would notoriously know about it as being a
major throughput bottleneck. Whereby 'we' I mean you guys as I don't
have much experience with regards to this one yet.

But then again, if this mutex is rather a protection against unlikely
races than against likely expected contention, then the overhead
should already be minimal anyways in most cases and Mark's patch would
make little to no difference really. More like just covering the
unlikely edge cases (which could still be a good thing).
So I'm wondering, is there actually any performance gain with Mark's
patch to begin with? Or is the aim to cover and address the edge cases
where CPUs (or rather tasks) may be racing and one/some are forced to
sleep?
The commit message tries to sell the non-blocking mode for new
inodes/subvol's. That sounds fair as it is, but again, little
experience from my side here.

> > > Even if we were to grab a new integer a billion
> > > times a second, it would take 584 years to wrap around.
> >
> > Good to know, but I would not use that as an argument.  This may hold if
> > you predict based on contemporary hardware. New technologies can bring
> > an order of magnitude improvement, eg. like NVMe brought compared to SSD
> > technology.
>
> I eagerly look forward to my 40GHz processor :)

You never know about unexpected break-throughs. So fingers crossed.
Though I'd be surprised.

But maybe a question for (not just) Dave:
Can you imagine (or do you know already) any workload which would rely
on creating FS objects as lightning-fast as possible?
The size of the storage would have to be enormous to hold that many
files so that the BTRFS_LAST_FREE_OBJECTID limit could be reached or
the files would have to be ephemeral (but in that case tmpfs sounds
like a better fit anyways).

Boris Burkov Jan. 31, 2025, 7:38 p.m. UTC | #18

On Thu, Jan 30, 2025 at 12:34:59PM +0100, Daniel Vacek wrote:
> On Wed, 29 Jan 2025 at 23:57, Boris Burkov <boris@bur.io> wrote:
> >
> > On Mon, Jan 27, 2025 at 09:17:17PM +0100, David Sterba wrote:
> > > On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
> > > > On 24/1/25 12:25, Filipe Manana wrote:
> > > > >>
> > > > >> It will only retry precisely when more concurrent requests race here.
> > > > >> And thanks to cmpxchg only one of them wins and increments. The others
> > > > >> retry in another iteration of the loop.
> > > > >>
> > > > >> I think this way no lock is needed and the previous behavior is preserved.
> > > > >
> > > > > That looks fine to me. But under heavy concurrency, there's the
> > > > > potential to loop too much, so I would at least add a cond_resched()
> > > > > call before doing the goto.
> > > > > With the mutex there's the advantage of not looping and wasting CPU if
> > > > > such high concurrency happens, tasks will be blocked and yield the cpu
> > > > > for other tasks.
> > > >
> > > > And then we have the problem of the function potentially sleeping, which
> > > > was one of the things I'm trying to avoid.
> > > >
> > > > I still think an atomic is the correct choice here:
> > > >
> > > > * Skipped integers aren't a problem, as there's no reliance on the
> > > > numbers being contiguous.
> > > > * The overflow check is wasted cycles, and should never have been there
> > > > in the first place.
> > >
> > > We do many checks that almost never happen but declare the range of the
> > > expected values and can catch eventual bugs caused by the "impossible"
> > > reasons like memory bitflips or consequences of other errors that only
> > > show up due to such checks. I would not cosider one condition wasted
> > > cycles in this case, unless we really are optimizing a hot path and
> > > counting the cycles.
> > >
> >
> > Could you explain a bit more what you view the philosophy on "impossible"
> > inputs to be? In this case, we are *not* safe from full on memory
> > corruption, as some other thread might doodle on our free_objectid after
> > we do the check. It helps with a bad write for inode metadata, but in
> > that case, we still have the following on our side:
> >
> > 1. we have multiple redundant inode items, so fsck/tree checker can
> > notice an inconsistency when we see an item with a random massive
> > objectid out of order. I haven't re-read it to see if it does, but it
> > seems easy enough to implement if not.
> >
> > 2. a single bit flip, even of the MSB, still doesn't put us THAT close
> > to the end of the range and Mark's argument about 2^64 being a big
> > number still presents a reasonable, if not bulletproof, defense.
> >
> > I generally support correctness trumping performance, but suppose the
> > existing code was the atomic and we got a patch to do the inc under a
> > mutex, justified by the possible bug. Would we be excited by that patch,
> > or would we say it's not a real bug?
> >
> > I was thinking some more about it, and was wondering if we could get
> > away with setting the max objectid to something smaller than -256 s.t.
> > we felt safe with some trivial algorithm like "atomic_inc with dec on
> > failure", which would obviously not fly with a buffer of only 256.
> 
> You mean at least NR_CPUS, right? That would imply wrapping into
> `preempt_disable()` section.

Probably :) I was just brainstorming and didn't think it through super
carefully in terms of a provably correct algorithm.

> But yeah, that could be another possible solution here.
> The pros would be a single pass (no loop) and hence no
> `cond_resched()` needed even.
> For the cons, there are multiple `root->free_objectid <=
> BTRFS_LAST_FREE_OBJECTID` asserts and other uses of the const macro
> which would need to be reviewed and considered.
> 
> > Another option is aborting the fs when we get an obviously impossibly
> > large inode. In that case, the disaster scenario of overflowing into a
> > dupe is averted, and it will never happen except in the case of
> > corruption, so it's not a worse UX than ENOSPC. Presumably, we can ensure
> > that we can't commit the transaction once a thread hits this error, so
> > no one should be able to write an item with an overflowed inode number.

I am liking this idea more. A filesystem that eternally ENOSPCs on inode
creation is borderline dead anyway.

> >
> > Those slightly rambly ideas aside, I also support Daniel's solution. I
> > would worry about doing it without cond_resched as we don't really know
> > how much we currently rely on the queueing behavior of mutex.
> 
> Let me emphasize once again, I still have this feeling that if this
> mutex was contended, we would notoriously know about it as being a
> major throughput bottleneck. Whereby 'we' I mean you guys as I don't
> have much experience with regards to this one yet.

In my opinion, it doesn't really matter if it's a bottleneck or not.
Code can evolve over time with people attempting to make safe
transformations until it reaches a state that doesn't make sense.

If not for the technically correct point about the bounds check, then
changing

mutex_lock(m);
x++;
mutex_unlock(m);

to

atomic_inc(x);

is just a free win for simplicity and using the right tool for the job.

It being a bottleneck would make it *urgent* but it still makes sense to
fix bad code when we find it.

I dug into the history of the code a little bit, and the current mutex
object dates back to 2008 when it was split out of a global fs mutex in
this patch:
a213501153fd ("Btrfs: Replace the big fs_mutex with a collection of other locks")
and the current bounds checking logic dates back to
13a8a7c8c47e ("Btrfs: do not reuse objectid of deleted snapshot/subvol")
which was a refactor of running the find_highest algorithm inline.
Perhaps this has to do with handling ORPHAN_OBJECTID properly when
loading the highest objectid. 

These were operating in a totally different environment, with different
timings for walking the inode tree to find the highest inode, as well as
a since ripped out experiment to cache free objectids.

Note that there is never a conscious choice to protect this integer
overflow check with a mutex, it's just a natural evolution of refactors
leaving things untouched. Some number of intelligent cleanups later and
we are left with the current basically nonsensical code.

I believe this supports my view that this was never done to consciously
protect against some feared corruption based overflow. I am open to
being corrected on this by someone who was there or knows better.

I think any one of:
- Mark's patch
- atomic_inc_unless
- abort fs on enospc
- big enough buffer to make dec on enospc work

would constitute an improvement over a mutex around an essentially
impossible condition. Besides performance, mutexes incur readability
overhead. They suggest that some important coordination is occurring.
They incur other costs as well, like making functions block. We
shouldn't use them unless we need them.

> 
> But then again, if this mutex is rather a protection against unlikely
> races than against likely expected contention, then the overhead
> should already be minimal anyways in most cases and Mark's patch would
> make little to no difference really. More like just covering the
> unlikely edge cases (which could still be a good thing).
> So I'm wondering, is there actually any performance gain with Mark's
> patch to begin with? Or is the aim to cover and address the edge cases
> where CPUs (or rather tasks) may be racing and one/some are forced to
> sleep?
> The commit message tries to sell the non-blocking mode for new
> inodes/subvol's. That sounds fair as it is, but again, little
> experience from my side here.

My understanding is that the motivation is to enable non-blocking mode
for io_uring operations, but I'll let Mark reply in detail.

> 
> > > > Even if we were to grab a new integer a billion
> > > > times a second, it would take 584 years to wrap around.
> > >
> > > Good to know, but I would not use that as an argument.  This may hold if
> > > you predict based on contemporary hardware. New technologies can bring
> > > an order of magnitude improvement, eg. like NVMe brought compared to SSD
> > > technology.
> >
> > I eagerly look forward to my 40GHz processor :)
> 
> You never know about unexpected break-throughs. So fingers crossed.
> Though I'd be surprised.
> 
> But maybe a question for (not just) Dave:
> Can you imagine (or do you know already) any workload which would rely
> on creating FS objects as lightning-fast as possible?
> The size of the storage would have to be enormous to hold that many
> files so that the BTRFS_LAST_FREE_OBJECTID limit could be reached or
> the files would have to be ephemeral (but in that case tmpfs sounds
> like a better fit anyways).

Mark Harmstone Feb. 5, 2025, 3:08 p.m. UTC | #19

On 31/1/25 19:38, Boris Burkov wrote:
> > 
> On Thu, Jan 30, 2025 at 12:34:59PM +0100, Daniel Vacek wrote:
>> On Wed, 29 Jan 2025 at 23:57, Boris Burkov <boris@bur.io> wrote:
>>>
>>> On Mon, Jan 27, 2025 at 09:17:17PM +0100, David Sterba wrote:
>>>> On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
>>>>> On 24/1/25 12:25, Filipe Manana wrote:
>>>>>>>
>>>>>>> It will only retry precisely when more concurrent requests race here.
>>>>>>> And thanks to cmpxchg only one of them wins and increments. The others
>>>>>>> retry in another iteration of the loop.
>>>>>>>
>>>>>>> I think this way no lock is needed and the previous behavior is preserved.
>>>>>>
>>>>>> That looks fine to me. But under heavy concurrency, there's the
>>>>>> potential to loop too much, so I would at least add a cond_resched()
>>>>>> call before doing the goto.
>>>>>> With the mutex there's the advantage of not looping and wasting CPU if
>>>>>> such high concurrency happens, tasks will be blocked and yield the cpu
>>>>>> for other tasks.
>>>>>
>>>>> And then we have the problem of the function potentially sleeping, which
>>>>> was one of the things I'm trying to avoid.
>>>>>
>>>>> I still think an atomic is the correct choice here:
>>>>>
>>>>> * Skipped integers aren't a problem, as there's no reliance on the
>>>>> numbers being contiguous.
>>>>> * The overflow check is wasted cycles, and should never have been there
>>>>> in the first place.
>>>>
>>>> We do many checks that almost never happen but declare the range of the
>>>> expected values and can catch eventual bugs caused by the "impossible"
>>>> reasons like memory bitflips or consequences of other errors that only
>>>> show up due to such checks. I would not cosider one condition wasted
>>>> cycles in this case, unless we really are optimizing a hot path and
>>>> counting the cycles.
>>>>
>>>
>>> Could you explain a bit more what you view the philosophy on "impossible"
>>> inputs to be? In this case, we are *not* safe from full on memory
>>> corruption, as some other thread might doodle on our free_objectid after
>>> we do the check. It helps with a bad write for inode metadata, but in
>>> that case, we still have the following on our side:
>>>
>>> 1. we have multiple redundant inode items, so fsck/tree checker can
>>> notice an inconsistency when we see an item with a random massive
>>> objectid out of order. I haven't re-read it to see if it does, but it
>>> seems easy enough to implement if not.
>>>
>>> 2. a single bit flip, even of the MSB, still doesn't put us THAT close
>>> to the end of the range and Mark's argument about 2^64 being a big
>>> number still presents a reasonable, if not bulletproof, defense.
>>>
>>> I generally support correctness trumping performance, but suppose the
>>> existing code was the atomic and we got a patch to do the inc under a
>>> mutex, justified by the possible bug. Would we be excited by that patch,
>>> or would we say it's not a real bug?
>>>
>>> I was thinking some more about it, and was wondering if we could get
>>> away with setting the max objectid to something smaller than -256 s.t.
>>> we felt safe with some trivial algorithm like "atomic_inc with dec on
>>> failure", which would obviously not fly with a buffer of only 256.
>>
>> You mean at least NR_CPUS, right? That would imply wrapping into
>> `preempt_disable()` section.
> 
> Probably :) I was just brainstorming and didn't think it through super
> carefully in terms of a provably correct algorithm.
> 
>> But yeah, that could be another possible solution here.
>> The pros would be a single pass (no loop) and hence no
>> `cond_resched()` needed even.
>> For the cons, there are multiple `root->free_objectid <=
>> BTRFS_LAST_FREE_OBJECTID` asserts and other uses of the const macro
>> which would need to be reviewed and considered.
>>
>>> Another option is aborting the fs when we get an obviously impossibly
>>> large inode. In that case, the disaster scenario of overflowing into a
>>> dupe is averted, and it will never happen except in the case of
>>> corruption, so it's not a worse UX than ENOSPC. Presumably, we can ensure
>>> that we can't commit the transaction once a thread hits this error, so
>>> no one should be able to write an item with an overflowed inode number.
> 
> I am liking this idea more. A filesystem that eternally ENOSPCs on inode
> creation is borderline dead anyway.
> 
>>>
>>> Those slightly rambly ideas aside, I also support Daniel's solution. I
>>> would worry about doing it without cond_resched as we don't really know
>>> how much we currently rely on the queueing behavior of mutex.
>>
>> Let me emphasize once again, I still have this feeling that if this
>> mutex was contended, we would notoriously know about it as being a
>> major throughput bottleneck. Whereby 'we' I mean you guys as I don't
>> have much experience with regards to this one yet.
> 
> In my opinion, it doesn't really matter if it's a bottleneck or not.
> Code can evolve over time with people attempting to make safe
> transformations until it reaches a state that doesn't make sense.
> 
> If not for the technically correct point about the bounds check, then
> changing
> 
> mutex_lock(m);
> x++;
> mutex_unlock(m);
> 
> to
> 
> atomic_inc(x);
> 
> is just a free win for simplicity and using the right tool for the job.
> 
> It being a bottleneck would make it *urgent* but it still makes sense to
> fix bad code when we find it.

Yes, precisely. The major bottleneck for file creation is the locking 
for the dentries, but using a mutex just to increase an integer is 
obviously the wrong tool for the job.

> I dug into the history of the code a little bit, and the current mutex
> object dates back to 2008 when it was split out of a global fs mutex in
> this patch:
> a213501153fd ("Btrfs: Replace the big fs_mutex with a collection of other locks")
> and the current bounds checking logic dates back to
> 13a8a7c8c47e ("Btrfs: do not reuse objectid of deleted snapshot/subvol")
> which was a refactor of running the find_highest algorithm inline.
> Perhaps this has to do with handling ORPHAN_OBJECTID properly when
> loading the highest objectid.
> 
> These were operating in a totally different environment, with different
> timings for walking the inode tree to find the highest inode, as well as
> a since ripped out experiment to cache free objectids.
> 
> Note that there is never a conscious choice to protect this integer
> overflow check with a mutex, it's just a natural evolution of refactors
> leaving things untouched. Some number of intelligent cleanups later and
> we are left with the current basically nonsensical code.
> 
> I believe this supports my view that this was never done to consciously
> protect against some feared corruption based overflow. I am open to
> being corrected on this by someone who was there or knows better.

This was my conclusion, too - it was a relic of older code, rather than 
a conscious design decision.

> 
> I think any one of:
> - Mark's patch
> - atomic_inc_unless
> - abort fs on enospc
> - big enough buffer to make dec on enospc work
> 
> would constitute an improvement over a mutex around an essentially
> impossible condition. Besides performance, mutexes incur readability
> overhead. They suggest that some important coordination is occurring.
> They incur other costs as well, like making functions block. We
> shouldn't use them unless we need them.
> 
>>
>> But then again, if this mutex is rather a protection against unlikely
>> races than against likely expected contention, then the overhead
>> should already be minimal anyways in most cases and Mark's patch would
>> make little to no difference really. More like just covering the
>> unlikely edge cases (which could still be a good thing).
>> So I'm wondering, is there actually any performance gain with Mark's
>> patch to begin with? Or is the aim to cover and address the edge cases
>> where CPUs (or rather tasks) may be racing and one/some are forced to
>> sleep?
>> The commit message tries to sell the non-blocking mode for new
>> inodes/subvol's. That sounds fair as it is, but again, little
>> experience from my side here.
> 
> My understanding is that the motivation is to enable non-blocking mode
> for io_uring operations, but I'll let Mark reply in detail.

That's right. io_uring operates in two passes: the first in non-blocking 
mode, then if it receives EAGAIN again in a work thread in blocking mode.

As part of my work to get btrfs receive to use io_uring, I want to add 
an io_uring interface for subvol creation. There's two things preventing 
a non-blocking version: this, and the fact we need a 
btrfs_try_start_transaction() (which should be relatively straightforward).

> 
>>
>>>>> Even if we were to grab a new integer a billion
>>>>> times a second, it would take 584 years to wrap around.
>>>>
>>>> Good to know, but I would not use that as an argument.  This may hold if
>>>> you predict based on contemporary hardware. New technologies can bring
>>>> an order of magnitude improvement, eg. like NVMe brought compared to SSD
>>>> technology.
>>>
>>> I eagerly look forward to my 40GHz processor :)
>>
>> You never know about unexpected break-throughs. So fingers crossed.
>> Though I'd be surprised.

More like 40THz, and somebody finding a way to increase the speed of 
light... There's four or five orders of magnitude to go before 64-bit 
wraparound would become a problem here.

>> But maybe a question for (not just) Dave:
>> Can you imagine (or do you know already) any workload which would rely
>> on creating FS objects as lightning-fast as possible?
>> The size of the storage would have to be enormous to hold that many
>> files so that the BTRFS_LAST_FREE_OBJECTID limit could be reached or
>> the files would have to be ephemeral (but in that case tmpfs sounds
>> like a better fit anyways).

Like I said, the dentry locking was the limiting factor for file 
creation when I looked into it. So it would have to be O_TMPFILE files, 
and then you'd hit the open fd limit early on.

When I mentioned this issue to Chris Mason, he thought that maybe percpu 
integers were the way to go. We wouldn't need any locking then, and 
could keep the existing overflow check.

Mark

Daniel Vacek Feb. 10, 2025, 1:32 p.m. UTC | #20

On Wed, 5 Feb 2025 at 16:09, Mark Harmstone <maharmstone@meta.com> wrote:
>
> On 31/1/25 19:38, Boris Burkov wrote:
> > >
> > On Thu, Jan 30, 2025 at 12:34:59PM +0100, Daniel Vacek wrote:
> >> On Wed, 29 Jan 2025 at 23:57, Boris Burkov <boris@bur.io> wrote:
> >>>
> >>> On Mon, Jan 27, 2025 at 09:17:17PM +0100, David Sterba wrote:
> >>>> On Mon, Jan 27, 2025 at 05:42:28PM +0000, Mark Harmstone wrote:
> >>>>> On 24/1/25 12:25, Filipe Manana wrote:
> >>>>>>>
> >>>>>>> It will only retry precisely when more concurrent requests race here.
> >>>>>>> And thanks to cmpxchg only one of them wins and increments. The others
> >>>>>>> retry in another iteration of the loop.
> >>>>>>>
> >>>>>>> I think this way no lock is needed and the previous behavior is preserved.
> >>>>>>
> >>>>>> That looks fine to me. But under heavy concurrency, there's the
> >>>>>> potential to loop too much, so I would at least add a cond_resched()
> >>>>>> call before doing the goto.
> >>>>>> With the mutex there's the advantage of not looping and wasting CPU if
> >>>>>> such high concurrency happens, tasks will be blocked and yield the cpu
> >>>>>> for other tasks.
> >>>>>
> >>>>> And then we have the problem of the function potentially sleeping, which
> >>>>> was one of the things I'm trying to avoid.
> >>>>>
> >>>>> I still think an atomic is the correct choice here:
> >>>>>
> >>>>> * Skipped integers aren't a problem, as there's no reliance on the
> >>>>> numbers being contiguous.
> >>>>> * The overflow check is wasted cycles, and should never have been there
> >>>>> in the first place.
> >>>>
> >>>> We do many checks that almost never happen but declare the range of the
> >>>> expected values and can catch eventual bugs caused by the "impossible"
> >>>> reasons like memory bitflips or consequences of other errors that only
> >>>> show up due to such checks. I would not cosider one condition wasted
> >>>> cycles in this case, unless we really are optimizing a hot path and
> >>>> counting the cycles.
> >>>>
> >>>
> >>> Could you explain a bit more what you view the philosophy on "impossible"
> >>> inputs to be? In this case, we are *not* safe from full on memory
> >>> corruption, as some other thread might doodle on our free_objectid after
> >>> we do the check. It helps with a bad write for inode metadata, but in
> >>> that case, we still have the following on our side:
> >>>
> >>> 1. we have multiple redundant inode items, so fsck/tree checker can
> >>> notice an inconsistency when we see an item with a random massive
> >>> objectid out of order. I haven't re-read it to see if it does, but it
> >>> seems easy enough to implement if not.
> >>>
> >>> 2. a single bit flip, even of the MSB, still doesn't put us THAT close
> >>> to the end of the range and Mark's argument about 2^64 being a big
> >>> number still presents a reasonable, if not bulletproof, defense.
> >>>
> >>> I generally support correctness trumping performance, but suppose the
> >>> existing code was the atomic and we got a patch to do the inc under a
> >>> mutex, justified by the possible bug. Would we be excited by that patch,
> >>> or would we say it's not a real bug?
> >>>
> >>> I was thinking some more about it, and was wondering if we could get
> >>> away with setting the max objectid to something smaller than -256 s.t.
> >>> we felt safe with some trivial algorithm like "atomic_inc with dec on
> >>> failure", which would obviously not fly with a buffer of only 256.
> >>
> >> You mean at least NR_CPUS, right? That would imply wrapping into
> >> `preempt_disable()` section.
> >
> > Probably :) I was just brainstorming and didn't think it through super
> > carefully in terms of a provably correct algorithm.
> >
> >> But yeah, that could be another possible solution here.
> >> The pros would be a single pass (no loop) and hence no
> >> `cond_resched()` needed even.
> >> For the cons, there are multiple `root->free_objectid <=
> >> BTRFS_LAST_FREE_OBJECTID` asserts and other uses of the const macro
> >> which would need to be reviewed and considered.
> >>
> >>> Another option is aborting the fs when we get an obviously impossibly
> >>> large inode. In that case, the disaster scenario of overflowing into a
> >>> dupe is averted, and it will never happen except in the case of
> >>> corruption, so it's not a worse UX than ENOSPC. Presumably, we can ensure
> >>> that we can't commit the transaction once a thread hits this error, so
> >>> no one should be able to write an item with an overflowed inode number.
> >
> > I am liking this idea more. A filesystem that eternally ENOSPCs on inode
> > creation is borderline dead anyway.
> >
> >>>
> >>> Those slightly rambly ideas aside, I also support Daniel's solution. I
> >>> would worry about doing it without cond_resched as we don't really know
> >>> how much we currently rely on the queueing behavior of mutex.
> >>
> >> Let me emphasize once again, I still have this feeling that if this
> >> mutex was contended, we would notoriously know about it as being a
> >> major throughput bottleneck. Whereby 'we' I mean you guys as I don't
> >> have much experience with regards to this one yet.
> >
> > In my opinion, it doesn't really matter if it's a bottleneck or not.
> > Code can evolve over time with people attempting to make safe
> > transformations until it reaches a state that doesn't make sense.
> >
> > If not for the technically correct point about the bounds check, then
> > changing
> >
> > mutex_lock(m);
> > x++;
> > mutex_unlock(m);
> >
> > to
> >
> > atomic_inc(x);
> >
> > is just a free win for simplicity and using the right tool for the job.
> >
> > It being a bottleneck would make it *urgent* but it still makes sense to
> > fix bad code when we find it.
>
> Yes, precisely. The major bottleneck for file creation is the locking
> for the dentries, but using a mutex just to increase an integer is
> obviously the wrong tool for the job.
>
> > I dug into the history of the code a little bit, and the current mutex
> > object dates back to 2008 when it was split out of a global fs mutex in
> > this patch:
> > a213501153fd ("Btrfs: Replace the big fs_mutex with a collection of other locks")
> > and the current bounds checking logic dates back to
> > 13a8a7c8c47e ("Btrfs: do not reuse objectid of deleted snapshot/subvol")
> > which was a refactor of running the find_highest algorithm inline.
> > Perhaps this has to do with handling ORPHAN_OBJECTID properly when
> > loading the highest objectid.
> >
> > These were operating in a totally different environment, with different
> > timings for walking the inode tree to find the highest inode, as well as
> > a since ripped out experiment to cache free objectids.
> >
> > Note that there is never a conscious choice to protect this integer
> > overflow check with a mutex, it's just a natural evolution of refactors
> > leaving things untouched. Some number of intelligent cleanups later and
> > we are left with the current basically nonsensical code.
> >
> > I believe this supports my view that this was never done to consciously
> > protect against some feared corruption based overflow. I am open to
> > being corrected on this by someone who was there or knows better.
>
> This was my conclusion, too - it was a relic of older code, rather than
> a conscious design decision.
>
> >
> > I think any one of:
> > - Mark's patch
> > - atomic_inc_unless
> > - abort fs on enospc
> > - big enough buffer to make dec on enospc work
> >
> > would constitute an improvement over a mutex around an essentially
> > impossible condition. Besides performance, mutexes incur readability
> > overhead. They suggest that some important coordination is occurring.
> > They incur other costs as well, like making functions block. We
> > shouldn't use them unless we need them.
> >
> >>
> >> But then again, if this mutex is rather a protection against unlikely
> >> races than against likely expected contention, then the overhead
> >> should already be minimal anyways in most cases and Mark's patch would
> >> make little to no difference really. More like just covering the
> >> unlikely edge cases (which could still be a good thing).
> >> So I'm wondering, is there actually any performance gain with Mark's
> >> patch to begin with? Or is the aim to cover and address the edge cases
> >> where CPUs (or rather tasks) may be racing and one/some are forced to
> >> sleep?
> >> The commit message tries to sell the non-blocking mode for new
> >> inodes/subvol's. That sounds fair as it is, but again, little
> >> experience from my side here.
> >
> > My understanding is that the motivation is to enable non-blocking mode
> > for io_uring operations, but I'll let Mark reply in detail.
>
> That's right. io_uring operates in two passes: the first in non-blocking
> mode, then if it receives EAGAIN again in a work thread in blocking mode.
>
> As part of my work to get btrfs receive to use io_uring, I want to add
> an io_uring interface for subvol creation. There's two things preventing
> a non-blocking version: this, and the fact we need a
> btrfs_try_start_transaction() (which should be relatively straightforward).
>
> >
> >>
> >>>>> Even if we were to grab a new integer a billion
> >>>>> times a second, it would take 584 years to wrap around.
> >>>>
> >>>> Good to know, but I would not use that as an argument.  This may hold if
> >>>> you predict based on contemporary hardware. New technologies can bring
> >>>> an order of magnitude improvement, eg. like NVMe brought compared to SSD
> >>>> technology.
> >>>
> >>> I eagerly look forward to my 40GHz processor :)
> >>
> >> You never know about unexpected break-throughs. So fingers crossed.
> >> Though I'd be surprised.
>
> More like 40THz, and somebody finding a way to increase the speed of
> light... There's four or five orders of magnitude to go before 64-bit
> wraparound would become a problem here.
>
> >> But maybe a question for (not just) Dave:
> >> Can you imagine (or do you know already) any workload which would rely
> >> on creating FS objects as lightning-fast as possible?
> >> The size of the storage would have to be enormous to hold that many
> >> files so that the BTRFS_LAST_FREE_OBJECTID limit could be reached or
> >> the files would have to be ephemeral (but in that case tmpfs sounds
> >> like a better fit anyways).
>
> Like I said, the dentry locking was the limiting factor for file
> creation when I looked into it. So it would have to be O_TMPFILE files,
> and then you'd hit the open fd limit early on.
>
> When I mentioned this issue to Chris Mason, he thought that maybe percpu
> integers were the way to go. We wouldn't need any locking then, and
> could keep the existing overflow check.

That sounds like an overkill. IMO, the atomic counter is perfectly fine here.

> Mark

David Sterba Feb. 10, 2025, 10:34 p.m. UTC | #21

On Wed, Feb 05, 2025 at 03:08:59PM +0000, Mark Harmstone wrote:
> On 31/1/25 19:38, Boris Burkov wrote:
> > My understanding is that the motivation is to enable non-blocking mode
> > for io_uring operations, but I'll let Mark reply in detail.
> 
> That's right. io_uring operates in two passes: the first in non-blocking 
> mode, then if it receives EAGAIN again in a work thread in blocking mode.
> 
> As part of my work to get btrfs receive to use io_uring, I want to add 
> an io_uring interface for subvol creation. There's two things preventing 
> a non-blocking version: this, and the fact we need a 
> btrfs_try_start_transaction() (which should be relatively straightforward).

> >>>>> Even if we were to grab a new integer a billion
> >>>>> times a second, it would take 584 years to wrap around.
> >>>>
> >>>> Good to know, but I would not use that as an argument.  This may hold if
> >>>> you predict based on contemporary hardware. New technologies can bring
> >>>> an order of magnitude improvement, eg. like NVMe brought compared to SSD
> >>>> technology.
> >>>
> >>> I eagerly look forward to my 40GHz processor :)
> >>
> >> You never know about unexpected break-throughs. So fingers crossed.
> >> Though I'd be surprised.
> 
> More like 40THz, and somebody finding a way to increase the speed of 
> light... There's four or five orders of magnitude to go before 64-bit 
> wraparound would become a problem here.

Thinking about the margins again, there is probably enough that we don't
have to care. I'd still keep the upper limit check as for any random
event like fuzzing, bitflips and such.

The use case for nonblocking io_uring makes sense and justifies the
optimization. As mentioned, there are other factors slowing down the
inode creation and number allocation so it's "fingers crossed* safe.

Mark Harmstone Feb. 13, 2025, 5:47 p.m. UTC | #22

On 10/2/25 22:34, David Sterba wrote:
> > 
> On Wed, Feb 05, 2025 at 03:08:59PM +0000, Mark Harmstone wrote:
>> On 31/1/25 19:38, Boris Burkov wrote:
>>> My understanding is that the motivation is to enable non-blocking mode
>>> for io_uring operations, but I'll let Mark reply in detail.
>>
>> That's right. io_uring operates in two passes: the first in non-blocking
>> mode, then if it receives EAGAIN again in a work thread in blocking mode.
>>
>> As part of my work to get btrfs receive to use io_uring, I want to add
>> an io_uring interface for subvol creation. There's two things preventing
>> a non-blocking version: this, and the fact we need a
>> btrfs_try_start_transaction() (which should be relatively straightforward).
> 
>>>>>>> Even if we were to grab a new integer a billion
>>>>>>> times a second, it would take 584 years to wrap around.
>>>>>>
>>>>>> Good to know, but I would not use that as an argument.  This may hold if
>>>>>> you predict based on contemporary hardware. New technologies can bring
>>>>>> an order of magnitude improvement, eg. like NVMe brought compared to SSD
>>>>>> technology.
>>>>>
>>>>> I eagerly look forward to my 40GHz processor :)
>>>>
>>>> You never know about unexpected break-throughs. So fingers crossed.
>>>> Though I'd be surprised.
>>
>> More like 40THz, and somebody finding a way to increase the speed of
>> light... There's four or five orders of magnitude to go before 64-bit
>> wraparound would become a problem here.
> 
> Thinking about the margins again, there is probably enough that we don't
> have to care. I'd still keep the upper limit check as for any random
> event like fuzzing, bitflips and such.
> 
> The use case for nonblocking io_uring makes sense and justifies the
> optimization. As mentioned, there are other factors slowing down the
> inode creation and number allocation so it's "fingers crossed* safe.

Thanks David. Am I okay to push this patch to for-next then?

David Sterba Feb. 13, 2025, 7:38 p.m. UTC | #23

On Thu, Feb 13, 2025 at 05:47:00PM +0000, Mark Harmstone wrote:
> On 10/2/25 22:34, David Sterba wrote:
> > > 
> > On Wed, Feb 05, 2025 at 03:08:59PM +0000, Mark Harmstone wrote:
> >> On 31/1/25 19:38, Boris Burkov wrote:
> >>> My understanding is that the motivation is to enable non-blocking mode
> >>> for io_uring operations, but I'll let Mark reply in detail.
> >>
> >> That's right. io_uring operates in two passes: the first in non-blocking
> >> mode, then if it receives EAGAIN again in a work thread in blocking mode.
> >>
> >> As part of my work to get btrfs receive to use io_uring, I want to add
> >> an io_uring interface for subvol creation. There's two things preventing
> >> a non-blocking version: this, and the fact we need a
> >> btrfs_try_start_transaction() (which should be relatively straightforward).
> > 
> >>>>>>> Even if we were to grab a new integer a billion
> >>>>>>> times a second, it would take 584 years to wrap around.
> >>>>>>
> >>>>>> Good to know, but I would not use that as an argument.  This may hold if
> >>>>>> you predict based on contemporary hardware. New technologies can bring
> >>>>>> an order of magnitude improvement, eg. like NVMe brought compared to SSD
> >>>>>> technology.
> >>>>>
> >>>>> I eagerly look forward to my 40GHz processor :)
> >>>>
> >>>> You never know about unexpected break-throughs. So fingers crossed.
> >>>> Though I'd be surprised.
> >>
> >> More like 40THz, and somebody finding a way to increase the speed of
> >> light... There's four or five orders of magnitude to go before 64-bit
> >> wraparound would become a problem here.
> > 
> > Thinking about the margins again, there is probably enough that we don't
> > have to care. I'd still keep the upper limit check as for any random
> > event like fuzzing, bitflips and such.
> > 
> > The use case for nonblocking io_uring makes sense and justifies the
> > optimization. As mentioned, there are other factors slowing down the
> > inode creation and number allocation so it's "fingers crossed* safe.
> 
> Thanks David. Am I okay to push this patch to for-next then?

Yes, but please send an updated version first, we need the reasoning for
the choice and that it's still safe regarding amount of time to reach
the maximum. Basically the objections and answers from this thread
summarized.

David Sterba March 3, 2025, 9:14 a.m. UTC | #24

On Wed, Jan 22, 2025 at 12:21:28PM +0000, Mark Harmstone wrote:
> Currently btrfs_get_free_objectid() uses a mutex to protect
> free_objectid; I'm guessing this was because of the inode cache that we
> used to have. The inode cache is no more, so simplify things by
> replacing it with an atomic.
> 
> There's no issues with ordering: free_objectid gets set to an initial
> value, then calls to btrfs_get_free_objectid() return a monotonically
> increasing value.
> 
> This change means that btrfs_get_free_objectid() will no longer
> potentially sleep, which was a blocker for adding a non-blocking mode
> for inode and subvol creation.
> 
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>

As we've discussed the idea of atomic_t for inode and agreed this is ok,
if you want to get this patch merged please summarize the outcome of the
dicussion and resend. This is still possible to get to 6.15 (during this
week).

Mark Harmstone March 3, 2025, 6:24 p.m. UTC | #25

On 3/3/25 09:14, David Sterba wrote:
> > 
> On Wed, Jan 22, 2025 at 12:21:28PM +0000, Mark Harmstone wrote:
>> Currently btrfs_get_free_objectid() uses a mutex to protect
>> free_objectid; I'm guessing this was because of the inode cache that we
>> used to have. The inode cache is no more, so simplify things by
>> replacing it with an atomic.
>>
>> There's no issues with ordering: free_objectid gets set to an initial
>> value, then calls to btrfs_get_free_objectid() return a monotonically
>> increasing value.
>>
>> This change means that btrfs_get_free_objectid() will no longer
>> potentially sleep, which was a blocker for adding a non-blocking mode
>> for inode and subvol creation.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> 
> As we've discussed the idea of atomic_t for inode and agreed this is ok,
> if you want to get this patch merged please summarize the outcome of the
> dicussion and resend. This is still possible to get to 6.15 (during this
> week).

Thanks David. I've added a paragraph to the commit message explaining 
why overflow isn't a likely scenario.

I've kept the actual patch as-is, i.e. still including the warning - I 
don't know if you want to take that out.

Mark

btrfs: use atomic64_t for free_objectid

Commit Message

Comments

Patch