Message ID | 61e8946ae040075ce2fe378e39b500c4ac97e8a3.1681151504.git.sweettea-kernel@dorminy.me (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: don't commit transaction for every subvol create | expand |
On Tue, Apr 11, 2023 at 3:22 PM Sweet Tea Dorminy <sweettea-kernel@dorminy.me> wrote: > > Recently a Meta-internal workload encountered subvolume creation taking > up to 2s each, significantly slower than directory creation. As they > were hoping to be able to use subvolumes instead of directories, and > were looking to create hundreds, this was a significant issue. After > Josef investigated, it turned out to be due to the transaction commit > currently performed at the end of subvolume creation. > > This change improves the workload by not doing transaction commit for every > subvolume creation, and merely requiring a transaction commit on fsync. > In the worst case, of doing a subvolume create and fsync in a loop, this > should require an equal amount of time to the current scheme; and in the > best case, the internal workload creating hundreds of subvols before > fsyncing is greatly improved. > > While it would be nice to be able to use the log tree and use the normal > fsync path, logtree replay can't deal with new subvolume inodes > presently. > > It's possible that there's some reason that the transaction commit is > necessary for correctness during subvolume creation; however, > git logs indicate that the commit dates back to the beginning of > subvolume creation, and there are no notes on why it would be necessary. > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> > --- > fs/btrfs/ioctl.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 25833b4eeaf5..a6f1ee2dc1b9 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > } > trans->block_rsv = &block_rsv; > trans->bytes_reserved = block_rsv.size; > + /* tree log can't currently deal with an inode which is a new root */ > + btrfs_set_log_full_commit(trans); > > ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); > if (ret) > @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > trans->bytes_reserved = 0; > btrfs_subvolume_release_metadata(root, &block_rsv); > > - if (ret) > - btrfs_end_transaction(trans); > - else > - ret = btrfs_commit_transaction(trans); > + btrfs_end_transaction(trans); > out_new_inode_args: > btrfs_new_inode_args_destroy(&new_inode_args); > out_inode: > -- > 2.40.0 > Wouldn't the consequence of this mean that we lose some guarantees around subvolume creation? That is, without it being committed, we would have a window in which the subvolume and data can be written without being written to disk? If so, how large is that window?
On 2023/4/12 03:10, Sweet Tea Dorminy wrote: > Recently a Meta-internal workload encountered subvolume creation taking > up to 2s each, significantly slower than directory creation. As they > were hoping to be able to use subvolumes instead of directories, and > were looking to create hundreds, this was a significant issue. After > Josef investigated, it turned out to be due to the transaction commit > currently performed at the end of subvolume creation. > > This change improves the workload by not doing transaction commit for every > subvolume creation, and merely requiring a transaction commit on fsync. > In the worst case, of doing a subvolume create and fsync in a loop, this > should require an equal amount of time to the current scheme; and in the > best case, the internal workload creating hundreds of subvols before > fsyncing is greatly improved. > > While it would be nice to be able to use the log tree and use the normal > fsync path, logtree replay can't deal with new subvolume inodes > presently. > > It's possible that there's some reason that the transaction commit is > necessary for correctness during subvolume creation; however, > git logs indicate that the commit dates back to the beginning of > subvolume creation, and there are no notes on why it would be necessary. For subvolume creation it looks fine. My main concern related to this topic is mostly around snapshots: - Snapshots can only be created during commit transaction - Snapshots qgroup accounting For now we only support quick path (aka, inherit the old numbers from the source, and that's no other qgroup level involved). But for subvolume creation, those are not involved at all. So the idea of skipping transaction commit looks solid to me. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> > --- > fs/btrfs/ioctl.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 25833b4eeaf5..a6f1ee2dc1b9 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > } > trans->block_rsv = &block_rsv; > trans->bytes_reserved = block_rsv.size; > + /* tree log can't currently deal with an inode which is a new root */ > + btrfs_set_log_full_commit(trans); > > ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); > if (ret) > @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > trans->bytes_reserved = 0; > btrfs_subvolume_release_metadata(root, &block_rsv); > > - if (ret) > - btrfs_end_transaction(trans); > - else > - ret = btrfs_commit_transaction(trans); > + btrfs_end_transaction(trans); > out_new_inode_args: > btrfs_new_inode_args_destroy(&new_inode_args); > out_inode:
On 2023/4/12 09:45, Neal Gompa wrote: > On Tue, Apr 11, 2023 at 3:22 PM Sweet Tea Dorminy > <sweettea-kernel@dorminy.me> wrote: >> >> Recently a Meta-internal workload encountered subvolume creation taking >> up to 2s each, significantly slower than directory creation. As they >> were hoping to be able to use subvolumes instead of directories, and >> were looking to create hundreds, this was a significant issue. After >> Josef investigated, it turned out to be due to the transaction commit >> currently performed at the end of subvolume creation. >> >> This change improves the workload by not doing transaction commit for every >> subvolume creation, and merely requiring a transaction commit on fsync. >> In the worst case, of doing a subvolume create and fsync in a loop, this >> should require an equal amount of time to the current scheme; and in the >> best case, the internal workload creating hundreds of subvols before >> fsyncing is greatly improved. >> >> While it would be nice to be able to use the log tree and use the normal >> fsync path, logtree replay can't deal with new subvolume inodes >> presently. >> >> It's possible that there's some reason that the transaction commit is >> necessary for correctness during subvolume creation; however, >> git logs indicate that the commit dates back to the beginning of >> subvolume creation, and there are no notes on why it would be necessary. >> >> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> >> --- >> fs/btrfs/ioctl.c | 7 +++---- >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 25833b4eeaf5..a6f1ee2dc1b9 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, >> } >> trans->block_rsv = &block_rsv; >> trans->bytes_reserved = block_rsv.size; >> + /* tree log can't currently deal with an inode which is a new root */ >> + btrfs_set_log_full_commit(trans); >> >> ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); >> if (ret) >> @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, >> trans->bytes_reserved = 0; >> btrfs_subvolume_release_metadata(root, &block_rsv); >> >> - if (ret) >> - btrfs_end_transaction(trans); >> - else >> - ret = btrfs_commit_transaction(trans); >> + btrfs_end_transaction(trans); >> out_new_inode_args: >> btrfs_new_inode_args_destroy(&new_inode_args); >> out_inode: >> -- >> 2.40.0 >> > > Wouldn't the consequence of this mean that we lose some guarantees > around subvolume creation? That is, without it being committed, we > would have a window in which the subvolume and data can be written > without being written to disk? If so, how large is that window? That can be avoided in btrfs-progs, by adding some -C|-c options just like subvolume deletion to ensure the case. If you're really concern about the window size, it's no smaller than commit interval (by default 30s). But that's not the full story, if you do fsync() on any inode of the new subvolume, it should cause a full commit transaction, which would write every needed metadata, including the subvolume itself, to the disk. So the end result is, you don't really need to bother the subvolume itself, just do your usual fsync() work on the inode you're really concerned, no need to bother the subvolume itself. Thanks, Qu > >
On 2023/4/12 10:34, Qu Wenruo wrote: > > > On 2023/4/12 09:45, Neal Gompa wrote: >> On Tue, Apr 11, 2023 at 3:22 PM Sweet Tea Dorminy >> <sweettea-kernel@dorminy.me> wrote: >>> >>> Recently a Meta-internal workload encountered subvolume creation taking >>> up to 2s each, significantly slower than directory creation. As they >>> were hoping to be able to use subvolumes instead of directories, and >>> were looking to create hundreds, this was a significant issue. After >>> Josef investigated, it turned out to be due to the transaction commit >>> currently performed at the end of subvolume creation. >>> >>> This change improves the workload by not doing transaction commit for >>> every >>> subvolume creation, and merely requiring a transaction commit on fsync. >>> In the worst case, of doing a subvolume create and fsync in a loop, this >>> should require an equal amount of time to the current scheme; and in the >>> best case, the internal workload creating hundreds of subvols before >>> fsyncing is greatly improved. >>> >>> While it would be nice to be able to use the log tree and use the normal >>> fsync path, logtree replay can't deal with new subvolume inodes >>> presently. >>> >>> It's possible that there's some reason that the transaction commit is >>> necessary for correctness during subvolume creation; however, >>> git logs indicate that the commit dates back to the beginning of >>> subvolume creation, and there are no notes on why it would be necessary. >>> >>> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> >>> --- >>> fs/btrfs/ioctl.c | 7 +++---- >>> 1 file changed, 3 insertions(+), 4 deletions(-) >>> >>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >>> index 25833b4eeaf5..a6f1ee2dc1b9 100644 >>> --- a/fs/btrfs/ioctl.c >>> +++ b/fs/btrfs/ioctl.c >>> @@ -647,6 +647,8 @@ static noinline int create_subvol(struct >>> mnt_idmap *idmap, >>> } >>> trans->block_rsv = &block_rsv; >>> trans->bytes_reserved = block_rsv.size; >>> + /* tree log can't currently deal with an inode which is a new >>> root */ >>> + btrfs_set_log_full_commit(trans); >>> >>> ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); >>> if (ret) >>> @@ -755,10 +757,7 @@ static noinline int create_subvol(struct >>> mnt_idmap *idmap, >>> trans->bytes_reserved = 0; >>> btrfs_subvolume_release_metadata(root, &block_rsv); >>> >>> - if (ret) >>> - btrfs_end_transaction(trans); >>> - else >>> - ret = btrfs_commit_transaction(trans); >>> + btrfs_end_transaction(trans); >>> out_new_inode_args: >>> btrfs_new_inode_args_destroy(&new_inode_args); >>> out_inode: >>> -- >>> 2.40.0 >>> >> >> Wouldn't the consequence of this mean that we lose some guarantees >> around subvolume creation? That is, without it being committed, we >> would have a window in which the subvolume and data can be written >> without being written to disk? If so, how large is that window? > > That can be avoided in btrfs-progs, by adding some -C|-c options just > like subvolume deletion to ensure the case. > > > If you're really concern about the window size, it's no smaller than > commit interval (by default 30s). s/smaller/larger/ > > > But that's not the full story, if you do fsync() on any inode of the new > subvolume, it should cause a full commit transaction, which would write > every needed metadata, including the subvolume itself, to the disk. > > So the end result is, you don't really need to bother the subvolume > itself, just do your usual fsync() work on the inode you're really > concerned, no need to bother the subvolume itself. > > Thanks, > Qu > >> >>
On Tue, Apr 11, 2023 at 09:45:34PM -0400, Neal Gompa wrote: > On Tue, Apr 11, 2023 at 3:22 PM Sweet Tea Dorminy > <sweettea-kernel@dorminy.me> wrote: > > > > Recently a Meta-internal workload encountered subvolume creation taking > > up to 2s each, significantly slower than directory creation. As they > > were hoping to be able to use subvolumes instead of directories, and > > were looking to create hundreds, this was a significant issue. After > > Josef investigated, it turned out to be due to the transaction commit > > currently performed at the end of subvolume creation. > > > > This change improves the workload by not doing transaction commit for every > > subvolume creation, and merely requiring a transaction commit on fsync. > > In the worst case, of doing a subvolume create and fsync in a loop, this > > should require an equal amount of time to the current scheme; and in the > > best case, the internal workload creating hundreds of subvols before > > fsyncing is greatly improved. > > > > While it would be nice to be able to use the log tree and use the normal > > fsync path, logtree replay can't deal with new subvolume inodes > > presently. > > > > It's possible that there's some reason that the transaction commit is > > necessary for correctness during subvolume creation; however, > > git logs indicate that the commit dates back to the beginning of > > subvolume creation, and there are no notes on why it would be necessary. > > > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> > > --- > > fs/btrfs/ioctl.c | 7 +++---- > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > index 25833b4eeaf5..a6f1ee2dc1b9 100644 > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > > } > > trans->block_rsv = &block_rsv; > > trans->bytes_reserved = block_rsv.size; > > + /* tree log can't currently deal with an inode which is a new root */ > > + btrfs_set_log_full_commit(trans); > > > > ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); > > if (ret) > > @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > > trans->bytes_reserved = 0; > > btrfs_subvolume_release_metadata(root, &block_rsv); > > > > - if (ret) > > - btrfs_end_transaction(trans); > > - else > > - ret = btrfs_commit_transaction(trans); > > + btrfs_end_transaction(trans); > > out_new_inode_args: > > btrfs_new_inode_args_destroy(&new_inode_args); > > out_inode: > > -- > > 2.40.0 > > > > Wouldn't the consequence of this mean that we lose some guarantees > around subvolume creation? That is, without it being committed, we > would have a window in which the subvolume and data can be written > without being written to disk? If so, how large is that window? It's the same as a normal directory, we create a subvol and then create a file in that subvol and start writing to that file, then lose power before the transaction commits, we lose everything. The same would happen for a new directory. The only concern here that's different is with the directory case you could fsync() the new file and it would all go into the tree log. As pointed out above the tree log stuff can't handle a subvolume that doesn't exist yet, so in the above case if the user fsync()'s that new file it'll commit the whole transaction and everything will be fine. Thanks, Josef
On Tue, Apr 11, 2023 at 03:10:53PM -0400, Sweet Tea Dorminy wrote: > Recently a Meta-internal workload encountered subvolume creation taking > up to 2s each, significantly slower than directory creation. As they > were hoping to be able to use subvolumes instead of directories, and > were looking to create hundreds, this was a significant issue. After > Josef investigated, it turned out to be due to the transaction commit > currently performed at the end of subvolume creation. > > This change improves the workload by not doing transaction commit for every > subvolume creation, and merely requiring a transaction commit on fsync. > In the worst case, of doing a subvolume create and fsync in a loop, this > should require an equal amount of time to the current scheme; and in the > best case, the internal workload creating hundreds of subvols before > fsyncing is greatly improved. > > While it would be nice to be able to use the log tree and use the normal > fsync path, logtree replay can't deal with new subvolume inodes > presently. > > It's possible that there's some reason that the transaction commit is > necessary for correctness during subvolume creation; however, > git logs indicate that the commit dates back to the beginning of > subvolume creation, and there are no notes on why it would be necessary. > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Tue, Apr 11, 2023 at 3:22 PM Sweet Tea Dorminy <sweettea-kernel@dorminy.me> wrote: > > Recently a Meta-internal workload encountered subvolume creation taking > up to 2s each, significantly slower than directory creation. As they > were hoping to be able to use subvolumes instead of directories, and > were looking to create hundreds, this was a significant issue. After > Josef investigated, it turned out to be due to the transaction commit > currently performed at the end of subvolume creation. > > This change improves the workload by not doing transaction commit for every > subvolume creation, and merely requiring a transaction commit on fsync. > In the worst case, of doing a subvolume create and fsync in a loop, this > should require an equal amount of time to the current scheme; and in the > best case, the internal workload creating hundreds of subvols before > fsyncing is greatly improved. > > While it would be nice to be able to use the log tree and use the normal > fsync path, logtree replay can't deal with new subvolume inodes > presently. > > It's possible that there's some reason that the transaction commit is > necessary for correctness during subvolume creation; however, > git logs indicate that the commit dates back to the beginning of > subvolume creation, and there are no notes on why it would be necessary. > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> > --- > fs/btrfs/ioctl.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 25833b4eeaf5..a6f1ee2dc1b9 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > } > trans->block_rsv = &block_rsv; > trans->bytes_reserved = block_rsv.size; > + /* tree log can't currently deal with an inode which is a new root */ > + btrfs_set_log_full_commit(trans); > > ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); > if (ret) > @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, > trans->bytes_reserved = 0; > btrfs_subvolume_release_metadata(root, &block_rsv); > > - if (ret) > - btrfs_end_transaction(trans); > - else > - ret = btrfs_commit_transaction(trans); > + btrfs_end_transaction(trans); > out_new_inode_args: > btrfs_new_inode_args_destroy(&new_inode_args); > out_inode: > -- > 2.40.0 > Reviewed-by: Neal Gompa <neal@gompa.dev>
On Tue, Apr 11, 2023 at 03:10:53PM -0400, Sweet Tea Dorminy wrote: > Recently a Meta-internal workload encountered subvolume creation taking > up to 2s each, significantly slower than directory creation. As they > were hoping to be able to use subvolumes instead of directories, and > were looking to create hundreds, this was a significant issue. After > Josef investigated, it turned out to be due to the transaction commit > currently performed at the end of subvolume creation. > > This change improves the workload by not doing transaction commit for every > subvolume creation, and merely requiring a transaction commit on fsync. > In the worst case, of doing a subvolume create and fsync in a loop, this > should require an equal amount of time to the current scheme; and in the > best case, the internal workload creating hundreds of subvols before > fsyncing is greatly improved. > > While it would be nice to be able to use the log tree and use the normal > fsync path, logtree replay can't deal with new subvolume inodes > presently. > > It's possible that there's some reason that the transaction commit is > necessary for correctness during subvolume creation; however, > git logs indicate that the commit dates back to the beginning of > subvolume creation, and there are no notes on why it would be necessary. > > Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> For subvolume creation the situation is much easier than with deletion, as said by others, the semantics is similar to creating a directory, fsync works to make sure the data are persistent. So I think it's safe and allows the use case of creating many subvolumes without the full transaction commit overhead. And we don't need the commandline options. Added to misc-next, thanks.
> > So I think it's safe and allows the use case of creating many subvolumes > without the full transaction commit overhead. And we don't need the > commandline options. Added to misc-next, thanks. Just to check, have you pushed misc-next containing this to github? I can't find it in git log. Thanks!
On Fri, Apr 21, 2023 at 08:58:54AM -0400, Sweet Tea Dorminy wrote: > > > > > So I think it's safe and allows the use case of creating many subvolumes > > without the full transaction commit overhead. And we don't need the > > commandline options. Added to misc-next, thanks. > > Just to check, have you pushed misc-next containing this to github? I > can't find it in git log. Maybe I forgot to push the latest changes, now pushed.
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 25833b4eeaf5..a6f1ee2dc1b9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -647,6 +647,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap, } trans->block_rsv = &block_rsv; trans->bytes_reserved = block_rsv.size; + /* tree log can't currently deal with an inode which is a new root */ + btrfs_set_log_full_commit(trans); ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit); if (ret) @@ -755,10 +757,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap, trans->bytes_reserved = 0; btrfs_subvolume_release_metadata(root, &block_rsv); - if (ret) - btrfs_end_transaction(trans); - else - ret = btrfs_commit_transaction(trans); + btrfs_end_transaction(trans); out_new_inode_args: btrfs_new_inode_args_destroy(&new_inode_args); out_inode:
Recently a Meta-internal workload encountered subvolume creation taking up to 2s each, significantly slower than directory creation. As they were hoping to be able to use subvolumes instead of directories, and were looking to create hundreds, this was a significant issue. After Josef investigated, it turned out to be due to the transaction commit currently performed at the end of subvolume creation. This change improves the workload by not doing transaction commit for every subvolume creation, and merely requiring a transaction commit on fsync. In the worst case, of doing a subvolume create and fsync in a loop, this should require an equal amount of time to the current scheme; and in the best case, the internal workload creating hundreds of subvols before fsyncing is greatly improved. While it would be nice to be able to use the log tree and use the normal fsync path, logtree replay can't deal with new subvolume inodes presently. It's possible that there's some reason that the transaction commit is necessary for correctness during subvolume creation; however, git logs indicate that the commit dates back to the beginning of subvolume creation, and there are no notes on why it would be necessary. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> --- fs/btrfs/ioctl.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)