Message ID | e65d1d3fd413623f9d0c58614a296f0ab5422a05.1683057598.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: don't free qgroup space unless specified | expand |
On 2023/5/3 04:00, Josef Bacik wrote: > Boris noticed in his simple quotas testing that he was getting a leak > with Sweet Tea's change to subvol create that stopped doing a > transaction commit. This was just a side effect of that change. > > In the delayed inode code we have an optimization that will free extra > reservations if we think we can pack a dir item into an already modified > leaf. Previously this wouldn't be triggered in the subvolume create > case because we'd commit the transaction, it was still possible but > much harder to trigger. It could actually be triggered if we did a > mkdir && subvol create with qgroups enabled. > > This occurs because in btrfs_insert_delayed_dir_index(), which gets > called when we're adding the dir item, we do the following > > btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL); > > if we're able to skip reserving space. > > The problem here is that trans->block_rsv points at the temporary block > rsv for the subvolume create, which has qgroup reservations in the block > rsv. > > This is a problem because btrfs_block_rsv_release() will do the > following > > if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { > qgroup_to_release = block_rsv->qgroup_rsv_reserved - > block_rsv->qgroup_rsv_size; > block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; > } > > The temporary block rsv just has ->qgroup_rsv_reserved set, > ->qgroup_rsv_size == 0. The optimization in > btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then > later on when we call btrfs_subvolume_release_metadata() which has > > btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release); > btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release); > > qgroup_to_release is set to 0, and we do not convert the reserved > metadata space. > > The problem here is that the block rsv code has been unconditionally > messing with ->qgroup_rsv_reserved, because the main place this is used > is delalloc, and any time we call btrfs_block_rsv_release() we do it > with qgroup_to_release set, and thus do the proper accounting. > > The subvolume code is the only other code that uses the qgroup > reservation stuff, but it's intermingled with the above optimization, > and thus was getting its reservation freed out from underneath it and > thus leaking the reserved space. > > The solution is to simply not mess with the qgroup reservations if we > don't have qgroup_to_release set. This works with the existing code as > anything that messes with the delalloc reservations always have > qgroup_to_release set. This fixes the leak that Boris was observing. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/block-rsv.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c > index 3ab707e26fa2..ac18c43fadad 100644 > --- a/fs/btrfs/block-rsv.c > +++ b/fs/btrfs/block-rsv.c > @@ -124,7 +124,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info, > } else { > num_bytes = 0; > } > - if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { > + if (qgroup_to_release_ret && > + block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { > qgroup_to_release = block_rsv->qgroup_rsv_reserved - > block_rsv->qgroup_rsv_size; > block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
On Tue, May 02, 2023 at 04:00:06PM -0400, Josef Bacik wrote: > Boris noticed in his simple quotas testing that he was getting a leak > with Sweet Tea's change to subvol create that stopped doing a > transaction commit. This was just a side effect of that change. > > In the delayed inode code we have an optimization that will free extra > reservations if we think we can pack a dir item into an already modified > leaf. Previously this wouldn't be triggered in the subvolume create > case because we'd commit the transaction, it was still possible but > much harder to trigger. It could actually be triggered if we did a > mkdir && subvol create with qgroups enabled. > > This occurs because in btrfs_insert_delayed_dir_index(), which gets > called when we're adding the dir item, we do the following > > btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL); > > if we're able to skip reserving space. > > The problem here is that trans->block_rsv points at the temporary block > rsv for the subvolume create, which has qgroup reservations in the block > rsv. > > This is a problem because btrfs_block_rsv_release() will do the > following > > if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { > qgroup_to_release = block_rsv->qgroup_rsv_reserved - > block_rsv->qgroup_rsv_size; > block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; > } > > The temporary block rsv just has ->qgroup_rsv_reserved set, > ->qgroup_rsv_size == 0. The optimization in > btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then > later on when we call btrfs_subvolume_release_metadata() which has > > btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release); > btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release); > > qgroup_to_release is set to 0, and we do not convert the reserved > metadata space. > > The problem here is that the block rsv code has been unconditionally > messing with ->qgroup_rsv_reserved, because the main place this is used > is delalloc, and any time we call btrfs_block_rsv_release() we do it > with qgroup_to_release set, and thus do the proper accounting. > > The subvolume code is the only other code that uses the qgroup > reservation stuff, but it's intermingled with the above optimization, > and thus was getting its reservation freed out from underneath it and > thus leaking the reserved space. > > The solution is to simply not mess with the qgroup reservations if we > don't have qgroup_to_release set. This works with the existing code as > anything that messes with the delalloc reservations always have > qgroup_to_release set. This fixes the leak that Boris was observing. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c index 3ab707e26fa2..ac18c43fadad 100644 --- a/fs/btrfs/block-rsv.c +++ b/fs/btrfs/block-rsv.c @@ -124,7 +124,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info, } else { num_bytes = 0; } - if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { + if (qgroup_to_release_ret && + block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { qgroup_to_release = block_rsv->qgroup_rsv_reserved - block_rsv->qgroup_rsv_size; block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
Boris noticed in his simple quotas testing that he was getting a leak with Sweet Tea's change to subvol create that stopped doing a transaction commit. This was just a side effect of that change. In the delayed inode code we have an optimization that will free extra reservations if we think we can pack a dir item into an already modified leaf. Previously this wouldn't be triggered in the subvolume create case because we'd commit the transaction, it was still possible but much harder to trigger. It could actually be triggered if we did a mkdir && subvol create with qgroups enabled. This occurs because in btrfs_insert_delayed_dir_index(), which gets called when we're adding the dir item, we do the following btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL); if we're able to skip reserving space. The problem here is that trans->block_rsv points at the temporary block rsv for the subvolume create, which has qgroup reservations in the block rsv. This is a problem because btrfs_block_rsv_release() will do the following if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { qgroup_to_release = block_rsv->qgroup_rsv_reserved - block_rsv->qgroup_rsv_size; block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; } The temporary block rsv just has ->qgroup_rsv_reserved set, ->qgroup_rsv_size == 0. The optimization in btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then later on when we call btrfs_subvolume_release_metadata() which has btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release); btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release); qgroup_to_release is set to 0, and we do not convert the reserved metadata space. The problem here is that the block rsv code has been unconditionally messing with ->qgroup_rsv_reserved, because the main place this is used is delalloc, and any time we call btrfs_block_rsv_release() we do it with qgroup_to_release set, and thus do the proper accounting. The subvolume code is the only other code that uses the qgroup reservation stuff, but it's intermingled with the above optimization, and thus was getting its reservation freed out from underneath it and thus leaking the reserved space. The solution is to simply not mess with the qgroup reservations if we don't have qgroup_to_release set. This works with the existing code as anything that messes with the delalloc reservations always have qgroup_to_release set. This fixes the leak that Boris was observing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/block-rsv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)