Major qgroup regression in 4.2?
diff mbox

Message ID 20150814211448.GB1145@wotan.suse.de
State New
Headers show

Commit Message

Mark Fasheh Aug. 14, 2015, 9:14 p.m. UTC
On Thu, Aug 13, 2015 at 04:13:08PM -0700, Mark Fasheh wrote:
> If there *is* a plan to make this all work again, can I please hear it? The
> comment mentions something about adding those nodes to a dirty_extent_root.
> Why wasn't that done?

Ok so I had more time to look through the changes today and came up with
this naive patch, it simply inserts dirty extent records where we were doing
our qgroup refs before. This passes my micro-test but I'm unclear on whether
there's some pitfall I'm unaware of (I'm guessing there must be?). Please
advise.

Thanks,
	--Mark

--
Mark Fasheh


From: Mark Fasheh <mfasheh@suse.de>

btrfs: qgroup: account shared subtree during snapshot delete (again)

Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents
during our shared subtree walk. This effectively restores the
functionality we had with the original shared subtree walkign code in
commit 1152651 (btrfs: qgroup: account shared subtrees during snapshot
delete)

This patch also moves the open coded allocation handling for
qgroup_extent_record into their own functions.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>

qdiff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ac3e81d..8156f50 100644
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Qu Wenruo Aug. 17, 2015, 1:33 a.m. UTC | #1
Hi Mark,

Thanks for pointing out the problem.
But it's already known and we didn't have a good idea to solve it yet.

BTW, the old framework won't handle it well either.
Liu Bo's testcase (btrfs/017) can easily trigger it.
So it's not a regression.

Let me explain the problem a little and then introduce the fixing plan.

[Qgroup for subvolume delete]
Here are two subvolumes, sharing the whole tree.
The whole trees are consistent with 8 tree blocks.
A B are the tree root of subv 257 and 258 respectively.
C~H are all shared by the two subvolumes.
And I is one data extent which is recorded in tree block H.


subv 257(A)    subv258(B)
    |     \       /    |
    C                  D
   / \                / \
   E F                G H
                        :
                        :
                        I

Let's assume the tree block are all in 4K size(just for easy 
calculation), and I is in 16K size.

Now qgroup info should be:
257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
258: excl: 4K, rfer: 44K

If we are going to delete subvolume 258, then expected qgroup should be:
257: excl 44K, rfer 44K

Which means we need to recalculate all the reference relation of tree 
block C~H and data extent I.

[[The real problem]]
Now the real problem is, we need to mark *ALL* metadata and data extent 
of a subvolume dirty to make subvolume deletion work.
Forgot to say, that's at transaction commit time, if we really do that,
commit time consumption is not acceptable if the subvolume is huge.

Let's take a extreme example, here is a 4 level tree, with 16K block size.
All their contents are file extents.
The math will be, for 3 level nodes, 16K node can contain at most 161 slots.

161 ^ 3 = 4173281

That's the number of leaves that 3 level nodes can contain.

And for file extent, it's 53 bytes large + 9 bytes of key
So one leaf can contain at most 16283 / 62 = 262 file extents.

At last, 4 level tree can contain 4173281 * 262 = 1093399622 file extents.

This means, at the worst case, a subvolume delete will need to mark
1,093,399,622 data extent dirty at commit time.
That's definitely unacceptable, as such operation won't only happen in 
memory, but also may kick read to find the correct ref number.

And that's just 4 level case, btrfs support max to 8 level.
And I just counts the data extent, don't forget to add tree blocks.

[[Possible fix plan]]
As you can see the biggest problem is the number of child metadata/data 
extents can be quite large, it's not possible to mark them all and 
account at commit transaction time.

But the good news is, we can keep the snapshot and account them in 
several commits.
The new extent-orient accounting is quite accurate as it happens at 
commit time.

So one possible fix is, if we want to delete a subvolume, we put the 
subvolume into an orphan status, and let qgroup account to ignore that 
qgroup from then on, and mark some extent dirty in one transaction, and 
continue marking in next transaction until we mark all extents in that 
subvolume dirty.
Then free the subvolume.

That's the ideal method for both snapshot creation and deletion, but 
takes the most time.

And we need to change existing snapshot deletion routine to delay it and 
add new facility to ignore given id(s) in new_roots accounting.
Anyway, it's not an easy work and it's already there before my qgroup 
rework.

[Hot fix idea]
So far, the best and easiest method I come up with is, if we found 
qgroup is enabled and the snapshot to delete is above level 1(level 
starts from 0), then mark the QGROUP_INCONSISTENT flag to info user to 
do a rescan.


Other comments will be inlined below.

Mark Fasheh wrote on 2015/08/14 14:14 -0700:
> On Thu, Aug 13, 2015 at 04:13:08PM -0700, Mark Fasheh wrote:
>> If there *is* a plan to make this all work again, can I please hear it? The
>> comment mentions something about adding those nodes to a dirty_extent_root.
>> Why wasn't that done?
>
> Ok so I had more time to look through the changes today and came up with
> this naive patch, it simply inserts dirty extent records where we were doing
> our qgroup refs before. This passes my micro-test but I'm unclear on whether
> there's some pitfall I'm unaware of (I'm guessing there must be?). Please
> advise.
Overall neat and easy fix, but only lacks scalability.

As we are doing qgroup accounting at commit transaction time, it will 
stuck for a long time.
BTW, if we don't do it at that time, qgroup number can easily go crazy 
like the old days.

>
> Thanks,
> 	--Mark
>
> --
> Mark Fasheh
>
>
> From: Mark Fasheh <mfasheh@suse.de>
>
> btrfs: qgroup: account shared subtree during snapshot delete (again)
>
> Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
> mechanism.') removed our qgroup accounting during
> btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
> going bad shortly after a snapshot is removed.
>
> Fix this by adding a dirty extent record when we encounter extents
> during our shared subtree walk. This effectively restores the
> functionality we had with the original shared subtree walkign code in
> commit 1152651 (btrfs: qgroup: account shared subtrees during snapshot
> delete)
>
> This patch also moves the open coded allocation handling for
> qgroup_extent_record into their own functions.
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
>
> qdiff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index ac3e81d..8156f50 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -486,7 +486,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>   		qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs,
>   							     qrecord);
>   		if (qexisting)
> -			kfree(qrecord);
> +			btrfs_qgroup_free_extent_record(qrecord);
Good cleanup.

I hope it can be extracted into one standalone patch.
As the fix may still need some discussion.
>   	}
>
>   	spin_lock_init(&head_ref->lock);
> @@ -654,7 +654,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
>   		goto free_ref;
>
>   	if (fs_info->quota_enabled && is_fstree(ref_root)) {
> -		record = kmalloc(sizeof(*record), GFP_NOFS);
> +		record = btrfs_qgroup_alloc_extent_record();
>   		if (!record)
>   			goto free_head_ref;
>   	}
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 07204bf..ab81135 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7756,18 +7756,31 @@ reada:
>   	wc->reada_slot = slot;
>   }
>
> -/*
> - * TODO: Modify related function to add related node/leaf to dirty_extent_root,
> - * for later qgroup accounting.
> - *
> - * Current, this function does nothing.
> - */
> +static int record_one_item(struct btrfs_trans_handle *trans, u64 bytenr,
> +			   u64 num_bytes)
> +{
> +	struct btrfs_qgroup_extent_record *qrecord = btrfs_qgroup_alloc_extent_record();
> +	struct btrfs_delayed_ref_root *delayed_refs = &trans->transaction->delayed_refs;
> +
> +	if (!qrecord)
> +		return -ENOMEM;
> +
> +	qrecord->bytenr = bytenr;
> +	qrecord->num_bytes = num_bytes;
> +	qrecord->old_roots = NULL;
> +
> +	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
> +		btrfs_qgroup_free_extent_record(qrecord);
> +
> +	return 0;
> +}
> +
>   static int account_leaf_items(struct btrfs_trans_handle *trans,
>   			      struct btrfs_root *root,
>   			      struct extent_buffer *eb)
>   {
>   	int nr = btrfs_header_nritems(eb);
> -	int i, extent_type;
> +	int i, extent_type, ret;
>   	struct btrfs_key key;
>   	struct btrfs_file_extent_item *fi;
>   	u64 bytenr, num_bytes;
> @@ -7790,6 +7803,10 @@ static int account_leaf_items(struct btrfs_trans_handle *trans,
>   			continue;
>
>   		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
> +
> +		ret = record_one_item(trans, bytenr, num_bytes);
> +		if (ret)
> +			return ret;
>   	}
>   	return 0;
>   }
> @@ -7858,8 +7875,6 @@ static int adjust_slots_upwards(struct btrfs_root *root,
>
>   /*
>    * root_eb is the subtree root and is locked before this function is called.
> - * TODO: Modify this function to mark all (including complete shared node)
> - * to dirty_extent_root to allow it get accounted in qgroup.
>    */
>   static int account_shared_subtree(struct btrfs_trans_handle *trans,
>   				  struct btrfs_root *root,
> @@ -7937,6 +7952,11 @@ walk_down:
>   			btrfs_tree_read_lock(eb);
>   			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>   			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
> +
> +			ret = record_one_item(trans, child_bytenr,
> +					      root->nodesize);
> +			if (ret)
> +				goto out;
The problem is scalability, see the extreme case I mentioned above.
So I didn't do it at that time.

Would you please try to make a complicated and large fs, whose root 
level is 3 (debug-tree can help you to determine if it's large enough),
and do a snapshot of it.

And then delete it, to see how much time it takes to commit transaction.
So from this respect, I'd like to rescan, as when rescanning, we can 
still do other things without "stuck" transaction.

Thanks,
Qu
>   		}
>
>   		if (level == 0) {
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8a82029..27ec405 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1452,6 +1452,17 @@ int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>
> +struct btrfs_qgroup_extent_record *btrfs_qgroup_alloc_extent_record(void)
> +{
> +	return kmalloc(sizeof(struct btrfs_qgroup_extent_record), GFP_NOFS);
> +}
> +
> +void btrfs_qgroup_free_extent_record(struct btrfs_qgroup_extent_record *record)
> +{
> +	if (record)
> +		kfree(record);
> +}
> +
I'd like to move them into headers, as they are quite small, inline them 
should reduce the function call overhead.

Thanks,
Qu
>   struct btrfs_qgroup_extent_record
>   *btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
>   				  struct btrfs_qgroup_extent_record *record)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 6387dcf..dfba663 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -59,6 +59,9 @@ int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
>   struct btrfs_qgroup_extent_record
>   *btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
>   				  struct btrfs_qgroup_extent_record *record);
> +struct btrfs_qgroup_extent_record
> +*btrfs_qgroup_alloc_extent_record(void);
> +void btrfs_qgroup_free_extent_record(struct btrfs_qgroup_extent_record *record);
>   int
>   btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
>   			    struct btrfs_fs_info *fs_info,
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh Aug. 17, 2015, 9:13 p.m. UTC | #2
Hi Qu,

Firstly thanks for the response. I have a few new questions and comments
below,

On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
> Thanks for pointing out the problem.
> But it's already known and we didn't have a good idea to solve it yet.
> 
> BTW, the old framework won't handle it well either.
> Liu Bo's testcase (btrfs/017) can easily trigger it.
> So it's not a regression.

I don't understand your meaning here. btrfs/017 doesn't have anything to do
with subvolume deletion. The regression I am talking about is that deleting
a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by your
patches (thanks) but that's of course been at the expense of reintroducing
the subvol regression :(


> Let me explain the problem a little and then introduce the fixing plan.
> 
> [Qgroup for subvolume delete]
> Here are two subvolumes, sharing the whole tree.
> The whole trees are consistent with 8 tree blocks.
> A B are the tree root of subv 257 and 258 respectively.
> C~H are all shared by the two subvolumes.
> And I is one data extent which is recorded in tree block H.
> 
> 
> subv 257(A)    subv258(B)
>    |     \       /    |
>    C                  D
>   / \                / \
>   E F                G H
>                        :
>                        :
>                        I
> 
> Let's assume the tree block are all in 4K size(just for easy
> calculation), and I is in 16K size.
> 
> Now qgroup info should be:
> 257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
> 258: excl: 4K, rfer: 44K
> 
> If we are going to delete subvolume 258, then expected qgroup should be:
> 257: excl 44K, rfer 44K

Ok, this is basically explaining how we expect the numbers from a subvolume
delete to work out. Everything there looks normal to me.


> Which means we need to recalculate all the reference relation of
> tree block C~H and data extent I.
>
> [[The real problem]]
> Now the real problem is, we need to mark *ALL* metadata and data
> extent of a subvolume dirty to make subvolume deletion work.

I don't understand why we have to mark all nodes of a subvol. The previous
implementation was able to get this correct without recalculating every
block in the tree. It only passed shared nodes and items through to the
qgroup accounting code. The accounting code in turn had very simple logic
for working it out - if the extent is now exclusively owned, we adjust
qgroup->excl for the (now only remaing) owner.

If you go back to qgroup_subtree_accounting() it, you would see
*that it never* changes ->rfer on a qgroup. That is because the rfer counts
on any related subvolumes don't change during subvol delete. Your example above in
fact illustrates this.

So to be clear,

- Why do we have to visit all tree nodes with the qgroup code given that
we only cared about shared ones before"


> Forgot to say, that's at transaction commit time, if we really do that,
> commit time consumption is not acceptable if the subvolume is huge.

This too is confusing to me. Even if we assume that qgroups will get every
node now instead of just the shared ones:

- Drop subvol already visits every node (though it doesn't pass them all to
the qgroup code). We were doing this before just fine to my knowledge - what
changed? The commit-time qgroup accounting?

- Also, btrfs_drop_snapshot() throttles transaction commits (via
btrfs_end_transaction_throttle()), so anything done at commit time looks
like it would be broken up into smaller chunks of work for us.

Am I mistaken in how I understand these functions?


Also I must ask... How did you discover this performance issue? Are we
talking about something theoretical here or was your implementation slow on
subvolume delete?


> [[Possible fix plan]]
> As you can see the biggest problem is the number of child
> metadata/data extents can be quite large, it's not possible to mark
> them all and account at commit transaction time.

... right so handling qgroups at commit time is performance sensitive in
that too many qgroup updates will cause the commit code to do a lot of work.
I think that actually answers one of my questions above, though my followup would be:

Would I be correct if I said this is a problem for any workload that creates
a large number of qgroup updates in a short period of time?


> But the good news is, we can keep the snapshot and account them in
> several commits.
> The new extent-orient accounting is quite accurate as it happens at
> commit time.

Btw, one thing I should say is that in general I really like your rewrite
grats on that - the code is far easier to read through and I like very much
that we've taken some of the open-coded qgroup code out of our extent
inc/dec code.


> So one possible fix is, if we want to delete a subvolume, we put the
> subvolume into an orphan status, and let qgroup account to ignore
> that qgroup from then on, and mark some extent dirty in one
> transaction, and continue marking in next transaction until we mark
> all extents in that subvolume dirty.
> Then free the subvolume.
> 
> That's the ideal method for both snapshot creation and deletion, but
> takes the most time.

It also sounds like a disk format change which would really suck to tell
users to do just so they can get accurate qgroup numbers.


> [Hot fix idea]
> So far, the best and easiest method I come up with is, if we found
> qgroup is enabled and the snapshot to delete is above level 1(level
> starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
> to do a rescan.

This is exactly the kind of band-aid solution we wanted to avoid the first
time qgroups and subvolume handling were fixed.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Aug. 18, 2015, 1:42 a.m. UTC | #3
Mark Fasheh wrote on 2015/08/17 14:13 -0700:
> Hi Qu,
>
> Firstly thanks for the response. I have a few new questions and comments
> below,
>
> On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
>> Thanks for pointing out the problem.
>> But it's already known and we didn't have a good idea to solve it yet.
>>
>> BTW, the old framework won't handle it well either.
>> Liu Bo's testcase (btrfs/017) can easily trigger it.
>> So it's not a regression.
>
> I don't understand your meaning here. btrfs/017 doesn't have anything to do
> with subvolume deletion. The regression I am talking about is that deleting
> a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
> 4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by your
> patches (thanks) but that's of course been at the expense of reintroducing
> the subvol regression :(
Oh, it seems that my memory went wrong about the test case.

But I did remember old implement has something wrong with subvolume 
deletion too.
(Or maybe I'm wrong again?)
>
>
>> Let me explain the problem a little and then introduce the fixing plan.
>>
>> [Qgroup for subvolume delete]
>> Here are two subvolumes, sharing the whole tree.
>> The whole trees are consistent with 8 tree blocks.
>> A B are the tree root of subv 257 and 258 respectively.
>> C~H are all shared by the two subvolumes.
>> And I is one data extent which is recorded in tree block H.
>>
>>
>> subv 257(A)    subv258(B)
>>     |     \       /    |
>>     C                  D
>>    / \                / \
>>    E F                G H
>>                         :
>>                         :
>>                         I
>>
>> Let's assume the tree block are all in 4K size(just for easy
>> calculation), and I is in 16K size.
>>
>> Now qgroup info should be:
>> 257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
>> 258: excl: 4K, rfer: 44K
>>
>> If we are going to delete subvolume 258, then expected qgroup should be:
>> 257: excl 44K, rfer 44K
>
> Ok, this is basically explaining how we expect the numbers from a subvolume
> delete to work out. Everything there looks normal to me.
>
>
>> Which means we need to recalculate all the reference relation of
>> tree block C~H and data extent I.
>>
>> [[The real problem]]
>> Now the real problem is, we need to mark *ALL* metadata and data
>> extent of a subvolume dirty to make subvolume deletion work.
>
> I don't understand why we have to mark all nodes of a subvol. The previous
> implementation was able to get this correct without recalculating every
> block in the tree. It only passed shared nodes and items through to the
> qgroup accounting code. The accounting code in turn had very simple logic
> for working it out - if the extent is now exclusively owned, we adjust
> qgroup->excl for the (now only remaing) owner.
Oh, you're right here.
I'm stupid on this, as exclusive extent doesn't matter in the case.

So no need to iterate the whole tree, but only shared extents.
That's already a lot of extent we can skip.

>
> If you go back to qgroup_subtree_accounting() it, you would see
> *that it never* changes ->rfer on a qgroup. That is because the rfer counts
> on any related subvolumes don't change during subvol delete. Your example above in
> fact illustrates this.
>
> So to be clear,
>
> - Why do we have to visit all tree nodes with the qgroup code given that
> we only cared about shared ones before"

My fault, as I didn't jump out of the box and still using the idea of 
rescan whole tree to do it.

>
>
>> Forgot to say, that's at transaction commit time, if we really do that,
>> commit time consumption is not acceptable if the subvolume is huge.
>
> This too is confusing to me. Even if we assume that qgroups will get every
> node now instead of just the shared ones:
>
> - Drop subvol already visits every node (though it doesn't pass them all to
> the qgroup code). We were doing this before just fine to my knowledge - what
> changed? The commit-time qgroup accounting?

Nothing, as even old implement does qgroup accounting at 
run_delayed_refs time, which is also done at commit_transaction.

So that's just my meaningless worry.

>
> - Also, btrfs_drop_snapshot() throttles transaction commits (via
> btrfs_end_transaction_throttle()), so anything done at commit time looks
> like it would be broken up into smaller chunks of work for us.
>
> Am I mistaken in how I understand these functions?

No, you're right.
Overall my concern about commit-time qgroup accouting performance impact 
is meaningless now.

>
>
> Also I must ask... How did you discover this performance issue? Are we
> talking about something theoretical here or was your implementation slow on
> subvolume delete?
Mainly theoretical.
And theoretically, the new implement should be faster than old implement.

But I'm not quite sure about the old implement is fast enough for 
subvolume deletion.
Anyway, your fix seems no slower than old implement.
So I'm completely OK with it now.

>
>
>> [[Possible fix plan]]
>> As you can see the biggest problem is the number of child
>> metadata/data extents can be quite large, it's not possible to mark
>> them all and account at commit transaction time.
>
> ... right so handling qgroups at commit time is performance sensitive in
> that too many qgroup updates will cause the commit code to do a lot of work.
> I think that actually answers one of my questions above, though my followup would be:
>
> Would I be correct if I said this is a problem for any workload that creates
> a large number of qgroup updates in a short period of time?
I'm afraid that's true.

The typical one should be file clone.
And that may be one of the worst case.

As file clone will increase data extent backref, even inside the same 
subvolume, it will cause qgroup accounting.

And it won't cause transaction to be committed, so in one transaction it 
can do file clone a lot of times.
If all clone are happen on one extent, that's OK as qgroup only need to 
do one accounting.
(BTW, only implement will do a lot of accounting even cloning the same 
extent)

But if someone is doing file clone on different extents, that will be a 
disaster.

Unlike other normal write, file clone won't cause much dirty page, so 
transaction won't be triggered by dirty page threshold.

So after a lot of file cloning on different extents, next transaction 
commit will cost a lot of time.

But that's my theoretical assumption, it may be totally wrong just like 
what I thought about subvolume deletion.

>
>
>> But the good news is, we can keep the snapshot and account them in
>> several commits.
>> The new extent-orient accounting is quite accurate as it happens at
>> commit time.
>
> Btw, one thing I should say is that in general I really like your rewrite
> grats on that - the code is far easier to read through and I like very much
> that we've taken some of the open-coded qgroup code out of our extent
> inc/dec code.

It's my pleasure. :)

>
>
>> So one possible fix is, if we want to delete a subvolume, we put the
>> subvolume into an orphan status, and let qgroup account to ignore
>> that qgroup from then on, and mark some extent dirty in one
>> transaction, and continue marking in next transaction until we mark
>> all extents in that subvolume dirty.
>> Then free the subvolume.
>>
>> That's the ideal method for both snapshot creation and deletion, but
>> takes the most time.
>
> It also sounds like a disk format change which would really suck to tell
> users to do just so they can get accurate qgroup numbers.

With all your explain, I think your current fix is good enough, at least 
no worse than old implement.

So I'm OK with it.
As it fixes a regression without major performance drop.

We can improve performance later if it's really needed.

>
>
>> [Hot fix idea]
>> So far, the best and easiest method I come up with is, if we found
>> qgroup is enabled and the snapshot to delete is above level 1(level
>> starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
>> to do a rescan.
>
> This is exactly the kind of band-aid solution we wanted to avoid the first
> time qgroups and subvolume handling were fixed.
> 	--Mark
With your fix, the quick and dirty one is not needed anyway.

Good job.

Thanks,
Qu
>
> --
> Mark Fasheh
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Aug. 18, 2015, 8:03 a.m. UTC | #4
Qu Wenruo wrote on 2015/08/18 09:42 +0800:
>
>
> Mark Fasheh wrote on 2015/08/17 14:13 -0700:
>> Hi Qu,
>>
>> Firstly thanks for the response. I have a few new questions and comments
>> below,
>>
>> On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
>>> Thanks for pointing out the problem.
>>> But it's already known and we didn't have a good idea to solve it yet.
>>>
>>> BTW, the old framework won't handle it well either.
>>> Liu Bo's testcase (btrfs/017) can easily trigger it.
>>> So it's not a regression.
>>
>> I don't understand your meaning here. btrfs/017 doesn't have anything
>> to do
>> with subvolume deletion. The regression I am talking about is that
>> deleting
>> a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
>> 4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by your
>> patches (thanks) but that's of course been at the expense of
>> reintroducing
>> the subvol regression :(
> Oh, it seems that my memory went wrong about the test case.
>
> But I did remember old implement has something wrong with subvolume
> deletion too.
> (Or maybe I'm wrong again?)
>>
>>
>>> Let me explain the problem a little and then introduce the fixing plan.
>>>
>>> [Qgroup for subvolume delete]
>>> Here are two subvolumes, sharing the whole tree.
>>> The whole trees are consistent with 8 tree blocks.
>>> A B are the tree root of subv 257 and 258 respectively.
>>> C~H are all shared by the two subvolumes.
>>> And I is one data extent which is recorded in tree block H.
>>>
>>>
>>> subv 257(A)    subv258(B)
>>>     |     \       /    |
>>>     C                  D
>>>    / \                / \
>>>    E F                G H
>>>                         :
>>>                         :
>>>                         I
>>>
>>> Let's assume the tree block are all in 4K size(just for easy
>>> calculation), and I is in 16K size.
>>>
>>> Now qgroup info should be:
>>> 257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
>>> 258: excl: 4K, rfer: 44K
>>>
>>> If we are going to delete subvolume 258, then expected qgroup should be:
>>> 257: excl 44K, rfer 44K
>>
>> Ok, this is basically explaining how we expect the numbers from a
>> subvolume
>> delete to work out. Everything there looks normal to me.
>>
>>
>>> Which means we need to recalculate all the reference relation of
>>> tree block C~H and data extent I.
>>>
>>> [[The real problem]]
>>> Now the real problem is, we need to mark *ALL* metadata and data
>>> extent of a subvolume dirty to make subvolume deletion work.
>>
>> I don't understand why we have to mark all nodes of a subvol. The
>> previous
>> implementation was able to get this correct without recalculating every
>> block in the tree. It only passed shared nodes and items through to the
>> qgroup accounting code. The accounting code in turn had very simple logic
>> for working it out - if the extent is now exclusively owned, we adjust
>> qgroup->excl for the (now only remaing) owner.
> Oh, you're right here.
> I'm stupid on this, as exclusive extent doesn't matter in the case.
>
> So no need to iterate the whole tree, but only shared extents.
> That's already a lot of extent we can skip.
Oh, I forgot one case involved with higher level qgroup.
            qg 1/0
            /    \
  subv 257(A)    subv258(B)
      |     \ /----    |
      C      D         J
     / \    / \ /------\
     E F    G H         L
              :
              :
              I

The picture is hard to see BTW...
Explain a little.
qgroup 257 and qgroup 258 are in the same higher level qgroup pool, 1/0
And this time, suvol 257 and subvol 258 only shares part of there tree.
         A (subvol 257)
          /     \
	C	D
        / \     / \
       E   F   G   H ..... I

         B (subvol 258)
          /     \
         C       J
        / \     / \
       E   F   G   L
Yes, one COW caused A->D->H route to be new B->J->L route in subvol 258.

And lets take a look at the qgroup numbers:
Still use 4K tree block and I is 16K.
qg 257: excl: 28K rfer: 44K
meta excl: A, D, H: 3 * 4 = 12K
data excl: I = 16K
meta rfer: all 7 tree blocks = 28K
data rfer: I = 16K

qg 258: excl: 12K rfer: 28K
meta excl: B, J, L: 3 * 4K = 12K
data excl: 0
meta rfer: all 7 tree blocks = 28K
data rfer: 0

qg: 1/0: excl 56K rfer: 56K
meta excl: A ~ H, J~L, 10 * 4K = 40K
data excl: I = 16K

Now, if only accounts shared nodes, which means only accounts
C, E, F, G tree blocks.
These 4 tree blocks will turn from shared to excl for subvol 257.
That's OK, as it won't cause anything wrong in qg 257.

But that will cause qg 1/0 wrong.
And B, J, L is now owned by nobody, excl of qg 1/0 should be reduced by 12K.
As they are not marked as dirty since they are not shared,
so the result will be:

qg 257: excl: 28 + 16 = 44K  rfer: 44K
qg 1/0: excl: 56K rfer: 56K
As related 4 tree blocks are still exel for qg 1/0.
Their change won't cause excl/rfer change.

So the result of qg 1/0 is wrong. Which should be excl: 44K: rfer 44K,
just the same as qg 257.

So, it is still needed to account all the nodes/leaves to ensure higher 
level qgroup get correct reading.


Sorry for not pointing this out in previous reply, as qgroup is quite 
complicated and I did forgot this case.

So, I'm afraid we need to change the fix to mark all extents in the 
subvolume to resolve the bug.

Thanks,
Qu

>
>>
>> If you go back to qgroup_subtree_accounting() it, you would see
>> *that it never* changes ->rfer on a qgroup. That is because the rfer
>> counts
>> on any related subvolumes don't change during subvol delete. Your
>> example above in
>> fact illustrates this.
>>
>> So to be clear,
>>
>> - Why do we have to visit all tree nodes with the qgroup code given that
>> we only cared about shared ones before"
>
> My fault, as I didn't jump out of the box and still using the idea of
> rescan whole tree to do it.
>
>>
>>
>>> Forgot to say, that's at transaction commit time, if we really do that,
>>> commit time consumption is not acceptable if the subvolume is huge.
>>
>> This too is confusing to me. Even if we assume that qgroups will get
>> every
>> node now instead of just the shared ones:
>>
>> - Drop subvol already visits every node (though it doesn't pass them
>> all to
>> the qgroup code). We were doing this before just fine to my knowledge
>> - what
>> changed? The commit-time qgroup accounting?
>
> Nothing, as even old implement does qgroup accounting at
> run_delayed_refs time, which is also done at commit_transaction.
>
> So that's just my meaningless worry.
>
>>
>> - Also, btrfs_drop_snapshot() throttles transaction commits (via
>> btrfs_end_transaction_throttle()), so anything done at commit time looks
>> like it would be broken up into smaller chunks of work for us.
>>
>> Am I mistaken in how I understand these functions?
>
> No, you're right.
> Overall my concern about commit-time qgroup accouting performance impact
> is meaningless now.
>
>>
>>
>> Also I must ask... How did you discover this performance issue? Are we
>> talking about something theoretical here or was your implementation
>> slow on
>> subvolume delete?
> Mainly theoretical.
> And theoretically, the new implement should be faster than old implement.
>
> But I'm not quite sure about the old implement is fast enough for
> subvolume deletion.
> Anyway, your fix seems no slower than old implement.
> So I'm completely OK with it now.
>
>>
>>
>>> [[Possible fix plan]]
>>> As you can see the biggest problem is the number of child
>>> metadata/data extents can be quite large, it's not possible to mark
>>> them all and account at commit transaction time.
>>
>> ... right so handling qgroups at commit time is performance sensitive in
>> that too many qgroup updates will cause the commit code to do a lot of
>> work.
>> I think that actually answers one of my questions above, though my
>> followup would be:
>>
>> Would I be correct if I said this is a problem for any workload that
>> creates
>> a large number of qgroup updates in a short period of time?
> I'm afraid that's true.
>
> The typical one should be file clone.
> And that may be one of the worst case.
>
> As file clone will increase data extent backref, even inside the same
> subvolume, it will cause qgroup accounting.
>
> And it won't cause transaction to be committed, so in one transaction it
> can do file clone a lot of times.
> If all clone are happen on one extent, that's OK as qgroup only need to
> do one accounting.
> (BTW, only implement will do a lot of accounting even cloning the same
> extent)
>
> But if someone is doing file clone on different extents, that will be a
> disaster.
>
> Unlike other normal write, file clone won't cause much dirty page, so
> transaction won't be triggered by dirty page threshold.
>
> So after a lot of file cloning on different extents, next transaction
> commit will cost a lot of time.
>
> But that's my theoretical assumption, it may be totally wrong just like
> what I thought about subvolume deletion.
>
>>
>>
>>> But the good news is, we can keep the snapshot and account them in
>>> several commits.
>>> The new extent-orient accounting is quite accurate as it happens at
>>> commit time.
>>
>> Btw, one thing I should say is that in general I really like your rewrite
>> grats on that - the code is far easier to read through and I like very
>> much
>> that we've taken some of the open-coded qgroup code out of our extent
>> inc/dec code.
>
> It's my pleasure. :)
>
>>
>>
>>> So one possible fix is, if we want to delete a subvolume, we put the
>>> subvolume into an orphan status, and let qgroup account to ignore
>>> that qgroup from then on, and mark some extent dirty in one
>>> transaction, and continue marking in next transaction until we mark
>>> all extents in that subvolume dirty.
>>> Then free the subvolume.
>>>
>>> That's the ideal method for both snapshot creation and deletion, but
>>> takes the most time.
>>
>> It also sounds like a disk format change which would really suck to tell
>> users to do just so they can get accurate qgroup numbers.
>
> With all your explain, I think your current fix is good enough, at least
> no worse than old implement.
>
> So I'm OK with it.
> As it fixes a regression without major performance drop.
>
> We can improve performance later if it's really needed.
>
>>
>>
>>> [Hot fix idea]
>>> So far, the best and easiest method I come up with is, if we found
>>> qgroup is enabled and the snapshot to delete is above level 1(level
>>> starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
>>> to do a rescan.
>>
>> This is exactly the kind of band-aid solution we wanted to avoid the
>> first
>> time qgroups and subvolume handling were fixed.
>>     --Mark
> With your fix, the quick and dirty one is not needed anyway.
>
> Good job.
>
> Thanks,
> Qu
>>
>> --
>> Mark Fasheh
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Aug. 21, 2015, 12:30 a.m. UTC | #5
Hi Mark,

Any further comment on the reason why we should mark all nodes/leaves 
dirty during a subvolume deletion?

Thanks,
Qu

Qu Wenruo wrote on 2015/08/18 16:03 +0800:
>
>
> Qu Wenruo wrote on 2015/08/18 09:42 +0800:
>>
>>
>> Mark Fasheh wrote on 2015/08/17 14:13 -0700:
>>> Hi Qu,
>>>
>>> Firstly thanks for the response. I have a few new questions and comments
>>> below,
>>>
>>> On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
>>>> Thanks for pointing out the problem.
>>>> But it's already known and we didn't have a good idea to solve it yet.
>>>>
>>>> BTW, the old framework won't handle it well either.
>>>> Liu Bo's testcase (btrfs/017) can easily trigger it.
>>>> So it's not a regression.
>>>
>>> I don't understand your meaning here. btrfs/017 doesn't have anything
>>> to do
>>> with subvolume deletion. The regression I am talking about is that
>>> deleting
>>> a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
>>> 4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by
>>> your
>>> patches (thanks) but that's of course been at the expense of
>>> reintroducing
>>> the subvol regression :(
>> Oh, it seems that my memory went wrong about the test case.
>>
>> But I did remember old implement has something wrong with subvolume
>> deletion too.
>> (Or maybe I'm wrong again?)
>>>
>>>
>>>> Let me explain the problem a little and then introduce the fixing plan.
>>>>
>>>> [Qgroup for subvolume delete]
>>>> Here are two subvolumes, sharing the whole tree.
>>>> The whole trees are consistent with 8 tree blocks.
>>>> A B are the tree root of subv 257 and 258 respectively.
>>>> C~H are all shared by the two subvolumes.
>>>> And I is one data extent which is recorded in tree block H.
>>>>
>>>>
>>>> subv 257(A)    subv258(B)
>>>>     |     \       /    |
>>>>     C                  D
>>>>    / \                / \
>>>>    E F                G H
>>>>                         :
>>>>                         :
>>>>                         I
>>>>
>>>> Let's assume the tree block are all in 4K size(just for easy
>>>> calculation), and I is in 16K size.
>>>>
>>>> Now qgroup info should be:
>>>> 257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
>>>> 258: excl: 4K, rfer: 44K
>>>>
>>>> If we are going to delete subvolume 258, then expected qgroup should
>>>> be:
>>>> 257: excl 44K, rfer 44K
>>>
>>> Ok, this is basically explaining how we expect the numbers from a
>>> subvolume
>>> delete to work out. Everything there looks normal to me.
>>>
>>>
>>>> Which means we need to recalculate all the reference relation of
>>>> tree block C~H and data extent I.
>>>>
>>>> [[The real problem]]
>>>> Now the real problem is, we need to mark *ALL* metadata and data
>>>> extent of a subvolume dirty to make subvolume deletion work.
>>>
>>> I don't understand why we have to mark all nodes of a subvol. The
>>> previous
>>> implementation was able to get this correct without recalculating every
>>> block in the tree. It only passed shared nodes and items through to the
>>> qgroup accounting code. The accounting code in turn had very simple
>>> logic
>>> for working it out - if the extent is now exclusively owned, we adjust
>>> qgroup->excl for the (now only remaing) owner.
>> Oh, you're right here.
>> I'm stupid on this, as exclusive extent doesn't matter in the case.
>>
>> So no need to iterate the whole tree, but only shared extents.
>> That's already a lot of extent we can skip.
> Oh, I forgot one case involved with higher level qgroup.
>             qg 1/0
>             /    \
>   subv 257(A)    subv258(B)
>       |     \ /----    |
>       C      D         J
>      / \    / \ /------\
>      E F    G H         L
>               :
>               :
>               I
>
> The picture is hard to see BTW...
> Explain a little.
> qgroup 257 and qgroup 258 are in the same higher level qgroup pool, 1/0
> And this time, suvol 257 and subvol 258 only shares part of there tree.
>          A (subvol 257)
>           /     \
>      C    D
>         / \     / \
>        E   F   G   H ..... I
>
>          B (subvol 258)
>           /     \
>          C       J
>         / \     / \
>        E   F   G   L
> Yes, one COW caused A->D->H route to be new B->J->L route in subvol 258.
>
> And lets take a look at the qgroup numbers:
> Still use 4K tree block and I is 16K.
> qg 257: excl: 28K rfer: 44K
> meta excl: A, D, H: 3 * 4 = 12K
> data excl: I = 16K
> meta rfer: all 7 tree blocks = 28K
> data rfer: I = 16K
>
> qg 258: excl: 12K rfer: 28K
> meta excl: B, J, L: 3 * 4K = 12K
> data excl: 0
> meta rfer: all 7 tree blocks = 28K
> data rfer: 0
>
> qg: 1/0: excl 56K rfer: 56K
> meta excl: A ~ H, J~L, 10 * 4K = 40K
> data excl: I = 16K
>
> Now, if only accounts shared nodes, which means only accounts
> C, E, F, G tree blocks.
> These 4 tree blocks will turn from shared to excl for subvol 257.
> That's OK, as it won't cause anything wrong in qg 257.
>
> But that will cause qg 1/0 wrong.
> And B, J, L is now owned by nobody, excl of qg 1/0 should be reduced by
> 12K.
> As they are not marked as dirty since they are not shared,
> so the result will be:
>
> qg 257: excl: 28 + 16 = 44K  rfer: 44K
> qg 1/0: excl: 56K rfer: 56K
> As related 4 tree blocks are still exel for qg 1/0.
> Their change won't cause excl/rfer change.
>
> So the result of qg 1/0 is wrong. Which should be excl: 44K: rfer 44K,
> just the same as qg 257.
>
> So, it is still needed to account all the nodes/leaves to ensure higher
> level qgroup get correct reading.
>
>
> Sorry for not pointing this out in previous reply, as qgroup is quite
> complicated and I did forgot this case.
>
> So, I'm afraid we need to change the fix to mark all extents in the
> subvolume to resolve the bug.
>
> Thanks,
> Qu
>
>>
>>>
>>> If you go back to qgroup_subtree_accounting() it, you would see
>>> *that it never* changes ->rfer on a qgroup. That is because the rfer
>>> counts
>>> on any related subvolumes don't change during subvol delete. Your
>>> example above in
>>> fact illustrates this.
>>>
>>> So to be clear,
>>>
>>> - Why do we have to visit all tree nodes with the qgroup code given that
>>> we only cared about shared ones before"
>>
>> My fault, as I didn't jump out of the box and still using the idea of
>> rescan whole tree to do it.
>>
>>>
>>>
>>>> Forgot to say, that's at transaction commit time, if we really do that,
>>>> commit time consumption is not acceptable if the subvolume is huge.
>>>
>>> This too is confusing to me. Even if we assume that qgroups will get
>>> every
>>> node now instead of just the shared ones:
>>>
>>> - Drop subvol already visits every node (though it doesn't pass them
>>> all to
>>> the qgroup code). We were doing this before just fine to my knowledge
>>> - what
>>> changed? The commit-time qgroup accounting?
>>
>> Nothing, as even old implement does qgroup accounting at
>> run_delayed_refs time, which is also done at commit_transaction.
>>
>> So that's just my meaningless worry.
>>
>>>
>>> - Also, btrfs_drop_snapshot() throttles transaction commits (via
>>> btrfs_end_transaction_throttle()), so anything done at commit time looks
>>> like it would be broken up into smaller chunks of work for us.
>>>
>>> Am I mistaken in how I understand these functions?
>>
>> No, you're right.
>> Overall my concern about commit-time qgroup accouting performance impact
>> is meaningless now.
>>
>>>
>>>
>>> Also I must ask... How did you discover this performance issue? Are we
>>> talking about something theoretical here or was your implementation
>>> slow on
>>> subvolume delete?
>> Mainly theoretical.
>> And theoretically, the new implement should be faster than old implement.
>>
>> But I'm not quite sure about the old implement is fast enough for
>> subvolume deletion.
>> Anyway, your fix seems no slower than old implement.
>> So I'm completely OK with it now.
>>
>>>
>>>
>>>> [[Possible fix plan]]
>>>> As you can see the biggest problem is the number of child
>>>> metadata/data extents can be quite large, it's not possible to mark
>>>> them all and account at commit transaction time.
>>>
>>> ... right so handling qgroups at commit time is performance sensitive in
>>> that too many qgroup updates will cause the commit code to do a lot of
>>> work.
>>> I think that actually answers one of my questions above, though my
>>> followup would be:
>>>
>>> Would I be correct if I said this is a problem for any workload that
>>> creates
>>> a large number of qgroup updates in a short period of time?
>> I'm afraid that's true.
>>
>> The typical one should be file clone.
>> And that may be one of the worst case.
>>
>> As file clone will increase data extent backref, even inside the same
>> subvolume, it will cause qgroup accounting.
>>
>> And it won't cause transaction to be committed, so in one transaction it
>> can do file clone a lot of times.
>> If all clone are happen on one extent, that's OK as qgroup only need to
>> do one accounting.
>> (BTW, only implement will do a lot of accounting even cloning the same
>> extent)
>>
>> But if someone is doing file clone on different extents, that will be a
>> disaster.
>>
>> Unlike other normal write, file clone won't cause much dirty page, so
>> transaction won't be triggered by dirty page threshold.
>>
>> So after a lot of file cloning on different extents, next transaction
>> commit will cost a lot of time.
>>
>> But that's my theoretical assumption, it may be totally wrong just like
>> what I thought about subvolume deletion.
>>
>>>
>>>
>>>> But the good news is, we can keep the snapshot and account them in
>>>> several commits.
>>>> The new extent-orient accounting is quite accurate as it happens at
>>>> commit time.
>>>
>>> Btw, one thing I should say is that in general I really like your
>>> rewrite
>>> grats on that - the code is far easier to read through and I like very
>>> much
>>> that we've taken some of the open-coded qgroup code out of our extent
>>> inc/dec code.
>>
>> It's my pleasure. :)
>>
>>>
>>>
>>>> So one possible fix is, if we want to delete a subvolume, we put the
>>>> subvolume into an orphan status, and let qgroup account to ignore
>>>> that qgroup from then on, and mark some extent dirty in one
>>>> transaction, and continue marking in next transaction until we mark
>>>> all extents in that subvolume dirty.
>>>> Then free the subvolume.
>>>>
>>>> That's the ideal method for both snapshot creation and deletion, but
>>>> takes the most time.
>>>
>>> It also sounds like a disk format change which would really suck to tell
>>> users to do just so they can get accurate qgroup numbers.
>>
>> With all your explain, I think your current fix is good enough, at least
>> no worse than old implement.
>>
>> So I'm OK with it.
>> As it fixes a regression without major performance drop.
>>
>> We can improve performance later if it's really needed.
>>
>>>
>>>
>>>> [Hot fix idea]
>>>> So far, the best and easiest method I come up with is, if we found
>>>> qgroup is enabled and the snapshot to delete is above level 1(level
>>>> starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
>>>> to do a rescan.
>>>
>>> This is exactly the kind of band-aid solution we wanted to avoid the
>>> first
>>> time qgroups and subvolume handling were fixed.
>>>     --Mark
>> With your fix, the quick and dirty one is not needed anyway.
>>
>> Good job.
>>
>> Thanks,
>> Qu
>>>
>>> --
>>> Mark Fasheh
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh Aug. 21, 2015, 11:31 p.m. UTC | #6
On Fri, Aug 21, 2015 at 08:30:06AM +0800, Qu Wenruo wrote:
> Hi Mark,
> 
> Any further comment on the reason why we should mark all
> nodes/leaves dirty during a subvolume deletion?

No, I see what you're talking about so we'll just ahve to mark all nodes on
the way down the tree, not really a huge change from marking the shared
ones IMHO.

I'm currently trying to figure out how to get our deleted roots in the list
gathered by btrfs_qgroup_prepare_account_extents(). From what I can tell, at
some point btrfs_find_all_roots() will no longer be able to find the
subvolume being dropped, which results in an old_roots count that is off by
at least one. I'm pretty confident that if we can get the right number of
roots on our old_roots pointer, the math in qgroup_update_counters will work
out for us, for all levels of qgroups.
	--Mark


> 
> Thanks,
> Qu
> 
> Qu Wenruo wrote on 2015/08/18 16:03 +0800:
> >
> >
> >Qu Wenruo wrote on 2015/08/18 09:42 +0800:
> >>
> >>
> >>Mark Fasheh wrote on 2015/08/17 14:13 -0700:
> >>>Hi Qu,
> >>>
> >>>Firstly thanks for the response. I have a few new questions and comments
> >>>below,
> >>>
> >>>On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
> >>>>Thanks for pointing out the problem.
> >>>>But it's already known and we didn't have a good idea to solve it yet.
> >>>>
> >>>>BTW, the old framework won't handle it well either.
> >>>>Liu Bo's testcase (btrfs/017) can easily trigger it.
> >>>>So it's not a regression.
> >>>
> >>>I don't understand your meaning here. btrfs/017 doesn't have anything
> >>>to do
> >>>with subvolume deletion. The regression I am talking about is that
> >>>deleting
> >>>a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
> >>>4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by
> >>>your
> >>>patches (thanks) but that's of course been at the expense of
> >>>reintroducing
> >>>the subvol regression :(
> >>Oh, it seems that my memory went wrong about the test case.
> >>
> >>But I did remember old implement has something wrong with subvolume
> >>deletion too.
> >>(Or maybe I'm wrong again?)
> >>>
> >>>
> >>>>Let me explain the problem a little and then introduce the fixing plan.
> >>>>
> >>>>[Qgroup for subvolume delete]
> >>>>Here are two subvolumes, sharing the whole tree.
> >>>>The whole trees are consistent with 8 tree blocks.
> >>>>A B are the tree root of subv 257 and 258 respectively.
> >>>>C~H are all shared by the two subvolumes.
> >>>>And I is one data extent which is recorded in tree block H.
> >>>>
> >>>>
> >>>>subv 257(A)    subv258(B)
> >>>>    |     \       /    |
> >>>>    C                  D
> >>>>   / \                / \
> >>>>   E F                G H
> >>>>                        :
> >>>>                        :
> >>>>                        I
> >>>>
> >>>>Let's assume the tree block are all in 4K size(just for easy
> >>>>calculation), and I is in 16K size.
> >>>>
> >>>>Now qgroup info should be:
> >>>>257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
> >>>>258: excl: 4K, rfer: 44K
> >>>>
> >>>>If we are going to delete subvolume 258, then expected qgroup should
> >>>>be:
> >>>>257: excl 44K, rfer 44K
> >>>
> >>>Ok, this is basically explaining how we expect the numbers from a
> >>>subvolume
> >>>delete to work out. Everything there looks normal to me.
> >>>
> >>>
> >>>>Which means we need to recalculate all the reference relation of
> >>>>tree block C~H and data extent I.
> >>>>
> >>>>[[The real problem]]
> >>>>Now the real problem is, we need to mark *ALL* metadata and data
> >>>>extent of a subvolume dirty to make subvolume deletion work.
> >>>
> >>>I don't understand why we have to mark all nodes of a subvol. The
> >>>previous
> >>>implementation was able to get this correct without recalculating every
> >>>block in the tree. It only passed shared nodes and items through to the
> >>>qgroup accounting code. The accounting code in turn had very simple
> >>>logic
> >>>for working it out - if the extent is now exclusively owned, we adjust
> >>>qgroup->excl for the (now only remaing) owner.
> >>Oh, you're right here.
> >>I'm stupid on this, as exclusive extent doesn't matter in the case.
> >>
> >>So no need to iterate the whole tree, but only shared extents.
> >>That's already a lot of extent we can skip.
> >Oh, I forgot one case involved with higher level qgroup.
> >            qg 1/0
> >            /    \
> >  subv 257(A)    subv258(B)
> >      |     \ /----    |
> >      C      D         J
> >     / \    / \ /------\
> >     E F    G H         L
> >              :
> >              :
> >              I
> >
> >The picture is hard to see BTW...
> >Explain a little.
> >qgroup 257 and qgroup 258 are in the same higher level qgroup pool, 1/0
> >And this time, suvol 257 and subvol 258 only shares part of there tree.
> >         A (subvol 257)
> >          /     \
> >     C    D
> >        / \     / \
> >       E   F   G   H ..... I
> >
> >         B (subvol 258)
> >          /     \
> >         C       J
> >        / \     / \
> >       E   F   G   L
> >Yes, one COW caused A->D->H route to be new B->J->L route in subvol 258.
> >
> >And lets take a look at the qgroup numbers:
> >Still use 4K tree block and I is 16K.
> >qg 257: excl: 28K rfer: 44K
> >meta excl: A, D, H: 3 * 4 = 12K
> >data excl: I = 16K
> >meta rfer: all 7 tree blocks = 28K
> >data rfer: I = 16K
> >
> >qg 258: excl: 12K rfer: 28K
> >meta excl: B, J, L: 3 * 4K = 12K
> >data excl: 0
> >meta rfer: all 7 tree blocks = 28K
> >data rfer: 0
> >
> >qg: 1/0: excl 56K rfer: 56K
> >meta excl: A ~ H, J~L, 10 * 4K = 40K
> >data excl: I = 16K
> >
> >Now, if only accounts shared nodes, which means only accounts
> >C, E, F, G tree blocks.
> >These 4 tree blocks will turn from shared to excl for subvol 257.
> >That's OK, as it won't cause anything wrong in qg 257.
> >
> >But that will cause qg 1/0 wrong.
> >And B, J, L is now owned by nobody, excl of qg 1/0 should be reduced by
> >12K.
> >As they are not marked as dirty since they are not shared,
> >so the result will be:
> >
> >qg 257: excl: 28 + 16 = 44K  rfer: 44K
> >qg 1/0: excl: 56K rfer: 56K
> >As related 4 tree blocks are still exel for qg 1/0.
> >Their change won't cause excl/rfer change.
> >
> >So the result of qg 1/0 is wrong. Which should be excl: 44K: rfer 44K,
> >just the same as qg 257.
> >
> >So, it is still needed to account all the nodes/leaves to ensure higher
> >level qgroup get correct reading.
> >
> >
> >Sorry for not pointing this out in previous reply, as qgroup is quite
> >complicated and I did forgot this case.
> >
> >So, I'm afraid we need to change the fix to mark all extents in the
> >subvolume to resolve the bug.
> >
> >Thanks,
> >Qu
> >
> >>
> >>>
> >>>If you go back to qgroup_subtree_accounting() it, you would see
> >>>*that it never* changes ->rfer on a qgroup. That is because the rfer
> >>>counts
> >>>on any related subvolumes don't change during subvol delete. Your
> >>>example above in
> >>>fact illustrates this.
> >>>
> >>>So to be clear,
> >>>
> >>>- Why do we have to visit all tree nodes with the qgroup code given that
> >>>we only cared about shared ones before"
> >>
> >>My fault, as I didn't jump out of the box and still using the idea of
> >>rescan whole tree to do it.
> >>
> >>>
> >>>
> >>>>Forgot to say, that's at transaction commit time, if we really do that,
> >>>>commit time consumption is not acceptable if the subvolume is huge.
> >>>
> >>>This too is confusing to me. Even if we assume that qgroups will get
> >>>every
> >>>node now instead of just the shared ones:
> >>>
> >>>- Drop subvol already visits every node (though it doesn't pass them
> >>>all to
> >>>the qgroup code). We were doing this before just fine to my knowledge
> >>>- what
> >>>changed? The commit-time qgroup accounting?
> >>
> >>Nothing, as even old implement does qgroup accounting at
> >>run_delayed_refs time, which is also done at commit_transaction.
> >>
> >>So that's just my meaningless worry.
> >>
> >>>
> >>>- Also, btrfs_drop_snapshot() throttles transaction commits (via
> >>>btrfs_end_transaction_throttle()), so anything done at commit time looks
> >>>like it would be broken up into smaller chunks of work for us.
> >>>
> >>>Am I mistaken in how I understand these functions?
> >>
> >>No, you're right.
> >>Overall my concern about commit-time qgroup accouting performance impact
> >>is meaningless now.
> >>
> >>>
> >>>
> >>>Also I must ask... How did you discover this performance issue? Are we
> >>>talking about something theoretical here or was your implementation
> >>>slow on
> >>>subvolume delete?
> >>Mainly theoretical.
> >>And theoretically, the new implement should be faster than old implement.
> >>
> >>But I'm not quite sure about the old implement is fast enough for
> >>subvolume deletion.
> >>Anyway, your fix seems no slower than old implement.
> >>So I'm completely OK with it now.
> >>
> >>>
> >>>
> >>>>[[Possible fix plan]]
> >>>>As you can see the biggest problem is the number of child
> >>>>metadata/data extents can be quite large, it's not possible to mark
> >>>>them all and account at commit transaction time.
> >>>
> >>>... right so handling qgroups at commit time is performance sensitive in
> >>>that too many qgroup updates will cause the commit code to do a lot of
> >>>work.
> >>>I think that actually answers one of my questions above, though my
> >>>followup would be:
> >>>
> >>>Would I be correct if I said this is a problem for any workload that
> >>>creates
> >>>a large number of qgroup updates in a short period of time?
> >>I'm afraid that's true.
> >>
> >>The typical one should be file clone.
> >>And that may be one of the worst case.
> >>
> >>As file clone will increase data extent backref, even inside the same
> >>subvolume, it will cause qgroup accounting.
> >>
> >>And it won't cause transaction to be committed, so in one transaction it
> >>can do file clone a lot of times.
> >>If all clone are happen on one extent, that's OK as qgroup only need to
> >>do one accounting.
> >>(BTW, only implement will do a lot of accounting even cloning the same
> >>extent)
> >>
> >>But if someone is doing file clone on different extents, that will be a
> >>disaster.
> >>
> >>Unlike other normal write, file clone won't cause much dirty page, so
> >>transaction won't be triggered by dirty page threshold.
> >>
> >>So after a lot of file cloning on different extents, next transaction
> >>commit will cost a lot of time.
> >>
> >>But that's my theoretical assumption, it may be totally wrong just like
> >>what I thought about subvolume deletion.
> >>
> >>>
> >>>
> >>>>But the good news is, we can keep the snapshot and account them in
> >>>>several commits.
> >>>>The new extent-orient accounting is quite accurate as it happens at
> >>>>commit time.
> >>>
> >>>Btw, one thing I should say is that in general I really like your
> >>>rewrite
> >>>grats on that - the code is far easier to read through and I like very
> >>>much
> >>>that we've taken some of the open-coded qgroup code out of our extent
> >>>inc/dec code.
> >>
> >>It's my pleasure. :)
> >>
> >>>
> >>>
> >>>>So one possible fix is, if we want to delete a subvolume, we put the
> >>>>subvolume into an orphan status, and let qgroup account to ignore
> >>>>that qgroup from then on, and mark some extent dirty in one
> >>>>transaction, and continue marking in next transaction until we mark
> >>>>all extents in that subvolume dirty.
> >>>>Then free the subvolume.
> >>>>
> >>>>That's the ideal method for both snapshot creation and deletion, but
> >>>>takes the most time.
> >>>
> >>>It also sounds like a disk format change which would really suck to tell
> >>>users to do just so they can get accurate qgroup numbers.
> >>
> >>With all your explain, I think your current fix is good enough, at least
> >>no worse than old implement.
> >>
> >>So I'm OK with it.
> >>As it fixes a regression without major performance drop.
> >>
> >>We can improve performance later if it's really needed.
> >>
> >>>
> >>>
> >>>>[Hot fix idea]
> >>>>So far, the best and easiest method I come up with is, if we found
> >>>>qgroup is enabled and the snapshot to delete is above level 1(level
> >>>>starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
> >>>>to do a rescan.
> >>>
> >>>This is exactly the kind of band-aid solution we wanted to avoid the
> >>>first
> >>>time qgroups and subvolume handling were fixed.
> >>>    --Mark
> >>With your fix, the quick and dirty one is not needed anyway.
> >>
> >>Good job.
> >>
> >>Thanks,
> >>Qu
> >>>
> >>>--
> >>>Mark Fasheh
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -486,7 +486,7 @@  add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 		qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs,
 							     qrecord);
 		if (qexisting)
-			kfree(qrecord);
+			btrfs_qgroup_free_extent_record(qrecord);
 	}
 
 	spin_lock_init(&head_ref->lock);
@@ -654,7 +654,7 @@  int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 		goto free_ref;
 
 	if (fs_info->quota_enabled && is_fstree(ref_root)) {
-		record = kmalloc(sizeof(*record), GFP_NOFS);
+		record = btrfs_qgroup_alloc_extent_record();
 		if (!record)
 			goto free_head_ref;
 	}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07204bf..ab81135 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7756,18 +7756,31 @@  reada:
 	wc->reada_slot = slot;
 }
 
-/*
- * TODO: Modify related function to add related node/leaf to dirty_extent_root,
- * for later qgroup accounting.
- *
- * Current, this function does nothing.
- */
+static int record_one_item(struct btrfs_trans_handle *trans, u64 bytenr,
+			   u64 num_bytes)
+{
+	struct btrfs_qgroup_extent_record *qrecord = btrfs_qgroup_alloc_extent_record();
+	struct btrfs_delayed_ref_root *delayed_refs = &trans->transaction->delayed_refs;
+
+	if (!qrecord)
+		return -ENOMEM;
+
+	qrecord->bytenr = bytenr;
+	qrecord->num_bytes = num_bytes;
+	qrecord->old_roots = NULL;
+
+	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+		btrfs_qgroup_free_extent_record(qrecord);
+
+	return 0;
+}
+
 static int account_leaf_items(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct extent_buffer *eb)
 {
 	int nr = btrfs_header_nritems(eb);
-	int i, extent_type;
+	int i, extent_type, ret;
 	struct btrfs_key key;
 	struct btrfs_file_extent_item *fi;
 	u64 bytenr, num_bytes;
@@ -7790,6 +7803,10 @@  static int account_leaf_items(struct btrfs_trans_handle *trans,
 			continue;
 
 		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+		ret = record_one_item(trans, bytenr, num_bytes);
+		if (ret)
+			return ret;
 	}
 	return 0;
 }
@@ -7858,8 +7875,6 @@  static int adjust_slots_upwards(struct btrfs_root *root,
 
 /*
  * root_eb is the subtree root and is locked before this function is called.
- * TODO: Modify this function to mark all (including complete shared node)
- * to dirty_extent_root to allow it get accounted in qgroup.
  */
 static int account_shared_subtree(struct btrfs_trans_handle *trans,
 				  struct btrfs_root *root,
@@ -7937,6 +7952,11 @@  walk_down:
 			btrfs_tree_read_lock(eb);
 			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
 			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+			ret = record_one_item(trans, child_bytenr,
+					      root->nodesize);
+			if (ret)
+				goto out;
 		}
 
 		if (level == 0) {
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8a82029..27ec405 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1452,6 +1452,17 @@  int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+struct btrfs_qgroup_extent_record *btrfs_qgroup_alloc_extent_record(void)
+{
+	return kmalloc(sizeof(struct btrfs_qgroup_extent_record), GFP_NOFS);
+}
+
+void btrfs_qgroup_free_extent_record(struct btrfs_qgroup_extent_record *record)
+{
+	if (record)
+		kfree(record);
+}
+
 struct btrfs_qgroup_extent_record
 *btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
 				  struct btrfs_qgroup_extent_record *record)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 6387dcf..dfba663 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -59,6 +59,9 @@  int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
 struct btrfs_qgroup_extent_record
 *btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
 				  struct btrfs_qgroup_extent_record *record);
+struct btrfs_qgroup_extent_record
+*btrfs_qgroup_alloc_extent_record(void);
+void btrfs_qgroup_free_extent_record(struct btrfs_qgroup_extent_record *record);
 int
 btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
 			    struct btrfs_fs_info *fs_info,