diff mbox

[v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot

Message ID 1460711302-2478-1-git-send-email-quwenruo@cn.fujitsu.com (mailing list archive)
State Superseded
Headers show

Commit Message

Qu Wenruo April 15, 2016, 9:08 a.m. UTC
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)

Comments

sri April 15, 2016, 12:41 p.m. UTC | #1
Hi,

I have couple of queries related to btrfs-image, btrfs send and with 
combination of two.
1)
I would like to know if a btrfs source file system is spread across more 
than 1 disks, does btrfs-image require same number of disks to create 
empty file system without files content??


2) would btrfs-image can be modified to keep only given subvolume foot 
print and related meta data to bring back file system live on destination 
disk?

   To elaborate more on this, Lets say I have 5 subvolumes on source btrfs 
and i run btrfs-image written to destination disk say /dev/sdd. In this 
process, can btrfs-image modified to just have only 1 subvolume and skipp 
other 4 subvolumes and write to destination i.. /dev/sdd so that when I 
mount /dev/sdd , I will have btrfs with only 1 subvolume with no data.

3) If 3 can successful, can btrfs-image further changed to include data of 
selected subvolume which gives files data also written to /dev/sdd which 
would be kind of a backup of a subvolume taken out of a btrfs file system 
which is having more than 1 subvolumes.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugo Mills April 15, 2016, 12:49 p.m. UTC | #2
On Fri, Apr 15, 2016 at 12:41:36PM +0000, sri wrote:
> Hi,
> 
> I have couple of queries related to btrfs-image, btrfs send and with 
> combination of two.
> 1)
> I would like to know if a btrfs source file system is spread across more 
> than 1 disks, does btrfs-image require same number of disks to create 
> empty file system without files content??

   I don't _think_ you need as many devices as there were originally.

> 2) would btrfs-image can be modified to keep only given subvolume foot 
> print and related meta data to bring back file system live on destination 
> disk?
> 
>    To elaborate more on this, Lets say I have 5 subvolumes on source btrfs 
> and i run btrfs-image written to destination disk say /dev/sdd. In this 
> process, can btrfs-image modified to just have only 1 subvolume and skipp 
> other 4 subvolumes and write to destination i.. /dev/sdd so that when I 
> mount /dev/sdd , I will have btrfs with only 1 subvolume with no data.

   For a first approximation, you could just drop any FS tree from the
image which wasn't the target one. After that, it turns into a
complicated accounting exercise to drop all of the back-refs to the
missing FS trees, and to drop all the extent records for the
non-shared data and the metadata for the missing FS trees.

   It's probably going to be complicated, and will basically involve
rewriting most of the image to avoid the metadata you didn't want.

> 3) If 3 can successful, can btrfs-image further changed to include data of 
> selected subvolume which gives files data also written to /dev/sdd which 
> would be kind of a backup of a subvolume taken out of a btrfs file system 
> which is having more than 1 subvolumes.

   If you're going to do all the hard work of (2), then (3) is a
reasonable logical(?) extension.

   On the other hand, what's wrong with simply using send/receive? It
gives you a data structure (a FAR-format send stream) which contains
everything you need to reconstruct a subvolume on a btrfs different
to the original.

   Hugo.
Henk Slager April 15, 2016, 4:21 p.m. UTC | #3
On Fri, Apr 15, 2016 at 2:49 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Apr 15, 2016 at 12:41:36PM +0000, sri wrote:
>> Hi,
>>
>> I have couple of queries related to btrfs-image, btrfs send and with
>> combination of two.
>> 1)
>> I would like to know if a btrfs source file system is spread across more
>> than 1 disks, does btrfs-image require same number of disks to create
>> empty file system without files content??
>
>    I don't _think_ you need as many devices as there were originally.

Indeed, if you run    btrfs-image -r     on a dump from a multi device
fs, you get 1 big fs image. I once did that for a 4x4TB RAID10 fs
(tools v4.3.x at that time I believe), resulting in a 17TB (sparse)
file. I was expecting that option  -m would create multiple files,
however, scanning the source-code revealed that there are things not
implemented (it resulted in a 34T file that was simply not a valid
fs). Or I did something wrong or there is a bug.

For just my case, it was much quicker to patch the kernel so that it
worked with the 17T file. There are then still issues w.r.t. devices,
but data is missing so anyhow only a limited set of tool actions or
issues can be researched with such a generated image. But for a
multi-TB fs, the data volume is acceptable (roughly in 1G or 10G
order).

I think it would make sense that the btrfs-image restore output can be
split into multiple files, so that the multidevice aspects are better
represented (or modelled).

>> 2) would btrfs-image can be modified to keep only given subvolume foot
>> print and related meta data to bring back file system live on destination
>> disk?
>>
>>    To elaborate more on this, Lets say I have 5 subvolumes on source btrfs
>> and i run btrfs-image written to destination disk say /dev/sdd. In this
>> process, can btrfs-image modified to just have only 1 subvolume and skipp
>> other 4 subvolumes and write to destination i.. /dev/sdd so that when I
>> mount /dev/sdd , I will have btrfs with only 1 subvolume with no data.
>
>    For a first approximation, you could just drop any FS tree from the
> image which wasn't the target one. After that, it turns into a
> complicated accounting exercise to drop all of the back-refs to the
> missing FS trees, and to drop all the extent records for the
> non-shared data and the metadata for the missing FS trees.
>
>    It's probably going to be complicated, and will basically involve
> rewriting most of the image to avoid the metadata you didn't want.
>
>> 3) If 3 can successful, can btrfs-image further changed to include data of
>> selected subvolume which gives files data also written to /dev/sdd which
>> would be kind of a backup of a subvolume taken out of a btrfs file system
>> which is having more than 1 subvolumes.
>
>    If you're going to do all the hard work of (2), then (3) is a
> reasonable logical(?) extension.
>
>    On the other hand, what's wrong with simply using send/receive? It
> gives you a data structure (a FAR-format send stream) which contains
> everything you need to reconstruct a subvolume on a btrfs different
> to the original.
>
>    Hugo.
>
> --
> Hugo Mills             | Mary had a little lamb
> hugo@... carfax.org.uk | You've heard this tale before
> http://carfax.org.uk/  | But did you know she passed her plate
> PGP: E2AB1DE4          | And had a little more?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
sri April 18, 2016, 7:24 a.m. UTC | #4
I tried btrfs-image and created image file and ran btrfs-image -r to a 
different disk. Once recovered and mounted, I can able to see data is 
not zeroed out as mentioned in btrfs-image man page.

I tried on same machine.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Henk Slager April 18, 2016, 2:13 p.m. UTC | #5
(your email keeps ending up in gmail spam folder)

On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote:
> I tried btrfs-image and created image file and ran btrfs-image -r to a
> different disk. Once recovered and mounted, I can able to see data is
> not zeroed out as mentioned in btrfs-image man page.

"different disk"  you mention, that is important info. If you doe the
restore to a image file, that image file is sparse and all data blocks
are read as zeros.

However, if you restore to a block device, then you can assume it just
writes the device blocks for metadata and leaves the rest untouched.
So trim whole device first or brute-force overwrite completely with
zeros.

So maybe the man pages needs some correction / extra notes.

> I tried on same machine.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roman Mamedov April 18, 2016, 2:26 p.m. UTC | #6
On Mon, 18 Apr 2016 16:13:28 +0200
Henk Slager <eye1tm@gmail.com> wrote:

> (your email keeps ending up in gmail spam folder)
> 
> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote:
> > I tried btrfs-image and created image file and ran btrfs-image -r to a
> > different disk. Once recovered and mounted, I can able to see data is
> > not zeroed out as mentioned in btrfs-image man page.
> 
> "different disk"  you mention, that is important info. If you doe the
> restore to a image file, that image file is sparse and all data blocks
> are read as zeros.
> 
> However, if you restore to a block device, then you can assume it just
> writes the device blocks for metadata and leaves the rest untouched.
> So trim whole device first or brute-force overwrite completely with
> zeros.
> 
> So maybe the man pages needs some correction / extra notes.
> 
> > I tried on same machine.

Does btrfs-image store/restore the FS UUID? If it does, then potentially both
the source FS and the restored one were visible at the same time to the kernel
with identical UUIDs, and maybe it was actually accessing/mounting the source
one.
Henk Slager April 18, 2016, 7:48 p.m. UTC | #7
On Mon, Apr 18, 2016 at 4:26 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Mon, 18 Apr 2016 16:13:28 +0200
> Henk Slager <eye1tm@gmail.com> wrote:
>
>> (your email keeps ending up in gmail spam folder)
>>
>> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar@yahoo.co.in> wrote:
>> > I tried btrfs-image and created image file and ran btrfs-image -r to a
>> > different disk. Once recovered and mounted, I can able to see data is
>> > not zeroed out as mentioned in btrfs-image man page.
>>
>> "different disk"  you mention, that is important info. If you doe the
>> restore to a image file, that image file is sparse and all data blocks
>> are read as zeros.
>>
>> However, if you restore to a block device, then you can assume it just
>> writes the device blocks for metadata and leaves the rest untouched.
>> So trim whole device first or brute-force overwrite completely with
>> zeros.
>>
>> So maybe the man pages needs some correction / extra notes.
>>
>> > I tried on same machine.
>
> Does btrfs-image store/restore the FS UUID? If it does, then potentially both
> the source FS and the restored one were visible at the same time to the kernel
> with identical UUIDs, and maybe it was actually accessing/mounting the source
> one.

Very good point! The UUID's are the same. I remember I used a VM to
separate the source FS from the restored FS

Also, the assumption I made about restoring to a block device is not
correct: If you restore to a loopdev that has a file with all
non-zeros as backing-store, the files in the mounted restored FS are
read as zeros.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
sri April 19, 2016, 9:39 a.m. UTC | #8
Henk Slager <eye1tm <at> gmail.com> writes:

> 
> On Mon, Apr 18, 2016 at 4:26 PM, Roman Mamedov <rm <at> romanrm.net> 
wrote:
> > On Mon, 18 Apr 2016 16:13:28 +0200
> > Henk Slager <eye1tm <at> gmail.com> wrote:
> >
> >> (your email keeps ending up in gmail spam folder)
> >>
> >> On Mon, Apr 18, 2016 at 9:24 AM, sri <toyours_sridhar <at> 
yahoo.co.in> wrote:
> >> > I tried btrfs-image and created image file and ran btrfs-image -r 
to a
> >> > different disk. Once recovered and mounted, I can able to see 
data is
> >> > not zeroed out as mentioned in btrfs-image man page.
> >>
> >> "different disk"  you mention, that is important info. If you doe 
the
> >> restore to a image file, that image file is sparse and all data 
blocks
> >> are read as zeros.
> >>
> >> However, if you restore to a block device, then you can assume it 
just
> >> writes the device blocks for metadata and leaves the rest 
untouched.
> >> So trim whole device first or brute-force overwrite completely with
> >> zeros.
> >>
> >> So maybe the man pages needs some correction / extra notes.
> >>
> >> > I tried on same machine.
> >
> > Does btrfs-image store/restore the FS UUID? If it does, then 
potentially both
> > the source FS and the restored one were visible at the same time to 
the kernel
> > with identical UUIDs, and maybe it was actually accessing/mounting 
the source
> > one.
> 
> Very good point! The UUID's are the same. I remember I used a VM to
> separate the source FS from the restored FS
> 
> Also, the assumption I made about restoring to a block device is not
> correct: If you restore to a loopdev that has a file with all
> non-zeros as backing-store, the files in the mounted restored FS are
> read as zeros.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
thank you for inputs. Actually I have tried in the following way.
I have /dev/sdb , /dev/sdc. Using wipefs -fa I cleared both devices and 
cre=
ated btrfs on /dev/sdb. Mounted and written some files and unmounted it.
Then I ran btrfs-image /dev/sdc /img1.img and got the dump.
Once image created I again ran wipefs -fa /dev/sdb.
Then I ran btrfs-image -r /img1.img /dev/sdc and mounted /dev/sdc.
ls to dumped filesystem shows the file size as original and no file 
content=
. I guess btrfs-image doesn't modify files stat so i feel it is showing 
siz=
e as original.
However running cat on some of files give i/o error
qd67:/btrfs1 # cat shadow.hcat: shadow.h: Input/output error
These errors are not on all files on other files, since dump doesn't 
contai=
ns any data it just prompts for cat as below.
qd67:/btrfs1 # cat stab.hqd67:/btrfs1 # cat stdio_ext.h
Not sure why i/o errors are coming.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Henk Slager April 19, 2016, 11:30 a.m. UTC | #9
> I have /dev/sdb , /dev/sdc. Using wipefs -fa I cleared both devices and
> created btrfs on /dev/sdb. Mounted and written some files and unmounted it.
>
> Then I ran btrfs-image /dev/sdc /img1.img and got the dump.

It looks like you imaged the wrong device, that might clarify the IO
errors later on

> Once image created I again ran wipefs -fa /dev/sdb.
>
> Then I ran btrfs-image -r /img1.img /dev/sdc and mounted /dev/sdc.
>
> ls to dumped filesystem shows the file size as original and no file content.
> I guess btrfs-image doesn't modify files stat so i feel it is showing size
> as original.
>
> However running cat on some of files give i/o error
>
> qd67:/btrfs1 # cat shadow.h
> cat: shadow.h: Input/output error
>
> These errors are not on all files on other files, since dump doesn't
> contains any data it just prompts for cat as below.
>
> qd67:/btrfs1 # cat stab.h
> qd67:/btrfs1 # cat stdio_ext.h
>
> Not sure why i/o errors are coming.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh April 19, 2016, 10:19 p.m. UTC | #10
On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote:
> Current btrfs qgroup design implies a requirement that after calling
> btrfs_qgroup_account_extents() there must be a commit root switch.
> 
> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> inside btrfs_commit_transaction() just be commit_cowonly_roots().
> 
> However there is a exception at create_pending_snapshot(), which will
> call btrfs_qgroup_account_extents() but no any commit root switch.
> 
> In case of creating a snapshot whose parent root is itself (create a
> snapshot of fs tree), it will corrupt qgroup by the following trace:
> (skipped unrelated data)
> ======
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> ======
> 
> The problem here is in first qgroup_account_extent(), the
> nr_new_roots of the extent is 1, which means its reference got
> increased, and qgroup increased its rfer and excl.
> 
> But at second qgroup_account_extent(), its reference got decreased, but
> between these two qgroup_account_extent(), there is no switch roots.
> This leads to the same nr_old_roots, and this extent just got ignored by
> qgroup, which means this extent is wrongly accounted.
> 
> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> create_pending_snapshot(), with needed preparation.
> 
> Reported-by: Mark Fasheh <mfasheh@suse.de>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Ok, this version seems to be giving me the right numbers. I'll send a test
for it shortly. I'd still like to know if this patch introduces an
unintended side effects but otherwise, thanks Qu.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
sri April 20, 2016, 11:50 a.m. UTC | #11
Sorry. Its typo I used original disk /dev/sdb where filesystem is 
created and seeing these errors.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Holger Hoffstätte April 20, 2016, 2:25 p.m. UTC | #12
On Tue, 19 Apr 2016 15:19:46 -0700, Mark Fasheh wrote:

> On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote:
>> Current btrfs qgroup design implies a requirement that after calling
>> btrfs_qgroup_account_extents() there must be a commit root switch.
>> 
>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>> 
>> However there is a exception at create_pending_snapshot(), which will
>> call btrfs_qgroup_account_extents() but no any commit root switch.
>> 
>> In case of creating a snapshot whose parent root is itself (create a
>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>> (skipped unrelated data)
>> ======
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>> ======
>> 
>> The problem here is in first qgroup_account_extent(), the
>> nr_new_roots of the extent is 1, which means its reference got
>> increased, and qgroup increased its rfer and excl.
>> 
>> But at second qgroup_account_extent(), its reference got decreased, but
>> between these two qgroup_account_extent(), there is no switch roots.
>> This leads to the same nr_old_roots, and this extent just got ignored by
>> qgroup, which means this extent is wrongly accounted.
>> 
>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>> create_pending_snapshot(), with needed preparation.
>> 
>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> 
> Ok, this version seems to be giving me the right numbers. I'll send a test
> for it shortly. I'd still like to know if this patch introduces an
> unintended side effects but otherwise, thanks Qu.
> 	--Mark

Hi Mark,

Can't speak about other side effects since I have not observed any so far,
but I can confirm that the previously failing case of deleting a renamed
snapshot [1] now works properly with v4 without getting the commit roots
in a twist. So:

Tested-by: holger.hoffstaette@googlemail.com

cheers
Holger

[1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/55052

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik April 22, 2016, 6:12 p.m. UTC | #13
On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> Current btrfs qgroup design implies a requirement that after calling
> btrfs_qgroup_account_extents() there must be a commit root switch.
>
> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>
> However there is a exception at create_pending_snapshot(), which will
> call btrfs_qgroup_account_extents() but no any commit root switch.
>
> In case of creating a snapshot whose parent root is itself (create a
> snapshot of fs tree), it will corrupt qgroup by the following trace:
> (skipped unrelated data)
> ======
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> ======
>
> The problem here is in first qgroup_account_extent(), the
> nr_new_roots of the extent is 1, which means its reference got
> increased, and qgroup increased its rfer and excl.
>
> But at second qgroup_account_extent(), its reference got decreased, but
> between these two qgroup_account_extent(), there is no switch roots.
> This leads to the same nr_old_roots, and this extent just got ignored by
> qgroup, which means this extent is wrongly accounted.
>
> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> create_pending_snapshot(), with needed preparation.
>
> Reported-by: Mark Fasheh <mfasheh@suse.de>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> v2:
>    Fix a soft lockup caused by missing switch_commit_root() call.
>    Fix a warning caused by dirty-but-not-committed root.
> v3:
>    Fix a bug which will cause qgroup accounting for dropping snapshot
>    wrong
> v4:
>    Fix a bug caused by non-cowed btree modification.
>
> To Filipe:
>    I'm sorry I didn't wait for your reply on the dropped roots.
>    I reverted back the version where we deleted dropped roots in
>    switch_commit_roots().
>
>    As I think as long as we called btrfs_qgroup_prepare_account_extents()
>    and btrfs_qgroup_account_extents(), it should have already accounted
>    extents for dropped roots, and then we are OK to drop them.
>
>    It would be very nice if you could point out what I missed.
>    Thanks
>    Qu
> ---
>   fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 93 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 43885e5..92f8193 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -311,10 +311,11 @@ loop:
>    * when the transaction commits
>    */
>   static int record_root_in_trans(struct btrfs_trans_handle *trans,
> -			       struct btrfs_root *root)
> +			       struct btrfs_root *root,
> +			       int force)
>   {
> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> -	    root->last_trans < trans->transid) {
> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> +	    root->last_trans < trans->transid) || force) {
>   		WARN_ON(root == root->fs_info->extent_root);
>   		WARN_ON(root->commit_root != root->node);
>
> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>   		smp_wmb();
>
>   		spin_lock(&root->fs_info->fs_roots_radix_lock);
> -		if (root->last_trans == trans->transid) {
> +		if (root->last_trans == trans->transid && !force) {
>   			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>   			return 0;
>   		}
> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>   		return 0;
>
>   	mutex_lock(&root->fs_info->reloc_mutex);
> -	record_root_in_trans(trans, root);
> +	record_root_in_trans(trans, root, 0);
>   	mutex_unlock(&root->fs_info->reloc_mutex);
>
>   	return 0;
> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>   }
>
>   /*
> + * Do all special snapshot related qgroup dirty hack.
> + *
> + * Will do all needed qgroup inherit and dirty hack like switch commit
> + * roots inside one transaction and write all btree into disk, to make
> + * qgroup works.
> + */
> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> +				   struct btrfs_root *src,
> +				   struct btrfs_root *parent,
> +				   struct btrfs_qgroup_inherit *inherit,
> +				   u64 dst_objectid)
> +{
> +	struct btrfs_fs_info *fs_info = src->fs_info;
> +	int ret;
> +
> +	/*
> +	 * We are going to commit transaction, see btrfs_commit_transaction()
> +	 * comment for reason locking tree_log_mutex
> +	 */
> +	mutex_lock(&fs_info->tree_log_mutex);
> +
> +	ret = commit_fs_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
> +	ret = btrfs_qgroup_inherit(trans, fs_info,
> +				   src->root_key.objectid, dst_objectid,
> +				   inherit);
> +	if (ret < 0)
> +		goto out;
> +
> +	/*
> +	 * Now we do a simplified commit transaction, which will:
> +	 * 1) commit all subvolume and extent tree
> +	 *    To ensure all subvolume and extent tree have a valid
> +	 *    commit_root to accounting later insert_dir_item()
> +	 * 2) write all btree blocks onto disk
> +	 *    This is to make sure later btree modification will be cowed
> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
> +	 * In this simplified commit, we don't really care about other trees
> +	 * like chunk and root tree, as they won't affect qgroup.
> +	 * And we don't write super to avoid half committed status.
> +	 */
> +	ret = commit_cowonly_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	switch_commit_roots(trans->transaction, fs_info);
> +	ret = btrfs_write_and_wait_transaction(trans, src);
> +	if (ret)
> +		btrfs_std_error(fs_info, ret,
> +			"Error while writing out transaction for qgroup");
> +
> +out:
> +	mutex_unlock(&fs_info->tree_log_mutex);
> +
> +	/*
> +	 * Force parent root to be updated, as we recorded it before so its
> +	 * last_trans == cur_transid.
> +	 * Or it won't be committed again onto disk after later
> +	 * insert_dir_item()
> +	 */
> +	if (!ret)
> +		record_root_in_trans(trans, parent, 1);
> +	return ret;
> +}

NACK, holy shit we aren't adding a special transaction commit only for 
qgroup snapshots.  Figure out a different way.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh April 22, 2016, 6:21 p.m. UTC | #14
On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >+	/*
> >+	 * Force parent root to be updated, as we recorded it before so its
> >+	 * last_trans == cur_transid.
> >+	 * Or it won't be committed again onto disk after later
> >+	 * insert_dir_item()
> >+	 */
> >+	if (!ret)
> >+		record_root_in_trans(trans, parent, 1);
> >+	return ret;
> >+}
> 
> NACK, holy shit we aren't adding a special transaction commit only
> for qgroup snapshots.  Figure out a different way.  Thanks,

Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
multiple times (at least from my reading) so I'm really unclear on what the
performance impact is.

Do you have any suggestion though? We've been banging our heads against this
for a while now and as slow as this patch might be, it actually works where
nothing else has so far.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik April 22, 2016, 6:23 p.m. UTC | #15
On 04/22/2016 02:21 PM, Mark Fasheh wrote:
> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>> +	/*
>>> +	 * Force parent root to be updated, as we recorded it before so its
>>> +	 * last_trans == cur_transid.
>>> +	 * Or it won't be committed again onto disk after later
>>> +	 * insert_dir_item()
>>> +	 */
>>> +	if (!ret)
>>> +		record_root_in_trans(trans, parent, 1);
>>> +	return ret;
>>> +}
>>
>> NACK, holy shit we aren't adding a special transaction commit only
>> for qgroup snapshots.  Figure out a different way.  Thanks,
>
> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
> multiple times (at least from my reading) so I'm really unclear on what the
> performance impact is.
>
> Do you have any suggestion though? We've been banging our heads against this
> for a while now and as slow as this patch might be, it actually works where
> nothing else has so far.

I'm less concerned about committing another transaction and more 
concerned about the fact that it is an special variant of the 
transaction commit.  If this goes wrong, or at some point in the future 
we fail to update it along with btrfs_transaction_commit we suddenly are 
corrupting metadata.  If we have to commit a transaction then call 
btrfs_commit_transaction(), don't open code a stripped down version, 
here be dragons.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh April 22, 2016, 6:29 p.m. UTC | #16
On Fri, Apr 22, 2016 at 02:23:59PM -0400, Josef Bacik wrote:
> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
> >On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> >>On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >>>+	/*
> >>>+	 * Force parent root to be updated, as we recorded it before so its
> >>>+	 * last_trans == cur_transid.
> >>>+	 * Or it won't be committed again onto disk after later
> >>>+	 * insert_dir_item()
> >>>+	 */
> >>>+	if (!ret)
> >>>+		record_root_in_trans(trans, parent, 1);
> >>>+	return ret;
> >>>+}
> >>
> >>NACK, holy shit we aren't adding a special transaction commit only
> >>for qgroup snapshots.  Figure out a different way.  Thanks,
> >
> >Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
> >multiple times (at least from my reading) so I'm really unclear on what the
> >performance impact is.
> >
> >Do you have any suggestion though? We've been banging our heads against this
> >for a while now and as slow as this patch might be, it actually works where
> >nothing else has so far.
> 
> I'm less concerned about committing another transaction and more
> concerned about the fact that it is an special variant of the
> transaction commit.  If this goes wrong, or at some point in the
> future we fail to update it along with btrfs_transaction_commit we
> suddenly are corrupting metadata.  If we have to commit a
> transaction then call btrfs_commit_transaction(), don't open code a
> stripped down version, here be dragons.  Thanks,

Ok yeah that makes perfect sense - I thought you were telling me that this
would be a big performance problem.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 25, 2016, 12:56 a.m. UTC | #17
Josef Bacik wrote on 2016/04/22 14:23 -0400:
> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>> +    /*
>>>> +     * Force parent root to be updated, as we recorded it before so
>>>> its
>>>> +     * last_trans == cur_transid.
>>>> +     * Or it won't be committed again onto disk after later
>>>> +     * insert_dir_item()
>>>> +     */
>>>> +    if (!ret)
>>>> +        record_root_in_trans(trans, parent, 1);
>>>> +    return ret;
>>>> +}
>>>
>>> NACK, holy shit we aren't adding a special transaction commit only
>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>
>> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
>> multiple times (at least from my reading) so I'm really unclear on
>> what the
>> performance impact is.
>>
>> Do you have any suggestion though? We've been banging our heads
>> against this
>> for a while now and as slow as this patch might be, it actually works
>> where
>> nothing else has so far.
>
> I'm less concerned about committing another transaction and more
> concerned about the fact that it is an special variant of the
> transaction commit.  If this goes wrong, or at some point in the future
> we fail to update it along with btrfs_transaction_commit we suddenly are
> corrupting metadata.  If we have to commit a transaction then call
> btrfs_commit_transaction(), don't open code a stripped down version,
> here be dragons.  Thanks,
>
> Josef
>
>
>
Yes, I also don't like the dirty hack.

Although the problem is, we have no other good choice.

If we can call commit_transaction() that's the best case, but the 
problem is, in create_pending_snapshots(), we are already inside 
commit_transaction().

Or commit_transaction() can be called inside commit_transaction()?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik April 25, 2016, 2:24 p.m. UTC | #18
On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>> +    /*
>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>> its
>>>>> +     * last_trans == cur_transid.
>>>>> +     * Or it won't be committed again onto disk after later
>>>>> +     * insert_dir_item()
>>>>> +     */
>>>>> +    if (!ret)
>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>> +    return ret;
>>>>> +}
>>>>
>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>
>>> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
>>> multiple times (at least from my reading) so I'm really unclear on
>>> what the
>>> performance impact is.
>>>
>>> Do you have any suggestion though? We've been banging our heads
>>> against this
>>> for a while now and as slow as this patch might be, it actually works
>>> where
>>> nothing else has so far.
>>
>> I'm less concerned about committing another transaction and more
>> concerned about the fact that it is an special variant of the
>> transaction commit.  If this goes wrong, or at some point in the future
>> we fail to update it along with btrfs_transaction_commit we suddenly are
>> corrupting metadata.  If we have to commit a transaction then call
>> btrfs_commit_transaction(), don't open code a stripped down version,
>> here be dragons.  Thanks,
>>
>> Josef
>>
>>
>>
> Yes, I also don't like the dirty hack.
>
> Although the problem is, we have no other good choice.
>
> If we can call commit_transaction() that's the best case, but the
> problem is, in create_pending_snapshots(), we are already inside
> commit_transaction().
>
> Or commit_transaction() can be called inside commit_transaction()?
>

No, figure out a different way.  IIRC I dealt with this with the 
no_quota flag for inc_ref/dec_ref since the copy root stuff does strange 
things with the reference counts, but all this code is gone now.  I 
looked around to see if I could figure out how the refs are ending up 
this way but it doesn't make sense to me and there isn't enough 
information in your changelog for me to be able to figure it out. 
You've created this mess, clean it up without making it messier.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 26, 2016, 12:35 a.m. UTC | #19
Josef Bacik wrote on 2016/04/25 10:24 -0400:
> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>> +    /*
>>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>>> its
>>>>>> +     * last_trans == cur_transid.
>>>>>> +     * Or it won't be committed again onto disk after later
>>>>>> +     * insert_dir_item()
>>>>>> +     */
>>>>>> +    if (!ret)
>>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>>> +    return ret;
>>>>>> +}
>>>>>
>>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>>
>>>> Yeah I saw that. To be fair, we run a whole lot of the transaction
>>>> stuff
>>>> multiple times (at least from my reading) so I'm really unclear on
>>>> what the
>>>> performance impact is.
>>>>
>>>> Do you have any suggestion though? We've been banging our heads
>>>> against this
>>>> for a while now and as slow as this patch might be, it actually works
>>>> where
>>>> nothing else has so far.
>>>
>>> I'm less concerned about committing another transaction and more
>>> concerned about the fact that it is an special variant of the
>>> transaction commit.  If this goes wrong, or at some point in the future
>>> we fail to update it along with btrfs_transaction_commit we suddenly are
>>> corrupting metadata.  If we have to commit a transaction then call
>>> btrfs_commit_transaction(), don't open code a stripped down version,
>>> here be dragons.  Thanks,
>>>
>>> Josef
>>>
>>>
>>>
>> Yes, I also don't like the dirty hack.
>>
>> Although the problem is, we have no other good choice.
>>
>> If we can call commit_transaction() that's the best case, but the
>> problem is, in create_pending_snapshots(), we are already inside
>> commit_transaction().
>>
>> Or commit_transaction() can be called inside commit_transaction()?
>>
>
> No, figure out a different way.  IIRC I dealt with this with the
> no_quota flag for inc_ref/dec_ref since the copy root stuff does strange
> things with the reference counts, but all this code is gone now.  I
> looked around to see if I could figure out how the refs are ending up
> this way but it doesn't make sense to me and there isn't enough
> information in your changelog for me to be able to figure it out. You've
> created this mess, clean it up without making it messier.  Thanks,
>
> Josef
>
>
Unfortunately, your original no_quota flag just hide the bug, and hide 
it in a bad method.

Originally, no_quota flag is used for case like this, to skip quota at 
snapshot creation, and use quota_inherit() to hack the quota accounting.
It seems work, but in fact, if the DIR_ITEM insert need to create a new 
cousin leaf, then quota is messed up.

Your quota rework doesn't really help, as it won't even accounting 
things well, just check fstest/btrfs/091 on 4.1 kernel.

The only perfect fix for this already nasty subvolume creation is to do 
full subtree rescan.
Or no one knows when higher qgroups will be broken.



If you think splitting commit_transaction into two variants can cause 
problem, I can merge this two variants into one.

As in btrfs_commit_transaction() the commit process is much the same as 
the one used in create_pending_snapshot().

If there is only one __commit_roots() to do such commit, then there is 
nothing special only for quota.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik April 26, 2016, 2:26 p.m. UTC | #20
On 04/25/2016 08:35 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/25 10:24 -0400:
>> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>>
>>>
>>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>>> +    /*
>>>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>>>> its
>>>>>>> +     * last_trans == cur_transid.
>>>>>>> +     * Or it won't be committed again onto disk after later
>>>>>>> +     * insert_dir_item()
>>>>>>> +     */
>>>>>>> +    if (!ret)
>>>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>
>>>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>>>
>>>>> Yeah I saw that. To be fair, we run a whole lot of the transaction
>>>>> stuff
>>>>> multiple times (at least from my reading) so I'm really unclear on
>>>>> what the
>>>>> performance impact is.
>>>>>
>>>>> Do you have any suggestion though? We've been banging our heads
>>>>> against this
>>>>> for a while now and as slow as this patch might be, it actually works
>>>>> where
>>>>> nothing else has so far.
>>>>
>>>> I'm less concerned about committing another transaction and more
>>>> concerned about the fact that it is an special variant of the
>>>> transaction commit.  If this goes wrong, or at some point in the future
>>>> we fail to update it along with btrfs_transaction_commit we suddenly
>>>> are
>>>> corrupting metadata.  If we have to commit a transaction then call
>>>> btrfs_commit_transaction(), don't open code a stripped down version,
>>>> here be dragons.  Thanks,
>>>>
>>>> Josef
>>>>
>>>>
>>>>
>>> Yes, I also don't like the dirty hack.
>>>
>>> Although the problem is, we have no other good choice.
>>>
>>> If we can call commit_transaction() that's the best case, but the
>>> problem is, in create_pending_snapshots(), we are already inside
>>> commit_transaction().
>>>
>>> Or commit_transaction() can be called inside commit_transaction()?
>>>
>>
>> No, figure out a different way.  IIRC I dealt with this with the
>> no_quota flag for inc_ref/dec_ref since the copy root stuff does strange
>> things with the reference counts, but all this code is gone now.  I
>> looked around to see if I could figure out how the refs are ending up
>> this way but it doesn't make sense to me and there isn't enough
>> information in your changelog for me to be able to figure it out. You've
>> created this mess, clean it up without making it messier.  Thanks,
>>
>> Josef
>>
>>
> Unfortunately, your original no_quota flag just hide the bug, and hide
> it in a bad method.
>
> Originally, no_quota flag is used for case like this, to skip quota at
> snapshot creation, and use quota_inherit() to hack the quota accounting.
> It seems work, but in fact, if the DIR_ITEM insert need to create a new
> cousin leaf, then quota is messed up.
>

No, and this is the problem, you fundamentally didn't understand what I 
wrote, and instead of trying to understand it and fix the bug you just 
threw it all away.  The no_quota stuff was not a hack, it was put in 
place to deal with refs we already knew where accounted for, such as 
when we converted to mixed refs or other such operations.

There were bugs in my rework, but now the situation is untenable.  What 
we now have is something that holds delayed refs in memory for the 
entire transaction, which is a non-starter for anybody who wants to use 
this in production.  On our gluster machines we get millions of delayed 
refs per transaction, and then there are multiple file systems.  Then 
you have to post-process all of that during the critical section of the 
commit?  So now suddenly I'm walking millions of delayed refs doing 
accounting all at once, that is going to cause commit latencies in the 
seconds which is completely unacceptable.

Anyway I spent some time this morning reading through the new stuff to 
figure out how it works now and I've got a patch to fix this problem 
that doesn't involve screwing with the transaction commit stuff at all. 
I sent it along separately

Btrfs: fix qgroup accounting when snapshotting

This fixes the basic case that was described originally.  I haven't 
tested it other than that but I'm pretty sure it is correct.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 27, 2016, 1:12 a.m. UTC | #21
Josef Bacik wrote on 2016/04/26 10:26 -0400:
> On 04/25/2016 08:35 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/25 10:24 -0400:
>>> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik
>>>>>> wrote:
>>>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>>>> +    /* +     * Force parent root to be updated, as we
>>>>>>>> recorded it before so its +     * last_trans ==
>>>>>>>> cur_transid. +     * Or it won't be committed again
>>>>>>>> onto disk after later +     * insert_dir_item() +
>>>>>>>> */ +    if (!ret) +        record_root_in_trans(trans,
>>>>>>>> parent, 1); +    return ret; +}
>>>>>>>
>>>>>>> NACK, holy shit we aren't adding a special transaction
>>>>>>> commit only for qgroup snapshots.  Figure out a different
>>>>>>> way.  Thanks,
>>>>>>
>>>>>> Yeah I saw that. To be fair, we run a whole lot of the
>>>>>> transaction stuff multiple times (at least from my reading)
>>>>>> so I'm really unclear on what the performance impact is.
>>>>>>
>>>>>> Do you have any suggestion though? We've been banging our
>>>>>> heads against this for a while now and as slow as this
>>>>>> patch might be, it actually works where nothing else has so
>>>>>> far.
>>>>>
>>>>> I'm less concerned about committing another transaction and
>>>>> more concerned about the fact that it is an special variant
>>>>> of the transaction commit.  If this goes wrong, or at some
>>>>> point in the future we fail to update it along with
>>>>> btrfs_transaction_commit we suddenly are corrupting metadata.
>>>>> If we have to commit a transaction then call
>>>>> btrfs_commit_transaction(), don't open code a stripped down
>>>>> version, here be dragons.  Thanks,
>>>>>
>>>>> Josef
>>>>>
>>>>>
>>>>>
>>>> Yes, I also don't like the dirty hack.
>>>>
>>>> Although the problem is, we have no other good choice.
>>>>
>>>> If we can call commit_transaction() that's the best case, but
>>>> the problem is, in create_pending_snapshots(), we are already
>>>> inside commit_transaction().
>>>>
>>>> Or commit_transaction() can be called inside
>>>> commit_transaction()?
>>>>
>>>
>>> No, figure out a different way.  IIRC I dealt with this with the
>>> no_quota flag for inc_ref/dec_ref since the copy root stuff does
>>> strange things with the reference counts, but all this code is
>>> gone now.  I looked around to see if I could figure out how the
>>> refs are ending up this way but it doesn't make sense to me and
>>> there isn't enough information in your changelog for me to be
>>> able to figure it out. You've created this mess, clean it up
>>> without making it messier.  Thanks,
>>>
>>> Josef
>>>
>>>
>> Unfortunately, your original no_quota flag just hide the bug, and
>> hide it in a bad method.
>>
>> Originally, no_quota flag is used for case like this, to skip quota
>> at snapshot creation, and use quota_inherit() to hack the quota
>> accounting. It seems work, but in fact, if the DIR_ITEM insert need
>> to create a new cousin leaf, then quota is messed up.
>>
>
> No, and this is the problem, you fundamentally didn't understand what
> I wrote, and instead of trying to understand it and fix the bug you
> just threw it all away.  The no_quota stuff was not a hack, it was
> put in place to deal with refs we already knew where accounted for,
> such as when we converted to mixed refs or other such operations.

Even we know it has been accounted, it's still a hack to jump away from
normal accounting routing.

Just like what we do in qgroup_inherit(). We believe snapshot creation
will always make the exclusive to nodesize.
But in fact, we didn't consider the higher qgroup, and qgroup_inherit()
is just messing up that case.


And Yes, I DID threw the old qgroup code away.
As there are too many existing bugs and possible bugs, not only in
qgroup, but also in backref walk.

Backref walk can't handle time_seq really well, that's one of the reason
for btrfs/091.
And new comer must handle the extra no_quota flag, which is more easy to
cause new regression.

>
> There were bugs in my rework, but now the situation is untenable.
> What we now have is something that holds delayed refs in memory for
> the entire transaction, which is a non-starter for anybody who wants
> to use this in production.

Nope, we didn't hold delayed refs in current qgroup codes.

Delayed refs can be flushed to disk at any time just as it is.
We only trace which bytenr will go through qgroup routine, nothing to do 
with delayed refs.

Yes you can have millions of delayed refs, but there won't be millions
of different extents. Or sync_fs() would have been triggered.

> On our gluster machines we get millions of delayed refs per
> transaction, and then there are multiple file systems.  Then you have
> to post-process all of that during the critical section of the
> commit?  So now suddenly I'm walking millions of delayed refs doing
> accounting all at once, that is going to cause commit latencies in
> the seconds which is completely unacceptable.

In fact qgroup code will never account or walk through delayed refs.

Qgroup code will only walk through backrefs in extent tree. (commit tree 
and just before commit tree)
So, millions is already reduced to a much smaller amount.

Then, for old qgroup code, each time a inc/dec_extent_ref() will walk
through all backref.

So in special case like in-band dedupe, thousands of inc_extent_ref()
will cause thousands backref walk through, no matter when you do the
walk through, it will hugely delayed the transaction.

But in new qgroup code, twice for any dirty extent in one trans.
Even in the best case, it is the same with old case.
And much much much much much faster in worst case.


And further more, since qgroup code won't bother with the already 
b**lsh*t delayed-refs, it only needs to care extent tree.
This also avoid the backref bugs in walking delayed_ref.


Although, the root problem is in backref walking.
It's too time consuming, and we need some cache for it.
Or dedupe(in-band or out-of-band or reflink) can easily screw it up with 
fiemap.

Just create a 1G file with all its extents pointing to a single 128K one.
Then fiemap will just hang your system.

>
> Anyway I spent some time this morning reading through the new stuff
> to figure out how it works now and I've got a patch to fix this
> problem that doesn't involve screwing with the transaction commit
> stuff at all. I sent it along separately
>
> Btrfs: fix qgroup accounting when snapshotting
>
> This fixes the basic case that was described originally.  I haven't
> tested it other than that but I'm pretty sure it is correct.
> Thanks,
>
> Josef
>
Unfortunately, it isn't correct and even didn't fix the bug.

I'll comment in that thread.

Thanks,
Qu




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Fasheh May 11, 2016, 4:57 p.m. UTC | #22
Hi Josef, 

On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >Current btrfs qgroup design implies a requirement that after calling
> >btrfs_qgroup_account_extents() there must be a commit root switch.
> >
> >Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> >inside btrfs_commit_transaction() just be commit_cowonly_roots().
> >
> >However there is a exception at create_pending_snapshot(), which will
> >call btrfs_qgroup_account_extents() but no any commit root switch.
> >
> >In case of creating a snapshot whose parent root is itself (create a
> >snapshot of fs tree), it will corrupt qgroup by the following trace:
> >(skipped unrelated data)
> >======
> >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >======
> >
> >The problem here is in first qgroup_account_extent(), the
> >nr_new_roots of the extent is 1, which means its reference got
> >increased, and qgroup increased its rfer and excl.
> >
> >But at second qgroup_account_extent(), its reference got decreased, but
> >between these two qgroup_account_extent(), there is no switch roots.
> >This leads to the same nr_old_roots, and this extent just got ignored by
> >qgroup, which means this extent is wrongly accounted.
> >
> >Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> >create_pending_snapshot(), with needed preparation.
> >
> >Reported-by: Mark Fasheh <mfasheh@suse.de>
> >Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >---
> >v2:
> >   Fix a soft lockup caused by missing switch_commit_root() call.
> >   Fix a warning caused by dirty-but-not-committed root.
> >v3:
> >   Fix a bug which will cause qgroup accounting for dropping snapshot
> >   wrong
> >v4:
> >   Fix a bug caused by non-cowed btree modification.
> >
> >To Filipe:
> >   I'm sorry I didn't wait for your reply on the dropped roots.
> >   I reverted back the version where we deleted dropped roots in
> >   switch_commit_roots().
> >
> >   As I think as long as we called btrfs_qgroup_prepare_account_extents()
> >   and btrfs_qgroup_account_extents(), it should have already accounted
> >   extents for dropped roots, and then we are OK to drop them.
> >
> >   It would be very nice if you could point out what I missed.
> >   Thanks
> >   Qu
> >---
> >  fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 93 insertions(+), 24 deletions(-)
> >
> >diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> >index 43885e5..92f8193 100644
> >--- a/fs/btrfs/transaction.c
> >+++ b/fs/btrfs/transaction.c
> >@@ -311,10 +311,11 @@ loop:
> >   * when the transaction commits
> >   */
> >  static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >-			       struct btrfs_root *root)
> >+			       struct btrfs_root *root,
> >+			       int force)
> >  {
> >-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >-	    root->last_trans < trans->transid) {
> >+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >+	    root->last_trans < trans->transid) || force) {
> >  		WARN_ON(root == root->fs_info->extent_root);
> >  		WARN_ON(root->commit_root != root->node);
> >
> >@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >  		smp_wmb();
> >
> >  		spin_lock(&root->fs_info->fs_roots_radix_lock);
> >-		if (root->last_trans == trans->transid) {
> >+		if (root->last_trans == trans->transid && !force) {
> >  			spin_unlock(&root->fs_info->fs_roots_radix_lock);
> >  			return 0;
> >  		}
> >@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
> >  		return 0;
> >
> >  	mutex_lock(&root->fs_info->reloc_mutex);
> >-	record_root_in_trans(trans, root);
> >+	record_root_in_trans(trans, root, 0);
> >  	mutex_unlock(&root->fs_info->reloc_mutex);
> >
> >  	return 0;
> >@@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
> >  }
> >
> >  /*
> >+ * Do all special snapshot related qgroup dirty hack.
> >+ *
> >+ * Will do all needed qgroup inherit and dirty hack like switch commit
> >+ * roots inside one transaction and write all btree into disk, to make
> >+ * qgroup works.
> >+ */
> >+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> >+				   struct btrfs_root *src,
> >+				   struct btrfs_root *parent,
> >+				   struct btrfs_qgroup_inherit *inherit,
> >+				   u64 dst_objectid)
> >+{
> >+	struct btrfs_fs_info *fs_info = src->fs_info;
> >+	int ret;
> >+
> >+	/*
> >+	 * We are going to commit transaction, see btrfs_commit_transaction()
> >+	 * comment for reason locking tree_log_mutex
> >+	 */
> >+	mutex_lock(&fs_info->tree_log_mutex);
> >+
> >+	ret = commit_fs_roots(trans, src);
> >+	if (ret)
> >+		goto out;
> >+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> >+	if (ret < 0)
> >+		goto out;
> >+	ret = btrfs_qgroup_account_extents(trans, fs_info);
> >+	if (ret < 0)
> >+		goto out;
> >+
> >+	/* Now qgroup are all updated, we can inherit it to new qgroups */
> >+	ret = btrfs_qgroup_inherit(trans, fs_info,
> >+				   src->root_key.objectid, dst_objectid,
> >+				   inherit);
> >+	if (ret < 0)
> >+		goto out;
> >+
> >+	/*
> >+	 * Now we do a simplified commit transaction, which will:
> >+	 * 1) commit all subvolume and extent tree
> >+	 *    To ensure all subvolume and extent tree have a valid
> >+	 *    commit_root to accounting later insert_dir_item()
> >+	 * 2) write all btree blocks onto disk
> >+	 *    This is to make sure later btree modification will be cowed
> >+	 *    Or commit_root can be populated and cause wrong qgroup numbers
> >+	 * In this simplified commit, we don't really care about other trees
> >+	 * like chunk and root tree, as they won't affect qgroup.
> >+	 * And we don't write super to avoid half committed status.
> >+	 */
> >+	ret = commit_cowonly_roots(trans, src);
> >+	if (ret)
> >+		goto out;
> >+	switch_commit_roots(trans->transaction, fs_info);
> >+	ret = btrfs_write_and_wait_transaction(trans, src);
> >+	if (ret)
> >+		btrfs_std_error(fs_info, ret,
> >+			"Error while writing out transaction for qgroup");
> >+
> >+out:
> >+	mutex_unlock(&fs_info->tree_log_mutex);
> >+
> >+	/*
> >+	 * Force parent root to be updated, as we recorded it before so its
> >+	 * last_trans == cur_transid.
> >+	 * Or it won't be committed again onto disk after later
> >+	 * insert_dir_item()
> >+	 */
> >+	if (!ret)
> >+		record_root_in_trans(trans, parent, 1);
> >+	return ret;
> >+}
> 
> NACK, holy shit we aren't adding a special transaction commit only
> for qgroup snapshots.  Figure out a different way.  Thanks,


Unfortunately I think we're going to have to swallow our pride on this one :(

Per our conversations on irc, and my detailed observations in this e-mail:

https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2

It seems like we have a core problem in that root counting during snapshot
create is unreliable and leads to corrupted qgroups. You add into that the
poor assumptions made by the rest of the code (such as qgroup_inherit()
always marking dst->excl = node_size) and ti looks like we won't have a
proper fix until another qgroup rewrite.

In the meantime, this would make qgroups numbers correct again. If we drop a
single check in there to only run when qgroups are enabled, we can mitigate
the performance impact. If I send that patch would you be ok to ACK it this
time around?

Thanks,
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik May 11, 2016, 4:59 p.m. UTC | #23
On 05/11/2016 09:57 AM, Mark Fasheh wrote:
> Hi Josef,
>
> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>> Current btrfs qgroup design implies a requirement that after calling
>>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>>
>>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>>
>>> However there is a exception at create_pending_snapshot(), which will
>>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>>
>>> In case of creating a snapshot whose parent root is itself (create a
>>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>>> (skipped unrelated data)
>>> ======
>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>>> ======
>>>
>>> The problem here is in first qgroup_account_extent(), the
>>> nr_new_roots of the extent is 1, which means its reference got
>>> increased, and qgroup increased its rfer and excl.
>>>
>>> But at second qgroup_account_extent(), its reference got decreased, but
>>> between these two qgroup_account_extent(), there is no switch roots.
>>> This leads to the same nr_old_roots, and this extent just got ignored by
>>> qgroup, which means this extent is wrongly accounted.
>>>
>>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>>> create_pending_snapshot(), with needed preparation.
>>>
>>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>> v2:
>>>   Fix a soft lockup caused by missing switch_commit_root() call.
>>>   Fix a warning caused by dirty-but-not-committed root.
>>> v3:
>>>   Fix a bug which will cause qgroup accounting for dropping snapshot
>>>   wrong
>>> v4:
>>>   Fix a bug caused by non-cowed btree modification.
>>>
>>> To Filipe:
>>>   I'm sorry I didn't wait for your reply on the dropped roots.
>>>   I reverted back the version where we deleted dropped roots in
>>>   switch_commit_roots().
>>>
>>>   As I think as long as we called btrfs_qgroup_prepare_account_extents()
>>>   and btrfs_qgroup_account_extents(), it should have already accounted
>>>   extents for dropped roots, and then we are OK to drop them.
>>>
>>>   It would be very nice if you could point out what I missed.
>>>   Thanks
>>>   Qu
>>> ---
>>>  fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 93 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 43885e5..92f8193 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -311,10 +311,11 @@ loop:
>>>   * when the transaction commits
>>>   */
>>>  static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>> -			       struct btrfs_root *root)
>>> +			       struct btrfs_root *root,
>>> +			       int force)
>>>  {
>>> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>> -	    root->last_trans < trans->transid) {
>>> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>> +	    root->last_trans < trans->transid) || force) {
>>>  		WARN_ON(root == root->fs_info->extent_root);
>>>  		WARN_ON(root->commit_root != root->node);
>>>
>>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>>  		smp_wmb();
>>>
>>>  		spin_lock(&root->fs_info->fs_roots_radix_lock);
>>> -		if (root->last_trans == trans->transid) {
>>> +		if (root->last_trans == trans->transid && !force) {
>>>  			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>>>  			return 0;
>>>  		}
>>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>>>  		return 0;
>>>
>>>  	mutex_lock(&root->fs_info->reloc_mutex);
>>> -	record_root_in_trans(trans, root);
>>> +	record_root_in_trans(trans, root, 0);
>>>  	mutex_unlock(&root->fs_info->reloc_mutex);
>>>
>>>  	return 0;
>>> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>>>  }
>>>
>>>  /*
>>> + * Do all special snapshot related qgroup dirty hack.
>>> + *
>>> + * Will do all needed qgroup inherit and dirty hack like switch commit
>>> + * roots inside one transaction and write all btree into disk, to make
>>> + * qgroup works.
>>> + */
>>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>>> +				   struct btrfs_root *src,
>>> +				   struct btrfs_root *parent,
>>> +				   struct btrfs_qgroup_inherit *inherit,
>>> +				   u64 dst_objectid)
>>> +{
>>> +	struct btrfs_fs_info *fs_info = src->fs_info;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * We are going to commit transaction, see btrfs_commit_transaction()
>>> +	 * comment for reason locking tree_log_mutex
>>> +	 */
>>> +	mutex_lock(&fs_info->tree_log_mutex);
>>> +
>>> +	ret = commit_fs_roots(trans, src);
>>> +	if (ret)
>>> +		goto out;
>>> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
>>> +	ret = btrfs_qgroup_inherit(trans, fs_info,
>>> +				   src->root_key.objectid, dst_objectid,
>>> +				   inherit);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	/*
>>> +	 * Now we do a simplified commit transaction, which will:
>>> +	 * 1) commit all subvolume and extent tree
>>> +	 *    To ensure all subvolume and extent tree have a valid
>>> +	 *    commit_root to accounting later insert_dir_item()
>>> +	 * 2) write all btree blocks onto disk
>>> +	 *    This is to make sure later btree modification will be cowed
>>> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
>>> +	 * In this simplified commit, we don't really care about other trees
>>> +	 * like chunk and root tree, as they won't affect qgroup.
>>> +	 * And we don't write super to avoid half committed status.
>>> +	 */
>>> +	ret = commit_cowonly_roots(trans, src);
>>> +	if (ret)
>>> +		goto out;
>>> +	switch_commit_roots(trans->transaction, fs_info);
>>> +	ret = btrfs_write_and_wait_transaction(trans, src);
>>> +	if (ret)
>>> +		btrfs_std_error(fs_info, ret,
>>> +			"Error while writing out transaction for qgroup");
>>> +
>>> +out:
>>> +	mutex_unlock(&fs_info->tree_log_mutex);
>>> +
>>> +	/*
>>> +	 * Force parent root to be updated, as we recorded it before so its
>>> +	 * last_trans == cur_transid.
>>> +	 * Or it won't be committed again onto disk after later
>>> +	 * insert_dir_item()
>>> +	 */
>>> +	if (!ret)
>>> +		record_root_in_trans(trans, parent, 1);
>>> +	return ret;
>>> +}
>>
>> NACK, holy shit we aren't adding a special transaction commit only
>> for qgroup snapshots.  Figure out a different way.  Thanks,
>
>
> Unfortunately I think we're going to have to swallow our pride on this one :(
>
> Per our conversations on irc, and my detailed observations in this e-mail:
>
> https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2
>
> It seems like we have a core problem in that root counting during snapshot
> create is unreliable and leads to corrupted qgroups. You add into that the
> poor assumptions made by the rest of the code (such as qgroup_inherit()
> always marking dst->excl = node_size) and ti looks like we won't have a
> proper fix until another qgroup rewrite.
>
> In the meantime, this would make qgroups numbers correct again. If we drop a
> single check in there to only run when qgroups are enabled, we can mitigate
> the performance impact. If I send that patch would you be ok to ACK it this
> time around?
>

Yeah as long as it's limited to only the qgroup case then I'm fine with 
it for now.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

======
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Reported-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
v2:
  Fix a soft lockup caused by missing switch_commit_root() call.
  Fix a warning caused by dirty-but-not-committed root.
v3:
  Fix a bug which will cause qgroup accounting for dropping snapshot
  wrong
v4:
  Fix a bug caused by non-cowed btree modification.

To Filipe:
  I'm sorry I didn't wait for your reply on the dropped roots.
  I reverted back the version where we deleted dropped roots in
  switch_commit_roots().

  As I think as long as we called btrfs_qgroup_prepare_account_extents()
  and btrfs_qgroup_account_extents(), it should have already accounted
  extents for dropped roots, and then we are OK to drop them.

  It would be very nice if you could point out what I missed.
  Thanks
  Qu
---
 fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 93 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 43885e5..92f8193 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -311,10 +311,11 @@  loop:
  * when the transaction commits
  */
 static int record_root_in_trans(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *root)
+			       struct btrfs_root *root,
+			       int force)
 {
-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
-	    root->last_trans < trans->transid) {
+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
+	    root->last_trans < trans->transid) || force) {
 		WARN_ON(root == root->fs_info->extent_root);
 		WARN_ON(root->commit_root != root->node);
 
@@ -331,7 +332,7 @@  static int record_root_in_trans(struct btrfs_trans_handle *trans,
 		smp_wmb();
 
 		spin_lock(&root->fs_info->fs_roots_radix_lock);
-		if (root->last_trans == trans->transid) {
+		if (root->last_trans == trans->transid && !force) {
 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
 			return 0;
 		}
@@ -402,7 +403,7 @@  int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 		return 0;
 
 	mutex_lock(&root->fs_info->reloc_mutex);
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	mutex_unlock(&root->fs_info->reloc_mutex);
 
 	return 0;
@@ -1311,6 +1312,80 @@  int btrfs_defrag_root(struct btrfs_root *root)
 }
 
 /*
+ * Do all special snapshot related qgroup dirty hack.
+ *
+ * Will do all needed qgroup inherit and dirty hack like switch commit
+ * roots inside one transaction and write all btree into disk, to make
+ * qgroup works.
+ */
+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
+				   struct btrfs_root *src,
+				   struct btrfs_root *parent,
+				   struct btrfs_qgroup_inherit *inherit,
+				   u64 dst_objectid)
+{
+	struct btrfs_fs_info *fs_info = src->fs_info;
+	int ret;
+
+	/*
+	 * We are going to commit transaction, see btrfs_commit_transaction()
+	 * comment for reason locking tree_log_mutex
+	 */
+	mutex_lock(&fs_info->tree_log_mutex);
+
+	ret = commit_fs_roots(trans, src);
+	if (ret)
+		goto out;
+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+	ret = btrfs_qgroup_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* Now qgroup are all updated, we can inherit it to new qgroups */
+	ret = btrfs_qgroup_inherit(trans, fs_info,
+				   src->root_key.objectid, dst_objectid,
+				   inherit);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Now we do a simplified commit transaction, which will:
+	 * 1) commit all subvolume and extent tree
+	 *    To ensure all subvolume and extent tree have a valid
+	 *    commit_root to accounting later insert_dir_item()
+	 * 2) write all btree blocks onto disk
+	 *    This is to make sure later btree modification will be cowed
+	 *    Or commit_root can be populated and cause wrong qgroup numbers
+	 * In this simplified commit, we don't really care about other trees
+	 * like chunk and root tree, as they won't affect qgroup.
+	 * And we don't write super to avoid half committed status.
+	 */
+	ret = commit_cowonly_roots(trans, src);
+	if (ret)
+		goto out;
+	switch_commit_roots(trans->transaction, fs_info);
+	ret = btrfs_write_and_wait_transaction(trans, src);
+	if (ret)
+		btrfs_std_error(fs_info, ret,
+			"Error while writing out transaction for qgroup");
+
+out:
+	mutex_unlock(&fs_info->tree_log_mutex);
+
+	/*
+	 * Force parent root to be updated, as we recorded it before so its
+	 * last_trans == cur_transid.
+	 * Or it won't be committed again onto disk after later
+	 * insert_dir_item()
+	 */
+	if (!ret)
+		record_root_in_trans(trans, parent, 1);
+	return ret;
+}
+
+/*
  * new snapshots need to be created at a very specific time in the
  * transaction commit.  This does the actual creation.
  *
@@ -1383,7 +1458,7 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	dentry = pending->dentry;
 	parent_inode = pending->dir;
 	parent_root = BTRFS_I(parent_inode)->root;
-	record_root_in_trans(trans, parent_root);
+	record_root_in_trans(trans, parent_root, 0);
 
 	cur_time = current_fs_time(parent_inode->i_sb);
 
@@ -1420,7 +1495,7 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
 	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
 	btrfs_check_and_init_root_item(new_root_item);
@@ -1516,6 +1591,17 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
+	/*
+	 * Do special qgroup accounting for snapshot, as we do some qgroup
+	 * snapshot hack to do fast snapshot.
+	 * To co-operate with that hack, we do hack again.
+	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
+	 */
+	ret = qgroup_account_snapshot(trans, root, parent_root,
+				      pending->inherit, objectid);
+	if (ret < 0)
+		goto fail;
+
 	ret = btrfs_insert_dir_item(trans, parent_root,
 				    dentry->d_name.name, dentry->d_name.len,
 				    parent_inode, &key,
@@ -1559,23 +1645,6 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	/*
-	 * account qgroup counters before qgroup_inherit()
-	 */
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_inherit(trans, fs_info,
-				   root->root_key.objectid,
-				   objectid, pending->inherit);
-	if (ret) {
-		btrfs_abort_transaction(trans, root, ret);
-		goto fail;
-	}
-
 fail:
 	pending->error = ret;
 dir_item_existed: