diff mbox series

[RFC] btrfs: reflink: disable cross-subvolume clone/dedupe for simple quota

Message ID 74730c411b0fd87484c8d894878c5cd8bac1d434.1710992258.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: reflink: disable cross-subvolume clone/dedupe for simple quota | expand

Commit Message

Qu Wenruo March 21, 2024, 3:39 a.m. UTC
Unlike the full qgroup, simple quota no longer makes backref walk to
properly make accurate accounting for each subvolume.

Instead it goes a much faster and much simpler way, anything modified by
the subvolume would be accounted to that subvolume.

Although it brings some small accuracy problem, mostly related to shared
extents between different subvolumes, the reduced overhead is more than
good enough.

Considering there are only 2 ways to share extents between subvolumes:

- Snapshotting
- Cross-subvolume clone/dedupe

And since snapshotting is the core functionality of btrfs, we will never
disable that.

But on the other hand, cross-subvolume snapshotting is not so critical,
and disabling that for simple quota would improve the accuracy of it,
I'd say it's worthy to do that.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/reflink.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
---
[REASON FOR RFC]
I'm not sure how important the cross-subvolume clone functionality is in
real-world.

But considering squota is mostly designed for container usage, in that
case disabling cross-subvolume clone should be completely fine.

Comments

Roman Mamedov March 21, 2024, 7:34 a.m. UTC | #1
On Thu, 21 Mar 2024 14:09:38 +1030
Qu Wenruo <wqu@suse.com> wrote:

> Unlike the full qgroup, simple quota no longer makes backref walk to
> properly make accurate accounting for each subvolume.
> 
> Instead it goes a much faster and much simpler way, anything modified by
> the subvolume would be accounted to that subvolume.
> 
> Although it brings some small accuracy problem, mostly related to shared
> extents between different subvolumes, the reduced overhead is more than
> good enough.
> 
> Considering there are only 2 ways to share extents between subvolumes:
> 
> - Snapshotting
> - Cross-subvolume clone/dedupe
> 
> And since snapshotting is the core functionality of btrfs, we will never
> disable that.
> 
> But on the other hand, cross-subvolume snapshotting is not so critical,

I think you meant cross-subvolume clone/dedupe in this case ^

> and disabling that for simple quota would improve the accuracy of it,
> I'd say it's worthy to do that.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/reflink.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> ---
> [REASON FOR RFC]
> I'm not sure how important the cross-subvolume clone functionality is in
> real-world.
> 
> But considering squota is mostly designed for container usage, in that
> case disabling cross-subvolume clone should be completely fine.

Use case that comes to mind, create multiple LXC guests as snapshots from the
template container. They run for a few months, and then you upgrade OS and
software in each, using "apt dist-upgrade", to the latest releases.

Now it would be neat to be able to run dedupe on them to bring disk usage back
to about that of a single container, since the OS and newly installed files
would be mostly the same in all of them.

It also could be that the containers are managed by different users or
customers, in which case recreating them from template on every new OS releases
would not be feasible, and upgrading each separately is the only option.

> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index 08d0fb46ceec..906cb6166b67 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -15,6 +15,7 @@
>  #include "file-item.h"
>  #include "file.h"
>  #include "super.h"
> +#include "qgroup.h"
>  
>  #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
>  
> @@ -350,6 +351,19 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	u64 last_dest_end = destoff;
>  	u64 prev_extent_end = off;
>  
> +	/*
> +	 * If squota is enabled, disable cloning between different subvolumes.
> +	 *
> +	 * As clone/reflink/dedupe is the only other way to share data between
> +	 * different subvolumes other than snapshotting.
> +	 * With it disabled, squota can be way more accurate.
> +	 */
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
> +		if (BTRFS_I(src)->root->root_key.objectid !=
> +		    BTRFS_I(inode)->root->root_key.objectid)
> +			return -EXDEV;
> +	}
> +
>  	ret = -ENOMEM;
>  	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
>  	if (!buf)
David Disseldorp March 21, 2024, 1:25 p.m. UTC | #2
On Thu, 21 Mar 2024 14:09:38 +1030, Qu Wenruo wrote:
...
> [REASON FOR RFC]
> I'm not sure how important the cross-subvolume clone functionality is in
> real-world.
> 
> But considering squota is mostly designed for container usage, in that
> case disabling cross-subvolume clone should be completely fine.

I think copy_file_range() is reasonably common nowadays, and this would
impact such workloads. I don't find the creator-subvol-pays simple quota
behaviour too confusing, so would vote too keep things as is.

Cheers, David
Josef Bacik March 21, 2024, 6:51 p.m. UTC | #3
On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote:
> Unlike the full qgroup, simple quota no longer makes backref walk to
> properly make accurate accounting for each subvolume.
> 
> Instead it goes a much faster and much simpler way, anything modified by
> the subvolume would be accounted to that subvolume.
> 
> Although it brings some small accuracy problem, mostly related to shared
> extents between different subvolumes, the reduced overhead is more than
> good enough.
> 
> Considering there are only 2 ways to share extents between subvolumes:
> 
> - Snapshotting
> - Cross-subvolume clone/dedupe
> 
> And since snapshotting is the core functionality of btrfs, we will never
> disable that.
> 
> But on the other hand, cross-subvolume snapshotting is not so critical,
> and disabling that for simple quota would improve the accuracy of it,
> I'd say it's worthy to do that.
> 

We did this on purpose, and absolutely want to leave this functionality in
place.  Boris made sure to document this behavior explicitly, because we are
absolutely taking advantage of this internally by having the package management
subvolume managed under a different quota, and then reflinking those packages
into their containers volume.  This is the price of squotas, you aren't getting
full tracking, but you're getting limits and speed.  Thanks,

Josef
Boris Burkov March 21, 2024, 7:08 p.m. UTC | #4
On Thu, Mar 21, 2024 at 02:51:35PM -0400, Josef Bacik wrote:
> On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote:
> > Unlike the full qgroup, simple quota no longer makes backref walk to
> > properly make accurate accounting for each subvolume.
> > 
> > Instead it goes a much faster and much simpler way, anything modified by
> > the subvolume would be accounted to that subvolume.
> > 
> > Although it brings some small accuracy problem, mostly related to shared
> > extents between different subvolumes, the reduced overhead is more than
> > good enough.
> > 
> > Considering there are only 2 ways to share extents between subvolumes:
> > 
> > - Snapshotting
> > - Cross-subvolume clone/dedupe
> > 
> > And since snapshotting is the core functionality of btrfs, we will never
> > disable that.
> > 
> > But on the other hand, cross-subvolume snapshotting is not so critical,
> > and disabling that for simple quota would improve the accuracy of it,
> > I'd say it's worthy to do that.
> > 
> 
> We did this on purpose, and absolutely want to leave this functionality in
> place.  Boris made sure to document this behavior explicitly, because we are
> absolutely taking advantage of this internally by having the package management
> subvolume managed under a different quota, and then reflinking those packages
> into their containers volume.  This is the price of squotas, you aren't getting
> full tracking, but you're getting limits and speed.  Thanks,
> 
> Josef

For a little extra context, if we hadn't wanted to support reflinking,
then squotas would have been yet simplER, and wouldn't have needed the
new inline owner refs. For better or worse, we decided it was in fact
quite important, so we added the owner refs. Now that we did the hard
work and committed to the incompat change, I certainly think it makes
sense to leave the support.

Boris
Qu Wenruo March 21, 2024, 8:17 p.m. UTC | #5
在 2024/3/21 23:55, David Disseldorp 写道:
> On Thu, 21 Mar 2024 14:09:38 +1030, Qu Wenruo wrote:
> ...
>> [REASON FOR RFC]
>> I'm not sure how important the cross-subvolume clone functionality is in
>> real-world.
>>
>> But considering squota is mostly designed for container usage, in that
>> case disabling cross-subvolume clone should be completely fine.
>
> I think copy_file_range() is reasonably common nowadays, and this would
> impact such workloads. I don't find the creator-subvol-pays simple quota
> behaviour too confusing, so would vote too keep things as is.
>
> Cheers, David
>
OK, thanks for all the feedback.

Please drop this patch.

Thanks,
Qu
Qu Wenruo March 21, 2024, 11:59 p.m. UTC | #6
在 2024/3/22 05:38, Boris Burkov 写道:
> On Thu, Mar 21, 2024 at 02:51:35PM -0400, Josef Bacik wrote:
>> On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote:
>>> Unlike the full qgroup, simple quota no longer makes backref walk to
>>> properly make accurate accounting for each subvolume.
>>>
>>> Instead it goes a much faster and much simpler way, anything modified by
>>> the subvolume would be accounted to that subvolume.
>>>
>>> Although it brings some small accuracy problem, mostly related to shared
>>> extents between different subvolumes, the reduced overhead is more than
>>> good enough.
>>>
>>> Considering there are only 2 ways to share extents between subvolumes:
>>>
>>> - Snapshotting
>>> - Cross-subvolume clone/dedupe
>>>
>>> And since snapshotting is the core functionality of btrfs, we will never
>>> disable that.
>>>
>>> But on the other hand, cross-subvolume snapshotting is not so critical,
>>> and disabling that for simple quota would improve the accuracy of it,
>>> I'd say it's worthy to do that.
>>>
>>
>> We did this on purpose, and absolutely want to leave this functionality in
>> place.  Boris made sure to document this behavior explicitly, because we are
>> absolutely taking advantage of this internally by having the package management
>> subvolume managed under a different quota, and then reflinking those packages
>> into their containers volume.  This is the price of squotas, you aren't getting
>> full tracking, but you're getting limits and speed.  Thanks,
>>
>> Josef
> 
> For a little extra context, if we hadn't wanted to support reflinking,
> then squotas would have been yet simplER,

I guess that's why my initial impression on squota, which can be even 
simpler without the extra extent tree change.

> and wouldn't have needed the
> new inline owner refs. For better or worse, we decided it was in fact
> quite important, so we added the owner refs. Now that we did the hard
> work and committed to the incompat change, I certainly think it makes
> sense to leave the support.

Considering you guys have the real world usage and have already 
committed so much to this feature, it's definitely worthy keeping.

The patch would be dropped.

Thanks,
Qu

> 
> Boris
diff mbox series

Patch

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 08d0fb46ceec..906cb6166b67 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -15,6 +15,7 @@ 
 #include "file-item.h"
 #include "file.h"
 #include "super.h"
+#include "qgroup.h"
 
 #define BTRFS_MAX_DEDUPE_LEN	SZ_16M
 
@@ -350,6 +351,19 @@  static int btrfs_clone(struct inode *src, struct inode *inode,
 	u64 last_dest_end = destoff;
 	u64 prev_extent_end = off;
 
+	/*
+	 * If squota is enabled, disable cloning between different subvolumes.
+	 *
+	 * As clone/reflink/dedupe is the only other way to share data between
+	 * different subvolumes other than snapshotting.
+	 * With it disabled, squota can be way more accurate.
+	 */
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
+		if (BTRFS_I(src)->root->root_key.objectid !=
+		    BTRFS_I(inode)->root->root_key.objectid)
+			return -EXDEV;
+	}
+
 	ret = -ENOMEM;
 	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
 	if (!buf)