Message ID | 74730c411b0fd87484c8d894878c5cd8bac1d434.1710992258.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: reflink: disable cross-subvolume clone/dedupe for simple quota | expand |
On Thu, 21 Mar 2024 14:09:38 +1030 Qu Wenruo <wqu@suse.com> wrote: > Unlike the full qgroup, simple quota no longer makes backref walk to > properly make accurate accounting for each subvolume. > > Instead it goes a much faster and much simpler way, anything modified by > the subvolume would be accounted to that subvolume. > > Although it brings some small accuracy problem, mostly related to shared > extents between different subvolumes, the reduced overhead is more than > good enough. > > Considering there are only 2 ways to share extents between subvolumes: > > - Snapshotting > - Cross-subvolume clone/dedupe > > And since snapshotting is the core functionality of btrfs, we will never > disable that. > > But on the other hand, cross-subvolume snapshotting is not so critical, I think you meant cross-subvolume clone/dedupe in this case ^ > and disabling that for simple quota would improve the accuracy of it, > I'd say it's worthy to do that. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/reflink.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > --- > [REASON FOR RFC] > I'm not sure how important the cross-subvolume clone functionality is in > real-world. > > But considering squota is mostly designed for container usage, in that > case disabling cross-subvolume clone should be completely fine. Use case that comes to mind, create multiple LXC guests as snapshots from the template container. They run for a few months, and then you upgrade OS and software in each, using "apt dist-upgrade", to the latest releases. Now it would be neat to be able to run dedupe on them to bring disk usage back to about that of a single container, since the OS and newly installed files would be mostly the same in all of them. It also could be that the containers are managed by different users or customers, in which case recreating them from template on every new OS releases would not be feasible, and upgrading each separately is the only option. > diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c > index 08d0fb46ceec..906cb6166b67 100644 > --- a/fs/btrfs/reflink.c > +++ b/fs/btrfs/reflink.c > @@ -15,6 +15,7 @@ > #include "file-item.h" > #include "file.h" > #include "super.h" > +#include "qgroup.h" > > #define BTRFS_MAX_DEDUPE_LEN SZ_16M > > @@ -350,6 +351,19 @@ static int btrfs_clone(struct inode *src, struct inode *inode, > u64 last_dest_end = destoff; > u64 prev_extent_end = off; > > + /* > + * If squota is enabled, disable cloning between different subvolumes. > + * > + * As clone/reflink/dedupe is the only other way to share data between > + * different subvolumes other than snapshotting. > + * With it disabled, squota can be way more accurate. > + */ > + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) { > + if (BTRFS_I(src)->root->root_key.objectid != > + BTRFS_I(inode)->root->root_key.objectid) > + return -EXDEV; > + } > + > ret = -ENOMEM; > buf = kvmalloc(fs_info->nodesize, GFP_KERNEL); > if (!buf)
On Thu, 21 Mar 2024 14:09:38 +1030, Qu Wenruo wrote: ... > [REASON FOR RFC] > I'm not sure how important the cross-subvolume clone functionality is in > real-world. > > But considering squota is mostly designed for container usage, in that > case disabling cross-subvolume clone should be completely fine. I think copy_file_range() is reasonably common nowadays, and this would impact such workloads. I don't find the creator-subvol-pays simple quota behaviour too confusing, so would vote too keep things as is. Cheers, David
On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote: > Unlike the full qgroup, simple quota no longer makes backref walk to > properly make accurate accounting for each subvolume. > > Instead it goes a much faster and much simpler way, anything modified by > the subvolume would be accounted to that subvolume. > > Although it brings some small accuracy problem, mostly related to shared > extents between different subvolumes, the reduced overhead is more than > good enough. > > Considering there are only 2 ways to share extents between subvolumes: > > - Snapshotting > - Cross-subvolume clone/dedupe > > And since snapshotting is the core functionality of btrfs, we will never > disable that. > > But on the other hand, cross-subvolume snapshotting is not so critical, > and disabling that for simple quota would improve the accuracy of it, > I'd say it's worthy to do that. > We did this on purpose, and absolutely want to leave this functionality in place. Boris made sure to document this behavior explicitly, because we are absolutely taking advantage of this internally by having the package management subvolume managed under a different quota, and then reflinking those packages into their containers volume. This is the price of squotas, you aren't getting full tracking, but you're getting limits and speed. Thanks, Josef
On Thu, Mar 21, 2024 at 02:51:35PM -0400, Josef Bacik wrote: > On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote: > > Unlike the full qgroup, simple quota no longer makes backref walk to > > properly make accurate accounting for each subvolume. > > > > Instead it goes a much faster and much simpler way, anything modified by > > the subvolume would be accounted to that subvolume. > > > > Although it brings some small accuracy problem, mostly related to shared > > extents between different subvolumes, the reduced overhead is more than > > good enough. > > > > Considering there are only 2 ways to share extents between subvolumes: > > > > - Snapshotting > > - Cross-subvolume clone/dedupe > > > > And since snapshotting is the core functionality of btrfs, we will never > > disable that. > > > > But on the other hand, cross-subvolume snapshotting is not so critical, > > and disabling that for simple quota would improve the accuracy of it, > > I'd say it's worthy to do that. > > > > We did this on purpose, and absolutely want to leave this functionality in > place. Boris made sure to document this behavior explicitly, because we are > absolutely taking advantage of this internally by having the package management > subvolume managed under a different quota, and then reflinking those packages > into their containers volume. This is the price of squotas, you aren't getting > full tracking, but you're getting limits and speed. Thanks, > > Josef For a little extra context, if we hadn't wanted to support reflinking, then squotas would have been yet simplER, and wouldn't have needed the new inline owner refs. For better or worse, we decided it was in fact quite important, so we added the owner refs. Now that we did the hard work and committed to the incompat change, I certainly think it makes sense to leave the support. Boris
在 2024/3/21 23:55, David Disseldorp 写道: > On Thu, 21 Mar 2024 14:09:38 +1030, Qu Wenruo wrote: > ... >> [REASON FOR RFC] >> I'm not sure how important the cross-subvolume clone functionality is in >> real-world. >> >> But considering squota is mostly designed for container usage, in that >> case disabling cross-subvolume clone should be completely fine. > > I think copy_file_range() is reasonably common nowadays, and this would > impact such workloads. I don't find the creator-subvol-pays simple quota > behaviour too confusing, so would vote too keep things as is. > > Cheers, David > OK, thanks for all the feedback. Please drop this patch. Thanks, Qu
在 2024/3/22 05:38, Boris Burkov 写道: > On Thu, Mar 21, 2024 at 02:51:35PM -0400, Josef Bacik wrote: >> On Thu, Mar 21, 2024 at 02:09:38PM +1030, Qu Wenruo wrote: >>> Unlike the full qgroup, simple quota no longer makes backref walk to >>> properly make accurate accounting for each subvolume. >>> >>> Instead it goes a much faster and much simpler way, anything modified by >>> the subvolume would be accounted to that subvolume. >>> >>> Although it brings some small accuracy problem, mostly related to shared >>> extents between different subvolumes, the reduced overhead is more than >>> good enough. >>> >>> Considering there are only 2 ways to share extents between subvolumes: >>> >>> - Snapshotting >>> - Cross-subvolume clone/dedupe >>> >>> And since snapshotting is the core functionality of btrfs, we will never >>> disable that. >>> >>> But on the other hand, cross-subvolume snapshotting is not so critical, >>> and disabling that for simple quota would improve the accuracy of it, >>> I'd say it's worthy to do that. >>> >> >> We did this on purpose, and absolutely want to leave this functionality in >> place. Boris made sure to document this behavior explicitly, because we are >> absolutely taking advantage of this internally by having the package management >> subvolume managed under a different quota, and then reflinking those packages >> into their containers volume. This is the price of squotas, you aren't getting >> full tracking, but you're getting limits and speed. Thanks, >> >> Josef > > For a little extra context, if we hadn't wanted to support reflinking, > then squotas would have been yet simplER, I guess that's why my initial impression on squota, which can be even simpler without the extra extent tree change. > and wouldn't have needed the > new inline owner refs. For better or worse, we decided it was in fact > quite important, so we added the owner refs. Now that we did the hard > work and committed to the incompat change, I certainly think it makes > sense to leave the support. Considering you guys have the real world usage and have already committed so much to this feature, it's definitely worthy keeping. The patch would be dropped. Thanks, Qu > > Boris
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 08d0fb46ceec..906cb6166b67 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -15,6 +15,7 @@ #include "file-item.h" #include "file.h" #include "super.h" +#include "qgroup.h" #define BTRFS_MAX_DEDUPE_LEN SZ_16M @@ -350,6 +351,19 @@ static int btrfs_clone(struct inode *src, struct inode *inode, u64 last_dest_end = destoff; u64 prev_extent_end = off; + /* + * If squota is enabled, disable cloning between different subvolumes. + * + * As clone/reflink/dedupe is the only other way to share data between + * different subvolumes other than snapshotting. + * With it disabled, squota can be way more accurate. + */ + if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) { + if (BTRFS_I(src)->root->root_key.objectid != + BTRFS_I(inode)->root->root_key.objectid) + return -EXDEV; + } + ret = -ENOMEM; buf = kvmalloc(fs_info->nodesize, GFP_KERNEL); if (!buf)
Unlike the full qgroup, simple quota no longer makes backref walk to properly make accurate accounting for each subvolume. Instead it goes a much faster and much simpler way, anything modified by the subvolume would be accounted to that subvolume. Although it brings some small accuracy problem, mostly related to shared extents between different subvolumes, the reduced overhead is more than good enough. Considering there are only 2 ways to share extents between subvolumes: - Snapshotting - Cross-subvolume clone/dedupe And since snapshotting is the core functionality of btrfs, we will never disable that. But on the other hand, cross-subvolume snapshotting is not so critical, and disabling that for simple quota would improve the accuracy of it, I'd say it's worthy to do that. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/reflink.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) --- [REASON FOR RFC] I'm not sure how important the cross-subvolume clone functionality is in real-world. But considering squota is mostly designed for container usage, in that case disabling cross-subvolume clone should be completely fine.