diff mbox

[v5,9/9] btrfs: btrfs_copy_file_range() only supports reflinks

Message ID 1443634014-3026-10-git-send-email-Anna.Schumaker@Netapp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Schumaker, Anna Sept. 30, 2015, 5:26 p.m. UTC
Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ioctl.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Christoph Hellwig Oct. 11, 2015, 2:29 p.m. UTC | #1
On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
> Reject copies that don't have the COPY_FR_REFLINK flag set.

I think a reflink actually is a perfectly valid copy, and I don't buy
the duplicate arguments in earlier threads.  We really need to think
more in terms of how this impacts a user and now how it's implemented
internally.  How does a user notice it's a reflink?  They don't as
implemented in btrfs and co.  Now on filesystem that don't always do
copy on write but might support reflinks (ocfs2, XFS in the future)
this becomes a bit more interesting - the difference he is that we
get an implicit fallocate when doing a real copy.  But if that's
something we have actual requests for that's how we should specify
it rather than in terms of arcane implementation details.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pádraig Brady Oct. 12, 2015, 10:23 a.m. UTC | #2
On 11/10/15 15:29, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
>> Reject copies that don't have the COPY_FR_REFLINK flag set.
> 
> I think a reflink actually is a perfectly valid copy, and I don't buy
> the duplicate arguments in earlier threads.  We really need to think
> more in terms of how this impacts a user and now how it's implemented
> internally.  How does a user notice it's a reflink?  They don't as
> implemented in btrfs and co. 

You're right that if the user doesn't notice, then there is no
point exposing this. However I think the user does notice as
there is a difference in the end state of the copy.  I.E. generally
if there is a different end state it would require an option,
while if only a different copying mechanism it would not.
I think the different end state of a reflink warrants an option for 3 reasons:

 - The user might want separate bits for resiliency. Now this is
   a weak argument due to possible deduplication in lower layers,
   but still valid is some setups.

 - The user might want to avoid CoW at a later time critical stage.

 - The user might want to avoid ENOSPC at a later critical stage.

> Now on filesystem that don't always do
> copy on write but might support reflinks (ocfs2, XFS in the future)
> this becomes a bit more interesting - the difference he is that we
> get an implicit fallocate when doing a real copy.  But if that's
> something we have actual requests for that's how we should specify
> it rather than in terms of arcane implementation details.

thanks,
Pádraig.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 12, 2015, 2:34 p.m. UTC | #3
On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> You're right that if the user doesn't notice, then there is no
> point exposing this. However I think the user does notice as
> there is a difference in the end state of the copy.  I.E. generally
> if there is a different end state it would require an option,
> while if only a different copying mechanism it would not.
> I think the different end state of a reflink warrants an option for 3 reasons:
> 
>  - The user might want separate bits for resiliency. Now this is
>    a weak argument due to possible deduplication in lower layers,
>    but still valid is some setups.

This one is completely bogus.  For one because literally every lower
layer can and increasinly will dedup or share in some form.  If we
prentend we could do this we actively mislead the user.

>  - The user might want to avoid CoW at a later time critical stage.
> 
>  - The user might want to avoid ENOSPC at a later critical stage.

These two are the same and would be the argument for the "falloc" flag
I mention before.  But we'd need to sit down and specify the exact
semantics for it.  For example one important question that comes to mind
is if it also applies for extents that are holes in the source range.

I'd much rather get the basic system call in ASAP and then let people
explain their use cases for this and only add it once we've made sure
we have consistent semantics that actually fit the users needs.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 12, 2015, 11:41 p.m. UTC | #4
On Mon, Oct 12, 2015 at 07:34:44AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> > You're right that if the user doesn't notice, then there is no
> > point exposing this. However I think the user does notice as
> > there is a difference in the end state of the copy.  I.E. generally
> > if there is a different end state it would require an option,
> > while if only a different copying mechanism it would not.
> > I think the different end state of a reflink warrants an option for 3 reasons:
> > 
> >  - The user might want separate bits for resiliency. Now this is
> >    a weak argument due to possible deduplication in lower layers,
> >    but still valid is some setups.
> 
> This one is completely bogus.  For one because literally every lower
> layer can and increasinly will dedup or share in some form.  If we
> prentend we could do this we actively mislead the user.
> 
> >  - The user might want to avoid CoW at a later time critical stage.
> > 
> >  - The user might want to avoid ENOSPC at a later critical stage.
> 
> These two are the same and would be the argument for the "falloc" flag
> I mention before.  But we'd need to sit down and specify the exact
> semantics for it.  For example one important question that comes to mind
> is if it also applies for extents that are holes in the source range.

One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
leaves holes alone.

Obviously we haven't yet figured out what are peoples' preferences in terms of
"fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
that fills the holes while unsharing is going on.

Personally I suspect that the most interest is in filling holes and unsharing,
because they don't want to pay for allocation at a critical stage for anywhere
in the file.  But I could be wrong, so allowing both goals to be expressed via
mode allows flexibility.

--D

> 
> I'd much rather get the basic system call in ASAP and then let people
> explain their use cases for this and only add it once we've made sure
> we have consistent semantics that actually fit the users needs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 13, 2015, 7:29 a.m. UTC | #5
On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> leaves holes alone.

Yes, I've seen the implementation. 

> Obviously we haven't yet figured out what are peoples' preferences in terms of
> "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> that fills the holes while unsharing is going on.
> 
> Personally I suspect that the most interest is in filling holes and unsharing,
> because they don't want to pay for allocation at a critical stage for anywhere
> in the file.  But I could be wrong, so allowing both goals to be expressed via
> mode allows flexibility.

Exactly.  And a normal falloc should do just that - fill holes and
ensure that we don't need to COW already allocated locks.  So I don't
think we need a new fallocate interface for that.  The question is if we
want a copy interface that gives you the same semantics as if you also
called an fallocate on the destination range.  For that case we'd
usually want to avoid doing the clone and instead do a in-kernel or
hardware assisted copy and then fill the holes with unwritten extents.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 14, 2015, 6:46 p.m. UTC | #6
On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > leaves holes alone.
> 
> Yes, I've seen the implementation. 
> 
> > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > that fills the holes while unsharing is going on.
> > 
> > Personally I suspect that the most interest is in filling holes and unsharing,
> > because they don't want to pay for allocation at a critical stage for anywhere
> > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > mode allows flexibility.
> 
> Exactly.  And a normal falloc should do just that - fill holes and
> ensure that we don't need to COW already allocated locks.  So I don't
> think we need a new fallocate interface for that.

The documentation for fallocate ought to be updated to include that as part of
guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
shared blocks will be unshared.

Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
COW it probably doesn't make sense to unshare blocks anyway... but maybe I
also don't want to dive into btrfs f-allocation behavior at this time. :)

Ok, so I'll rework the XFS funshare code into something that hangs off the
regular fallocate call, and get rid of the explicit 'funshare' bits.

> The question is if we
> want a copy interface that gives you the same semantics as if you also
> called an fallocate on the destination range.  For that case we'd
> usually want to avoid doing the clone and instead do a in-kernel or
> hardware assisted copy and then fill the holes with unwritten extents.

Probably; I can easily imagine people wanting to fill the holes and also
not wanting them filled.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 15, 2015, 6 a.m. UTC | #7
On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> The documentation for fallocate ought to be updated to include that as part of
> guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> shared blocks will be unshared.
> 
> Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> also don't want to dive into btrfs f-allocation behavior at this time. :)
> 
> Ok, so I'll rework the XFS funshare code into something that hangs off the
> regular fallocate call, and get rid of the explicit 'funshare' bits.

Yes, that would be my preference.  I'd also like to understand what
exactly btrfs does in fallocate.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 15, 2015, 8:35 a.m. UTC | #8
On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > > leaves holes alone.
> > 
> > Yes, I've seen the implementation. 
> > 
> > > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > > that fills the holes while unsharing is going on.
> > > 
> > > Personally I suspect that the most interest is in filling holes and unsharing,
> > > because they don't want to pay for allocation at a critical stage for anywhere
> > > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > > mode allows flexibility.
> > 
> > Exactly.  And a normal falloc should do just that - fill holes and
> > ensure that we don't need to COW already allocated locks.  So I don't
> > think we need a new fallocate interface for that.
> 
> The documentation for fallocate ought to be updated to include that as part of
> guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> shared blocks will be unshared.
> 
> Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> also don't want to dive into btrfs f-allocation behavior at this time. :)
> 
> Ok, so I'll rework the XFS funshare code into something that hangs off the
> regular fallocate call, and get rid of the explicit 'funshare' bits.

Makes sense given we have the FALLOC_FL_ZERO_RANGE operation which
returns a zero to and preallocates all the holes in the range. I
would expect this operation on shared blocks to unshare blocks,
too...

> > The question is if we
> > want a copy interface that gives you the same semantics as if you also
> > called an fallocate on the destination range.  For that case we'd
> > usually want to avoid doing the clone and instead do a in-kernel or
> > hardware assisted copy and then fill the holes with unwritten extents.

If hole filling was required, then I'd do the operation the other
way around - prealloc the entire range, then do hardware assisted
copy of each separate data range in the source file with unwritten
extent conversion on offload completion...

> Probably; I can easily imagine people wanting to fill the holes and also
> not wanting them filled.

*nod*.

Cheers,

Dave.
Chris Mason Oct. 16, 2015, 11:49 a.m. UTC | #9
On Wed, Oct 14, 2015 at 11:00:45PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> > The documentation for fallocate ought to be updated to include that as part of
> > guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> > shared blocks will be unshared.
> > 
> > Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> > COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> > also don't want to dive into btrfs f-allocation behavior at this time. :)
> > 
> > Ok, so I'll rework the XFS funshare code into something that hangs off the
> > regular fallocate call, and get rid of the explicit 'funshare' bits.
> 
> Yes, that would be my preference.  I'd also like to understand what
> exactly btrfs does in fallocate.

For which part?  The answer changes based on how many references there
are to a given fallocated region.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 16, 2015, 12:25 p.m. UTC | #10
On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > Yes, that would be my preference.  I'd also like to understand what
> > exactly btrfs does in fallocate.
> 
> For which part?  The answer changes based on how many references there
> are to a given fallocated region.

Both cases.  With btrfs allocating new block on every write how do you
avoid that ENOSPC?  Is there a unassigned block preallocation that's
made persistent in some way?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Oct. 16, 2015, 1:19 p.m. UTC | #11
On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > Yes, that would be my preference.  I'd also like to understand what
> > > exactly btrfs does in fallocate.
> > 
> > For which part?  The answer changes based on how many references there
> > are to a given fallocated region.
> 
> Both cases.  With btrfs allocating new block on every write how do you
> avoid that ENOSPC?  Is there a unassigned block preallocation that's
> made persistent in some way?

So:

fallocate 1g -> foo

reflink foo foo2

We've now implicitly doubled the size of the fallocate, but at reflink
time btrfs doesn't account for the doubling.  It's actually much
better in this case to just use a hole because neither foo or foo2 can
use the preallocated space until the 1g is fully unshared.

When we're doing writes, it'll check the preallocated extents for extra
refs and force COW if any exist.  So writes into a preallocated region
can enospc.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 16, 2015, 9:44 p.m. UTC | #12
On Fri, Oct 16, 2015 at 09:19:50AM -0400, Chris Mason wrote:
> On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > > Yes, that would be my preference.  I'd also like to understand what
> > > > exactly btrfs does in fallocate.
> > > 
> > > For which part?  The answer changes based on how many references there
> > > are to a given fallocated region.
> > 
> > Both cases.  With btrfs allocating new block on every write how do you
> > avoid that ENOSPC?  Is there a unassigned block preallocation that's
> > made persistent in some way?
> 
> So:
> 
> fallocate 1g -> foo
> 
> reflink foo foo2
> 
> We've now implicitly doubled the size of the fallocate, but at reflink

No, I don't think it implies that at all. the posix_fallocate()
"future writes will succeed" guarantee only applies to foo, not to
/copies/ such as foo2. At it's core, reflink is just an optimised
file copy mechanism - the resultant copy should have the same
behaviour as a file copied by read/write. Copies done by physically
copying data do not duplicate fallocate() regions or guarantees from
the source file to the destination file.

> time btrfs doesn't account for the doubling.  It's actually much
> better in this case to just use a hole because neither foo or foo2 can
> use the preallocated space until the 1g is fully unshared.

Right - this implies unwritten extents should not be shared by
reflink, instead either skipped (i.e. leave as a hole in foo2 as you
suggest) or duplicated so that the next write to the region of foo2
will also succeed. I'd suggest that COPY_FALLOC (or whatever it'll
get called) implies the latter behaviour, the default behaviour
being the former...

> When we're doing writes, it'll check the preallocated extents for extra
> refs and force COW if any exist.  So writes into a preallocated region
> can enospc.

This really seems like an btrfs interpretation/implementation
issue, not a problem for reflink in general.

Cheers,

Dave.
Chris Mason Oct. 17, 2015, 1:44 p.m. UTC | #13
On Sat, Oct 17, 2015 at 08:44:35AM +1100, Dave Chinner wrote:
> 
> > When we're doing writes, it'll check the preallocated extents for extra
> > refs and force COW if any exist.  So writes into a preallocated region
> > can enospc.
> 
> This really seems like an btrfs interpretation/implementation
> issue, not a problem for reflink in general.
> 

Right, now matter how we do it there are tradeoffs, and this one seemed
the least surprising to me.  I don't think it's a big problem at all.

Automatically replacing preallocated extents with holes during clone
seems like a better compromise though (at least for btrfs anyway).

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d3697e8..c1f115d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@ 
 #include <linux/uuid.h>
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
+#include <linux/copy.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3848,6 +3849,9 @@  ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 {
 	ssize_t ret;
 
+	if (!(flags & COPY_FR_REFLINK))
+		return -EOPNOTSUPP;
+
 	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
 	if (ret == 0)
 		ret = len;