Message ID | 1443634014-3026-10-git-send-email-Anna.Schumaker@Netapp.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
> Reject copies that don't have the COPY_FR_REFLINK flag set.
I think a reflink actually is a perfectly valid copy, and I don't buy
the duplicate arguments in earlier threads. We really need to think
more in terms of how this impacts a user and now how it's implemented
internally. How does a user notice it's a reflink? They don't as
implemented in btrfs and co. Now on filesystem that don't always do
copy on write but might support reflinks (ocfs2, XFS in the future)
this becomes a bit more interesting - the difference he is that we
get an implicit fallocate when doing a real copy. But if that's
something we have actual requests for that's how we should specify
it rather than in terms of arcane implementation details.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/10/15 15:29, Christoph Hellwig wrote: > On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote: >> Reject copies that don't have the COPY_FR_REFLINK flag set. > > I think a reflink actually is a perfectly valid copy, and I don't buy > the duplicate arguments in earlier threads. We really need to think > more in terms of how this impacts a user and now how it's implemented > internally. How does a user notice it's a reflink? They don't as > implemented in btrfs and co. You're right that if the user doesn't notice, then there is no point exposing this. However I think the user does notice as there is a difference in the end state of the copy. I.E. generally if there is a different end state it would require an option, while if only a different copying mechanism it would not. I think the different end state of a reflink warrants an option for 3 reasons: - The user might want separate bits for resiliency. Now this is a weak argument due to possible deduplication in lower layers, but still valid is some setups. - The user might want to avoid CoW at a later time critical stage. - The user might want to avoid ENOSPC at a later critical stage. > Now on filesystem that don't always do > copy on write but might support reflinks (ocfs2, XFS in the future) > this becomes a bit more interesting - the difference he is that we > get an implicit fallocate when doing a real copy. But if that's > something we have actual requests for that's how we should specify > it rather than in terms of arcane implementation details. thanks, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote: > You're right that if the user doesn't notice, then there is no > point exposing this. However I think the user does notice as > there is a difference in the end state of the copy. I.E. generally > if there is a different end state it would require an option, > while if only a different copying mechanism it would not. > I think the different end state of a reflink warrants an option for 3 reasons: > > - The user might want separate bits for resiliency. Now this is > a weak argument due to possible deduplication in lower layers, > but still valid is some setups. This one is completely bogus. For one because literally every lower layer can and increasinly will dedup or share in some form. If we prentend we could do this we actively mislead the user. > - The user might want to avoid CoW at a later time critical stage. > > - The user might want to avoid ENOSPC at a later critical stage. These two are the same and would be the argument for the "falloc" flag I mention before. But we'd need to sit down and specify the exact semantics for it. For example one important question that comes to mind is if it also applies for extents that are holes in the source range. I'd much rather get the basic system call in ASAP and then let people explain their use cases for this and only add it once we've made sure we have consistent semantics that actually fit the users needs. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 07:34:44AM -0700, Christoph Hellwig wrote: > On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote: > > You're right that if the user doesn't notice, then there is no > > point exposing this. However I think the user does notice as > > there is a difference in the end state of the copy. I.E. generally > > if there is a different end state it would require an option, > > while if only a different copying mechanism it would not. > > I think the different end state of a reflink warrants an option for 3 reasons: > > > > - The user might want separate bits for resiliency. Now this is > > a weak argument due to possible deduplication in lower layers, > > but still valid is some setups. > > This one is completely bogus. For one because literally every lower > layer can and increasinly will dedup or share in some form. If we > prentend we could do this we actively mislead the user. > > > - The user might want to avoid CoW at a later time critical stage. > > > > - The user might want to avoid ENOSPC at a later critical stage. > > These two are the same and would be the argument for the "falloc" flag > I mention before. But we'd need to sit down and specify the exact > semantics for it. For example one important question that comes to mind > is if it also applies for extents that are holes in the source range. One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE flag; at the moment it _only_ forces copy-on-write of shared blocks, and it leaves holes alone. Obviously we haven't yet figured out what are peoples' preferences in terms of "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only fill the holes". It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag that fills the holes while unsharing is going on. Personally I suspect that the most interest is in filling holes and unsharing, because they don't want to pay for allocation at a critical stage for anywhere in the file. But I could be wrong, so allowing both goals to be expressed via mode allows flexibility. --D > > I'd much rather get the basic system call in ASAP and then let people > explain their use cases for this and only add it once we've made sure > we have consistent semantics that actually fit the users needs. > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote: > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it > leaves holes alone. Yes, I've seen the implementation. > Obviously we haven't yet figured out what are peoples' preferences in terms of > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only > fill the holes". It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag > that fills the holes while unsharing is going on. > > Personally I suspect that the most interest is in filling holes and unsharing, > because they don't want to pay for allocation at a critical stage for anywhere > in the file. But I could be wrong, so allowing both goals to be expressed via > mode allows flexibility. Exactly. And a normal falloc should do just that - fill holes and ensure that we don't need to COW already allocated locks. So I don't think we need a new fallocate interface for that. The question is if we want a copy interface that gives you the same semantics as if you also called an fallocate on the destination range. For that case we'd usually want to avoid doing the clone and instead do a in-kernel or hardware assisted copy and then fill the holes with unwritten extents. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote: > On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote: > > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE > > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it > > leaves holes alone. > > Yes, I've seen the implementation. > > > Obviously we haven't yet figured out what are peoples' preferences in terms of > > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only > > fill the holes". It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag > > that fills the holes while unsharing is going on. > > > > Personally I suspect that the most interest is in filling holes and unsharing, > > because they don't want to pay for allocation at a critical stage for anywhere > > in the file. But I could be wrong, so allowing both goals to be expressed via > > mode allows flexibility. > > Exactly. And a normal falloc should do just that - fill holes and > ensure that we don't need to COW already allocated locks. So I don't > think we need a new fallocate interface for that. The documentation for fallocate ought to be updated to include that as part of guaranteeing that subsequent writes to the range won't fail due to ENOSPC, shared blocks will be unshared. Incidentally, btrfs leaves shared blocks alone. OTOH, given that it's totally COW it probably doesn't make sense to unshare blocks anyway... but maybe I also don't want to dive into btrfs f-allocation behavior at this time. :) Ok, so I'll rework the XFS funshare code into something that hangs off the regular fallocate call, and get rid of the explicit 'funshare' bits. > The question is if we > want a copy interface that gives you the same semantics as if you also > called an fallocate on the destination range. For that case we'd > usually want to avoid doing the clone and instead do a in-kernel or > hardware assisted copy and then fill the holes with unwritten extents. Probably; I can easily imagine people wanting to fill the holes and also not wanting them filled. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote: > The documentation for fallocate ought to be updated to include that as part of > guaranteeing that subsequent writes to the range won't fail due to ENOSPC, > shared blocks will be unshared. > > Incidentally, btrfs leaves shared blocks alone. OTOH, given that it's totally > COW it probably doesn't make sense to unshare blocks anyway... but maybe I > also don't want to dive into btrfs f-allocation behavior at this time. :) > > Ok, so I'll rework the XFS funshare code into something that hangs off the > regular fallocate call, and get rid of the explicit 'funshare' bits. Yes, that would be my preference. I'd also like to understand what exactly btrfs does in fallocate. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote: > On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote: > > On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote: > > > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE > > > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it > > > leaves holes alone. > > > > Yes, I've seen the implementation. > > > > > Obviously we haven't yet figured out what are peoples' preferences in terms of > > > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only > > > fill the holes". It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag > > > that fills the holes while unsharing is going on. > > > > > > Personally I suspect that the most interest is in filling holes and unsharing, > > > because they don't want to pay for allocation at a critical stage for anywhere > > > in the file. But I could be wrong, so allowing both goals to be expressed via > > > mode allows flexibility. > > > > Exactly. And a normal falloc should do just that - fill holes and > > ensure that we don't need to COW already allocated locks. So I don't > > think we need a new fallocate interface for that. > > The documentation for fallocate ought to be updated to include that as part of > guaranteeing that subsequent writes to the range won't fail due to ENOSPC, > shared blocks will be unshared. > > Incidentally, btrfs leaves shared blocks alone. OTOH, given that it's totally > COW it probably doesn't make sense to unshare blocks anyway... but maybe I > also don't want to dive into btrfs f-allocation behavior at this time. :) > > Ok, so I'll rework the XFS funshare code into something that hangs off the > regular fallocate call, and get rid of the explicit 'funshare' bits. Makes sense given we have the FALLOC_FL_ZERO_RANGE operation which returns a zero to and preallocates all the holes in the range. I would expect this operation on shared blocks to unshare blocks, too... > > The question is if we > > want a copy interface that gives you the same semantics as if you also > > called an fallocate on the destination range. For that case we'd > > usually want to avoid doing the clone and instead do a in-kernel or > > hardware assisted copy and then fill the holes with unwritten extents. If hole filling was required, then I'd do the operation the other way around - prealloc the entire range, then do hardware assisted copy of each separate data range in the source file with unwritten extent conversion on offload completion... > Probably; I can easily imagine people wanting to fill the holes and also > not wanting them filled. *nod*. Cheers, Dave.
On Wed, Oct 14, 2015 at 11:00:45PM -0700, Christoph Hellwig wrote: > On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote: > > The documentation for fallocate ought to be updated to include that as part of > > guaranteeing that subsequent writes to the range won't fail due to ENOSPC, > > shared blocks will be unshared. > > > > Incidentally, btrfs leaves shared blocks alone. OTOH, given that it's totally > > COW it probably doesn't make sense to unshare blocks anyway... but maybe I > > also don't want to dive into btrfs f-allocation behavior at this time. :) > > > > Ok, so I'll rework the XFS funshare code into something that hangs off the > > regular fallocate call, and get rid of the explicit 'funshare' bits. > > Yes, that would be my preference. I'd also like to understand what > exactly btrfs does in fallocate. For which part? The answer changes based on how many references there are to a given fallocated region. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote: > > Yes, that would be my preference. I'd also like to understand what > > exactly btrfs does in fallocate. > > For which part? The answer changes based on how many references there > are to a given fallocated region. Both cases. With btrfs allocating new block on every write how do you avoid that ENOSPC? Is there a unassigned block preallocation that's made persistent in some way? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote: > On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote: > > > Yes, that would be my preference. I'd also like to understand what > > > exactly btrfs does in fallocate. > > > > For which part? The answer changes based on how many references there > > are to a given fallocated region. > > Both cases. With btrfs allocating new block on every write how do you > avoid that ENOSPC? Is there a unassigned block preallocation that's > made persistent in some way? So: fallocate 1g -> foo reflink foo foo2 We've now implicitly doubled the size of the fallocate, but at reflink time btrfs doesn't account for the doubling. It's actually much better in this case to just use a hole because neither foo or foo2 can use the preallocated space until the 1g is fully unshared. When we're doing writes, it'll check the preallocated extents for extra refs and force COW if any exist. So writes into a preallocated region can enospc. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 16, 2015 at 09:19:50AM -0400, Chris Mason wrote: > On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote: > > On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote: > > > > Yes, that would be my preference. I'd also like to understand what > > > > exactly btrfs does in fallocate. > > > > > > For which part? The answer changes based on how many references there > > > are to a given fallocated region. > > > > Both cases. With btrfs allocating new block on every write how do you > > avoid that ENOSPC? Is there a unassigned block preallocation that's > > made persistent in some way? > > So: > > fallocate 1g -> foo > > reflink foo foo2 > > We've now implicitly doubled the size of the fallocate, but at reflink No, I don't think it implies that at all. the posix_fallocate() "future writes will succeed" guarantee only applies to foo, not to /copies/ such as foo2. At it's core, reflink is just an optimised file copy mechanism - the resultant copy should have the same behaviour as a file copied by read/write. Copies done by physically copying data do not duplicate fallocate() regions or guarantees from the source file to the destination file. > time btrfs doesn't account for the doubling. It's actually much > better in this case to just use a hole because neither foo or foo2 can > use the preallocated space until the 1g is fully unshared. Right - this implies unwritten extents should not be shared by reflink, instead either skipped (i.e. leave as a hole in foo2 as you suggest) or duplicated so that the next write to the region of foo2 will also succeed. I'd suggest that COPY_FALLOC (or whatever it'll get called) implies the latter behaviour, the default behaviour being the former... > When we're doing writes, it'll check the preallocated extents for extra > refs and force COW if any exist. So writes into a preallocated region > can enospc. This really seems like an btrfs interpretation/implementation issue, not a problem for reflink in general. Cheers, Dave.
On Sat, Oct 17, 2015 at 08:44:35AM +1100, Dave Chinner wrote: > > > When we're doing writes, it'll check the preallocated extents for extra > > refs and force COW if any exist. So writes into a preallocated region > > can enospc. > > This really seems like an btrfs interpretation/implementation > issue, not a problem for reflink in general. > Right, now matter how we do it there are tradeoffs, and this one seemed the least surprising to me. I don't think it's a big problem at all. Automatically replacing preallocated extents with holes during clone seems like a better compromise though (at least for btrfs anyway). -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d3697e8..c1f115d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -44,6 +44,7 @@ #include <linux/uuid.h> #include <linux/btrfs.h> #include <linux/uaccess.h> +#include <linux/copy.h> #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, { ssize_t ret; + if (!(flags & COPY_FR_REFLINK)) + return -EOPNOTSUPP; + ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out); if (ret == 0) ret = len;